GPU inference

Consumer GPUs. Production prices.

Batch inference and fine-tuning on idle 4090s, 5090s, and M3 Max Apple Silicon. $0.20 / GPU-hour — ~10x cheaper per token than H100 spot at typical 7B–34B workloads. Hugging Face TGI / vLLM / MLX templates included. Bring your model, we route the work.

Start at $0.20 / GPU-hour

Hardware mix

NVIDIA 4090 / 5090

24+ GB VRAM consumer cards from gamer providers. Great for 7B–34B models, batch inference, LoRA fine-tunes.

Apple Silicon MLX

M3 Max and M4 Macs running MLX. Especially good for Mistral, Llama, and Whisper transcription.

Per-second billing

First minute is rounded; after that, billed per second. Best-fit pricing for short inference bursts.

Pre-flight benchmark

We benchmark each provider’s GPU before dispatch and only charge once we’ve confirmed advertised performance.

Templates

vLLM, Hugging Face TGI, Ollama, ComfyUI, AUTOMATIC1111 — start a job with a template name, no Dockerfile needed.

BYO weights

Mount weights from S3 or HF Hub. Provider hardware can’t exfiltrate model parameters — VRAM is wiped at job exit.

When to use us

Yes: batch inference (parallel embedding, document scoring, image generation), short fine-tunes, RAG re-ranking, audio transcription.
Yes: bursty workloads that don’t fit a 24/7 reserved-capacity contract.
Maybe: latency-sensitive real-time inference. Best-effort SLA in Phase 1; Phase 2 ships region-pinned reserved capacity.
No (yet): training from scratch on H100-class capacity. We’re consumer-GPU first — see RunPod or Lambda for H100/B200 needs.

Try one inference job

First $5 is on us. Run a vLLM job against Llama 3 70B and tell us how the latency compares.

See pricing