Production hardware
The production Mako-32B Conductor runs on:| Spec | Value |
|---|---|
| GPU | NVIDIA RTX PRO 6000 Blackwell |
| VRAM | 96 GB |
| vCPUs | 23x |
| RAM | 256 GB |
Setup
Create a Hugging Face account
Sign up at huggingface.co and subscribe to a PRO or Enterprise plan for access to Inference Endpoints.
Create an Inference Endpoint
Go to Inference Endpoints → New Endpoint. Configure:
- Model: Set to the Mako-32B Conductor model repository
- Instance type: Select an instance with at least 48 GB VRAM (RTX PRO 6000 Blackwell recommended)
- Region: Choose the region closest to your gateway server
- Scaling: Configure min/max replicas based on expected load
Get your credentials
- Endpoint URL: Found on the endpoint dashboard page
- API Token: Generate under Settings → Access Tokens
Cold start considerations
- First request after scaling from zero may take 30–90 seconds while the GPU instance starts
- Subsequent requests to a running instance typically complete in 3–10 seconds
- The gateway returns a
503status during cold starts, allowing frontends to display a loading indicator - Configure a minimum replica count of 1 to eliminate cold starts (at the cost of higher GPU spend)