Skip to main content
The Mako-32B Conductor model requires GPU infrastructure to run. Hugging Face Inference Endpoints provides dedicated GPU instances that the gateway communicates with via an OpenAI-compatible API.

Production hardware

The production Mako-32B Conductor runs on:
SpecValue
GPUNVIDIA RTX PRO 6000 Blackwell
VRAM96 GB
vCPUs23x
RAM256 GB

Setup

1

Create a Hugging Face account

Sign up at huggingface.co and subscribe to a PRO or Enterprise plan for access to Inference Endpoints.
2

Create an Inference Endpoint

Go to Inference EndpointsNew Endpoint. Configure:
  • Model: Set to the Mako-32B Conductor model repository
  • Instance type: Select an instance with at least 48 GB VRAM (RTX PRO 6000 Blackwell recommended)
  • Region: Choose the region closest to your gateway server
  • Scaling: Configure min/max replicas based on expected load
3

Get your credentials

  • Endpoint URL: Found on the endpoint dashboard page
  • API Token: Generate under SettingsAccess Tokens
4

Configure the gateway

Add these to your gateway’s .env file:
RUNPOD_ENDPOINT_ID=your_hf_endpoint_url
RUNPOD_API_KEY=your_hf_api_token
MODEL_NAME=DeepMako/Mako-32B-Conductor
The gateway routes model: "conductor" requests to the configured endpoint.

Cold start considerations

  • First request after scaling from zero may take 30–90 seconds while the GPU instance starts
  • Subsequent requests to a running instance typically complete in 3–10 seconds
  • The gateway returns a 503 status during cold starts, allowing frontends to display a loading indicator
  • Configure a minimum replica count of 1 to eliminate cold starts (at the cost of higher GPU spend)