Hugging Face Setup

The Mako-32B Conductor model requires GPU infrastructure to run. Hugging Face Inference Endpoints provides dedicated GPU instances that the gateway communicates with via an OpenAI-compatible API.

Production hardware

The production Mako-32B Conductor runs on:

Spec	Value
GPU	NVIDIA RTX PRO 6000 Blackwell
VRAM	96 GB
vCPUs	23x
RAM	256 GB

Setup

Create a Hugging Face account

Create an Inference Endpoint

Go to Inference Endpoints → New Endpoint. Configure:

Model: Set to the Mako-32B Conductor model repository
Instance type: Select an instance with at least 48 GB VRAM (RTX PRO 6000 Blackwell recommended)
Region: Choose the region closest to your gateway server
Scaling: Configure min/max replicas based on expected load

Get your credentials

Endpoint URL: Found on the endpoint dashboard page
API Token: Generate under Settings → Access Tokens

Configure the gateway

Add these to your gateway’s .env file:

RUNPOD_ENDPOINT_ID=your_hf_endpoint_url
RUNPOD_API_KEY=your_hf_api_token
MODEL_NAME=DeepMako/Mako-32B-Conductor

The gateway routes model: "conductor" requests to the configured endpoint.

Cold start considerations

First request after scaling from zero may take 30–90 seconds while the GPU instance starts
Subsequent requests to a running instance typically complete in 3–10 seconds
The gateway returns a 503 status during cold starts, allowing frontends to display a loading indicator
Configure a minimum replica count of 1 to eliminate cold starts (at the cost of higher GPU spend)

Gateway Server

Environment Variables

⌘I

​Production hardware

​Setup

​Cold start considerations

Production hardware

Setup

Cold start considerations