Skip to main content

Command Palette

Search for a command to run...

Deploying Qwen3 235B on JarvisLabs with vLLM

Deploying for Indian user data safety

Updated
7 min read
Deploying Qwen3 235B on JarvisLabs with vLLM
A
a TypeScript full stack developer shipping scalable web apps and adding AI powered workflows on top.

This guide explains how to deploy Qwen3 235B on JarvisLabs and expose it as an OpenAI-compatible API endpoint.

We will deploy this model:

Qwen/Qwen3-235B-A22B-Instruct-2507-FP8

We will serve it with this API model name:

qwen3-235b-classifier

We will use:

JarvisLabs
4 x NVIDIA H200 GPUs
PyTorch template
vLLM
Hugging Face
OpenAI-compatible API

1. Create a JarvisLabs instance

Go to JarvisLabs and create a new GPU instance.

Choose:

Template: PyTorch
GPU: H200
GPU count: 4
Storage: 1000 GB

The model is large, so do not use a small disk. We used 1000 GB storage.

After the instance starts, open the terminal from Jupyter or VS Code.

2. Check GPUs

Run:

nvidia-smi

You should see 4 H200 GPUs.

You can also check from Python:

python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"

Expected output:

True
4

3. Install vLLM and Hugging Face tools

Run:

pip install -U pip
pip install -U vllm huggingface_hub openai

Check vLLM:

vllm --version

4. Login to Hugging Face

The model will download from Hugging Face, so login first.

Run:

hf auth login

Paste your Hugging Face token.

You can create a token from:

Hugging Face account
Settings
Access Tokens
New token

Use a read token.

Check login:

hf auth whoami

5. Set Hugging Face cache folder

The model is very large, so store it in the workspace disk.

Run:

mkdir -p /workspace/hf-cache

export HF_HOME=/workspace/hf-cache
export HUGGINGFACE_HUB_CACHE=/workspace/hf-cache

Save these settings for future terminal sessions:

echo 'export HF_HOME=/workspace/hf-cache' >> ~/.bashrc
echo 'export HUGGINGFACE_HUB_CACHE=/workspace/hf-cache' >> ~/.bashrc

After login, the token may be saved here:

/workspace/hf-cache/token

Export it:

export HF_TOKEN="$(cat /workspace/hf-cache/token)"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"

Check that the token is loaded:

echo $HF_TOKEN | cut -c1-8

It should print something like:

hf_xxxxx

Do not share your full Hugging Face token.

6. Stop old vLLM process if any

Before starting the model, stop any old vLLM process:

pkill -f "vllm serve"

Check GPU memory:

nvidia-smi

If old workers are still using GPU memory, wait a few seconds and check again.

7. Clean old vLLM compile cache

This helps avoid old failed kernel cache issues.

Run:

rm -rf /home/.cache/vllm/deep_gemm
rm -rf /home/.cache/vllm/torch_compile_cache

8. Start vLLM server

Set your API key:

export VLLM_API_KEY="your-strong-secret-key"

Use a strong secret key. Do not use a simple test key in production.

Now start the model:

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --served-model-name qwen3-235b-classifier \
  --host 0.0.0.0 \
  --port 6006 \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.85 \
  --dtype auto \
  --trust-remote-code \
  --enforce-eager \
  --linear-backend triton \
  --moe-backend triton \
  --api-key "$VLLM_API_KEY"

9. Why these flags are important

This flag sets the API model name:

--served-model-name qwen3-235b-classifier

This lets vLLM listen outside localhost:

--host 0.0.0.0

This uses port 6006:

--port 6006

This splits the model across 4 GPUs:

--tensor-parallel-size 4

This limits max context length:

--max-model-len 16384

This keeps some free GPU memory:

--gpu-memory-utilization 0.85

This helped avoid compile issues:

--enforce-eager

These two flags were important for our H200 setup:

--linear-backend triton
--moe-backend triton

Without these, vLLM tried to use a DeepGEMM path and failed with an NVCC compile error.

10. Wait for model loading

The first start will take time.

vLLM will download and load a large model.

You may see logs like:

Loading safetensors checkpoint shards

Wait until you see:

Starting vLLM server on http://0.0.0.0:6006
Application startup complete

That means the API server is ready.

11. Test inside the JarvisLabs terminal

Run:

curl http://localhost:6006/v1/models \
  -H "Authorization: Bearer your-strong-secret-key"

You should see a model list.

Example:

{
  "object": "list",
  "data": [
    {
      "id": "qwen3-235b-classifier",
      "object": "model"
    }
  ]
}

Now test chat completion:

curl http://localhost:6006/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-strong-secret-key" \
  -d '{
    "model": "qwen3-235b-classifier",
    "messages": [
      {
        "role": "user",
        "content": "Hello"
      }
    ],
    "temperature": 0,
    "max_tokens": 100
  }'

If this works locally, vLLM is running correctly.

12. Get the public API endpoint

JarvisLabs gives a notebook URL.

It may look like this:

https://your-instance-id.notebooksn.jarvislabs.net

In our setup, the working API URL was directly:

https://your-instance-id.notebooksn.jarvislabs.net/v1

Do not use:

/proxy/6006/v1

That gave a Not Found error in our setup.

The correct test from local PC was:

curl https://your-instance-id.notebooksn.jarvislabs.net/v1/models \
  -H "Authorization: Bearer your-strong-secret-key"

If it returns the model list, your public endpoint is ready.

13. Use the endpoint with curl from your PC

Run this from your local laptop:

curl https://your-instance-id.notebooksn.jarvislabs.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-strong-secret-key" \
  -d '{
    "model": "qwen3-235b-classifier",
    "messages": [
      {
        "role": "user",
        "content": "Hello"
      }
    ],
    "temperature": 0,
    "max_tokens": 100
  }'

14. Use the endpoint with Python

Install the OpenAI SDK on your local machine:

pip install openai

Create a Python file:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-instance-id.notebooksn.jarvislabs.net/v1",
    api_key="your-strong-secret-key",
)

response = client.chat.completions.create(
    model="qwen3-235b-classifier",
    messages=[
        {
            "role": "user",
            "content": "Hello"
        }
    ],
    temperature=0,
    max_tokens=100,
)

print(response.choices[0].message.content)

Run:

python test_qwen.py

15. Use the endpoint with Node.js

Install the OpenAI SDK:

npm install openai

Create a JavaScript file:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://your-instance-id.notebooksn.jarvislabs.net/v1",
  apiKey: "your-strong-secret-key",
});

const response = await client.chat.completions.create({
  model: "qwen3-235b-classifier",
  messages: [
    {
      role: "user",
      content: "Hello",
    },
  ],
  temperature: 0,
  max_tokens: 100,
});

console.log(response.choices[0].message.content);

Run:

node test-qwen.js

16. Useful debug commands

Check GPU:

nvidia-smi

Check vLLM process:

ps aux | grep vllm

Check port 6006:

ss -ltnp | grep 6006

Check local model list:

curl http://localhost:6006/v1/models \
  -H "Authorization: Bearer your-strong-secret-key"

Check health:

curl http://localhost:6006/health

Check Hugging Face cache size:

du -sh /workspace/hf-cache

Check disk space:

df -h

17. Common problems and fixes

Problem: Hugging Face warning

You may see:

You are sending unauthenticated requests to the HF Hub

Fix:

hf auth login
export HF_TOKEN="$(cat /workspace/hf-cache/token)"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"

Problem: Connection refused

You may see:

curl: Failed to connect to localhost port 6006

This means vLLM is not ready or it crashed.

Check:

ps aux | grep vllm
nvidia-smi

Wait until you see:

Application startup complete

Problem: Not Found with proxy path

Wrong:

https://your-instance-id.notebooksn.jarvislabs.net/proxy/6006/v1/models

Correct in our setup:

https://your-instance-id.notebooksn.jarvislabs.net/v1/models

Problem: NVCC compilation failed

You may see:

NVCC compilation failed
deep_gemm.py
fp8_gemm_nt_op

Fix by using Triton backend:

--linear-backend triton
--moe-backend triton
--enforce-eager

Also clean cache before restarting:

rm -rf /home/.cache/vllm/deep_gemm
rm -rf /home/.cache/vllm/torch_compile_cache

Final working command

This is the final command that worked:

export HF_HOME=/workspace/hf-cache
export HUGGINGFACE_HUB_CACHE=/workspace/hf-cache
export HF_TOKEN="$(cat /workspace/hf-cache/token)"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
export VLLM_API_KEY="your-strong-secret-key"

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --served-model-name qwen3-235b-classifier \
  --host 0.0.0.0 \
  --port 6006 \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.85 \
  --dtype auto \
  --trust-remote-code \
  --enforce-eager \
  --linear-backend triton \
  --moe-backend triton \
  --api-key "$VLLM_API_KEY"

Final API format

Base URL:

https://your-instance-id.notebooksn.jarvislabs.net/v1

Model name:

qwen3-235b-classifier

API key:

your-strong-secret-key

Example endpoint:

https://your-instance-id.notebooksn.jarvislabs.net/v1/chat/completions

That is it. The model is now deployed on JarvisLabs and can be used like an OpenAI-compatible API.