Deploying Qwen3 235B on JarvisLabs with vLLM
Deploying for Indian user data safety

This guide explains how to deploy Qwen3 235B on JarvisLabs and expose it as an OpenAI-compatible API endpoint.
We will deploy this model:
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
We will serve it with this API model name:
qwen3-235b-classifier
We will use:
JarvisLabs
4 x NVIDIA H200 GPUs
PyTorch template
vLLM
Hugging Face
OpenAI-compatible API
1. Create a JarvisLabs instance
Go to JarvisLabs and create a new GPU instance.
Choose:
Template: PyTorch
GPU: H200
GPU count: 4
Storage: 1000 GB
The model is large, so do not use a small disk. We used 1000 GB storage.
After the instance starts, open the terminal from Jupyter or VS Code.
2. Check GPUs
Run:
nvidia-smi
You should see 4 H200 GPUs.
You can also check from Python:
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"
Expected output:
True
4
3. Install vLLM and Hugging Face tools
Run:
pip install -U pip
pip install -U vllm huggingface_hub openai
Check vLLM:
vllm --version
4. Login to Hugging Face
The model will download from Hugging Face, so login first.
Run:
hf auth login
Paste your Hugging Face token.
You can create a token from:
Hugging Face account
Settings
Access Tokens
New token
Use a read token.
Check login:
hf auth whoami
5. Set Hugging Face cache folder
The model is very large, so store it in the workspace disk.
Run:
mkdir -p /workspace/hf-cache
export HF_HOME=/workspace/hf-cache
export HUGGINGFACE_HUB_CACHE=/workspace/hf-cache
Save these settings for future terminal sessions:
echo 'export HF_HOME=/workspace/hf-cache' >> ~/.bashrc
echo 'export HUGGINGFACE_HUB_CACHE=/workspace/hf-cache' >> ~/.bashrc
After login, the token may be saved here:
/workspace/hf-cache/token
Export it:
export HF_TOKEN="$(cat /workspace/hf-cache/token)"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
Check that the token is loaded:
echo $HF_TOKEN | cut -c1-8
It should print something like:
hf_xxxxx
Do not share your full Hugging Face token.
6. Stop old vLLM process if any
Before starting the model, stop any old vLLM process:
pkill -f "vllm serve"
Check GPU memory:
nvidia-smi
If old workers are still using GPU memory, wait a few seconds and check again.
7. Clean old vLLM compile cache
This helps avoid old failed kernel cache issues.
Run:
rm -rf /home/.cache/vllm/deep_gemm
rm -rf /home/.cache/vllm/torch_compile_cache
8. Start vLLM server
Set your API key:
export VLLM_API_KEY="your-strong-secret-key"
Use a strong secret key. Do not use a simple test key in production.
Now start the model:
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
--served-model-name qwen3-235b-classifier \
--host 0.0.0.0 \
--port 6006 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--gpu-memory-utilization 0.85 \
--dtype auto \
--trust-remote-code \
--enforce-eager \
--linear-backend triton \
--moe-backend triton \
--api-key "$VLLM_API_KEY"
9. Why these flags are important
This flag sets the API model name:
--served-model-name qwen3-235b-classifier
This lets vLLM listen outside localhost:
--host 0.0.0.0
This uses port 6006:
--port 6006
This splits the model across 4 GPUs:
--tensor-parallel-size 4
This limits max context length:
--max-model-len 16384
This keeps some free GPU memory:
--gpu-memory-utilization 0.85
This helped avoid compile issues:
--enforce-eager
These two flags were important for our H200 setup:
--linear-backend triton
--moe-backend triton
Without these, vLLM tried to use a DeepGEMM path and failed with an NVCC compile error.
10. Wait for model loading
The first start will take time.
vLLM will download and load a large model.
You may see logs like:
Loading safetensors checkpoint shards
Wait until you see:
Starting vLLM server on http://0.0.0.0:6006
Application startup complete
That means the API server is ready.
11. Test inside the JarvisLabs terminal
Run:
curl http://localhost:6006/v1/models \
-H "Authorization: Bearer your-strong-secret-key"
You should see a model list.
Example:
{
"object": "list",
"data": [
{
"id": "qwen3-235b-classifier",
"object": "model"
}
]
}
Now test chat completion:
curl http://localhost:6006/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-strong-secret-key" \
-d '{
"model": "qwen3-235b-classifier",
"messages": [
{
"role": "user",
"content": "Hello"
}
],
"temperature": 0,
"max_tokens": 100
}'
If this works locally, vLLM is running correctly.
12. Get the public API endpoint
JarvisLabs gives a notebook URL.
It may look like this:
https://your-instance-id.notebooksn.jarvislabs.net
In our setup, the working API URL was directly:
https://your-instance-id.notebooksn.jarvislabs.net/v1
Do not use:
/proxy/6006/v1
That gave a Not Found error in our setup.
The correct test from local PC was:
curl https://your-instance-id.notebooksn.jarvislabs.net/v1/models \
-H "Authorization: Bearer your-strong-secret-key"
If it returns the model list, your public endpoint is ready.
13. Use the endpoint with curl from your PC
Run this from your local laptop:
curl https://your-instance-id.notebooksn.jarvislabs.net/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-strong-secret-key" \
-d '{
"model": "qwen3-235b-classifier",
"messages": [
{
"role": "user",
"content": "Hello"
}
],
"temperature": 0,
"max_tokens": 100
}'
14. Use the endpoint with Python
Install the OpenAI SDK on your local machine:
pip install openai
Create a Python file:
from openai import OpenAI
client = OpenAI(
base_url="https://your-instance-id.notebooksn.jarvislabs.net/v1",
api_key="your-strong-secret-key",
)
response = client.chat.completions.create(
model="qwen3-235b-classifier",
messages=[
{
"role": "user",
"content": "Hello"
}
],
temperature=0,
max_tokens=100,
)
print(response.choices[0].message.content)
Run:
python test_qwen.py
15. Use the endpoint with Node.js
Install the OpenAI SDK:
npm install openai
Create a JavaScript file:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://your-instance-id.notebooksn.jarvislabs.net/v1",
apiKey: "your-strong-secret-key",
});
const response = await client.chat.completions.create({
model: "qwen3-235b-classifier",
messages: [
{
role: "user",
content: "Hello",
},
],
temperature: 0,
max_tokens: 100,
});
console.log(response.choices[0].message.content);
Run:
node test-qwen.js
16. Useful debug commands
Check GPU:
nvidia-smi
Check vLLM process:
ps aux | grep vllm
Check port 6006:
ss -ltnp | grep 6006
Check local model list:
curl http://localhost:6006/v1/models \
-H "Authorization: Bearer your-strong-secret-key"
Check health:
curl http://localhost:6006/health
Check Hugging Face cache size:
du -sh /workspace/hf-cache
Check disk space:
df -h
17. Common problems and fixes
Problem: Hugging Face warning
You may see:
You are sending unauthenticated requests to the HF Hub
Fix:
hf auth login
export HF_TOKEN="$(cat /workspace/hf-cache/token)"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
Problem: Connection refused
You may see:
curl: Failed to connect to localhost port 6006
This means vLLM is not ready or it crashed.
Check:
ps aux | grep vllm
nvidia-smi
Wait until you see:
Application startup complete
Problem: Not Found with proxy path
Wrong:
https://your-instance-id.notebooksn.jarvislabs.net/proxy/6006/v1/models
Correct in our setup:
https://your-instance-id.notebooksn.jarvislabs.net/v1/models
Problem: NVCC compilation failed
You may see:
NVCC compilation failed
deep_gemm.py
fp8_gemm_nt_op
Fix by using Triton backend:
--linear-backend triton
--moe-backend triton
--enforce-eager
Also clean cache before restarting:
rm -rf /home/.cache/vllm/deep_gemm
rm -rf /home/.cache/vllm/torch_compile_cache
Final working command
This is the final command that worked:
export HF_HOME=/workspace/hf-cache
export HUGGINGFACE_HUB_CACHE=/workspace/hf-cache
export HF_TOKEN="$(cat /workspace/hf-cache/token)"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
export VLLM_API_KEY="your-strong-secret-key"
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
--served-model-name qwen3-235b-classifier \
--host 0.0.0.0 \
--port 6006 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--gpu-memory-utilization 0.85 \
--dtype auto \
--trust-remote-code \
--enforce-eager \
--linear-backend triton \
--moe-backend triton \
--api-key "$VLLM_API_KEY"
Final API format
Base URL:
https://your-instance-id.notebooksn.jarvislabs.net/v1
Model name:
qwen3-235b-classifier
API key:
your-strong-secret-key
Example endpoint:
https://your-instance-id.notebooksn.jarvislabs.net/v1/chat/completions
That is it. The model is now deployed on JarvisLabs and can be used like an OpenAI-compatible API.





