3-Spark Qwen3.5 379B NVFP4 Deployment

Architectural execution guide for spanning a 379B parameter model across three DGX Sparks (12 GPUs) utilizing highly compressed NVIDIA Floating Point 4-bit (NVFP4) quantization.

Hardware Context Assumes 3x DGX Spark nodes. Node 1 will act as the Head (Controller). Nodes 2 and 3 will act as Workers. Because NVFP4 is utilized, we achieve massive compression, but Pipeline Parallelism (PP=3) is still required to stretch the layers across the physical cluster boundary.

01 Clear VRAM / Memory Caches

Before initiating the Ray cluster, you must purge the memory caches on all three Sparks. Failure to do so often results in contiguous memory allocation failures during the massive vLLM profiling phase.

Execute on Spark-01, Spark-02, and Spark-03

sudo sync; sudo sysctl -w vm.drop_caches=3

02 Initialize Ray Cluster Topology

Establish the communication layer. The head node coordinates the pipeline, while the workers attach to it.

First, start the Head node on Spark-01:

Execute on Spark-01 (Head Node)

ray start --head --port=6379 --num-gpus=4

Next, attach the Workers (Spark-02 and Spark-03):

Execute on Spark-02 and Spark-03 (Replace <SPARK_01_IP>)

ray start --address=<SPARK_01_IP>:6379 --num-gpus=4

03 Launch vLLM Execution Engine

Invoke vLLM on the Head Node (Spark-01). Ray will automatically distribute the weights across the 3 nodes. We use --tensor-parallel-size 4 to maximize intra-node bandwidth and --pipeline-parallel-size 3 to span the 3 physical Sparks.

Execute on Spark-01 (Wait for "Uvicorn running on..." message)

vllm serve Qwen/Qwen3.5-379B-Instruct-NVFP4 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 3 \
    --quantization nvfp4 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768 \
    --trust-remote-code

04 Verify Cluster Topography

While the model is loading (which will take several minutes), you can verify that all 12 GPUs across the 3 nodes are actively engaged.

Execute on any Spark node

ray status

05 Validation / API Test

Once the vLLM server is listening on port 8000, trigger a completion request to ensure the pipeline is successfully passing activations between Spark-01, 02, and 03.

Execute from client machine

curl http://<SPARK_01_IP>:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3.5-379B-Instruct-NVFP4",
        "messages": [
            {"role": "system", "content": "You are an expert cluster architect."},
            {"role": "user", "content": "Explain pipeline parallelism across 3 physical nodes."}
        ],
        "max_tokens": 512
    }'

Troubleshooting NVFP4 & OOM (Out of Memory) If the model throws an OOM error during the KV-cache profiling phase, kill the process entirely, drop caches again on all 3 Sparks, and lower --gpu-memory-utilization 0.95 to 0.92 or 0.90. Note that --kv-cache-dtype fp8 is highly recommended alongside NVFP4 to ensure maximum context length fit.