Before initiating the Ray cluster, you must purge the memory caches on all three Sparks. Failure to do so often results in contiguous memory allocation failures during the massive vLLM profiling phase.
sudo sync; sudo sysctl -w vm.drop_caches=3
Establish the communication layer. The head node coordinates the pipeline, while the workers attach to it.
First, start the Head node on Spark-01:
ray start --head --port=6379 --num-gpus=4
Next, attach the Workers (Spark-02 and Spark-03):
ray start --address=<SPARK_01_IP>:6379 --num-gpus=4
Invoke vLLM on the Head Node (Spark-01). Ray will automatically distribute the weights across the 3 nodes. We use --tensor-parallel-size 4 to maximize intra-node bandwidth and --pipeline-parallel-size 3 to span the 3 physical Sparks.
vllm serve Qwen/Qwen3.5-379B-Instruct-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 3 \
--quantization nvfp4 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.95 \
--max-model-len 32768 \
--trust-remote-code
While the model is loading (which will take several minutes), you can verify that all 12 GPUs across the 3 nodes are actively engaged.
ray status
Once the vLLM server is listening on port 8000, trigger a completion request to ensure the pipeline is successfully passing activations between Spark-01, 02, and 03.
curl http://<SPARK_01_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-379B-Instruct-NVFP4",
"messages": [
{"role": "system", "content": "You are an expert cluster architect."},
{"role": "user", "content": "Explain pipeline parallelism across 3 physical nodes."}
],
"max_tokens": 512
}'