Ensure the dual MikroTik switch topology is bonded using all 8x SFP56 ports alongside the QSFP-DD interconnect. Apply jumbo frames to the primary high-speed interfaces across all 8 nodes.
network:
version: 2
ethernets:
enp1s0f1np0:
dhcp4: false
addresses:
- 192.168.100.1X/24 # Replace X with node designation (1-8)
mtu: 9000
routes:
- to: 192.168.100.0/24
via: 192.168.100.1
nameservers:
addresses: [8.8.8.8, 1.1.1.1]
The DGX Spark's URAM is highly sensitive to host OS overhead. You must aggressively drop kernel caches to guarantee ~119.6 GiB usable memory per node for the FP16/BF16 tensor allocation.
# Disable GUI target to save ~4GB URAM
sudo systemctl set-default multi-user.target
sudo systemctl isolate multi-user.target
# Force kernel memory cache flush
sync && sudo sysctl -w vm.drop_caches=3
# Verify available memory pool exceeds 118GiB
free -h
Establish the Ray cluster fabric. You must bind the head node explicitly to the enp1s0f1np0 interface to prevent NCCL traffic from routing over the management network.
# Pull the optimal Hugging Face vLLM image
docker pull scitrera/dgx-spark-vllm:0.16.0-t5
# Define High-Speed Interface IP
export HEAD_NODE_IP=$(ip -4 addr show enp1s0f1np0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
# Start Ray Head
ray start --head --node-ip-address=${HEAD_NODE_IP} --port=6379 --num-cpus=32 --num-gpus=1
# Define target Head IP and local interface
export HEAD_NODE_IP="192.168.100.11" # Update to matched Node 1 IP
export WORKER_IP=$(ip -4 addr show enp1s0f1np0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
# Attach Worker to cluster
ray start --address=${HEAD_NODE_IP}:6379 --node-ip-address=${WORKER_IP} --num-cpus=32 --num-gpus=1
Deploy the inference server from the Head Node using the Hugging Face container. This command utilizes Tensor Parallelism across all 8 nodes (-tp 8) and strictly enforces BF16 precision to maintain reasoning capabilities without NVFP4 quantization loss.
docker run --privileged --gpus all -it --rm \
--network host --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e NCCL_IB_HCA=enp1s0f1np0 \
-e NCCL_IB_GID_INDEX=3 \
-e VLLM_USE_MODELSCOPE=true \
scitrera/dgx-spark-vllm:0.16.0-t5 \
vllm serve Qwen/Qwen3.5-397B-A17B \
--dtype bfloat16 \
--tensor-parallel-size 8 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.95 \
--max-model-len 32768 \
--port 8000 --host 0.0.0.0 \
--trust-remote-code