8-Spark GB10 FP16/BF16 Deployment

Distributed Hugging Face Container Deployment via Ray Fabric

8× Nodes (1TB URAM) RoCE v2 (800Gbps) Precision: BF16/FP16 Transformers 5.x

Architectural Substrate This playbook orchestrates an 8-node DGX Spark cluster utilizing the scitrera/dgx-spark-vllm:0.16.0-t5 Hugging Face container. It targets large parameter models (e.g., Qwen3.5 397B, GLM-4.7) requiring full BF16 mathematical coherence without FP4 quantization degradation. Total effective VRAM pool: 956 GiB.

01 Network Fabric: 800Gbps ConnectX-7 Initialization

Ensure the dual MikroTik switch topology is bonded using all 8x SFP56 ports alongside the QSFP-DD interconnect. Apply jumbo frames to the primary high-speed interfaces across all 8 nodes.

/etc/netplan/80-cx7-fabric.yaml (Apply to Nodes 1-8)

network:
  version: 2
  ethernets:
    enp1s0f1np0:
      dhcp4: false
      addresses:
        - 192.168.100.1X/24  # Replace X with node designation (1-8)
      mtu: 9000
      routes:
        - to: 192.168.100.0/24
          via: 192.168.100.1
      nameservers:
        addresses: [8.8.8.8, 1.1.1.1]

02 Memory Reclamation & Host OS Stabilization

The DGX Spark's URAM is highly sensitive to host OS overhead. You must aggressively drop kernel caches to guarantee ~119.6 GiB usable memory per node for the FP16/BF16 tensor allocation.

Critical Constraint Failure to execute cache drops will result in Ray worker OOM (Out of Memory) evictions when loading 300B+ parameter weights in BF16.

Bash Executable (Run on all 8 nodes)

# Disable GUI target to save ~4GB URAM
sudo systemctl set-default multi-user.target
sudo systemctl isolate multi-user.target

# Force kernel memory cache flush
sync && sudo sysctl -w vm.drop_caches=3

# Verify available memory pool exceeds 118GiB
free -h

03 Ray Orchestrator Initialization

Establish the Ray cluster fabric. You must bind the head node explicitly to the enp1s0f1np0 interface to prevent NCCL traffic from routing over the management network.

Node 1 (Head Node) Execution

# Pull the optimal Hugging Face vLLM image
docker pull scitrera/dgx-spark-vllm:0.16.0-t5

# Define High-Speed Interface IP
export HEAD_NODE_IP=$(ip -4 addr show enp1s0f1np0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')

# Start Ray Head
ray start --head --node-ip-address=${HEAD_NODE_IP} --port=6379 --num-cpus=32 --num-gpus=1

Nodes 2-8 (Worker Node) Execution

# Define target Head IP and local interface
export HEAD_NODE_IP="192.168.100.11" # Update to matched Node 1 IP
export WORKER_IP=$(ip -4 addr show enp1s0f1np0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')

# Attach Worker to cluster
ray start --address=${HEAD_NODE_IP}:6379 --node-ip-address=${WORKER_IP} --num-cpus=32 --num-gpus=1

04 Distributed vLLM Launch (BF16 Precision)

Deploy the inference server from the Head Node using the Hugging Face container. This command utilizes Tensor Parallelism across all 8 nodes (-tp 8) and strictly enforces BF16 precision to maintain reasoning capabilities without NVFP4 quantization loss.

vLLM Serve Command (Execute on Node 1)

docker run --privileged --gpus all -it --rm \
  --network host --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e NCCL_IB_HCA=enp1s0f1np0 \
  -e NCCL_IB_GID_INDEX=3 \
  -e VLLM_USE_MODELSCOPE=true \
  scitrera/dgx-spark-vllm:0.16.0-t5 \
  vllm serve Qwen/Qwen3.5-397B-A17B \
  --dtype bfloat16 \
  --tensor-parallel-size 8 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.95 \
  --max-model-len 32768 \
  --port 8000 --host 0.0.0.0 \
  --trust-remote-code