github.com/fthrvi/nakshatra
README.md
1# Nakshatra23> Distributed LLM inference across heterogeneous workers (NVIDIA / AMD / Apple Silicon / CPU). Splits one model by layer ranges; patched llama.cpp + gRPC chain protocol. (Inspired by [Petals](README_PETALS.md); v0.1 design is independent.)45Status (2026-05-06): **v0.1 functionally alive.** Two-worker cluster on Tailscale produces the same top-1 token as a single-machine `llama-cli` reference. See [`experiments/v0.0/m6_findings.md`](experiments/v0.0/m6_findings.md) for the empirical result.67**Paper:** [Nakshatra: Vendor-Agnostic Distributed Inference on Heterogeneous Consumer Hardware](https://pnl.market/research/6a017d83b86a1bf1c69ea714) (Bastola, 2026).89---1011## What this is1213Nakshatra splits one transformer model across multiple machines. Each worker holds a contiguous range of the model's layers — say worker A holds layers 0–13 and worker B holds layers 14–27 of a 28-layer model. A request flows: tokenizer on the client → worker A computes its layers and emits a hidden-state vector → worker A ships the vector to worker B over the network → worker B finishes the layers, applies the language-model head, returns the next token.1415Per-token network traffic is the size of one hidden state vector (12 KB for a 3B model, 16 KB for a 70B model) crossing one hop per worker boundary. Trivial bandwidth. Weights stay local on each worker — they are never streamed during inference.1617The architectural commitments and roadmap live in [`docs/`](docs/). Start with [`petals-architecture.md`](docs/petals-architecture.md) for the design and [`v0.1-implementation-plan.md`](docs/v0.1-implementation-plan.md) for the milestones and acceptance test.1819---2021## Quickstart — stand up a 2-worker cluster2223The walkthrough below takes a fresh pair of machines from "we have Python and a GGUF" to "we're seeing the right token come back." Targets the v0.1 §7 ship gate: under one hour for an external operator.2425### Prerequisites2627On **both** machines:2829- Python 3.9+30- `git`, `cmake`, a C++17 compiler (gcc 13+ on Linux, Xcode CLT on macOS)31- About 4 GB of free disk per worker for a sub-GGUF of a 3B-class model (more for larger models)32- Tailscale or any other point-to-point IP transport between the two machines33- SSH between them (for shipping the sub-GGUF)3435On the **first machine** only:3637- The full GGUF of the model you want to run. Llama-family architectures only for v0.1. Tested with Llama-3.2-3B-class fine-tunes (28 layers, hidden_size=3072).3839### 1. Clone and build patched llama.cpp on each machine4041```bash42git clone https://github.com/ggml-org/llama.cpp.git ~/llama.cpp43cd ~/llama.cpp44git checkout c46583b # or close enough; recent llama.cpp commits work45```4647Apply the M3+M4 patches (in [`experiments/v0.0/m4_patches/`](experiments/v0.0/m4_patches/)) on each machine:4849```bash50cd ~/llama.cpp51patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-model.h.patch52patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-model.cpp.patch53patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-model-loader.cpp.patch54patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-graph.cpp.patch55patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/models_llama.cpp.patch56```5758Set up the worker daemon as a CMake target by dropping the source into `examples/nakshatra-spike/`:5960```bash61mkdir -p examples/nakshatra-spike62cp /path/to/nakshatra/experiments/v0.0/worker_daemon.cpp examples/nakshatra-spike/63cat > examples/nakshatra-spike/CMakeLists.txt <<EOF64set(TARGET llama-nakshatra-worker)65add_executable(\${TARGET} worker_daemon.cpp)66install(TARGETS \${TARGET} RUNTIME)67target_link_libraries(\${TARGET} PRIVATE common llama \${CMAKE_THREAD_LIBS_INIT})68target_compile_features(\${TARGET} PRIVATE cxx_std_17)69EOF7071# Add nakshatra-spike to examples/CMakeLists.txt's add_subdirectory list, then:72mkdir -p build && cd build73cmake -DGGML_METAL=ON .. # macOS Metal. For Linux: -DGGML_HIPBLAS=ON (ROCm) or -DGGML_CUDA=ON (NVIDIA).74 # For deterministic regression tests only, build with ALL GPU backends OFF.75cmake --build . --target llama-nakshatra-worker -j76```7778You should now have `~/llama.cpp/build/bin/llama-nakshatra-worker`.7980### 2. Generate sub-GGUFs (on the machine that has the full model)8182```bash83cd /path/to/nakshatra84python3 -m venv ~/nakshatra-venv && source ~/nakshatra-venv/bin/activate85pip install ~/llama.cpp/gguf-py grpcio grpcio-tools pyyaml8687python experiments/v0.0/partial_gguf.py \88 /path/to/full/model.gguf \89 /tmp/cuts/w0.gguf --start 0 --end 1490python experiments/v0.0/partial_gguf.py \91 /path/to/full/model.gguf \92 /tmp/cuts/wlast.gguf --start 14 --end 28 --keep-token-embd93```9495The `--keep-token-embd` on the last worker is required for **tied-embedding** models (Llama-3.2 family included) — the lm_head falls back to using token_embd as its output projection.9697Cut points (`--start`, `--end`) must add up to a contiguous partition of the model's layer count. For an N-layer model with K workers, divide as evenly as you can.9899### 3. Ship the back-half sub-GGUF to the second machine100101```bash102scp /tmp/cuts/wlast.gguf user@host2:/tmp/wlast.gguf103```104105### 4. Generate gRPC stubs and copy worker scripts106107On **both** machines:108109```bash110cd /path/to/nakshatra111bash scripts/generate.sh # produces scripts/nakshatra_pb2*.py112```113114### 5. Write the cluster YAML115116```yaml117# scripts/cluster.yaml — adjust addresses, ports, paths to your setup118model:119 id: my-model-q4120 hidden_size: 3072 # match the model121 num_blocks: 28122 wire_dtype: f32123124workers:125 - id: machine-a126 address: 100.X.Y.Z # Tailscale IP of host 1127 port: 5530128 layer_range: [0, 14]129 sub_gguf_path: /tmp/cuts/w0.gguf130 mode: first131 - id: machine-b132 address: 100.A.B.C # Tailscale IP of host 2133 port: 5531134 layer_range: [14, 28]135 sub_gguf_path: /tmp/wlast.gguf136 mode: last137```138139The first worker must have `mode: first`, the last `mode: last`, intermediates `middle`. Layer ranges must form a contiguous `[0, num_blocks)` partition with no gaps.140141### 6. Start the workers142143On **host 1**:144145```bash146python scripts/worker.py --port 5530 \147 --sub-gguf /tmp/cuts/w0.gguf --mode first \148 --layer-start 0 --layer-end 14 --model-id my-model-q4 --n-ctx 256149```150151On **host 2**:152153```bash154python scripts/worker.py --port 5531 \155 --sub-gguf /tmp/wlast.gguf --mode last \156 --layer-start 14 --layer-end 28 --model-id my-model-q4 --n-ctx 256 \157 --daemon-bin /Users/.../llama.cpp/build/bin/llama-nakshatra-worker158```159160Each worker prints `M5 listening on :PORT` once its daemon has loaded the sub-GGUF (a few seconds for 3B-class models).161162### 7. Run the chain163164From any machine that can reach both Tailscale IPs:165166```bash167python scripts/client.py --config scripts/cluster.yaml \168 --model-path /path/to/full/model.gguf \169 --prompt "The capital of France is" --max-tokens 10170```171172Expected output:173174```175[chain] OK: contiguous coverage of [0, 28)176[chain] step 1: id=12366 ' Paris'177[chain] step 2: id=13 '.'178...179[chain] generated 10 tokens in 3.7s (2.7 tok/s)180[chain] full: 'The capital of France is Paris. The capital of France is Paris. The'181TOPTOKS_CHAIN 12366 13 578 6864 315 9822 374 12366 13 578182```183184If the first generated token matches what `llama-cli` produces on the same prompt with greedy decoding, the cluster is operating correctly.185186### Pre-flight check187188Before starting the chain, validate your YAML and worker reachability:189190```bash191python scripts/validate_cluster.py --config scripts/cluster.yaml192```193194This connects to each worker, verifies `Info` returns the expected layer range, checks the partition is contiguous, and reports the first/last worker flags. It does **not** load any models — it's a fast network + config sanity check.195196---197198## Repository layout199200```201docs/202├── petals-architecture.md v0.1 design (the "what")203├── path-a-vs-path-b-memo.md C++ feasibility memo (the "why this approach")204├── petals-deep-read.md upstream-Petals source-reading notes205├── north-star.md L1-L4 vision (substrate / engine / OS / agents)206├── v0.0-validation-plan.md Phase 0 gates (both RESOLVED)207├── v0.1-implementation-plan.md 7 milestones, ship gate, resolved decisions208└── m4-decode-patch-design.md M4 patch points209210experiments/v0.0/211├── partial_gguf.py sub-GGUF generator212├── partial_gguf_findings.md Phase 0a evidence — loader rejects partials213├── spike.cpp + spike_findings.md Phase 0b evidence — orchestration protocol214├── cb_eval_probe.py + findings Python pivot story215├── m4_patches/ the 5 patches that make llama.cpp do partial-load216├── m4_chain.cpp + findings M4 single-process chain validation217├── worker_daemon.cpp C++ daemon that runs patched libllama218├── m5_findings.md M5 — gRPC chain on localhost219└── m6_findings.md M6 — cross-machine acceptance test220221proto/222└── nakshatra.proto v0.1 wire contract223224scripts/225├── worker.py Python gRPC worker (spawns C++ daemon)226├── client.py chain-walking client227├── validate_cluster.py pre-flight YAML + reachability check228├── generate.sh regenerates Python protobuf stubs229├── cluster_localhost.yaml example: 2 workers on one host230└── cluster_crossmachine.yaml example: 2 workers on Tailscale231```232233## Architecture in one paragraph234235Each worker is a Python gRPC process that spawns a long-lived C++ daemon (`llama-nakshatra-worker`). The daemon holds the model slice and KV cache, accepts framed binary messages over stdin/stdout, and runs `llama_decode` per request. The Python worker pumps gRPC requests through to the daemon and back. The client tokenizes the prompt locally, calls the first worker with token IDs, ferries the returned hidden state through any middle workers, and gets back a token id from the last worker. The v0.3 federation extends this with sub-GGUF auto-fetch (workers download missing slices from peers), latency-aware chain assembly via a pillar registry, and Metal / ROCm GPU offload across heterogeneous machines. **Outputs on GPU paths are reproducible *in distribution*, not byte-for-byte** — kernel-level non-determinism in current backends. Bit-identical reproducibility is available via CPU-only workers (`--gpu-backend cpu`), kept for regression tests. See `docs/v0.5-design-lock.md` for the full property statement.236237## Status — what's shipped, what's coming238239**Shipped (v0.1 → v0.3):**240241- GPU acceleration — Metal on Macs, ROCm on Linux. Live on the 5-machine lab cluster.242- Streaming KV-cache reuse (M2.5) — 5× speedup on multi-token generation vs naive re-prefill.243- Sub-GGUF distribution: workers advertise their cached slices and auto-fetch missing ones from peers over HTTP byte-range (Phase 4 / 4a).244- Latency-aware chain assembly via a pillar registry (Phase H / I).245- `/healthz` endpoint per worker with rich state (identity, daemon liveness, recent RPC latency, GPU offload, GPU inventory via ioreg).246247**In-progress (v0.5 protocol foundations, partial):**248249- `Inference` streaming RPC (`--use-streaming`) — bidi streams replace per-step Forward calls (M0.5.1).250- Idempotency cache on workers (M0.5.2) — replayed `(session_id, step_id)` returns cached response without re-decoding.251- Server-to-server activation push (`--use-streaming-push`, M0.5.3 v1) — 2-worker chains working today; multi-hop is v2.252- Client-side recovery on stream failure (M0.5.4 v0) — replays history through fresh streams; continuation is plausible but not bit-identical (per the non-determinism property).253254**Deferred:**255256- Alternate-worker recovery from a registry of redundant peers (M0.5.4 v1).257- Latency-based silent-failure detection (M0.5.4 v2).258- Non-Llama model architectures — the partial-load patches today gate only `models_llama.cpp`; Qwen3 / Gemma / etc. need their own per-arch patch.259- DHT-based public-network peer discovery (v1.0+).260- Opt-in cryptographic verification (v1.0+).261262See [`docs/v0.5-design-lock.md`](docs/v0.5-design-lock.md) for the v0.5 design contract and acceptance criteria.263264## License & attribution265266Forked from [Petals](README_PETALS.md). The Nakshatra additions (everything in `docs/`, `experiments/v0.0/`, `proto/`, `scripts/`, plus the patches in `experiments/v0.0/m4_patches/`) are MIT-licensed (per the upstream Petals repository's license).267