README.md · code

github.com/fthrvi/nakshatra
README.md
267 lines · 12.4 kb · markdown
1# Nakshatra
2
3> Distributed LLM inference across heterogeneous workers (NVIDIA / AMD / Apple Silicon / CPU). Splits one model by layer ranges; patched llama.cpp + gRPC chain protocol. (Inspired by [Petals](README_PETALS.md); v0.1 design is independent.)
4
5Status (2026-05-06): **v0.1 functionally alive.** Two-worker cluster on Tailscale produces the same top-1 token as a single-machine `llama-cli` reference. See [`experiments/v0.0/m6_findings.md`](experiments/v0.0/m6_findings.md) for the empirical result.
6
7**Paper:** [Nakshatra: Vendor-Agnostic Distributed Inference on Heterogeneous Consumer Hardware](https://pnl.market/research/6a017d83b86a1bf1c69ea714) (Bastola, 2026).
8
9---
10
11## What this is
12
13Nakshatra splits one transformer model across multiple machines. Each worker holds a contiguous range of the model's layers — say worker A holds layers 0–13 and worker B holds layers 14–27 of a 28-layer model. A request flows: tokenizer on the client → worker A computes its layers and emits a hidden-state vector → worker A ships the vector to worker B over the network → worker B finishes the layers, applies the language-model head, returns the next token.
14
15Per-token network traffic is the size of one hidden state vector (12 KB for a 3B model, 16 KB for a 70B model) crossing one hop per worker boundary. Trivial bandwidth. Weights stay local on each worker — they are never streamed during inference.
16
17The architectural commitments and roadmap live in [`docs/`](docs/). Start with [`petals-architecture.md`](docs/petals-architecture.md) for the design and [`v0.1-implementation-plan.md`](docs/v0.1-implementation-plan.md) for the milestones and acceptance test.
18
19---
20
21## Quickstart — stand up a 2-worker cluster
22
23The walkthrough below takes a fresh pair of machines from "we have Python and a GGUF" to "we're seeing the right token come back." Targets the v0.1 §7 ship gate: under one hour for an external operator.
24
25### Prerequisites
26
27On **both** machines:
28
29- Python 3.9+
30- `git`, `cmake`, a C++17 compiler (gcc 13+ on Linux, Xcode CLT on macOS)
31- About 4 GB of free disk per worker for a sub-GGUF of a 3B-class model (more for larger models)
32- Tailscale or any other point-to-point IP transport between the two machines
33- SSH between them (for shipping the sub-GGUF)
34
35On the **first machine** only:
36
37- The full GGUF of the model you want to run. Llama-family architectures only for v0.1. Tested with Llama-3.2-3B-class fine-tunes (28 layers, hidden_size=3072).
38
39### 1. Clone and build patched llama.cpp on each machine
40
41```bash
42git clone https://github.com/ggml-org/llama.cpp.git ~/llama.cpp
43cd ~/llama.cpp
44git checkout c46583b   # or close enough; recent llama.cpp commits work
45```
46
47Apply the M3+M4 patches (in [`experiments/v0.0/m4_patches/`](experiments/v0.0/m4_patches/)) on each machine:
48
49```bash
50cd ~/llama.cpp
51patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-model.h.patch
52patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-model.cpp.patch
53patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-model-loader.cpp.patch
54patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-graph.cpp.patch
55patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/models_llama.cpp.patch
56```
57
58Set up the worker daemon as a CMake target by dropping the source into `examples/nakshatra-spike/`:
59
60```bash
61mkdir -p examples/nakshatra-spike
62cp /path/to/nakshatra/experiments/v0.0/worker_daemon.cpp examples/nakshatra-spike/
63cat > examples/nakshatra-spike/CMakeLists.txt <<EOF
64set(TARGET llama-nakshatra-worker)
65add_executable(\${TARGET} worker_daemon.cpp)
66install(TARGETS \${TARGET} RUNTIME)
67target_link_libraries(\${TARGET} PRIVATE common llama \${CMAKE_THREAD_LIBS_INIT})
68target_compile_features(\${TARGET} PRIVATE cxx_std_17)
69EOF
70
71# Add nakshatra-spike to examples/CMakeLists.txt's add_subdirectory list, then:
72mkdir -p build && cd build
73cmake -DGGML_METAL=ON ..   # macOS Metal. For Linux: -DGGML_HIPBLAS=ON (ROCm) or -DGGML_CUDA=ON (NVIDIA).
74                           # For deterministic regression tests only, build with ALL GPU backends OFF.
75cmake --build . --target llama-nakshatra-worker -j
76```
77
78You should now have `~/llama.cpp/build/bin/llama-nakshatra-worker`.
79
80### 2. Generate sub-GGUFs (on the machine that has the full model)
81
82```bash
83cd /path/to/nakshatra
84python3 -m venv ~/nakshatra-venv && source ~/nakshatra-venv/bin/activate
85pip install ~/llama.cpp/gguf-py grpcio grpcio-tools pyyaml
86
87python experiments/v0.0/partial_gguf.py \
88    /path/to/full/model.gguf \
89    /tmp/cuts/w0.gguf --start 0 --end 14
90python experiments/v0.0/partial_gguf.py \
91    /path/to/full/model.gguf \
92    /tmp/cuts/wlast.gguf --start 14 --end 28 --keep-token-embd
93```
94
95The `--keep-token-embd` on the last worker is required for **tied-embedding** models (Llama-3.2 family included) — the lm_head falls back to using token_embd as its output projection.
96
97Cut points (`--start`, `--end`) must add up to a contiguous partition of the model's layer count. For an N-layer model with K workers, divide as evenly as you can.
98
99### 3. Ship the back-half sub-GGUF to the second machine
100
101```bash
102scp /tmp/cuts/wlast.gguf user@host2:/tmp/wlast.gguf
103```
104
105### 4. Generate gRPC stubs and copy worker scripts
106
107On **both** machines:
108
109```bash
110cd /path/to/nakshatra
111bash scripts/generate.sh     # produces scripts/nakshatra_pb2*.py
112```
113
114### 5. Write the cluster YAML
115
116```yaml
117# scripts/cluster.yaml — adjust addresses, ports, paths to your setup
118model:
119  id: my-model-q4
120  hidden_size: 3072       # match the model
121  num_blocks: 28
122  wire_dtype: f32
123
124workers:
125  - id: machine-a
126    address: 100.X.Y.Z      # Tailscale IP of host 1
127    port: 5530
128    layer_range: [0, 14]
129    sub_gguf_path: /tmp/cuts/w0.gguf
130    mode: first
131  - id: machine-b
132    address: 100.A.B.C      # Tailscale IP of host 2
133    port: 5531
134    layer_range: [14, 28]
135    sub_gguf_path: /tmp/wlast.gguf
136    mode: last
137```
138
139The first worker must have `mode: first`, the last `mode: last`, intermediates `middle`. Layer ranges must form a contiguous `[0, num_blocks)` partition with no gaps.
140
141### 6. Start the workers
142
143On **host 1**:
144
145```bash
146python scripts/worker.py --port 5530 \
147    --sub-gguf /tmp/cuts/w0.gguf --mode first \
148    --layer-start 0 --layer-end 14 --model-id my-model-q4 --n-ctx 256
149```
150
151On **host 2**:
152
153```bash
154python scripts/worker.py --port 5531 \
155    --sub-gguf /tmp/wlast.gguf --mode last \
156    --layer-start 14 --layer-end 28 --model-id my-model-q4 --n-ctx 256 \
157    --daemon-bin /Users/.../llama.cpp/build/bin/llama-nakshatra-worker
158```
159
160Each worker prints `M5 listening on :PORT` once its daemon has loaded the sub-GGUF (a few seconds for 3B-class models).
161
162### 7. Run the chain
163
164From any machine that can reach both Tailscale IPs:
165
166```bash
167python scripts/client.py --config scripts/cluster.yaml \
168    --model-path /path/to/full/model.gguf \
169    --prompt "The capital of France is" --max-tokens 10
170```
171
172Expected output:
173
174```
175[chain] OK: contiguous coverage of [0, 28)
176[chain] step 1: id=12366 ' Paris'
177[chain] step 2: id=13 '.'
178...
179[chain] generated 10 tokens in 3.7s  (2.7 tok/s)
180[chain] full: 'The capital of France is Paris. The capital of France is Paris. The'
181TOPTOKS_CHAIN 12366 13 578 6864 315 9822 374 12366 13 578
182```
183
184If the first generated token matches what `llama-cli` produces on the same prompt with greedy decoding, the cluster is operating correctly.
185
186### Pre-flight check
187
188Before starting the chain, validate your YAML and worker reachability:
189
190```bash
191python scripts/validate_cluster.py --config scripts/cluster.yaml
192```
193
194This connects to each worker, verifies `Info` returns the expected layer range, checks the partition is contiguous, and reports the first/last worker flags. It does **not** load any models — it's a fast network + config sanity check.
195
196---
197
198## Repository layout
199
200```
201docs/
202├── petals-architecture.md          v0.1 design (the "what")
203├── path-a-vs-path-b-memo.md        C++ feasibility memo (the "why this approach")
204├── petals-deep-read.md             upstream-Petals source-reading notes
205├── north-star.md                   L1-L4 vision (substrate / engine / OS / agents)
206├── v0.0-validation-plan.md         Phase 0 gates (both RESOLVED)
207├── v0.1-implementation-plan.md     7 milestones, ship gate, resolved decisions
208└── m4-decode-patch-design.md       M4 patch points
209
210experiments/v0.0/
211├── partial_gguf.py                 sub-GGUF generator
212├── partial_gguf_findings.md        Phase 0a evidence — loader rejects partials
213├── spike.cpp + spike_findings.md   Phase 0b evidence — orchestration protocol
214├── cb_eval_probe.py + findings     Python pivot story
215├── m4_patches/                     the 5 patches that make llama.cpp do partial-load
216├── m4_chain.cpp + findings         M4 single-process chain validation
217├── worker_daemon.cpp               C++ daemon that runs patched libllama
218├── m5_findings.md                  M5 — gRPC chain on localhost
219└── m6_findings.md                  M6 — cross-machine acceptance test
220
221proto/
222└── nakshatra.proto                 v0.1 wire contract
223
224scripts/
225├── worker.py                       Python gRPC worker (spawns C++ daemon)
226├── client.py                       chain-walking client
227├── validate_cluster.py             pre-flight YAML + reachability check
228├── generate.sh                     regenerates Python protobuf stubs
229├── cluster_localhost.yaml          example: 2 workers on one host
230└── cluster_crossmachine.yaml       example: 2 workers on Tailscale
231```
232
233## Architecture in one paragraph
234
235Each worker is a Python gRPC process that spawns a long-lived C++ daemon (`llama-nakshatra-worker`). The daemon holds the model slice and KV cache, accepts framed binary messages over stdin/stdout, and runs `llama_decode` per request. The Python worker pumps gRPC requests through to the daemon and back. The client tokenizes the prompt locally, calls the first worker with token IDs, ferries the returned hidden state through any middle workers, and gets back a token id from the last worker. The v0.3 federation extends this with sub-GGUF auto-fetch (workers download missing slices from peers), latency-aware chain assembly via a pillar registry, and Metal / ROCm GPU offload across heterogeneous machines. **Outputs on GPU paths are reproducible *in distribution*, not byte-for-byte** — kernel-level non-determinism in current backends. Bit-identical reproducibility is available via CPU-only workers (`--gpu-backend cpu`), kept for regression tests. See `docs/v0.5-design-lock.md` for the full property statement.
236
237## Status — what's shipped, what's coming
238
239**Shipped (v0.1 → v0.3):**
240
241- GPU acceleration — Metal on Macs, ROCm on Linux. Live on the 5-machine lab cluster.
242- Streaming KV-cache reuse (M2.5) — 5× speedup on multi-token generation vs naive re-prefill.
243- Sub-GGUF distribution: workers advertise their cached slices and auto-fetch missing ones from peers over HTTP byte-range (Phase 4 / 4a).
244- Latency-aware chain assembly via a pillar registry (Phase H / I).
245- `/healthz` endpoint per worker with rich state (identity, daemon liveness, recent RPC latency, GPU offload, GPU inventory via ioreg).
246
247**In-progress (v0.5 protocol foundations, partial):**
248
249- `Inference` streaming RPC (`--use-streaming`) — bidi streams replace per-step Forward calls (M0.5.1).
250- Idempotency cache on workers (M0.5.2) — replayed `(session_id, step_id)` returns cached response without re-decoding.
251- Server-to-server activation push (`--use-streaming-push`, M0.5.3 v1) — 2-worker chains working today; multi-hop is v2.
252- Client-side recovery on stream failure (M0.5.4 v0) — replays history through fresh streams; continuation is plausible but not bit-identical (per the non-determinism property).
253
254**Deferred:**
255
256- Alternate-worker recovery from a registry of redundant peers (M0.5.4 v1).
257- Latency-based silent-failure detection (M0.5.4 v2).
258- Non-Llama model architectures — the partial-load patches today gate only `models_llama.cpp`; Qwen3 / Gemma / etc. need their own per-arch patch.
259- DHT-based public-network peer discovery (v1.0+).
260- Opt-in cryptographic verification (v1.0+).
261
262See [`docs/v0.5-design-lock.md`](docs/v0.5-design-lock.md) for the v0.5 design contract and acceptance criteria.
263
264## License & attribution
265
266Forked from [Petals](README_PETALS.md). The Nakshatra additions (everything in `docs/`, `experiments/v0.0/`, `proto/`, `scripts/`, plus the patches in `experiments/v0.0/m4_patches/`) are MIT-licensed (per the upstream Petals repository's license).
267