README_PETALS.md · code

github.com/fthrvi/nakshatra
README_PETALS.md
166 lines · 8.9 kb · markdown
1<p align="center">
2    <img src="https://i.imgur.com/7eR7Pan.png" width="400"><br>
3    Run large language models at home, BitTorrent-style.<br>
4    Fine-tuning and inference <a href="https://github.com/bigscience-workshop/petals#benchmarks">up to 10x faster</a> than offloading
5    <br><br>
6    <a href="https://pypi.org/project/petals/"><img src="https://img.shields.io/pypi/v/petals.svg?color=green"></a>
7    <a href="https://discord.gg/tfHfe8B34k"><img src="https://img.shields.io/discord/865254854262652969?label=discord&logo=discord&logoColor=white"></a>
8    <br>
9</p>
10
11Generate text with distributed **Llama 3.1** (up to 405B), **Mixtral** (8x22B), **Falcon** (40B+) or **BLOOM** (176B) and fine‑tune them for your own tasks &mdash; right from your desktop computer or Google Colab:
12
13```python
14from transformers import AutoTokenizer
15from petals import AutoDistributedModelForCausalLM
16
17# Choose any model available at https://health.petals.dev
18model_name = "meta-llama/Meta-Llama-3.1-405B-Instruct"
19
20# Connect to a distributed network hosting model layers
21tokenizer = AutoTokenizer.from_pretrained(model_name)
22model = AutoDistributedModelForCausalLM.from_pretrained(model_name)
23
24# Run the model as if it were on your computer
25inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"]
26outputs = model.generate(inputs, max_new_tokens=5)
27print(tokenizer.decode(outputs[0]))  # A cat sat on a mat...
28```
29
30<p align="center">
31    🚀 &nbsp;<b><a href="https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing">Try now in Colab</a></b>
32</p>
33
34🦙 **Want to run Llama?** [Request access](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) to its weights, then run `huggingface-cli login` in the terminal before loading the model. Or just try it in our [chatbot app](https://chat.petals.dev).
35
36🔏 **Privacy.** Your data will be processed with the help of other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.
37
38💬 **Any questions?** Ping us in [our Discord](https://discord.gg/KdThf2bWVU)!
39
40## Connect your GPU and increase Petals capacity
41
42Petals is a community-run system &mdash; we rely on people sharing their GPUs. You can help serving one of the [available models](https://health.petals.dev) or host a new model from 🤗 [Model Hub](https://huggingface.co/models)!
43
44As an example, here is how to host a part of [Llama 3.1 (405B) Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) on your GPU:
45
46🦙 **Want to host Llama?** [Request access](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) to its weights, then run `huggingface-cli login` in the terminal before loading the model.
47
48🐧 **Linux + Anaconda.** Run these commands for NVIDIA GPUs (or follow [this](https://github.com/bigscience-workshop/petals/wiki/Running-on-AMD-GPU) for AMD):
49
50```bash
51conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
52pip install git+https://github.com/bigscience-workshop/petals
53python -m petals.cli.run_server meta-llama/Meta-Llama-3.1-405B-Instruct
54```
55
56🪟 **Windows + WSL.** Follow [this guide](https://github.com/bigscience-workshop/petals/wiki/Run-Petals-server-on-Windows) on our Wiki.
57
58🐋 **Docker.** Run our [Docker](https://www.docker.com) image for NVIDIA GPUs (or follow [this](https://github.com/bigscience-workshop/petals/wiki/Running-on-AMD-GPU) for AMD):
59
60```bash
61sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
62    learningathome/petals:main \
63    python -m petals.cli.run_server --port 31330 meta-llama/Meta-Llama-3.1-405B-Instruct
64```
65
66🍏 **macOS + Apple M1/M2 GPU.** Install [Homebrew](https://brew.sh/), then run these commands:
67
68```bash
69brew install python
70python3 -m pip install git+https://github.com/bigscience-workshop/petals
71python3 -m petals.cli.run_server meta-llama/Meta-Llama-3.1-405B-Instruct
72```
73
74<p align="center">
75    📚 &nbsp;<b><a href="https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#running-a-server">Learn more</a></b> (how to use multiple GPUs, start the server on boot, etc.)
76</p>
77
78🔒 **Security.** Hosting a server does not allow others to run custom code on your computer. Learn more [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety).
79
80💬 **Any questions?** Ping us in [our Discord](https://discord.gg/X7DgtxgMhc)!
81
82🏆 **Thank you!** Once you load and host 10+ blocks, we can show your name or link on the [swarm monitor](https://health.petals.dev) as a way to say thanks. You can specify them with `--public_name YOUR_NAME`.
83
84## How does it work?
85
86- You load a small part of the model, then join a [network](https://health.petals.dev) of people serving the other parts. Single‑batch inference runs at up to **6 tokens/sec** for **Llama 2** (70B) and up to **4 tokens/sec** for **Falcon** (180B) — enough for [chatbots](https://chat.petals.dev) and interactive apps.
87- You can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of **PyTorch** and **🤗 Transformers**.
88
89<p align="center">
90    <img src="https://i.imgur.com/RTYF3yW.png" width="800">
91</p>
92
93<p align="center">
94    📜 &nbsp;<b><a href="https://arxiv.org/pdf/2209.01188.pdf">Read paper</a></b>
95    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
96    📚 &nbsp;<b><a href="https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions">See FAQ</a></b>
97</p>
98
99## 📚 Tutorials, examples, and more
100
101Basic tutorials:
102
103- Getting started: [tutorial](https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing)
104- Prompt-tune Llama-65B for text semantic classification: [tutorial](https://colab.research.google.com/github/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb)
105- Prompt-tune BLOOM to create a personified chatbot: [tutorial](https://colab.research.google.com/github/bigscience-workshop/petals/blob/main/examples/prompt-tuning-personachat.ipynb)
106
107Useful tools:
108
109- [Chatbot web app](https://chat.petals.dev) (connects to Petals via an HTTP/WebSocket endpoint): [source code](https://github.com/petals-infra/chat.petals.dev)
110- [Monitor](https://health.petals.dev) for the public swarm: [source code](https://github.com/petals-infra/health.petals.dev)
111
112Advanced guides:
113
114- Launch a private swarm: [guide](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm)
115- Run a custom model: [guide](https://github.com/bigscience-workshop/petals/wiki/Run-a-custom-model-with-Petals)
116
117### Benchmarks
118
119Please see **Section 3.3** of our [paper](https://arxiv.org/pdf/2209.01188.pdf).
120
121### 🛠️ Contributing
122
123Please see our [FAQ](https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#contributing) on contributing.
124
125### 📜 Citations
126
127Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel.
128[Petals: Collaborative Inference and Fine-tuning of Large Models.](https://arxiv.org/abs/2209.01188)
129_Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)._ 2023.
130
131```bibtex
132@inproceedings{borzunov2023petals,
133  title = {Petals: Collaborative Inference and Fine-tuning of Large Models},
134  author = {Borzunov, Alexander and Baranchuk, Dmitry and Dettmers, Tim and Riabinin, Maksim and Belkada, Younes and Chumachenko, Artem and Samygin, Pavel and Raffel, Colin},
135  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
136  pages = {558--568},
137  year = {2023},
138  url = {https://arxiv.org/abs/2209.01188}
139}
140```
141
142Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin Raffel.
143[Distributed inference and fine-tuning of large language models over the Internet.](https://arxiv.org/abs/2312.08361)
144_Advances in Neural Information Processing Systems_ 36 (2023).
145
146```bibtex
147@inproceedings{borzunov2023distributed,
148  title = {Distributed inference and fine-tuning of large language models over the {I}nternet},
149  author = {Borzunov, Alexander and Ryabinin, Max and Chumachenko, Artem and Baranchuk, Dmitry and Dettmers, Tim and Belkada, Younes and Samygin, Pavel and Raffel, Colin},
150  booktitle = {Advances in Neural Information Processing Systems},
151  volume = {36},
152  pages = {12312--12331},
153  year = {2023},
154  url = {https://arxiv.org/abs/2312.08361}
155}
156```
157
158--------------------------------------------------------------------------------
159
160<p align="center">
161    This project is a part of the <a href="https://bigscience.huggingface.co/">BigScience</a> research workshop.
162</p>
163<p align="center">
164    <img src="https://petals.dev/bigscience.png" width="150">
165</p>
166