Release v2.2.0: Falcon, macOS support, and more · bigscience-workshop/petals

Highlights

🦅 Falcon support. Petals now supports all models based on Falcon, including Falcon 180B released today. We improved the 🤗 Transformers FalconModel implementation to be up to 40% faster on recent GPUs. Our chatbot app runs Falcon 180B-Chat at ~2 tokens/sec.

Falcon-40B is licensed under Apache 2.0, so you can load it by specifying tiiuae/falcon-40b or tiiuae/falcon-40b-instruct as the model name. Falcon-180B is licensed under a custom license, and it is not clear if we can provide a Python interface for inference and fine-tuning of this model. Right now, it is only available in the chatbot app, and we are waiting for further clarifications from TII on this issue.

🍏 Native macOS support. You can run Petals clients and servers on macOS natively - just install Homebrew and run these commands:

brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server petals-team/StableBeluga2

If your computer has Apple M1/M2 chip, the Petals server will use the integrated GPU automatically. We recommend to only host Llama-based models, since other supported architectures do not work efficiently on M1/M2 chips yet. We also recommend using Python 3.10+ on macOS (installed by Homebrew automatically).

🔌 Serving custom models. Custom models now automatically show up at https://health.petals.dev as "not officially supported" models. As a reminder, you are not limited to models available at https://health.petals.dev and can run a server hosting any model based on BLOOM, Llama, or Falcon architecture (given that it's allowed by the model license), or even add a support for a new architecture yourself. We also improved Petals compatibility with some popular Llama-based models (e.g., models from NousResearch) in this release.

🐞 Bug fixes. This release also fixes inference of prefix-tuned models, which was broken in Petals 2.1.0.

What's Changed

Require transformers>=4.32.0 by @borzunov in #479
Fix requiring transformers>=4.32.0 by @borzunov in #480
Rewrite MemoryCache alloc_timeout logic by @justheuristic in #434
Refactor readme by @borzunov in #482
Support macOS natively by @borzunov in #477
Remove no-op process in PrioritizedTaskPool by @borzunov in #484
Fix .generate(input_ids=...) by @borzunov in #485
Wait for DHT storing state OFFLINE on shutdown by @borzunov in #486
Fix race condition in MemoryCache by @borzunov in #487
Replace dots in repo names when building DHT prefixes by @borzunov in #489
Create model index in DHT by @borzunov in #491
Force use_cache=True by @borzunov in #496
Force use_cache=True in config only by @borzunov in #497
Add Falcon support by @borzunov in #499
Fix prompt tuning after #464 by @borzunov in #501
Optimize the Falcon block for inference by @mryab in #500

Full Changelog: v2.1.0...v2.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.2.0: Falcon, macOS support, and more

Highlights

What's Changed

Contributors