Highlights
馃 Falcon support. Petals now supports all models based on Falcon, including Falcon 180B released today. We improved the 馃 Transformers FalconModel
implementation to be up to 40% faster on recent GPUs. Our chatbot app runs Falcon 180B-Chat at ~2 tokens/sec.
Falcon-40B is licensed under Apache 2.0, so you can load it by specifying tiiuae/falcon-40b
or tiiuae/falcon-40b-instruct
as the model name. Falcon-180B is licensed under a custom license, and it is not clear if we can provide a Python interface for inference and fine-tuning of this model. Right now, it is only available in the chatbot app, and we are waiting for further clarifications from TII on this issue.
馃崗 Native macOS support. You can run Petals clients and servers on macOS natively - just install Homebrew and run these commands:
brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server petals-team/StableBeluga2
If your computer has Apple M1/M2 chip, the Petals server will use the integrated GPU automatically. We recommend to only host Llama-based models, since other supported architectures do not work efficiently on M1/M2 chips yet. We also recommend using Python 3.10+ on macOS (installed by Homebrew automatically).
馃攲 Serving custom models. Custom models now automatically show up at https://health.petals.dev as "not officially supported" models. As a reminder, you are not limited to models available at https://health.petals.dev and can run a server hosting any model based on BLOOM, Llama, or Falcon architecture (given that it's allowed by the model license), or even add a support for a new architecture yourself. We also improved Petals compatibility with some popular Llama-based models (e.g., models from NousResearch) in this release.
馃悶 Bug fixes. This release also fixes inference of prefix-tuned models, which was broken in Petals 2.1.0.
What's Changed
- Require transformers>=4.32.0 by @borzunov in #479
- Fix requiring transformers>=4.32.0 by @borzunov in #480
- Rewrite MemoryCache alloc_timeout logic by @justheuristic in #434
- Refactor readme by @borzunov in #482
- Support macOS natively by @borzunov in #477
- Remove no-op process in PrioritizedTaskPool by @borzunov in #484
- Fix
.generate(input_ids=...)
by @borzunov in #485 - Wait for DHT storing state OFFLINE on shutdown by @borzunov in #486
- Fix race condition in MemoryCache by @borzunov in #487
- Replace dots in repo names when building DHT prefixes by @borzunov in #489
- Create model index in DHT by @borzunov in #491
- Force use_cache=True by @borzunov in #496
- Force use_cache=True in config only by @borzunov in #497
- Add Falcon support by @borzunov in #499
- Fix prompt tuning after #464 by @borzunov in #501
- Optimize the Falcon block for inference by @mryab in #500
Full Changelog: v2.1.0...v2.2.0