Releases: ollama/ollama
v0.1.41
What's Changed
- Fixed issue on Windows 10 and 11 with Intel CPUs with integrated GPUs where Ollama would encounter an error
Full Changelog: v0.1.40...v0.1.41
v0.1.40
New models
- Codestral: Codestral is Mistral AI’s first-ever code model designed for code generation tasks.
- IBM Granite Code: now in 3B and 8B parameter sizes.
- Deepseek V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
What's Changed
- Fixed out of memory and incorrect token issues when running Codestral on 16GB Macs
- Fixed issue where full-width characters (e.g. Japanese, Chinese, Russian) were deleted at end of the line when using
ollama run
New Examples
New Contributors
- @zhewang1-intc made their first contribution in #3278
Full Changelog: v0.1.39...v0.1.40
v0.1.39
New models
- Cohere Aya 23: A new state-of-the-art, multilingual LLM covering 23 different languages.
- Mistral 7B 0.3: A new version of Mistral 7B with initial support for function calling.
- Phi-3 Medium: a 14B parameters, lightweight, state-of-the-art open model by Microsoft.
- Phi-3 Mini 128K and Phi-3 Medium 128K: versions of the Phi-3 models that support a context window size of 128K
- Granite code: A family of open foundation models by IBM for Code Intelligence
Llama 3 import
It is now possible to import and quantize Llama 3 and its finetunes from Safetensors format to Ollama.
First, clone a Hugging Face repo with a Safetensors model:
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
cd Meta-Llama-3-8B-Instruct
Next, create a Modelfile
:
FROM .
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
Then, create and quantize a model:
ollama create --quantize q4_0 -f Modelfile my-llama3
ollama run my-llama3
What's Changed
- Fixed issues where wide characters such as Chinese, Korean, Japanese and Russian languages.
- Added new
OLLAMA_NOHISTORY=1
environment variable that can be set to disable history when usingollama run
- New experimental
OLLAMA_FLASH_ATTENTION=1
flag forollama serve
that improves token generation speed on Apple Silicon Macs and NVIDIA graphics cards - Fixed error that would occur on Windows running
ollama create -f Modelfile
ollama create
can now create models from I-Quant GGUF files- Fixed
EOF
errors when resuming downloads viaollama pull
- Added a
Ctrl+W
shortcut toollama run
New Contributors
- @rapmd73 made their first contribution in #4467
- @sammcj made their first contribution in #4120
- @likejazz made their first contribution in #4535
Full Changelog: v0.1.38...v0.1.39
v0.1.38
New Models
- Falcon 2: A new 11B parameters causal decoder-only model built by TII and trained over 5T tokens.
- Yi 1.5: A new high-performing version of Yi, now licensed as Apache 2.0. Available in 6B, 9B and 34B sizes.
What's Changed
ollama ps
A new command is now available: ollama ps
. This command displays currently loaded models, their memory footprint, and the processors used (GPU or CPU):
% ollama ps
NAME ID SIZE PROCESSOR UNTIL
mixtral:latest 7708c059a8bb 28 GB 47%/53% CPU/GPU Forever
llama3:latest a6990ed6be41 5.5 GB 100% GPU 4 minutes from now
all-minilm:latest 1b226e2802db 585 MB 100% GPU 4 minutes from now
/clear
To clear the chat history for a session when running ollama run
, use /clear
:
>>> /clear
Cleared session context
- Fixed issue where switching loaded models on Windows would take several seconds
- Running
/save
will no longer abort the chat session if an incorrect name is provided - The
/api/tags
API endpoint will now correctly return an empty list[]
instead ofnull
if no models are provided
New Contributors
- @fangtaosong made their first contribution in #4387
- @machimachida made their first contribution in #4424
Full Changelog: v0.1.37...v0.1.38
v0.1.37
What's Changed
- Fixed issue where models with uppercase characters in the name would not show with
ollama list
- Fixed usage string for
ollama create
- Fix
finish_reason
being""
instead ofnull
in the Open-AI compatible chat API.
New Contributors
- @todashuta made their first contribution in #4362
Full Changelog: v0.1.36...v0.1.37
v0.1.36
What's Changed
- Fixed
exit status 0xc0000005
error with AMD graphics cards on Windows - Fixed rare out of memory errors when loading a model to run with CPU
Full Changelog: v0.1.35...v0.1.36
v0.1.35
New models
- Llama 3 ChatQA: A model from NVIDIA based on Llama 3 that excels at conversational question answering (QA) and retrieval-augmented generation (RAG).
What's Changed
- Quantization:
ollama create
can now quantize models when importing them using the--quantize
or-q
flag:
ollama create -f Modelfile --quantize q4_0 mymodel
Note
--quantize
works when importing float16
or float32
models:
- From a binary GGUF files (e.g.
FROM ./model.gguf
) - From a library model (e.g.
FROM llama3:8b-instruct-fp16
)
- Fixed issue where inference subprocesses wouldn't be cleaned up on shutdown.
- Fixed a series out of memory errors when loading models on multi-GPU systems
- Ctrl+J characters will now properly add newlines in
ollama run
- Fixed issues when running
ollama show
for vision models OPTIONS
requests to the Ollama API will no longer result in errors- Fixed issue where partially downloaded files wouldn't be cleaned up
- Added a new
done_reason
field in responses describing why generation stopped responding - Ollama will now more accurately estimate how much memory is available on multi-GPU systems especially when running different models one after another
New Contributors
- @fmaclen made their first contribution in #3884
- @Renset made their first contribution in #3881
- @glumia made their first contribution in #3043
- @boessu made their first contribution in #4236
- @gaardhus made their first contribution in #2307
- @svilupp made their first contribution in #2192
- @WolfTheDeveloper made their first contribution in #4300
Full Changelog: v0.1.34...v0.1.35
v0.1.34
New models
- Llava Llama 3: A new high-performing LLaVA model fine-tuned from Llama 3 Instruct.
- Llava Phi 3: A new small LLaVA model fine-tuned from Phi 3.
- StarCoder2 15B Instruct: A new instruct fine-tune of the StarCoder2 model
- CodeGemma 1.1: A new release of the CodeGemma model.
- StableLM2 12B: A new 12B version of the StableLM 2 model from Stability AI
- Moondream 2: Moondream 2's runtime parameters have been improved for better responses
What's Changed
- Fixed issues with LLaVa models where they would respond incorrectly after the first request
- Fixed out of memory errors when running large models such as Llama 3 70B
- Fixed various issues with Nvidia GPU discovery on Linux and Windows
- Fixed a series of Modelfile errors when running
ollama create
- Fixed
no slots available
error that occurred when cancelling a request and then sending follow up requests - Improved AMD GPU detection on Fedora
- Improved reliability when using the experimental
OLLAMA_NUM_PARALLEL
andOLLAMA_MAX_LOADED
flags ollama serve
will now shut down quickly, even if a model is loading
New Contributors
- @drnic made their first contribution in #4116
- @bernardo-bruning made their first contribution in #4111
- @Drlordbasil made their first contribution in #4174
- @Saif-Shines made their first contribution in #4119
- @HydenLiu made their first contribution in #4194
- @jl-codes made their first contribution in #3621
- @Nurgo made their first contribution in #3473
- @adrienbrault made their first contribution in #3129
- @Darinochka made their first contribution in #3945
Full Changelog: v0.1.33...v0.1.34
v0.1.33
New models:
- Llama 3: a new model by Meta, and the most capable openly available LLM to date
- Phi 3 Mini: a new 3.8B parameters, lightweight, state-of-the-art open model by Microsoft.
- Moondream moondream is a small vision language model designed to run efficiently on edge devices.
- Llama 3 Gradient 1048K: A Llama 3 fine-tune by Gradient to support up to a 1M token context window.
- Dolphin Llama 3: The uncensored Dolphin model, trained by Eric Hartford and based on Llama 3 with a variety of instruction, conversational, and coding skills.
- Qwen 110B: The first Qwen model over 100B parameters in size with outstanding performance in evaluations
What's Changed
- Fixed issues where the model would not terminate, causing the API to hang.
- Fixed a series of out of memory errors on Apple Silicon Macs
- Fixed out of memory errors when running Mixtral architecture models
Experimental concurrency features
New concurrency features are coming soon to Ollama. They are available
OLLAMA_NUM_PARALLEL
: Handle multiple requests simultaneously for a single modelOLLAMA_MAX_LOADED_MODELS
: Load multiple models simultaneously
To enable these features, set the environment variables for ollama serve
. For more info see this guide:
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve
New Contributors
- @hmartinez82 made their first contribution in #3972
- @Cephra made their first contribution in #4037
- @arpitjain099 made their first contribution in #4007
- @MarkWard0110 made their first contribution in #4031
- @alwqx made their first contribution in #4073
- @sidxt made their first contribution in #3705
- @ChengenH made their first contribution in #3789
- @secondtruth made their first contribution in #3503
- @reid41 made their first contribution in #3612
- @ericcurtin made their first contribution in #3626
- @JT2M0L3Y made their first contribution in #3633
- @datvodinh made their first contribution in #3655
- @MapleEve made their first contribution in #3817
- @swuecho made their first contribution in #3810
- @brycereitano made their first contribution in #3895
- @bsdnet made their first contribution in #3889
- @fyxtro made their first contribution in #3855
- @natalyjazzviolin made their first contribution in #3962
Full Changelog: v0.1.32...v0.1.33
v0.1.32
New models
- WizardLM 2: State of the art large language model from Microsoft AI with improved performance on complex chat, multilingual, reasoning and agent use cases.
wizardlm2:8x22b
: large 8x22B model based on Mixtral 8x22Bwizardlm2:7b
: fast, high-performing model based on Mistral 7B
- Snowflake Arctic Embed: A suite of text embedding models by Snowflake, optimized for performance.
- Command R+: a powerful, scalable large language model purpose-built for RAG use cases
- DBRX: A large 132B open, general-purpose LLM created by Databricks.
- Mixtral 8x22B: the new leading Mixture of Experts (MoE) base model by Mistral AI.
What's Changed
- Ollama will now better utilize available VRAM, leading to less out-of-memory errors, as well as better GPU utilization
- When running larger models that don't fit into VRAM on macOS, Ollama will now split the model between GPU and CPU to maximize performance.
- Fixed several issues where Ollama would hang upon encountering an error
- Fix issue where using quotes in
OLLAMA_ORIGINS
would cause an error
New Contributors
- @sugarforever made their first contribution in #3400
- @yaroslavyaroslav made their first contribution in #3378
- @Nagi-ovo made their first contribution in #3423
- @ParisNeo made their first contribution in #3436
- @philippgille made their first contribution in #3437
- @cesto93 made their first contribution in #3461
- @ThomasVitale made their first contribution in #3515
- @writinwaters made their first contribution in #3539
- @alexmavr made their first contribution in #3555
Full Changelog: v0.1.31...v0.1.32