Hey can anyone help me how to export the fine-tuned model into GGUF using gradio UI or cli #3539

gamercoder153 · 2024-05-01T13:00:36Z

gamercoder153
May 1, 2024

Hey can anyone help me how to export the fine-tuned model into GGUF using gradio UI or cli

Katehuuh · 2024-05-05T09:15:18Z

Katehuuh
May 5, 2024

From what i know, only GPTQ quantization is supported.

LLaMA-Factory/src/llmtuner/model/utils/quantization.py

Lines 27 to 36 in bd095ee

    
           class QuantizationMethod(str, Enum): 
        
               r""" 
        
               Borrowed from `transformers.utils.quantization_config.QuantizationMethod`. 
        
               """ 
        
               BITS_AND_BYTES = "bitsandbytes" 
        
               GPTQ = "gptq" 
        
               AWQ = "awq" 
        
               AQLM = "aqlm" 
        
               QUANTO = "quanto"

Using cli --export_quantization_bit 4 e.g.

LLaMA-Factory/examples/merge_lora/quantize.sh

Lines 4 to 11 in 845d5ac

    
           CUDA_VISIBLE_DEVICES=0 llamafactory-cli export \ 
        
               --model_name_or_path ../../models/llama2-7b-sft \ 
        
               --template default \ 
        
               --export_dir ../../models/llama2-7b-sft-int4 \ 
        
               --export_quantization_bit 4 \ 
        
               --export_quantization_dataset ../../data/c4_demo.json \ 
        
               --export_size 2 \ 
        
               --export_legacy_format False

Or from UI:

GGUF quantization is not supported on LLaMA-Factory, you will need to do it manually using llama.cpp.

I already quantization GGUF like

Clone llama.cpp and download one of the llama.cpp/releases.
Run convert.py Llama-2-13B-fp16, this give ggml-model-f16.gguf (older version .bin)
.\download_releases\quantize.exe .\Llama-2-13B-fp16\ggml-model-f16.gguf .\Llama-2-13B-fp16\Llama-2-13B-fp16-q4_k_m.gguf Q4_K_M

Alternatively, If you still encounter problem, you can try the demo space: gguf-my-repo

For more detail guide GGUF (untested), quantization from text-generation-webui/pull/5935

```

tutorial on how to convert a model to GGUF and quantize it for Windows and Linux.

If you can't find a quantized version of the model you need on HuggingFace, you can quantize the model yourself using this guide.

GGUF

Requirements:

Python 3.10/3.11
Git
Downloaded model in Transformers format
- Models in this format are a folder containing several large pytorch_model-XXXXX-of-XXXXX files with .bin or .safetensors extension,
  as well as several `.json' files.
A lot of free disk space

Windows

Preparation

Open a terminal in any folder with enough free space and run the following commands:
- git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
- cd llama.cpp
- python -m pip install -r requirements.txt.
Move the folder with the model weights to llama.cpp\models (for convenience, this is optional)
Download the w64devkit-fortran-<version>.zip file from https://github.com/skeeto/w64devkit/releases/latest and unzip it anywhere convenient.
Run w64devkit.exe and use the cd command reach the to the llama.cpp folder, for example: cd "A:\LLM models\llama.cpp".
Run the make command and wait for the compilation to complete.

Convert

Run the python convert.py models\<your model>\ command.
- If you get a TypeError: <model> must be converted with BpeVocab error, add the --vocab-type bpe flag to the end of the command.
  For example python convert.py models/Meta-Llama-3-8B/ --vocab-type bpe.

After that, the file ggml-model-f16.gguf or ggml-model-f32.gguf will appear in your model folder.

Quantize

At the command line with the llama.cpp folder open, run the .\quantize.exe .\<the resulting converted .gguf file> <quantization method> command.
- The quantization methods are Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_S, Q5_K_M, Q6_K, Q8_0.
  They are listed in ascending order of accuracy. The higher the accuracy, the larger the file, the greater the RAM/VRAM requirement, the lower the generation speed, but the better the quality of the model output.
  Accuracy below Q4_0 is not recommended, Q4_K_M, Q5_K_M and Q8_0 are recommended.
- Example: .\quantize.exe .\models\Meta-Llama-3-8B\ggml-model-f32.gguf Q8_0.

Linux

Preparation

Run the following commands in any convenient directory with plenty of free space.
- git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
- python3 -m pip install -r requirements.txt
- make.
Move the folder with the model weights to llama.cpp/models (for convenience, this is optional)

Convert

Run the python convert.py models/<your model>/ command.
- If you get a TypeError: <model> must be converted with BpeVocab error, add the --vocab-type bpe flag to the end of the command.
  For example python convert.py models/Meta-Llama-3-8B/ --vocab-type bpe.

After that, the file ggml-model-f16.gguf or ggml-model-f32.gguf will appear in your model folder.

Quantize

In the llama.cpp directory, run the ./quantize ./<converted .gguf file> <quantization method> command.
- The quantization methods are Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_S, Q5_K_M, Q6_K, Q8_0.
  They are listed in ascending order of accuracy. The higher the accuracy, the larger the file, the greater the RAM/VRAM requirement, the lower the generation speed, but the better the quality of the model output.
  Accuracy below Q4_0 is not recommended, Q4_K_M, Q5_K_M and Q8_0 are recommended.
- Example: ./quantize ./models/Meta-Llama-3-8B/ggml-model-f32.gguf Q8_0.

</details>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hey can anyone help me how to export the fine-tuned model into GGUF using gradio UI or cli #3539

{{title}}

Replies: 1 comment

{{title}}

tutorial on how to convert a model to GGUF and quantize it for Windows and Linux.

GGUF

Requirements:

Windows

Preparation

Convert

Quantize

Linux

Preparation

Convert

Quantize

Select a reply

Hey can anyone help me how to export the fine-tuned model into GGUF using gradio UI or cli #3539

gamercoder153 May 1, 2024

Replies: 1 comment

Katehuuh May 5, 2024

tutorial on how to convert a model to GGUF and quantize it for Windows and Linux.

GGUF

Requirements:

Windows

Preparation

Convert

Quantize

Linux

Preparation

Convert

Quantize

gamercoder153
May 1, 2024

Katehuuh
May 5, 2024