-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation results of llama2 with exetorch #3568
Comments
I think you should try with |
thank you@mergennachin, |
@Jack-Khuu can you reproduce these numbers on your end? |
about the dataset of wikitext, I downlaod from the url "https://huggingface.co/datasets/wikitext/tree/main/wikitext-2-raw-v1". is the dataset download same as you? thank you |
@l2002924700 Which commit hash are you using? |
@Jack-Khuu the commit hash is 8aaa8b27d493dba10b8553290236799e6dc57829 |
@l2002924700 Can you try rerunning with The numbers provided in the README were for groupwise 4b |
Context: I ran your command on main and got reasonable numbers
wikitext: {'word_perplexity,none': 9.168552791655282, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.5134046290382204, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.5977977630089415, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} And was able to repro our README with |
Thank you @Jack-Khuu. Follow your suggestion, I rerun the test with -qmode 8da4w. The command is as follows: ``` python -m examples.models.llama2.eval_llama -c /home/LLM-Models/Llama-2-7b/consolidated.00.pth --params /home/LLM-Models/Llama-2-7b/params.json -t /home/LLM-Models/Llama-2-7b/tokenizer.model -qmode 8da4w -G 128 --max_seq_len 2048 --limit 1000" I got the more weird results. |
I have tested the llama2 model with wikitext dataset using llama.cpp. The test results are similar with the github value.
my test results:
So I think my model and dataset seems like which the llama.cpp use. I don't know why the test results differences are so widely between ours. |
We're using the HF download of 7b: https://huggingface.co/meta-llama/Llama-2-7b/tree/main Are you using the tokenizer/model/params from there? For evaluation we're using EleutherAI's lm_Eval so the dataset is abstract there |
hi, Jack-Khuu, python -m examples.models.llama2.eval_llama --checkpoint /home/jackkhuu/llm_files/7b/consolidated.00.pth --params /home/jackkhuu/llm_files/7b/config.json -t /home/jackkhuu/llm_files/7b/tokenizer.model --group_size 128 --quantization_mode int8 --max_seq_len 2048 --limit 1000 the params is "config.json", but in the HF:https://huggingface.co/meta-llama/Llama-2-7b/tree/main, there is no this file. I think you should modify the params.json file and rename it as config.json, am i right? thank you in advanced |
Ah I should've clarified, I meant the hashes of your model/params/tokenizer files. As for the naming: my params just happens to be named config.json, the contents are the same |
I am sorry that I don't know hoe to get the hashes of my model/params/tokenizer files. Could you please teach me how to get the hashes of model/params/tokenizer files? |
You can call For examples |
Thank you @Jack-Khuu .
would you please help me check whether the hases of the files are the same as yours. |
Those are the exact same hashes that we're using... Are you getting the same results on a clean conda instance and install?
|
ok, I will try it. Thank you |
Closing this issue for now, since we aren't able to reproduce your results Thanks @l2002924700 for surfacing and feel free to spin up a new issue should anything else arise! |
hi, kindly helper,
I am a newer of exetorch and I want to test the llama2 model as descriped on the "https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md". the test steps of mine are as follows:
"python -m examples.models.llama2.eval_llama --checkpoint /home/LLM-Models/Llama-2-7b/consolidated.00.pth --params /home/LLM-Models/Llama-2-7b/params.json -t /home/LLM-Models/Llama-2-7b/tokenizer.model --group_size 128 --quantization_mode int8 --max_seq_len 2048 --limit 1000".
Then I get the results as follows:
Tasks Version Filter n-shot Metric Value
wikitext 2 none 0 word_perplexity 16.77
none 0 byte_perplexity 1.717
none 0 bits_per_byte 0.780
Are the results nomal? I think the value is too high compared with the test results of "https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md"
The text was updated successfully, but these errors were encountered: