Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation results of llama2 with exetorch #3568

Closed
l2002924700 opened this issue May 10, 2024 · 19 comments
Closed

Evaluation results of llama2 with exetorch #3568

l2002924700 opened this issue May 10, 2024 · 19 comments
Assignees
Labels
llm: evaluation Perplexity, accuracy triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@l2002924700
Copy link

l2002924700 commented May 10, 2024

hi, kindly helper,
I am a newer of exetorch and I want to test the llama2 model as descriped on the "https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md". the test steps of mine are as follows:

  1. Download the llama2 model from hugfaceweb site.
  2. install the exetorch as descriped in this github of "https://github.com/pytorch/executorch".
  3. the I execute the test using the command :
    "python -m examples.models.llama2.eval_llama --checkpoint /home/LLM-Models/Llama-2-7b/consolidated.00.pth --params /home/LLM-Models/Llama-2-7b/params.json -t /home/LLM-Models/Llama-2-7b/tokenizer.model --group_size 128 --quantization_mode int8 --max_seq_len 2048 --limit 1000".
    Then I get the results as follows:
    Tasks Version Filter n-shot Metric Value
    wikitext 2 none 0 word_perplexity 16.77
    none 0 byte_perplexity 1.717
    none 0 bits_per_byte 0.780
    Are the results nomal? I think the value is too high compared with the test results of "https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md"
@mergennachin
Copy link
Contributor

I think you should try with -qmode 8da4w instead of int8

cc @digantdesai @kimishpatel

@l2002924700
Copy link
Author

l2002924700 commented May 11, 2024

thank you@mergennachin,
I have used -qmode 8da4w with command "python -m examples.models.llama2.eval_llama --checkpoint /home/LLM-Models/Llama-2-7b/consolidated.00.pth --params /home/LLM-Models/Llama-2-7b/params.json -t /home/LLM-Models/Llama-2-7b/tokenizer.model --group_size 256 --quantization_mode 8da4w --max_seq_len 2048 --limit 1000", However i get more high value :
word_perplexity 26.321708194170252
byte_perplexity 1.8721607782928849
bits_per_byte 0.9047043366764531.
what can I do next?
Thank you

@kimishpatel
Copy link
Contributor

@Jack-Khuu can you reproduce these numbers on your end?

@iseeyuan iseeyuan added llm: evaluation Perplexity, accuracy triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 13, 2024
@l2002924700
Copy link
Author

about the dataset of wikitext, I downlaod from the url "https://huggingface.co/datasets/wikitext/tree/main/wikitext-2-raw-v1". is the dataset download same as you? thank you

@Jack-Khuu
Copy link
Contributor

@l2002924700 Which commit hash are you using?

@l2002924700
Copy link
Author

@Jack-Khuu the commit hash is 8aaa8b27d493dba10b8553290236799e6dc57829

@Jack-Khuu
Copy link
Contributor

@l2002924700 Can you try rerunning with -qmode 8da4w?

The numbers provided in the README were for groupwise 4b

image

@Jack-Khuu
Copy link
Contributor

Context: I ran your command on main and got reasonable numbers

python -m examples.models.llama2.eval_llama --checkpoint /home/jackkhuu/llm_files/7b/consolidated.00.pth --params /home/jackkhuu/llm_files/7b/config.json -t /home/jackkhuu/llm_files/7b/tokenizer.model --group_size 128 --quantization_mode int8 --max_seq_len 2048 --limit 1000

wikitext: {'word_perplexity,none': 9.168552791655282, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.5134046290382204, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.5977977630089415, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

And was able to repro our README with 8da4w

@l2002924700
Copy link
Author

Thank you @Jack-Khuu. Follow your suggestion, I rerun the test with -qmode 8da4w. The command is as follows:

``` python -m examples.models.llama2.eval_llama -c /home/LLM-Models/Llama-2-7b/consolidated.00.pth --params /home/LLM-Models/Llama-2-7b/params.json -t /home/LLM-Models/Llama-2-7b/tokenizer.model -qmode 8da4w -G 128 --max_seq_len 2048 --limit 1000"

I got the more weird results.
wikitext: {'word_perplexity,none': 25.471536960494742, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.8604115013524496, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.8956217639672349, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
Since we use the same program to evaluate the model, I think my weird results may be from dataset or LLM model.
So could you please share your download url of Dataset of wikitext and LLM model of Llama-2-7b.
thank you in anvance for your kind help

@l2002924700
Copy link
Author

I have tested the llama2 model with wikitext dataset using llama.cpp. The test results are similar with the github value.

Model Mesure F16 Q4_0 Q4_1
7B perplexity 5.9066 6.1565 6.0912

my test results:

Model Mesure F16 Q4_0 Q4_1
7B perplexity 5.7962 5.9625 6.0008

So I think my model and dataset seems like which the llama.cpp use. I don't know why the test results differences are so widely between ours.
thank you for your kindly answer in advanced.

@Jack-Khuu
Copy link
Contributor

We're using the HF download of 7b: https://huggingface.co/meta-llama/Llama-2-7b/tree/main

Are you using the tokenizer/model/params from there?
If so can you share the hashes, just as a sanity check?

For evaluation we're using EleutherAI's lm_Eval so the dataset is abstract there

@l2002924700
Copy link
Author

hi, Jack-Khuu,
thank you vary much.
my download commit'hash value is 69656aac4cb47911a639f5890ff35b41ceb82e98. In your commad

python -m examples.models.llama2.eval_llama --checkpoint /home/jackkhuu/llm_files/7b/consolidated.00.pth --params /home/jackkhuu/llm_files/7b/config.json -t /home/jackkhuu/llm_files/7b/tokenizer.model --group_size 128 --quantization_mode int8 --max_seq_len 2048 --limit 1000

the params is "config.json", but in the HF:https://huggingface.co/meta-llama/Llama-2-7b/tree/main, there is no this file. I think you should modify the params.json file and rename it as config.json, am i right?
my params.json is as follows:
{"dim": 4096, "multiple_of": 256, "n_heads": 32, "n_layers": 32, "norm_eps": 1e-05, "vocab_size": 32000}.
Is my params.json file the same as your config.json?

thank you in advanced

@Jack-Khuu
Copy link
Contributor

Ah I should've clarified, I meant the hashes of your model/params/tokenizer files.
That way we can verify which files are different.

As for the naming: my params just happens to be named config.json, the contents are the same

@l2002924700
Copy link
Author

I am sorry that I don't know hoe to get the hashes of my model/params/tokenizer files. Could you please teach me how to get the hashes of model/params/tokenizer files?
thank you in advanced

@Jack-Khuu
Copy link
Contributor

You can call md5sum on the files

For examples md5sum tokenizer.model

@l2002924700
Copy link
Author

Thank you @Jack-Khuu .
I have get the hashes of my model/params/tokenizer files with md5sum , and the results are as follows:

md5sum tokenizer.model
eeec4125e9c7560836b4873b6f8e3025  tokenizer.model
md5sum params.json
faeb3d79269b5783e9a9a0e99956c018  params.json
md5sum consolidated.00.pth
daa8e3109935070df7fe8fc42d34525e  consolidated.00.pth

would you please help me check whether the hases of the files are the same as yours.
thank you in advanced

@Jack-Khuu
Copy link
Contributor

Those are the exact same hashes that we're using...

Are you getting the same results on a clean conda instance and install?

  • If the files are the same then the only discrepancy I can think of would be either local changes or env differences

@l2002924700
Copy link
Author

ok, I will try it. Thank you

@Jack-Khuu
Copy link
Contributor

Closing this issue for now, since we aren't able to reproduce your results

Thanks @l2002924700 for surfacing and feel free to spin up a new issue should anything else arise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llm: evaluation Perplexity, accuracy triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants