Skip to content

Latest commit

 

History

History

eval

Evaluation

Here are the results of the evaluation of the models on the different tasks. The results are presented in the form of tables, where the first column is the model name, and the reset columns are the performance on Thai (th), Indonesian (id), and Vietnamese (vi) languages, respectively. The results of Sailor models are highlighted in bold.

Question Answering

3-shot (EM / F1) XQuAD (th) TydiQA (id) XQuAD (vi)
Qwen1.5-0.5B 14.19 / 23.35 20.71 / 32.64 19.85 / 35.38
Sailor-0.5B 15.84 / 27.58 30.44 / 54.74 21.13 / 40.57
Qwen1.5-1.8B 27.24 / 43.56 29.73 / 53.76 29.17 / 48.15
Sailor-1.8B 32.72 / 48.66 40.88 / 65.37 34.22 / 53.35
Qwen1.5-4B 34.03 / 53.40 48.32 / 72.68 43.71 / 63.86
Sailor-4B 46.82 / 63.34 53.98 / 73.48 47.65 / 67.09
Llama-2-7b 30.64 / 43.80 56.64 / 72.14 46.96 / 66.16
Mistral-7B-v0.1 48.48 / 63.27 63.54 / 78.73 53.72 / 72.75
SeaLLM-7b-Hybrid 49.70 / 67.62 50.62 / 75.21 49.62 / 70.74
SeaLLM-7b-v2 34.55 / 55.13 52.21 / 77.00 46.19 / 72.11
Qwen1.5-7B 53.79 / 69.30 57.17 / 77.28 56.63 / 76.99
Sailor-7B 57.88 / 71.06 60.53 / 75.42 53.81 / 74.62

Commonsense Reasoning

3-shot (EM) XCOPA (th) XCOPA (id) XCOPA (vi)
Random guess 50.00 50.00 50.00
Qwen1.5-0.5B 51.00 52.20 53.80
Sailor-0.5B 51.00 58.20 58.00
Qwen1.5-1.8B 52.60 51.60 53.40
Sailor-1.8B 53.80 64.20 63.20
Qwen1.5-4B 53.40 55.00 57.80
Sailor-4B 53.40 69.20 68.20
Llama-2-7b 52.80 64.00 62.00
Mistral-7B-v0.1 57.20 62.40 61.60
SeaLLM-7b-Hybrid 58.20 71.60 67.60
SeaLLM-7b-v2 56.80 64.00 64.60
Qwen1.5-7B 54.20 62.20 66.20
Sailor-7B 59.00 72.20 72.20

Reading Comprehension

3-shot (EM) Belebele (th) Belebele (id) Belebele (vi)
Random guess 25.00 25.00 25.00
Qwen1.5-0.5B 29.89 26.89 30.22
Sailor-0.5B 32.22 30.89 32.33
Qwen1.5-1.8B 30.11 32.00 31.33
Sailor-1.8B 34.22 34.89 35.33
Qwen1.5-4B 32.78 36.22 35.22
Sailor-4B 36.11 41.33 38.89
Llama-2-7b 31.78 39.78 38.00
Mistral-7B-v0.1 34.33 41.33 41.33
SeaLLM-7b-Hybrid 37.78 43.11 43.00
SeaLLM-7b-v2 36.33 43.11 47.00
Qwen1.5-7B 38.33 42.00 42.89
Sailor-7B 41.56 44.33 45.33

Examination (Generation)

We have observed that the performance discrepancy of Sailor on M3Exam is due to a significant option bias, which leads the Sailor models to favor certain option IDs (e.g., always C) when making predictions. As for more explaintation on Generation and Perplexity, please refer to our paper for more details.

3-shot (EM) M3Exam (th) M3Exam (jv) M3Exam (vi)
Qwen1.5-0.5B 22.38 22.10 29.12
Sailor-0.5B 21.87 28.84 23.53
Qwen1.5-1.8B 23.81 26.15 36.39
Sailor-1.8B 23.90 29.65 27.67
Qwen1.5-4B 26.26 30.19 40.02
Sailor-4B 27.23 29.11 31.58
Llama-2-7B 21.13 23.99 34.15
Mistral-7B-v0.1 29.59 31.00 43.54
Typhoon-7B 36.71 -- --
VinaLLaMA-7B -- -- 36.95
Sea-Lion-7B 23.90 21.56 26.89
SeaLLM-7B-Hybrid 25.98 24.53 38.79
SeaLLM-7B-v2 35.60 29.92 50.36
Qwen1.5-7B 35.88 33.15 51.09
Sailor-7B 38.33 35.85 51.98

Examination (Perplexity)

3-shot (EM) M3Exam (th) M3Exam (jv) M3Exam (vi)
Random guess 22.90 25.00 25.21
Qwen1.5-0.5B 22.93 25.07 26.66
Sailor-0.5B 24.41 26.15 30.91
Qwen1.5-1.8B 24.04 24.26 28.68
Sailor-1.8B 25.38 28.30 34.71
Qwen1.5-4B 24.50 24.26 30.02
Sailor-4B 27.88 31.27 40.69
Llama-2-7b 23.67 25.07 33.15
Mistral-7B-v0.1 26.03 26.68 36.11
SeaLLM-7b-Hybrid 27.18 26.95 36.50
SeaLLM-7b-v2 28.48 29.92 39.18
Qwen1.5-7B 25.75 26.15 36.28
Sailor-7B 30.00 32.88 44.10