Skip to content

paramiai/cantoformer

Repository files navigation



Cantoformer
廣東話嘅語言 AI

Recent advances in AI enable smarter applications based on texts. It's good but they are mostly in English due to its abundance of texts available from the Internet.

This repository explores LM in Cantonese (Yue Chinese, 廣東話), a langauge predominantly spoken in Guangzhou, Hong Kong and Macau, and containing very challenging lingual properties for AI to learn.

AI 喺呢幾年發展得好快,好多嘢都話用 AI 處理會醒好多,但其實喺「語言處理」嘅領域入面,好多嘅資源都只係得英文,所以要落手做廣東話嘅 NLP,其實唔容易。

所以諗住喺呢度開個 Repo ,鼓勵更多人開發廣東話 AI。

Challenges

  • Mixed Languages (English, Chinese, Yue)
    夾雜多種語言
  • Complex Syntax
    語法複雜
  • Scarce Resource
    資源稀少
  • Many Homonyms & Homophones in online texts
    網上嘅字通常有好多一語多義/同音異字

Remediation

We adopt the following preprocessing to the model:
用呢個 model 前我哋會對文字做一啲嘅處理:

  • WordPiece Tokenizer from forked 🤗Tokenizers which,

    • strips accents like the original BERT
      除去組合附加符號 (e.g. àa)

    • uses lower casing
      使用細階英文

    • treats symbols/numers as a separate token
      符號/數字全部當係一個 token

    • Simplified Chinese → Traditional Chinese (Since most of our corpus are in Trad. Chinese)
      簡轉繁(因為文本大部分都係繁體字)

      Using OpenCC v1.1.1 from here

    • normalizes Unicode Characters (Some are hand-crafted) by
      統一中文字符(其中一啲係人手分類)

      • Symbols of the same functionality 相同功能嘅符號 (e.g. [ )
      • Variant Chinese characters 異體字 (e.g. )
      • Deomposing rare characters 將罕見字拆開 (e.g. 亻春 )

      (Mapping here)

  • Newlines are regarded as a token, i.e. <nl>

Framework to be used

  • Tensorflow
  • Pytorch

Libraries to be used

  • OpenCC (Simpl-to-Trad, 簡轉繁) @ v1.1.1
  • 🤗Tokenizers (forked version is used for normalization)
# Installing OpenCC v1.1.1 by
sudo bash ./install_opencc.sh

# Installing by forked 🤗 Tokenizers by 
pip3 install 'git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python'
# This takes some time!

# This is forked from tokenizers@v0.8.1
# with python package renamed to tokenizers_zh

Corpus

zh en
~ 80 GB
(incl. ~ 20 GB Cantonese)
~ 100 GB

Evaluation

Since we have NO datasets in Cantonese, we evaluate the models on both English and Chinese datasets:

  • MNLI (Entailment Prediction)
  • DRCD (Reading Comprehension)
  • SQuAD-v2 (Reading Comprehension)
  • CMRC2018 (Reading Comprehension)

Something to explore

  1. Sentence Order Prediction (SOP)

    SOP is a pretraining objective that is used in Albert. StructBERT also introduces Sentence Structural Objective, but since the code for electra reads the data sequentially, this repo explores SOP first

  2. Cluster Objective

    DocProduct is a cool project training a BERT model to cluster similar Q&A -- if a text A answers the question Q, then Q and A will be close in vector representation.

    This means the model must predict the possible contexts (before and after) in order to embed a vector that can minimize the cost function

    Details refer to the DocProduct repo.

To Do List

  • Normalize Chinese characters
  • ELECTRA-small
  • ELECTRA-base
  • ELECTRA-base-sop
  • ELECTRA-albert-base
  • ELECTRA-albert-xlarge
  • ELECTRA-base-cluster
  • ELECTRA-large
  • Evaluation in Cantonese dataset
  • Upload to 🤗Huggingface

Model Comparisons

Model params # L/H MNLI-en DRCD-dev
(EM/F1)
SQuADv2-dev
(EM/F1)
CMRC2018-dev
(EM/F1)
🐤 BERT (s) 12M 12/256 77.6 60.5/64.2🤗
🐦 BERT (b) 110M 12/768 84.3 85.0/91.2 72.4/75.8🤗
🦅 BERT (l) 334M 12/1024 87.1 92.8/86.7
🐦 roBERTa (b) 110M 12/768 87.6 86.6/92.5 78.5/81.7🤗
🦅 roBERTa (l) 335M 24/1024 90.2 88.9/94.6
🐤 alBERT (b) 12M 12/768 84.6 79.3/82.1
🐤 alBERT (l) 18M 24/1024 86.5 81.8/84.9
🐦 alBERT (xl) 60M 24/2048 87.9 84.1/87.9
🦅 alBERT (xxl) 235M 12/4096 90.6 86.9/89.8
🐤 ELECTRA (s) 14M 12/256 81.6 83.5/89.2 69.7/73.4🤗
🐦 ELECTRA (b) 110M 12/768 88.5 89.6/94.2 80.5/83.3 69.3/87.0
🦅 ELECTRA (l) 335M 24/1024 90.7 88.8/93.3 88.0/90.6
🐦 XLM-R (b) 270M 12/768
🦅 XLM-R (l) 550M 24/1024 89.0
Ours (1.2M)
🐤 ELECTRA (s) 14M 12/256 80.7 82.1/88.0 69.4/72.1
🐦 ELECTRA (b) 110M 12/768 86.3 88.2/92.5 80.4/83.3
🐦 albert (xl) 60M 12/2048 87.7 89.9/94.7 82.9/85.9
Ours (1.5M)
🐦 ELECTRA (b) 110M 12/768 86.8 88.5/93.3 80.8/83.7 67.4/86.7
+ finetuned after SQuAD 89.5/94.1 70.2/88.5

Individual Comparisions

Small Models 🐤

Model params # L/H MNLI-en DRCD-dev
(EM/F1)
SQuADv2-dev
(EM/F1)
🐤 BERT (s) 12M 12/256 77.6 60.5/64.2🤗
🐤 alBERT (b) 12M 12/768 84.6 79.3/82.1
🐤 alBERT (l) 18M 24/1024 86.5 81.8/84.9
🐤 ELECTRA (s) 14M 12/256 81.6 83.5/89.2 69.7/73.4🤗
Ours
🐤 ELECTRA (s) 14M 12/256 80.7 82.1/88.0 69.4/72.1

Base Models 🐦

Model params # L/H MNLI-en DRCD-dev
(EM/F1)
SQuADv2-dev
(EM/F1)
CMRC2018-dev
(EM/F1)
🐦 BERT (b) 110M 12/768 84.3 85.0/91.2 72.4/75.8🤗
🐦 roBERTa (b) 110M 12/768 87.6 86.6/92.5 78.5/81.7🤗 67.4/87.2
🐦 ELECTRA (b) 110M 12/768 88.5 89.6/94.2 80.5/83.3 69.3/87.0
Ours
🐦 ELECTRA (b) 110M 12/768 86.3 88.2/92.5 80.4/83.3
Ours (1.5M)
🐦 ELECTRA (b) 110M 12/768 86.8 88.5/93.3 80.8/83.7 67.4/86.7
+ finetuned after SQuAD 89.5/94.1 70.2/88.5

Downloads 🐤🐦

Electra checkpoints are put here in Google Drive.

Electra-albert checkpoints are here in Google Drive


Explorations

Model params # L/H MNLI-en DRCD-dev
(EM/F1)
SQuADv2-dev
(EM/F1)
Ours (1.5M)
🐦 ELECTRA (b) 110M 12/768 86.8 88.5/93.3 80.8/83.7
+ finetuned after SQuAD 89.5/94.1
Ours (1.5M) + SOP
🐦 ELECTRA (b) 110M 12/768 87.1 88.6/93.6 80.4/83.2
+ finetuned after SQuAD 89.7/94.1

References

Expected Losses / Training Curves during Pre-Training.

google-research/electra#3


Credits

Special thanks to Google's TensorFlow Research Cloud (TFRC) for providing TPU-v3 for all the training in this repo!