GitHub - paramiai/cantoformer: Transformers for Cantonese

Cantoformer
廣東話嘅語言 AI

Recent advances in AI enable smarter applications based on texts. It's good but they are mostly in English due to its abundance of texts available from the Internet.

This repository explores LM in Cantonese (Yue Chinese, 廣東話), a langauge predominantly spoken in Guangzhou, Hong Kong and Macau, and containing very challenging lingual properties for AI to learn.

AI 喺呢幾年發展得好快，好多嘢都話用 AI 處理會醒好多，但其實喺「語言處理」嘅領域入面，好多嘅資源都只係得英文，所以要落手做廣東話嘅 NLP，其實唔容易。

所以諗住喺呢度開個 Repo ，鼓勵更多人開發廣東話 AI。

Challenges

Mixed Languages (English, Chinese, Yue)
夾雜多種語言
Complex Syntax
語法複雜
Scarce Resource
資源稀少
Many Homonyms & Homophones in online texts
網上嘅字通常有好多一語多義／同音異字

Remediation

We adopt the following preprocessing to the model:
用呢個 model 前我哋會對文字做一啲嘅處理：

WordPiece Tokenizer from forked 🤗Tokenizers which,
- strips accents like the original BERT
  除去組合附加符號 (e.g. à → a)
- uses lower casing
  使用細階英文
- treats symbols/numers as a separate token
  符號／數字全部當係一個 token
- Simplified Chinese → Traditional Chinese (Since most of our corpus are in Trad. Chinese)
  簡轉繁（因為文本大部分都係繁體字）
  
  Using OpenCC v1.1.1 from here
- normalizes Unicode Characters (Some are hand-crafted) by
  統一中文字符（其中一啲係人手分類）
  - Symbols of the same functionality 相同功能嘅符號 (e.g. 【 → [ )
  - Variant Chinese characters 異體字 (e.g. 俢 → 修 )
  - Deomposing rare characters 將罕見字拆開 (e.g. 偆 → 亻春 )
  (Mapping here)
Newlines are regarded as a token, i.e. <nl>

Framework to be used

Tensorflow
Pytorch

Libraries to be used

OpenCC (Simpl-to-Trad, 簡轉繁) @ v1.1.1
🤗Tokenizers (forked version is used for normalization)

# Installing OpenCC v1.1.1 by
sudo bash ./install_opencc.sh

# Installing by forked 🤗 Tokenizers by 
pip3 install 'git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python'
# This takes some time!

# This is forked from tokenizers@v0.8.1
# with python package renamed to tokenizers_zh

Corpus

zh	en
~ 80 GB (incl. ~ 20 GB Cantonese)	~ 100 GB

Evaluation

Since we have NO datasets in Cantonese, we evaluate the models on both English and Chinese datasets:

MNLI (Entailment Prediction)
DRCD (Reading Comprehension)
SQuAD-v2 (Reading Comprehension)
CMRC2018 (Reading Comprehension)

Something to explore

Sentence Order Prediction (SOP)

SOP is a pretraining objective that is used in Albert. StructBERT also introduces Sentence Structural Objective, but since the code for electra reads the data sequentially, this repo explores SOP first
Cluster Objective

DocProduct is a cool project training a BERT model to cluster similar Q&A -- if a text A answers the question Q, then Q and A will be close in vector representation.

This means the model must predict the possible contexts (before and after) in order to embed a vector that can minimize the cost function

Details refer to the DocProduct repo.

To Do List

Model Comparisons

	Model	params #	L/H	MNLI-en	DRCD-dev (EM/F1)	SQuADv2-dev (EM/F1)	CMRC2018-dev (EM/F1)
🐤	BERT (s)	12M	12/256	77.6		60.5/64.2🤗
🐦	BERT (b)	110M	12/768	84.3	85.0/91.2	72.4/75.8🤗
🦅	BERT (l)	334M	12/1024	87.1		92.8/86.7

🐦	roBERTa (b)	110M	12/768	87.6	86.6/92.5	78.5/81.7🤗
🦅	roBERTa (l)	335M	24/1024	90.2		88.9/94.6

🐤	alBERT (b)	12M	12/768	84.6		79.3/82.1
🐤	alBERT (l)	18M	24/1024	86.5		81.8/84.9
🐦	alBERT (xl)	60M	24/2048	87.9		84.1/87.9
🦅	alBERT (xxl)	235M	12/4096	90.6		86.9/89.8

🐤	ELECTRA (s)	14M	12/256	81.6	83.5/89.2	69.7/73.4🤗
🐦	ELECTRA (b)	110M	12/768	88.5	89.6/94.2	80.5/83.3	69.3/87.0
🦅	ELECTRA (l)	335M	24/1024	90.7	88.8/93.3	88.0/90.6

🐦	XLM-R (b)	270M	12/768
🦅	XLM-R (l)	550M	24/1024	89.0

	Ours (1.2M)
🐤	ELECTRA (s)	14M	12/256	80.7	82.1/88.0	69.4/72.1
🐦	ELECTRA (b)	110M	12/768	86.3	88.2/92.5	80.4/83.3
🐦	albert (xl)	60M	12/2048	87.7	89.9/94.7	82.9/85.9

	Ours (1.5M)
🐦	ELECTRA (b)	110M	12/768	86.8	88.5/93.3	80.8/83.7	67.4/86.7
	+ finetuned after SQuAD				89.5/94.1		70.2/88.5

Individual Comparisions

Small Models 🐤

	Model	params #	L/H	MNLI-en	DRCD-dev (EM/F1)	SQuADv2-dev (EM/F1)
🐤	BERT (s)	12M	12/256	77.6		60.5/64.2🤗

🐤	alBERT (b)	12M	12/768	84.6		79.3/82.1
🐤	alBERT (l)	18M	24/1024	86.5		81.8/84.9

🐤	ELECTRA (s)	14M	12/256	81.6	83.5/89.2	69.7/73.4🤗

	Ours
🐤	ELECTRA (s)	14M	12/256	80.7	82.1/88.0	69.4/72.1

Base Models 🐦

	Model	params #	L/H	MNLI-en	DRCD-dev (EM/F1)	SQuADv2-dev (EM/F1)	CMRC2018-dev (EM/F1)
🐦	BERT (b)	110M	12/768	84.3	85.0/91.2	72.4/75.8🤗

🐦	roBERTa (b)	110M	12/768	87.6	86.6/92.5	78.5/81.7🤗	67.4/87.2

🐦	ELECTRA (b)	110M	12/768	88.5	89.6/94.2	80.5/83.3	69.3/87.0

	Ours
🐦	ELECTRA (b)	110M	12/768	86.3	88.2/92.5	80.4/83.3
	Ours (1.5M)
🐦	ELECTRA (b)	110M	12/768	86.8	88.5/93.3	80.8/83.7	67.4/86.7
	+ finetuned after SQuAD				89.5/94.1		70.2/88.5

Downloads 🐤🐦

Electra checkpoints are put here in Google Drive.

Electra-albert checkpoints are here in Google Drive

Explorations

	Model	params #	L/H	MNLI-en	DRCD-dev (EM/F1)	SQuADv2-dev (EM/F1)
	Ours (1.5M)
🐦	ELECTRA (b)	110M	12/768	86.8	88.5/93.3	80.8/83.7
	+ finetuned after SQuAD				89.5/94.1

	Ours (1.5M) + SOP
🐦	ELECTRA (b)	110M	12/768	87.1	88.6/93.6	80.4/83.2
	+ finetuned after SQuAD				89.7/94.1

References

Expected Losses / Training Curves during Pre-Training.

google-research/electra#3

Credits

Special thanks to Google's TensorFlow Research Cloud (TFRC) for providing TPU-v3 for all the training in this repo!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
electra		electra
imgs		imgs
LICENSE		LICENSE
README.md		README.md
cantokenizer.py		cantokenizer.py
install_opencc.sh		install_opencc.sh
scripts.md		scripts.md
zh_char2str_mapping.txt		zh_char2str_mapping.txt

License

paramiai/cantoformer

Folders and files

Latest commit

History

Repository files navigation

Cantoformer 廣東話嘅語言 AI