Skip to content

A curated list of papers and experiments in the field of Natural Language Processing (NLP)

Notifications You must be signed in to change notification settings

murali1996/nlp-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

UPDATE

This file is no longer actively maintained. If you are interested in maintaining/updating it, feel free to update by raising PRs or by reaching out to jsaimurali001 [at] gmail [dot] com

READINGS_NLP

Word and Sentence Embeddings

word-level representations

  1. Natural Language Processing (almost) from Scratch, Collobert et al. 2011
  2. Word2Vec, Efficient Estimation of Word Representations in Vector Space, Mikolov et al. 2013a
  3. Word2Vec, Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al. 2013b

💡 see Ruder's 3 parts of explanation-p1, p2, p3 along with his Aylien blog, Chris McCormick's take on Negative Sampling here along with resources to reimplement, see here for backprop derivations in word2vec and here to download pretrained embeddings

  1. GloVe: Global Vectors for Word Representation, Pennington et al. 2014

character-level representations

  1. https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html(G Lample et al. 2016)
  2. ELMo, Deep contextualized word representations, Peters et al. 2018
  3. FLAIR, Contextual String Embeddings for Sequence Labeling, Akbik et al. 2018 [CODE]
  4. Character-Level Language Modeling with Deeper Self-Attention, Rami et al. 2018

subword-level representations

  1. FastText, Enriching Word Vectors with Subword Information, Bojanowski et al. 2016
  2. Neural Machine Translation of Rare Words with Subword Units, Sennrich et al. 2015 [also see this and this]

additional objectives

💡 de-biasing, robust to spelling errors, etc.

  1. Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network, Sakaguchi et al. 2016
  2. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, Bolukbasi et al. 2016
  3. Learning Gender-Neutral Word Embeddings, Zhao et al. 2018
  4. Combating Adversarial Misspellings with Robust Word Recognition, Danish et al. 2019
  5. Misspelling Oblivious Word Embeddings, Edizel et al. 2019 [facebook AI]

sentence representations

  1. Skip-Thought Vectors, Kiros et al. 2015
  2. A Structured Self-attentive Sentence Embedding, Lin et al. 2017
  3. InferSent, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, Conneau eta al. 2017
  4. Hierarchical Attention Networks for Document Classification, Yang et al. 2016
  5. DisSent: Sentence Representation Learning from Explicit Discourse Relations, Nie et al. 2017
  6. USE, Universal Sentence Encoder, Cer et al. 2018 [also see Multilingual USE]

Multi-lingual word embeddings

  1. [fasttext embeddings]
  2. Polyglot: Distributed Word Representations for Multilingual NLP, Rami et al. 2013
  3. Density Matching for Bilingual Word Embedding, Zhou et al. 2015
  4. Word Translation Without Parallel Data, Conneua et al. 2017 [repo]
  5. Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion, Joulin et al. 2018
  6. Unsupervised Multilingual Word Embeddings, Chen & Cardie 2018 [repo]

Evaluation

💡 NLU and XLU

  1. GLUECoS: An Evaluation Benchmark for Code-Switched NLP, Khanuja et al. 2020
  2. XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation, Liang et al. 2020
  3. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization, Hu et al. 2020
  4. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, Wang et al. 2019
  5. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, Wang et al. 2018 [Site]
  6. XNLI: Evaluating Cross-lingual Sentence Representations, Conneau eta al. 2018c
  7. SentEval: An Evaluation Toolkit for Universal Sentence Representations, Conneau et al. 2018a [Site]
  8. CLUE: Language Understanding Evaluation benchmark for Chinese (CLUE)

Interpretability and Ethics

💡 inductive bias, distillation and pruning, adversarial attacks, fairness and bias
💡 distillation (can thought of a MAP estimate with prior rather than MLE objective)
↗️ see some papers on bias in Word and Sentence Embeddings section

inductive bias and generalization

  1. Dissecting Contextual Word Embeddings: Architecture and Representation, Peters et al. 2018b
  2. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties, Conneau et al 2018b
  3. Troubling Trends in Machine Learning Scholarship, Lipton and Steinhardt 2018
  4. How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks, Kaushik and Lipton 2018
  5. Are Sixteen Heads Really Better than One?, Paul et al. 2019 [Blogpost]
  6. No Training Required: Exploring Random Encoders for Sentence Classification, Wieting et al. 2019
  7. BERT Rediscovers the Classical NLP Pipeline, Tenney et al. 2019
  8. Compositional Questions Do Not Necessitate Multi-hop Reasoning, Min et al. 2019
  9. Probing Neural Network Comprehension of Natural Language Arguments, Niven & Kao 2019 and [this] related article
  10. The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives, Voita et al. 2019
  11. Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study, Fu et al. 2019

interpreting attention

  1. Attention is not Explanation, Jain and Wallace 2019
  2. Is Attention Interpretable?, Serrano and Smith 2019
  3. Attention is not not Explanation, Wiegreffe and Pinter 2019
  4. Learning to Deceive with Attention-Based Explanations, Pruthi et al. 2020

adversarial attacks

  1. Combating Adversarial Misspellings with Robust Word Recognition, Danish et al. 2019
  2. Universal Adversarial Triggers for Attacking and Analyzing NLP, Wallace et al. 2019
  3. Weight Poisoning Attacks on Pre-trained Models, Kurita et al. 2020

model distillation and pruning

  1. Understanding Knowledge Distillation in Non-autoregressive Machine Translation, Zhou et al. 2019
  2. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, Tang et al. 2019. Also a related work from HuggingFace here, and work on quantization compression by RASA here
  3. Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes [also see this article]
  4. RoBERTa, A Robustly Optimized BERT Pretraining Approach, Liu et al. 2019
  5. Patient Knowledge Distillation for BERT Model Compression, Sun et al. 2019
  6. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, Lan et al. 2019

fairness and bias in models

  1. GROVER, Defending Against Neural Fake News, Zellers et al. 2019 [blogpost]

Contextual Representations and Transfer Learning

Language modeling

💡 Similar works are also compiled here: Pre-trained Language Model Papers
💡 Typically, these pre-training methods involve an self-supervised (also called semi-supervised/unsupervised in some works) learning followed by a supervised learning. This is unlike CV domain where pre-training is mainly supervised learning.

  1. https://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture13-contextual-representations.pdf
  2. Semi-supervised Sequence Learning, Dai et al. 2015
  3. Unsupervised Pretraining for Sequence to Sequence Learning, Ramachandran et al. 2016
  4. context2vec: Learning Generic Context Embedding with Bidirectional LSTM, Melamud et al. 2016
  5. InferSent, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, Conneau eta al. 2017
  6. ULM-FiT, Universal Language Model Fine-tuning for Text Classification, Howard and Ruder 2018
  7. ELMo, Deep contextualized word representations, Peters et al. 2018 [also see previus works- TagLM and CoVe ]
  8. GPT-1 aka OpenAI Transformer, Improving Language Understanding by Generative Pre-Training, Radford et al. 2018
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al. 2018 [SLIDES] [also see Illustrated BERT]
  10. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang et al. 2019
  11. GPT-2, Language Models are Unsupervised Multitask Learners, Radford et al. 2019 [also see Illustrated GPT-2]
  12. ERNIE: Enhanced Language Representation with Informative Entities, Zhang et al. 2019
  13. XLNet: Generalized Autoregressive Pretraining for Language Understanding, Yang et al. 2019
  14. RoBERTa: A Robustly Optimized BERT Pretraining Approach, Liu et al. 2019
  15. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding, Sun et al. 2019
  16. CTRL: A Conditional Transformer Language Model for Controllable Generation, Keskar et al. 2019
  17. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, Lan et al. 2019
  18. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, Clark et al. 2019 [Google Blog]

+ supervised objectives

💡 Some people went ahead and thought "how about using supervised (+- self-unsupervised) tasks for pretraining?!"

  1. InferSent, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, Conneau eta al. 2017
  2. USE, Universal Sentence Encoder, Cer et al. 2018 [also see Multilingual USE]
  3. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks, Phang et al. 2018

BERT & Transformers

💡 see Interpretability and Ethics section for more papers

BERTology

▶️ Bert related papers compilation

  1. E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT, Poerner et al. 2020
  2. A Primer in BERTology: What we know about how BERT works, Rogers et al. 2020
  3. Comparing BERT against traditional machine learning text classification, Carvajal et al. 2020
  4. Revisiting Few-sample BERT Fine-tuning, Zhang et al. 2020

Transformers

  1. The Evolved Transformer, So et al. 2019
  2. R-Transformer: Recurrent Neural Network Enhanced Transformer, Wang et al. 2019
  3. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Raffel et al. 2019
  4. The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives, Voita et al. 2019
  5. Reformer: The Efficient Transformer, Kitaev et al. 2020

Active Learning

💡 dataset distillation :p

  1. Deep Active Learning for Named Entity Recognition, shen et al. 2017
  2. Learning how to Active Learn: A Deep Reinforcement Learning Approach, Fang et al. 2017
  3. An Ensemble Deep Active Learning Method for Intent Classification, Zhang et al. 2019

Multi-task learning

  1. decaNLP, The Natural Language Decathlon: Multitask Learning as Question Answering, McCann et al. 2018
  2. HMTL, A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks, Victor et al. 2018
  3. GenSen, Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning, Subramanian et al. 2018
  4. Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling, Wang et al. 2019
  5. GPT-2, Language Models are Unsupervised Multitask Learners, Radford et al. 2019 [also see Illustrated GPT-2]
  6. Unified Language Model Pre-training for Natural Language Understanding and Generation, Dong et al. 2019
  7. MASS: Masked Sequence to Sequence Pre-training for Language Generation, Song et al. 2019
  8. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding, Sun et al. 2019
  9. T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Raffel et al. 2019 [code]

Generation

text generation

  1. Incorporating Copying Mechanism in Sequence-to-Sequence Learning, Jiatao Gu et al. 2016
  2. Quantifying Exposure Bias for Neural Language Generation, He et al. 2019
  3. CTRL: A Conditional Transformer Language Model for Controllable Generation, Keskar et al. 2019
  4. Plug and Play Language Models: A Simple Approach to Controlled Text Generation, Dathathri et al. 2019

dialogue sytems

  1. Zero-shot User Intent Detection via Capsule Neural Networks, Xia et al. 2018
  2. Investigating Capsule Networks with Dynamic Routing for Text Classification, Zhao et al. 2018
  3. BERT for Joint Intent Classification and Slot Filling, Chen et al. 2019
  4. Few-Shot Generalization Across Dialogue Tasks, Vlasov et al. 2019 [RASA Research]
  5. Towards Open Intent Discovery for Conversational Text, Vedula et al. 2019
  6. What makes a good conversation? How controllable attributes affect human judgments [also see this article]

machine translation

  1. Sequence to Sequence Learning with Neural Networks, Sutskever et al. 2014
  2. Addressing the Rare Word Problem in Neural Machine Translation, Luong et al. 2014
  3. Neural Machine Translation of Rare Words with Subword Units, Sennrich et al. 2015
  4. Transformer, Attention Is All You Need, Vaswami et al. 2017
  5. Understanding Back-Translation at Scale, Edunov et al. 2018
  6. Achieving Human Parity on Automatic Chinese to English News Translation, Microsoft Research 2018 [Bites] [also see this and this]

Knowledge Graphs

💡 LMs realized as diverse learners; learning more than what you thought!!

  1. Language Models as Knowledge Bases?, Petroni et al. 2019

Multi-lingual and cross-lingual learning

multilingual

  1. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond, Artetxe et al. 2018
  2. How multilingual is Multilingual BERT?, Pires et al.2019
  3. Multilingual Universal Sentence Encoder (USE) for Semantic Retrieval, Yinfei Yang et al. 2019
  4. How Language-Neutral is Multilingual BERT?, Libovicky et al. 2020
  5. Universal Phone Recognition with a Multilingual Allophone System, Le te al. 2020

Cross-Lingual

  1. http://ruder.io/cross-lingual-embeddings/index.html
  2. XLM, Cross-lingual Language Model Pretraining, Guillaume and Conneau et al. 2019
  3. Cross-Lingual Ability of Multilingual BERT: An Empirical Study, karthikeyan et al. 2019
  4. XQA: A Cross-lingual Open-domain Question Answering Dataset, Liu et al. 2019

Multi-modal learning

  1. Representation Learning with Contrastive Predictive Coding, Oord et al. 2018
  2. M-BERT: Injecting Multimodal Information in the BERT Structure, Rahman et al. 2019
  3. LXMERT: Learning Cross-Modality Encoder Representations from Transformers, Tan and Bansal 2019
  4. BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations, Scialom et al. 2020

Question Answering

  1. A Deep Neural Network Framework for English Hindi Question Answering
  2. DrQA, Reading Wikipedia to Answer Open-Domain Questions, Chen et al. 2017
  3. GoldEn Retriever, Answering Complex Open-domain Questions Through Iterative Query Generation, Qi et al 2019
  4. BREAK It Down: A Question Understanding Benchmark, Wolfson et al. 2020
  5. XQA: A Cross-lingual Open-domain Question Answering Dataset, Liu et al. 2019

Notes

Quick Bites

  1. Byte Pair Encoding (BPE) is a data compression technique that iteratively replaces the most frequent pair of symbols (originally bytes) in a given dataset with a single unused symbol. In each iteration, the algorithm finds the most frequent (adjacent) pair of symbols, each can be constructed of a single character or a sequence of characters, and merged them to create a new symbol. All occurences of the selected pair are then replaced with the new symbol before the next iteration. Eventually, frequent sequence of characters, up to a whole word, are replaced with a single symbol, until the algorithm reaches the defined number of iterations (50k can be an example figure). During inference, if a word isn’t part of the BPE’s pre-built dictionary, it will be split into subwords that are. Code of BPE can be found here. See Overall Idea blog-post, BPE specific blog-post and BPE Code for more details.
import re
words0 = [" ".join([char for char in word]+["</w>"]) for word in "in the rain in Ukraine".split()]+["i n"]+["<w> i n"]
print(words0)

eword1 = re.escape('i n')
p1 = re.compile(r'(?<!\S)' + eword1 + r'(?!\S)')
words1 = [p1.sub('in',word) for word in words0]
print(words1)


eword2 = re.escape('in </w>')
p2 = re.compile(r'(?<!\S)' + eword2 + r'(?!\S)')
words2 = [p2.sub('in</w>',word) for word in words1]
print(words2)


eword3 = re.escape('a in</w>')
p3 = re.compile(r'(?<!\S)' + eword3 + r'(?!\S)')
words3 = [p3.sub('ain</w>',word) for word in words2]
print(words3)
  1. re library
re.search(), re.findall(), re.split(), re.sub()
re.escape(), re.compile()

101, Positive and Negative Lookahead/Lookbehind

  1. Models can be trained on SNLI in two different ways: (i) sentence encoding-based models that explicitly separate the encoding of the individual sentences and (ii) joint methods that allow to use encoding of both sentences (to use cross-features or attention from one sentence to the other).

Food For Thought

  1. How good do ranking algorithms, the ones with pointwise/pairwise/listwise learning paradigms, perform when the no. of test classes at the infernece time grow massively? KG Reasoning using Translational/Bilinear/DL techniques is one important area under consideration.
  2. While the chosen neural achitecture is important, the techniques used for training the problem objective e.g.Word2Vec or the techniques used while doing loss optimization e.g.OpenAI Transformer play a significant role in both fast as well as a good convergence.
  3. Commonality between Language Modelling, Machine Translation and Word2vec: All of them have a huge vocabulary size at the output and there is a need to alleviate computing of the huge sized softmax layer! See Ruder's page for a quick-read.

Bookmarks

codes, videos and MOOCs

each link is either a series of blogs from an individual/organization or a conference related link or a MOOC

selected blog-posts

miscellaneous