NLP Notes (Updating)

3 minute read


NLP Notes (updating)

Natural Language Generation


PPLM: github code

GEM Benchmark: paper


story generation



Image Captioning


Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4565-4574).

Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).

Code: DenseCap

Sentences to paragraph:

Krause, J., Johnson, J., Krishna, R., & Fei-Fei, L. (2017). A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 317-325).

Other related key words: image description, visual storytelling


Gehrmann, S., Adewumi, T., Aggarwal, K., Ammanamanchi, P. S., Anuoluwapo, A., Bosselut, A., … & Zhou, J. (2021). The gem benchmark: Natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672.

Automated metrics

Lexical similarity:

  • BLEU
    • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation.
  • ROUGE-1/2/L
    • Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries.
    • ROUGE can be improved by increased the output length of the model
    • Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments.

Semantic Equivalence

  • BERTScore
    • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with BERT.
    • Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text generation.


  • Shannon Entropy over unigrams and bigrams (H_1, H_2)
    • Claude E Shannon and Warren Weaver. 1963. A mathematical theory of communication.
  • Mean Segmented Type Token Ratio over segment lengths of 100 (MSTTR)
    • Wendell Johnson. 1944. Studies in language be- havior: A program of research.
  • The ratio of distinct n-grams over the total number of n-grams (Distinct_1, 2)
  • The count of n-grams that only appear once across the entire test output (Unique_1, 2)
    • Jiwei Li, Michel Galley, Chris Brockett, Jian- feng Gao, and Bill Dolan. 2016. A diversity- promoting objective function for neural conversation models.

Human evaluation

Anya Belz, Simon Mille, and David M. Howcroft. 2020. Disentangling the properties of human evaluation methods: A classification system to support comparability, meta-evaluation and re-producibility testing.

David M. Howcroft, Anya Belz, Miruna-Adriana Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, and Verena Rieser. 2020. Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions.

Anastasia Shimorina and Anya Belz. 2021. The human evaluation datasheet 1.0: A template for recording details of human evaluation experiments in nlp.


CEFR level

6 levels: A1 (beginners), A2 (pre-intermediate), B1 (intermediate), B2 (upper-internediate), C1 (advanced), C2 (proficiency)


Structured prediction

Question answering