Foundations of Natural Language Processing - Lecture Notes

Written by Bora M. Alper in 2020 based on the lecture slides at https://www.inf.ed.ac.uk/teaching/courses/fnlp/.

Table of Contents

14 January 2020

17 January 2020

Corpora

Sentiment Analysis

Building a Sentiment Analyser

  1. What is the input for each prediction? Sentence? Full review text? Text + metadata?

  2. What are the possible outputs? + or -? stars?

  3. How will it decide?

    • When a system’s behaviour is determined solely by manual rules or databases, it is said to be rule-based, symbolic, or knowledge-driven (early days of computational linguistics)

    • Learning is the act of collecting statistics or patterns automatically from corpora to govern the system’s behaviour (dominant in most areas of contemporary NLP)

      • Supervised learning is when the data provides example input-output pairs (main focus of this course)
      • Core behaviour: training Refining behaviour: tuning
  4. How will you measure its effectiveness?

The last one, at least, requires data!

A Simple Sentiment Classification Algorithm

image-20200130082212859

Preprocessing & Normalisation

Choice of Training & Evaluation Data

21 January 2020

Use Cases of a Language Model

Spelling Correction

image-20200427221108590

Automatic Speech Recognition

image-20200427221201587

Machine Translation

image-20200427221235845

Prediction

Estimating Probabilities

Probability Theory vs Estimation

Notation

Relative Frequency Estimation

Problems
Sparse Data

N-Gram Models

Deriving an N-Gram Model

Beginning/End of Sequence
Costs (Negative Log Probabilities)
Problems

24 January 2020

Evaluating a Language Model

Types of Evaluation in NLP

Entropy

Entropy as Yes/No Questions

Estimates and Cross Entropy

Estimating Cross Entropy

For with large , per-word cross-entropy is well approximated by:

Perplexity

Interpreting Measures

Sparse Data, Again

Smoothing

Add-One (Laplace) Smoothing

where is the vocabulary size.

Problems
Add-α (Lidstone) Smoothing

Good-Turing Smoothing

28 January 2020

Problems with Good-Turing

Interpolation

Fitting Interpolation Parameters

Katz Back-Off

Diversity of Histories

Kneser-Ney Smoothing

Kneser-Ney In Practice

Distributed Representations and Word Similarity

SKIPPED. See 05_slides.pdf page 19.

Noisy Channel

image-20200428114346431

ApplicationYX
Speech Recognitionspoken wordsacoustic signal
Machine Translationwords in words in
Spelling Correctionintended wordstyped words

Noisy Channel as Probabilistic Inference

31 January 2020

Edit Distance

Finding an Optimal Alignment

Chart

image-20200430112819360

Filling First Cell

image-20200430113216885

Second Column

image-20200430113345332

Single Best Path

image-20200430113520265

Completed Chart

image-20200430114036749

Uses of Alignment and MED

Catch-22

General Formulation
Expectation-Maximisation (EM)
  1. Initialise parameters to arbitrary values (e.g. set all costs to 1)

  2. Using these parameters, compute optimal values for variables (run MED to get alignments).

  3. Now using those alignments, recompute the parameters

    • Just pretend that alignments are hand annotations.
    • Estimate parameters as from annotated corpus.
  4. Repeat steps 2 and 3 until parameters stop changing.

EM vs Hard EM

Likelihood Function

4 February 2020

Text Classification

Naive Bayes

Modelling (Feature Probabilities)

Calculating the Feature Probabilities

is normally estimated with simple smoothing:

Alternative Features

Costs and Lineraity

Review of Naive Bayes

Advantages
Disadvantages

Maximum Entropy Classifiers

Features

Classification with MaxEnt

Choose the class that has highest probability according to

where normalisation constant

Feature Templates

Training the Model

Review of MaxEnt

7 February 2020

Sequence Labelling (Tagging)

Parts of Speech

PoS Tagging

A Probabilistic Model for Tagging

  1. Choose a tag conditioned on previous tag

    • Transition probability
  2. Choose a word conditioned on its tag

    • Emission probability

      • Because every state emits a word (except <s> and </s>).

Actual Tagging with Hidden Markov Models (HMM)

The Viterbi Algorithm

11 February 2020

Viterbi Table

image-20200501130946800

14 February 2020

Meta lecture. About evaluation mostly and doing NLP in practise.

Annotation

Inter-Annotator Agreement (IAA)

Cross-Validation

Measuring a Model’s Performance

Bounds

Significance

25 February 2020

Theories of Syntax

Context-Free Grammar

Ambiguity

Structural Ambiguity

image-20200502120426280

Attachment Ambiguity
image-20200502120541260image-20200502120557026
“the man with the telescope”“with the telescope I saw the man

Parse Trees

Parsing

Recursive Descent Parsing

Shift-Reduce Parsing

28 February 2020

CKY Parsing

Step by Step

  1. image-20200503114746719

    • We have added all PoS tags that are allowed for each word.

    • Beware that the bottom-left half of the table (excluding the diagonal) is empty.

      • We are interested in cells where .
  2. image-20200503114833898

    • Red shows which children create which parents.

      • Normally we add pointers from parent to child to store this info permanently, but here we omit them for clarity.
  3. image-20200503115033105

    • More unary rule construction.
  4. image-20200503115101204

    • A-ha, this is interesting!
    • Given binary rule , we construct in cell from in cell and from in cell .
  5. Skipping intermediate steps…

    • Remember that, say for cell we consider all of the following:

      • and
      • and
    • In general, for cell [remembering that is always less than ] you should consider all of the following:

      • and
      • and
      • and
  6. image-20200503115333113

    • The top-right cell contains the root of the parse tree, which then can be constructed using backpointers.

CKY Ordering

CKY in Practise

Treebank Grammars

Probabilistic Context Free Grammars (PCFG)

Probabilistic CYK

Best-First Probabilistic Parsing

Lexical Dependencies

3 March 2020

Evaluating Parse Accuracy

image-20200503142727775

Handling Lexical Dependencies (Lexicalisation)

Dependency Trees

1.

image-20200503151921846

  1. Remove phrasal categories

    image-20200503151943291

  2. Remove duplicated terminals

    image-20200503152025716

  3. Collapse chains of duplicates -

    image-20200503152057463 -

    image-20200503152123232

  4. Result

    image-20200503152150150

Head Rules

Pros and Cons

Pros
Cons

Direct Dependency Parsing

Shift-Reduce Parsing

image-20200503154043516

Transition-Based Parsing
Graph-Based Parsing
Comparison

6 March 2020

Meaning

Question Answering

Semantics

Word Sense Disambiguation (WSD)

Semantic Classes

Named Entity Recognition (NER)
Supersense Tagging

10 March 2020

Word Similarity

Distributional Hypothesis

The Context

Questions
Defining the Context
Weighing the Context Words
Pointwise Mutual Information (PMI)

Alternatives to PMI for Finding Collocations