Written by Bora M. Alper in 2020 based on the lecture slides at https://www.inf.ed.ac.uk/teaching/courses/fnlp/.
Syntax, Part of Speech, Words, Morphology, Semantics, and Discourse
Why NLP is hard?
Because there ambiguity at many levels:
There are two ways of dealing with ambiguity:
Non-probabilistic methods, returning all possible answers
Probabilistic methods returning the best possible analysis
Like most other parts of AI, NLP today is dominated by statistical methods
Typically more robust than earlier rule-based methods
Probabilities are learned from data
Why NLP is hard again?
The frequency of a word is inversely proportional to its rank in the frequency table (and it’s also exponential!)
It’s a natural phenomenon that is observed in other languages and other fields as well.
Regardless of the size of your corpus!
Can also be observed for other linguistic structure (e.g. syntactic rules in a CFG).
These mean that we need to find clever ways to estimate probabilities for things we have rarely or never seen during training.
Suppose we train a Part of Speech tagger on the Wall Street Journal, which is nice and neat and copy-edited.
What will happen if we try to use this tagger for social media?
ikr smh he asked fir yo last name
Not only can one form have different meanings (ambiguity) but the same meaning can be expressed with different forms:
A corpus is a body of utterances, as words or sentences, assumed to be representative of and used for lexical, grammatical, or other linguistic analysis.
To understand and model how language works, we need empirical evidence. Ideally, this should be naturally-occurring corpora.
Aside from utterances, corpus datasets include metadata – side information about where the utterances come from, such as author, date, topic, publication, etc.
Corpora with linguistic annotations are of particular interest for core NLP and therefore this course, where humans have read the text and marked categories and/or structures describing their syntax and/or meaning.
Goal predict the opinion expressed in a piece of text.
The simplest way is to count the number of words with positive and negative denotations/connotations.
What is the input for each prediction? Sentence? Full review text? Text + metadata?
What are the possible outputs? + or -? stars?
How will it decide?
When a system’s behaviour is determined solely by manual rules or databases, it is said to be rule-based, symbolic, or knowledge-driven (early days of computational linguistics)
Learning is the act of collecting statistics or patterns automatically from corpora to govern the system’s behaviour (dominant in most areas of contemporary NLP)
How will you measure its effectiveness?
The last one, at least, requires data!
BEFORE you build a system, choose a dataset for evaluation.
Why is data-driven evaluation is important?
Often you should have multiple evaluation datasets: one for development as you hack on your system, and one reserved for FINAL testing.
Gold labels: Correct labels
Simplest rule: Count positive and negative words in the input. Predict whichever is greater.
Hard to know whether words that seem positive or negative tend to actually be used that way:
Opinion words may be describing for example a character’s attitude rather than being an evaluation of the film.
Some words act as semantic modifiers of other opinion-bearing words/phrases so interpreting the full meaning requires sophistication:
Normal written conventions often do not reflect the “logical” organisation of textual symbols.
Given a string of raw text, a tokeniser adds logical boundaries between separate words/punctuation tokens (occurrences) not already separated by spaces:
English tokenisation conventions vary somewhat —e.g., with respect to:
Word-level tokenisation is just part of the larger process of preprocessing or normalisation, which may also include
removal of markup
insertion of markup
sentence boundary detection
It should be evident that a large number of decisions have to be made, many of them dependent on the eventual intended use of the output, before a satisfactory preprocessor for such data can be produced.
Documenting those decisions and their implementation is then a key step in establishing the credibility of any subsequent experiments.
We know that the way people use language varies considerably depending on context. Factors include
Statistical approaches typically assume that the training data and the test data are sampled from the same distribution.
Things can go awry if the test data is appreciably different, e.g.
Domain adaptation techniques attempt to correct for this assumption when something about the source/characteristics of the test data is known to be different.
Probability of a sentence is “how likely is it to occur in natural language”
Consider only a specific language (English) [or even British English]
Not including meta-language (e.g. linguistic discussion)
P(“the cat slept peacefully”) > P(“slept the peacefully cat”)
P(“she studies morphosyntax”) > P(“she studies more faux syntax”)
It’s very difficult to know the true probability of an arbitrary sequence of words.
But we can define a language model to give us good approximations.
Like all models, language models will be good at capturing some things and less good for others.
Language models can be used for prediction as well as correction.
E.g. Predictive text correction/completion on your mobile phone:
Probability theory can solve problems like
But often we don’t know the true probabilities, only have data:
Intuitive way to estimate discrete probabilities where
This method is also known as maximum-likelihood estimation (MLE) for reasons we’ll explain.
Using MLE on full sentences doesn’t work well for language model estimation.
In general, MLE thinks anything that hasn’t occurred will never occur (P=0).
Clearly not true! Such things can have differing and non-zero probabilities.
And similarly for word sequences that have never occurred.
More generally, we use the chain rule.
But many of these conditional probabilities are just as sparse!
So we make an independence assumption: the probability of a word only depends on a fixed number of previous words (called history)
This assumption is not always a good one! But it does reduce the sparse data problem.
If we use MLE, we consider:
Word probabilities are typically very small.
Multiplying lots of small probabilities quickly gets so tiny that we cannot represent the numbers accurately, even with double precision floating point.
So in practice, we typically use negative log probabilities (also called costs)
N-gram models can be too simplistic, length of a “context” often varies: can be shorter or longer than an arbitrary .
Still suffers from assigning zero probabilities to not-seen sequences.
Intuitively, a trigram model captures more context than a bigram model, so should be a “better” model.
Extrinsic: measure performance on a downstream application
Intrinsic: design a measure that is inherent to the current task
Can be much quicker/easier during development cycle
But not always easy to figure out what the right measure is
Let’s consider how to define an intrinsic measure for language models.
Intuitively, a measure of uncertainty/disorder
Example: where the area of a section is proportional to its probability
A good model should have low uncertainty (entropy) about what comes next.
Cross entropy measures how close (estimate) is to (true):
Note that cross-entropy entropy
For with large , per-word cross-entropy is well approximated by:
Language model performance is often reported as perplexity rather than cross-entropy.
Perplexity is simply .
Cross entropy of a language model on some corpus is 5.2
Is this good?
No way to tell! Cross entropy depends on both the model and the corpus.
Some language is simply more predictable (e.g. casual speech vs academic writing)
We can only compare different models on the same corpus.
where is the vocabulary size.
Laplace smoothing steals way too much from seen events.
Assumes that we know the vocabulary size in advance.
Good is the surname of Irving John Good. :)
Previous methods changed the denominator, which can have big effects even on frequent events.
Good-Turing changes the numerator only.
Think of Good-Turing like this:
The actual formula is skipped for brevity. I don’t think it’s actually needed.
There are even better methods!
Assumes we know the vocabulary size (i.e. no unseen words)
Does not allow “holes” in the counts (i.e. if then )
Applies discounts even to high-frequency items
Assigns equal probabilities to all unseen events
A better solution is to use information from lower order N-grams (shorter histories)
Combine higher and lower order N-gram models, since they have different strengths and weaknesses:
If is N-gram estimate (from MLE, GT, etc; ), use:
Nota bene that all s must sum to 1!
“York” almost always directly follows “New”, say in a corpus
So, in unseen bigram contexts, York should have low probability
Kneser-Ney smoothing takes diversity of histories into account
Count of distinct histories for a word
Recall: MLE of unigram language model
In KN smoothing, replace raw counts with count of histories:
The best thing about KN smoothing is that it gives you the probability of “appearing in new contexts”
SKIPPED. See 05_slides.pdf page 19.
Use neural networks to project words into a continuous space, so words that appear in similar contexts have similar representation.
Can tell us anything about ?
|Speech Recognition||spoken words||acoustic signal|
|Machine Translation||words in||words in|
|Spelling Correction||intended words||typed words|
Mathematically, what we want is
Rewrite using Bayes’ Rule:
is the noise model
is the language model
Training conditional probabilities requires input/output pairs which are often limited:
But language models can be trained on huge unannotated corpora: a better model! Can help improve overall performance.
Assume we have a way to compute and . Can we do the following:
The task: find the optimal character alignment between two words (the one with the fewest character changes: the minimum edit distance or MED).
Example: if all changes count equally, MED(stall, table) is 3:
S T A L L T A L L deletion T A B L substitution T A B L E insertion
Written as an alignment:
S T A L L - d | | s | i - T A B L E
There may be multiple best alignments
For now, all costs are equal: cost(ins) = cost(del) = cost(sub) = 1
Brute force doesn’t scale well.
Instead we will use dynamic programming algorithm.
stores two things:
: the MED of substrings of length ,
Backpointer(s): which sub-alignment(s) used to create this one.
Deletion move down
Insertion move right
Substitution move down and right
Sum costs as we expand out from cell (0, 0) to populate the entire matrix.
Moving down in chart means that we had a deletion of S.
Add cost of deletion (1) and backpointer (to where we came from).
Moving right in chart (from (0,0)) means that we had an insertion of T.
Add cost of insertion (1) and backpointer (to where we came from).
Now compute . Take the minimum of three possibilities:
You can enumerate all the optimum MEDs by starting from the bottom right cell and following the arrows until the top left cell.
Computing distances and/or alignments between arbitrary strings can be used for:
Other fields entirely
Related algorithms are also used in speech recognition and timeseries data mining.
This sort of problem actually happens a lot in NLP (and ML).
We have some probabilistic model and want to estimate its parameters (here, the costs).
The model also contains variables whose value is unknown (here, the correct character alignments).
We would be able to estimate the parameters if we knew the values of the variables…
Initialise parameters to arbitrary values (e.g. set all costs to 1)
Using these parameters, compute optimal values for variables (run MED to get alignments).
Now using those alignments, recompute the parameters
Repeat steps 2 and 3 until parameters stop changing.
What we have just described is actually “hard EM” (meaning: no soft/fuzzy decisions)
Step 2 of true EM does not choose optimal values for variables, instead computes expected values.
True EM is guaranteed converge to a local optimum of the likelihood function.
Let’s call the parameters of our model .
For any value of , we can compute the probability of our dataset . This is the likelihood.
The likelihood is a function of , and can have multiple local optima.
Schematically (but is really multi-dimensional):
EM will converge to one of these local optima; hard EM won’t necessarily.
Neither is guaranteed to find the global optimum!
We might want to categorise the content of the text:
Spam detection (binary: spam or not)
Sentiment analysis (binary or multiway)
Topic classification (multiway: sport/finance/travel/etc)
… or we might want to categorise the author of the text (authorship attribution)
N-gram models can sometimes be used for classification but
for many tasks, sequential relationships between words are largely irrelevant: we can just consider the document as a bag of words
on the other hand, we may want to include other kinds of features (e.g. PoS tags) that N-gram models don’t include
Here we consider two alternative models for classification:
Just as in spelling correction, we need to define and .
is the prior probability of class before observing any data.
We represent each document as the set of features (words) it contains: . So:
As in Language Models, we cannot accurately estimate due to sparse data.
So make a naive Bayes assumption: features are conditionally independent given the class:
That is, the probability of word occurring depends only on the class.
Effectively, we only care about the count of each feature in each document.
is normally estimated with simple smoothing:
Use only binary values for : did this word occur in or not?
Use only a subset of the vocabulary for
Ignore stopwords (function words and others with little content)
Choose a small task-relevant set (e.g. using a sentiment lexicon)
Can be tricky:
E.g. sentiment analysis might need domain-specific non-sentiment words: such as quiet for computer product reviews
And for other tasks, stopwords might be very useful features:
Probably better to use too many irrelevant features than not enough relevant ones.
Use more complex features (bigrams, syntactic features, morphological features, …)
Multiplying large numbers of small probabilities together is problematic, thus we use costs (neg log) again.
In which case, we look for the lowest cost overall.
Naive Bayes then:
This amounts to classification using a linear function (in log space) of the input features.
Naive Bayes assumption is Naive.
Consider the following categories: travel, finance, sport
Are the following features independent given the category: beach, sun, ski, snow, pitch, palm, football, relax, ocean
In short, features are not usually independent given the class.
Accuracy of classifier can sometimes still be OK, but it will be highly overconfident in its decisions.
Used widely in many different fields, under many different names.
Most commonly, multinomial logistic regression
Also called: log-linear model, one-layer neural network, single neuron classifier, etc…
Like Naive Bayes, Maximum Entropy assigns a document to class , where
Unlike Naive Bayes, Maximum Entropy does not apply Bayes’ Rule. Instead, it models directly.
Like Naive Bayes, MaxEnt models use features we think will be useful for classification.
However, features are treated differently in two models:
For example, if we have three classes, our features will always come in groups of three. Imagine three binary features:
Each feature has a real-values weight learned in training.
Choose the class that has highest probability according to
where normalisation constant
Given annotated data, choose weights that make the labels most probable under the model.
That is, given examples with labels , choose
Called conditional maximum likelihood estimation (CMLE)
Supervised CMLE in MaxEnt is not so easy:
It is often the first step towards any syntactic analysis (which in turn, is often useful for semantic analysis).
Named Entity Recognition labels words as belonging to persons, organisations, locations, or none of the above
Information Field Segmentation given specific type of text (classified advert, bibliography entry, etc.), identify which words belong to which “fields” (price/size/location, author/title/year)
In sequence labelling, deciding the correct label depends on
Open-Class Words (or Content Words)
Closed-Class Words (or Function Words)
pronouns, determiners, prepositions, connectives
there a limited number of these
mostly functional: to tie the concepts of a sentence together
new ones are rare:
The number of parts of speech (tags) to have is a both linguistic and also a practical consideration.
Morphologically rich (e.g. Turkish) languages often have compound morphosyntactic tags: Noun+A3sg+P2sg+Nom
The problem of finding the best tag sequence for a sentence is also called decoding.
PoS tagging is hard because:
Relevant knowledge for PoS tagging
The word itself
Tags of surrounding words
Choose a tag conditioned on previous tag
Choose a word conditioned on its tag
So the model assumes
Transition Probability Table
Emission Probability Table
In this model, joint probability
Given a sequence of words, what is the most probable state path that generated them?
HMMs are quite similar to what we have seen earlier:
Find the best tag sequence for an untagged sentence :
By Bayes’ Rule:
Brute-force enumeration of all the possible tag sequences takes time for possible tags and words in the sentence.
The topmost row is the word sequence.
The cells below are the PoS tags.
You know Viterbi… Not gonna dwell into it again.
Intuition: the best path of length ending in state must include the best path of length to the previous state. So:
Find the best path of length to each state
Consider extending each of those by 1 step, until state
Take the best of those options as the best path to state
And of course use a chart to store partial results as we go.
And use backtrace to construct the path.
We can add up all probabilities in the last column to get the likelihood of (the probability of the entire) sequence.
As probabilities can get really tiny quickly, thus risking underflow, we use costs instead.
We can use expectation-maximisation to “bootstrap” an HMM in an unsupervised fashion.
Meta lecture. About evaluation mostly and doing NLP in practise.
Annotation costs time and money, you need to decide on
Text might be ambiguous
There may be grey area between categories in the annotation scheme
An important way to estimate the reliability of annotations is to have multiple people independently annotate a common sample, and measure inter-annotator agreement.
Raw agreement rate is the proportion of labels in agreement.
The agreement rate can be thought of as an upper bound (human ceiling) on the accuracy of a system evaluated on that dataset.
What if our dataset is too small to have a nice train/test or train/dev/test split?
-fold cross validation: partition the data into pieces and treat them as mini held-out sets. Each fold is an experiment with a different held-out set, using the rest of the data for training.
Precision proportion of model’s answers that are right
Recall proportion of test data that model gets right
Bora: It is hard to differentiate… Precision seems to be much more common.
When we are evaluating a model against each other or to a bound, how do we decide if the differences we find are significant?
In other words, should we interpret the differences as down to pure chance? Or is something more going on?
Parametric when the underlying distribution is normal
We have seen various ways to model word behaviour
There are often long-range dependencies!
The form of one word often depend on (or agrees with) another, even when arbitrarily many words intervene:
We want models that can capture these dependencies: for translation, or for understanding.
We may also want to capture substitutability at the phrasal level.
POS categories indicate which words are substitutable. For example, substituting adjectives:
Phrasal categories indicate which phrases are substitutable. For example, substituting a noun phrases:
A theory of syntax should explain which sentences are well-formed (grammatical) and which are not.
Note that well-formed is distinct from meaningful.
However, we’ll see shortly that the reason we care about syntax is mainly for interpreting meaning.
Context-Free Grammar (CFG) in today’s lecture
Dependency Grammar following lecture
Two types of grammar symbols:
Rules of the form where is any string of non-terminals and terminals
A CFG in Chomsky Normal Form only has rules of the form
To show that a sentence is well-formed under this CFG, we must provide a parse. One way to do this is by drawing a tree.
Some sentences have more than one parse.
Here, the structural ambiguity is caused by POS ambiguity in several of the words.
|“the man with the telescope”||“with the telescope I saw the man|
Some sentences have structural ambiguity even without PoS ambiguity. This is called attachment ambiguity.
Depends on where different phrases attach in the tree.
Different attachments often have different meanings:
We want to use parse trees as scaffolding for semantics
So ambiguity matters a lot!
Computing the structure(s) for an input string given a grammar.
As usual, ambiguity is a huge problem
Multiple analyses for parts of sentence.
E.g. “The dog bit the child.”
Syntactic ambiguity is rampant; humans don’t even notice because we are good at using context/semantics to disambiguate.
All parsers have two fundamental properties:
Directionality the sequence in which the structures are constructed
Search strategy the order in which the search space of possible analyses is explored
very efficient for unambiguous structures
can be massively inefficient (exponential in sentence length) if faced with local ambiguity
CKY (Cocke, Kasami, Younger) is a bottom-up, breadth-first parsing algorithm.
Original version assumes grammar in Chomsky Normal Form.
Dynamic (chart) programming algorithm.
Add constituent in cell if there is:
Fills chart in order
Takes time , where is the number of grammar rules and is the number of words in the sentence.
We have added all PoS tags that are allowed for each word.
Beware that the bottom-left half of the table (excluding the diagonal) is empty.
Red shows which children create which parents.
Skipping intermediate steps…
Remember that, say for cell we consider all of the following:
In general, for cell [remembering that is always less than ] you should consider all of the following:
We filled in all short entries, then longer ones.
Effectively, we are sweeping out diagonals beginning with the main diagonal and moving up to the right.
Avoids re-computing substructures so much more efficient than depth-first parsers in worst case.
Still may compute a lot of unnecessary partial parses.
Simple version requires converting the grammar to Chomsky Normal Formal.
Various other chart parsing methods avoid these issues by combining top-down and bottom-up approaches.
The big idea: instead of paying linguists to write a grammar, pay them to annotate real sentences with parse trees.
This way, we implicitly get a grammar (for CFG: read the rules off the trees)
And we get probabilities for those rules (using any of our fav estimation techniques).
A probabilistic context free grammar (PCFG) is a CFG where each rule (where is a symbol sequence) is assigned a probability .
The sum over all expansions of must be equal to 1.
Easiest way to create a PCFG from a tree: MLE
Under this model, the probability of a parse is simply the product of all rules in the parse:
Given multiple trees for a sentence, choose the one with the highest probability (or lowest cost).
Probability of a sentence is the sum of the probabilities over all the parses.
Goal: return the highest probability parse of the sentence (analogous to Viterbi)
Inside algorithm computes the probability of the whole sentence (analogous to forward algorithm).
Inside-outside algorithm is a form of EM that learns grammar rule probabilities from unannotated sentences (analogous to forward-backward).
Exhaustive parsing can be really expensive.
Basic idea: use probabilities of subtrees to decide which ones to build up further.
Notice we are no longer filling the chart in any fixed order!
Often limiting the size of the agenda by prunning out low-scoring edges (beam search).
Not as great as you might first think!
Because higher on the tree, lower is the probability so in the majority of the cases, you will expand all the lower constituents before moving up.
If we use raw probabilities for the score, smaller constituents will almost always have higher scores.
Instead, we can divide by the number of words in the constituent.
This works much better, but now not guaranteed to find the best parse first.
Replacing one word with another with the same POS will never result in a different parsing decision, even though it should!
But PCFGs are context-free, so an NP is an NP, and will have the same expansion probabilities regardless of where it appears.
Replacing one word with another with the same PoS will never result in a different parsing decision, even though it should!
Lexicalisation: create new categories, by adding the lexical head of the phrase.
Identifying the head of every rule is not always straightforward.
All this category-splitting makes the grammar much more specific (good!)
Do we really need phrase structure in the first place? Not always!
Remove phrasal categories
Remove duplicated terminals
Collapse chains of duplicates -
Don’t worry if the tree looks stupid to you, because it is!
This demonstrates how dependency trees can help with correct parsing too.
The right one is
Meaning of words within a sentence depend on one another, mostly in asymmetric, binary relations.
Also, in languages with free word order (e.g. Turkish, Russian), phrase structure (constituency) grammars don’t make as much sense.
E.g. we would need both and . Not very informative about what is really going on.
In contrast, dependency relations stay constant:
Edge labels can help us distinguish different kinds of head modifier relations:
A sentence’s dependency parse is said to be projective if every subtree (node and all its descendants) occupies a contiguous span of the sentence:
How can we find each phrase’s head in the first place?
The standard solution is to use head rules
For every non-unary (P)CFG production, designate on RHS non-terminal as containing the head.
We can also employ heuristics to scale this to large grammars: within an NP, last immediate N child is the head
Sensible framework for free word order languages
Identifies syntactic relations directly.
Dependency pairs/chains can make good features in classifiers, information-extractors, etc.
Parsers can be very fast!
The assumption of asymmetric binary relations isn’t always right…
3 possible actions:
Remember, dependency relations point from head to dependent.
Both LeftArc and RightArc leave the head at the top of the stack!
From a fully connected directed graph of all possible edges, choose the best ones that form a tree.
Can be formulated as constraint-satisfaction problem with integer linear programming
One grand goal of artificial intelligence is to understand what people mean when they talk.
But how do we know if we succeeded?
Meaning and understanding can lead to deep “philosophical” questions.
NLP usually takes a more pragmatic view: can the computer behave as though it understands (in order to do what we want)
What issues will we face in building such systems?
We would like to build
Basically, a smarter Google
This is typically called Question Answering
To build our QA system, we will need to deal with issues in semantics, i.e. meaning
Lexical Semantics: the meanings of individual words
Sentential Semantics: how word meanings combine (after that in a sentence)
Some examples to highlight problems in lexical semantics:
plant (flora) vs plant (infrastructure)
vacation and holiday
animals and polar bears
remove vs eliminate
Poland and Central Europe
Some of these problems can be solved with a good ontology, e.g. WordNet.
WordNet is a hand-built resource containing 117k synsets: sets of synonymous words
Synsets are connected by relations such as
Words are typically semantically ambiguous.
Lumping vs Splitting
Another way to define senses is to look if occurrences of the word have different translations.
Eng. Interest German
Polysemous is a word having multiple senses.
For many applications, we would like to disambiguate senses
Task: given a sense ambiguous word, find the sense in a given context
WSD as classification
Given a word token in context, which sense (class) does it belong to?
We can train a supervised classifier, assuming sense-labelled training data
Lots of options available:
Issues with WSD
Not always clear how fine-grained the gold-standard should be
Difficult/expensive to annotate corpora with fine-grained senses
Classifiers must be trained separately for each word
Recognising and classifying proper names in text is important for many applications. A kind of information extraction.
Different inventories of classes:
NER systems typically use some form of feature-based sequence tagging, with features like capitalisation being important.
Lists of known names called gazetteers are also important.
How to know if words have similar meanings?
Can we just use a thesaurus?
Let’s try to compute similarity automatically.
Meaning from Context(s)
Perhaps we can infer meaning just by looking at the contexts a word occurs in!
Perhaps meaning is the contexts a word occurs in (Wittgenstein!)
Either way, similar contexts imply similar meanings
Represent each word as a vector of its contexts
Each dimension is a context word; = 1 if it co-occurs with , otherwise 0.
What defines “context”?
How to weight the context words (boolean? counts? other?)
How to measure similarity between vectors?
There are two kinds of co-occurrence between two words:
First-Order Co-Occurrence (syntagmatic association)
Second-Order Co-Occurrence (paradigmatic association)
Usually ignore stopwords (function words and other very frequent/uninformative words)
Usually use a large window around the target word (e.g. 100 words, maybe even whole document)
But smaller windows allow for relations other than co-occurrence
All of these for semantic similarity
Binary indicators are not very informative.
Presumably more frequent co-occurrences matter more.
Is frequency good enough?
Frequent (overall) words are expected to have high counts in the context vector.
We want to know which words occur unusually often in the context of : more than we’d expect by chance?
Put another way, what collocations include ?
PMI tells us how much more/less likely the cooccurrence is than if the words were independent.
There a lot!