Probabilistic Grammars and Data-Oriented Parsing
Lecturers: Detlef
Prescher,
Remko Scha, and
Khalil
Sima'an
Date: Tuesday, 4-6 p.m.
Location: P.018
First Lecture: February 3, 2004
Het college wordt in het Engels gegeven! This course will be given
in English!
Content: Probabilistic grammars start where formal language
theory stops. Ambiguity resolution, robustness, efficiency,
learning/adaption and estimation from data is the starting point for
probabilistic grammars. The subject matter for these mathematical
models is Natural Language Processing (as oppose to Formal Language
Processing - e.g. compilation). The course concentrates on syntactic
parsing with ambiguity resolution as the core subject for
probabilistic grammars. We will look mainly at the following subjects
for Probabilistic Context-Free Grammars (PCFGs): (1) Mathematical
aspects of probabilistic grammars, (2) Estimation Theory for
supervised and unsupervised learning, (3) Algorithms for estimation of
PCFGs, (4) Efficient parsing under PCFGs. Then we will consider
aspects of Data Oriented Parsing and how it resembles/deviates from
PCFGs: parameter estimation, efficiency and linguistic issues. The
course has a slot for student presentations of a paper from a list of
papers provided by the lecturers.
References
Beside Remko Scha's background
material on Data-Oriented Parsing (DOP), (parts of) the
following papers or books are especially interesting for the class:
Syllabus
FIRST PART (Lectures 1-7, by Detlef Prescher)
Keywords: Motivation for working with probabilistic grammars (the
need to deal with uncertainty); definition of probabilistic
context-free grammars (PCFG); statistical
estimation theory and how it applies to PCFGs; formal notions of a
corpus and a probability model;
maximum-likelihood estimation (MLE); relation of MLE to
relative-frequency estimation; other kinds of estimators:
maximum-entropy estimation and the expectation-maximization (EM)
algorithm; treebank training for PCFGs and unsupervised training of
PCFGs using the EM algorithm; efficient parsing
under PCFGs: Viterbi, inside and outside algorithm; the inside-outside
algorithm for unupervised training of PCFGs and its relation to EM.
- Lecture 1 (February 3): Formal definition
of a context-free grammar (CFG); Chomsky's very first context-free
syntax tree; CFGs can be used as parsers and generators; sentence
generation with a CFG is a non-deterministic process; This leads to a
formal definition of PCFGs that is based on rule, tree and sentence
probabilities.
Reading: The sub-section "Background: Probabilistic
Modeling of CFGs" of Section 5 of Prescher (2003).
Recommended: Chapter 9 of the book of Jurafsky and Martin.
-
Lecture 2 (February 10): Non-standard PCFGs; Resolving
ambiguities with PCFGs; Two examples: PP attachment and
conjunctions; PCFGs can deal with these, but a
transformation of the underlying CFG should be done beforehand; The
parent-encoding technique; Definition of treebank grammars.
Reading: The sub-section "Background:
Resolving Ambiguities with PCFGs" of Section 5 of Prescher (2003).
Highly recommended: Charniak (1996), Mark Johnson (1999), and
Klein and Manning (2003).
For freaks: Sanchez and Benedi (1997), and Chi and Geman
(1998).
- Lecture 3 (February 17): General
Estimation Theory: Definition of a corpus and the relative-frequency
estimate and (restricted or unrestricted) probability models;
Definition of corpus probabilities and the maximum-likelihood
estimation; Problems for MLE: existence, uniqueness, and efficiency;
MLE and relative-frequency estimation: The relative-frequency
estimate is a ML estimate for a (restricted or unrestricted)
probability model, if and only if the relative-frequency estimate is a
model instance; The information inequality of information theory;
MLE and relative-entropy estimation: Maximizing the corpus
probability is equivalent to minimizing the relative entropy with
respect to the relative-frequency estimate.
Reading: Section 2 of Prescher (2003).
- Lecture 4 (February 24): Review of the
general estimation theory; Estimation of PCFGs: Definition of a CFG's
probability model; A CFG's probability model is restricted if the
CFG is read off from a treebank and if it generates a full parse-tree
outside the treebank: The relative-frequency estimate on the treebank
is NOT an instance of this CFG's probability model! On the one
hand, this is good as ML estimates for unrestricted models necessarily
overfit their training corpus; On the other hand, this is bad: Is
there a ML estimate for a CFG's probability model on a given
treebank?! Is it unique?! Can it be efficiently calculated?!
Reading: The beginning of sub-section "Maximum-Likelihood
Estimation of PCFGs" of Section 5 of Prescher (2003).
- Lecture 5 (March 2): Estimation of
PCFGs: The ML estimate for a CFG's probability model on a given
treebank is the treebank grammar!
Reading: The end of sub-section "Maximum-Likelihood
Estimation of PCFGs" of Section 5 of Prescher (2003).
- Lecture 6 (March 9): Definition of the
input of the EM algorithm (a symbolic analyzer, an incomplete-data
corpus, a complete-data model, and a randomly generated starting
instance); Definition of the procedure of the EM algorithm (a loop
over E- and M-steps: an E-step generates a complete-data corpus
expected by the current model instance, and a M-step looks for a
maximum-likelihood estimate for the complete-data model on the
expected complete-data corpus); Theorem about the output of the EM
algorithm (a sequence of instances of the complete-data model,
so-called EM re-estimates, such that the sequence of probabilities
allocated to the incomplete-data corpus is monotonic increasing);
So the EM algorithm aims at doing MLE on incomplete-data by
performing a series of MLE steps on complete-data. Typically, the EM
algorithm will be applied in settings, where incomplete-data MLE is
difficult but complete-data MLE is easy! Example: The EM
algorithm for PCFGs.
Reading: Section 3 and the sub-section "EM-Training of PCFGs"
of Section 5 of Prescher (2003).
- Lecture 7 (March 16): The
semiring-parsing algorithm (comprising many standard parsing
algorithms like the parse-forest, the CKY, the count, the inside, and
the Viterbi algorithm).
Reading: Goodman (1999).
MID-TERM PROJECT
(Three weeks: From March 16 to April 6)
The mid-term project consists of an experiment and a report on this
experiment. It shall comprise the homework for about 7 lectures. [Note also
that there are no lectures in the week of March 22-26.] The submission deadline for the project report is April
13.
SECOND PART (Lectures 8-14, mainly by Remko Scha)
Keywords: Introduction to data-oriented parsing (DOP). This final part
will also consist of presentations given by the students (Relevant
papers will be put on this page. The topics are: Shallow parsing,
parsing with PCFGs and unification grammars, data-oriented parsing,
etc.).
- Lecture 8 (March 30): Data-Oriented
Parsing I (ppt)
- Lecture 9 (April 6): Data-Oriented
Parsing II (ppt)
- Lecture 10 (April 13): Data-Oriented
Parsing and Estimation Theory
- Lecture 11 (April 20): Presentation (Klein and Manning, 2003).
- Lecture 12 (April 27): Presentation (Sabine Schulte im Walde, 2000)
- Lecture 13 (May 4): Presentation
(Bod and Kaplan, to appear)
- Lecture 14 (May 11): Presentation (Mike de Kreek, 2003)
FINAL PROJECT (Four weeks: From May 11
to June 8)
The final project is similiar to the mid-term project but shall be a
bit bigger. The submission deadline for the project report is June 15.
The final mark consists of: 40%*midterm project
+ 40%*final project + 20%*presentation