Probabilistic Grammars and Data-Oriented Parsing

Lecturers: Detlef Prescher, Remko Scha, and Khalil Sima'an
Date: Tuesday, 4-6 p.m.
Location: P.018
First Lecture: February 3, 2004

Het college wordt in het Engels gegeven! This course will be given in English!

Content: Probabilistic grammars start where formal language theory stops. Ambiguity resolution, robustness, efficiency, learning/adaption and estimation from data is the starting point for probabilistic grammars. The subject matter for these mathematical models is Natural Language Processing (as oppose to Formal Language Processing - e.g. compilation). The course concentrates on syntactic parsing with ambiguity resolution as the core subject for probabilistic grammars. We will look mainly at the following subjects for Probabilistic Context-Free Grammars (PCFGs): (1) Mathematical aspects of probabilistic grammars, (2) Estimation Theory for supervised and unsupervised learning, (3) Algorithms for estimation of PCFGs, (4) Efficient parsing under PCFGs. Then we will consider aspects of Data Oriented Parsing and how it resembles/deviates from PCFGs: parameter estimation, efficiency and linguistic issues. The course has a slot for student presentations of a paper from a list of papers provided by the lecturers.

References

Beside Remko Scha's background material on Data-Oriented Parsing (DOP), (parts of) the following papers or books are especially interesting for the class:

A Data-Oriented Parsing Model for Lexical-Functional Grammar.

Statistical Language Learning

Treebank Grammars

Estimation of Probabilistic Context-Free Grammars

Semiring Parsing

PCFG models of linguistic tree representations.

Speech and Language Processing

Accurate Unlexicalized Parsing

Language Acquisition and Virtual Grammars. On the Continuity of Cognition.

Foundations of Statistical Natural Language Processing

Tagging English Text with a Probabilistic Model.

Parsing Without Grammar.

A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars

Consistency of Stochastic Context-Free Grammars from Probabilistic Estimation based on Growth Transformations

The German Statistical Grammar Model: Development, Training and Linguistic Exploitation

Directed Treebank Refinement for PCFG Parsing.

Syllabus

FIRST PART (Lectures 1-7, by Detlef Prescher)

Keywords: Motivation for working with probabilistic grammars (the need to deal with uncertainty); definition of probabilistic context-free grammars (PCFG); statistical estimation theory and how it applies to PCFGs; formal notions of a corpus and a probability model; maximum-likelihood estimation (MLE); relation of MLE to relative-frequency estimation; other kinds of estimators: maximum-entropy estimation and the expectation-maximization (EM) algorithm; treebank training for PCFGs and unsupervised training of PCFGs using the EM algorithm; efficient parsing under PCFGs: Viterbi, inside and outside algorithm; the inside-outside algorithm for unupervised training of PCFGs and its relation to EM.

Lecture 1 (February 3): Formal definition of a context-free grammar (CFG); Chomsky's very first context-free syntax tree; CFGs can be used as parsers and generators; sentence generation with a CFG is a non-deterministic process; This leads to a formal definition of PCFGs that is based on rule, tree and sentence probabilities.
Reading: The sub-section "Background: Probabilistic Modeling of CFGs" of Section 5 of Prescher (2003).
Recommended: Chapter 9 of the book of Jurafsky and Martin.

Lecture 2 (February 10): Non-standard PCFGs; Resolving ambiguities with PCFGs; Two examples: PP attachment and conjunctions; PCFGs can deal with these, but a transformation of the underlying CFG should be done beforehand; The parent-encoding technique; Definition of treebank grammars.
Reading: The sub-section "Background: Resolving Ambiguities with PCFGs" of Section 5 of Prescher (2003).
Highly recommended: Charniak (1996), Mark Johnson (1999), and Klein and Manning (2003).
For freaks: Sanchez and Benedi (1997), and Chi and Geman (1998).

Lecture 3 (February 17): General Estimation Theory: Definition of a corpus and the relative-frequency estimate and (restricted or unrestricted) probability models; Definition of corpus probabilities and the maximum-likelihood estimation; Problems for MLE: existence, uniqueness, and efficiency; MLE and relative-frequency estimation: The relative-frequency estimate is a ML estimate for a (restricted or unrestricted) probability model, if and only if the relative-frequency estimate is a model instance; The information inequality of information theory; MLE and relative-entropy estimation: Maximizing the corpus probability is equivalent to minimizing the relative entropy with respect to the relative-frequency estimate.
Reading: Section 2 of Prescher (2003).

Lecture 4 (February 24): Review of the general estimation theory; Estimation of PCFGs: Definition of a CFG's probability model; A CFG's probability model is restricted if the CFG is read off from a treebank and if it generates a full parse-tree outside the treebank: The relative-frequency estimate on the treebank is NOT an instance of this CFG's probability model! On the one hand, this is good as ML estimates for unrestricted models necessarily overfit their training corpus; On the other hand, this is bad: Is there a ML estimate for a CFG's probability model on a given treebank?! Is it unique?! Can it be efficiently calculated?!
Reading: The beginning of sub-section "Maximum-Likelihood Estimation of PCFGs" of Section 5 of Prescher (2003).

Lecture 5 (March 2): Estimation of PCFGs: The ML estimate for a CFG's probability model on a given treebank is the treebank grammar!
Reading: The end of sub-section "Maximum-Likelihood Estimation of PCFGs" of Section 5 of Prescher (2003).

Lecture 6 (March 9): Definition of the input of the EM algorithm (a symbolic analyzer, an incomplete-data corpus, a complete-data model, and a randomly generated starting instance); Definition of the procedure of the EM algorithm (a loop over E- and M-steps: an E-step generates a complete-data corpus expected by the current model instance, and a M-step looks for a maximum-likelihood estimate for the complete-data model on the expected complete-data corpus); Theorem about the output of the EM algorithm (a sequence of instances of the complete-data model, so-called EM re-estimates, such that the sequence of probabilities allocated to the incomplete-data corpus is monotonic increasing); So the EM algorithm aims at doing MLE on incomplete-data by performing a series of MLE steps on complete-data. Typically, the EM algorithm will be applied in settings, where incomplete-data MLE is difficult but complete-data MLE is easy! Example: The EM algorithm for PCFGs.
Reading: Section 3 and the sub-section "EM-Training of PCFGs" of Section 5 of Prescher (2003).

Lecture 7 (March 16): The semiring-parsing algorithm (comprising many standard parsing algorithms like the parse-forest, the CKY, the count, the inside, and the Viterbi algorithm).
Reading: Goodman (1999).

MID-TERM PROJECT (Three weeks: From March 16 to April 6)

The mid-term project consists of an experiment and a report on this experiment. It shall comprise the homework for about 7 lectures. [Note also that there are no lectures in the week of March 22-26.] The submission deadline for the project report is April 13.

SECOND PART (Lectures 8-14, mainly by Remko Scha)

Keywords: Introduction to data-oriented parsing (DOP). This final part will also consist of presentations given by the students (Relevant papers will be put on this page. The topics are: Shallow parsing, parsing with PCFGs and unification grammars, data-oriented parsing, etc.).

Lecture 8 (March 30): Data-Oriented Parsing I (ppt)

Lecture 9 (April 6): Data-Oriented Parsing II (ppt)

Lecture 10 (April 13): Data-Oriented Parsing and Estimation Theory

Lecture 11 (April 20): Presentation (Klein and Manning, 2003).

Lecture 12 (April 27): Presentation (Sabine Schulte im Walde, 2000)

Lecture 13 (May 4): Presentation (Bod and Kaplan, to appear)

Lecture 14 (May 11): Presentation (Mike de Kreek, 2003)

FINAL PROJECT (Four weeks: From May 11 to June 8)

The final project is similiar to the mid-term project but shall be a bit bigger. The submission deadline for the project report is June 15.

The final mark consists of: 40%*midterm project + 40%*final project + 20%*presentation