"Chance favors the prepared mind." (Louis Pasteur)
What has Computational Linguistics to do with weather forecasts?!
Satellite pictures of Europe provided by METEO
FRANCE
and Institut für Meteorologie, FU Berlin:
(today)
|
(tomorrow)
|
The picture at the left-hand side displays the current weather,
whereas the picture at the right-hand side displays a preview of
tomorrow's weather. Meteorologists provide weather forecasts on the
basis of Probability Theory and Statistics. They use a probability
model to determine the most probable outcome among the
different alternatives for tomorros' weather, and they use Statistics
to infer the therefor needed probability model from a large corpus of
empirically observed weather data. In line with our experience,
meteorologists are quite successful in predicting our tomorrow's
weather with this method.
A similiar situation comes up in Computational Linguistics. Most
sentences in natural language are ambiguous, i.e., they have
more than one possible reading / analysis (or outcome in the wheather terminology). For example, the famous
sentence
"the man saw the woman with the telescope"
has at least two readings: 'saw with the telescope' versus
'woman with the telescope'. The situation gets even worse
when using a formal grammar mimicing natural language. (One job, maybe
even the job, of computational linguists is to create such
grammars...) Then sentences have typically millions of readings! The
crucial problem is that someone has to (create a
disambiguator being able to) select the correct one.
Although some computational linguists still believe that Probability
Theory and Statistics are not appropriate for their discipline, it is
true that Probability Theory is the theory of uncertainty /
ambiguity. Thus this theory offers the best chance to resolve the
presented natural-language disambiguation problem: The most
probable among the different alternatives is the reading
of a given sentence. Like in weather forecasts, Statistics helps to
infer the necessary probabilities from a large corpus of linguistic
data.
Last updated: February 2007.