Detection of Errors and Correction
in Corpus Annotation

Prune Diseased Branches to Get Healthy Trees!

How to Find Erroneous Local Trees in a Treebank and Why It Matters

Markus Dickinson and Walt Detmar Meurers

Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005). Barcelona, Spain.

We present a new method for detecting bracketing and labeling errors in syntactic annotation and demonstrated its effectiveness for the WSJ treebank. The method is inspired by the linguistic concept of endocentricity and its consequence that the list of daughters in a local tree constrains the possible categories of the mother in that local tree. To determine which mother node variations are likely to be errors, we explore several heuristics and demonstrated that frequency by itself is an insufficient predictor of errors. We instead propose a heuristic combining frequency with an expected-ambiguity measure and show that removing the rules thus flagged as errors from the set of rules used by a PCFG parser leads to an improved performance of the parser.


Electronically available file formats:


Bibtex entry:

@InProceedings{dickinson:meurers:05,
  author =       {Markus Dickinson and W. Detmar Meurers},
  title =        {Prune Diseased Branches to Get Healthy Trees!
                  How to Find Erroneous Local Trees in a Treebank
                  and Why It Matters},
  booktitle =    {Proceedings of the Fourth Workshop on Treebanks 
                  and Linguistic Theories (TLT 2005)},
  address =      {Barcelona, Spain},
  url =          {http://ling.osu.edu/~dm/papers/dickinson-meurers-tlt05.html},
  year =         {2005}
}