🗃️ NLPre-GA Dataset


Test datasets


The NLPre-GA benchmark consists of a set of various linguistic tasks, including segmentation, lemmatization, morphological analysis, part-of-speech tagging, and dependency parsing, as well as a collection of manually annotated test datasets selected for evaluating NLP models performing these tasks.

NLPre-GA employs the modern Irish UD treebank, referred to as UD_Irish-IDT (Lynn et al., 2012) for evaluation of the NLPre tasks. UD_Irish-IDT is a conversion of the Irish Dependency Treebank and contains 4910 sentences split as follows:

  • test: 454 trees
  • dev: 451 trees
  • train: 4005 trees

Test textual data


Download the zip file with the textual data to be processed.