Data annotation for low-resource NLP

2026-05-05

Data annotation for low-resource NLP

It’s that time of the semester where I’m in the throes of assessing student work for classes and trying to submit grades. One of my tasks has recently involved NLP projects related to the taggedPBC, with a particular focus on annotation and (semi-)automation of pos-tagging.

While there are some arguments to be made that pos-tagging is not a useful task these days, given the advent of LLMs, I would say that this is only really true if you have a lot of data for a language. LLMs can infer word classes (distributional clusters) in a corpus, but this requires a lot of examples. When working with a small dataset (~100k words or less) word class information (parts of speech) can be quite valuable. Since the majority of the world’s 7,000 or so languages have few resources, this makes such tasks as pos-tagging and data annotation essential.

Language technologies for low-resource languages

When it comes to language technologies, there have been some good efforts over the years for “smaller” languages (i.e. with fewer speakers). SIL in particular has developed software that can be used to support language research and development, mainly in terms of creating documents (translations, dictionaries, descriptions) that are community-oriented. Other notable tools for language research are of course Elan and Praat.

Attempts at developing technologies for downstream tasks have mostly focused on translation between languages, with a bit of work on specific languages. Again, the amount of data available for languages makes a big difference - for example, while Facebook/Meta’s project ‘No Language Left Behind’ is a good initiative, 200 languages is basically a drop in the bucket and many of the “translations” it produces are unintelligible, primarily due to structural differences between languages. The languages it works reasonably well for are those languages that have a large amount of data. Further, it seems that support for these languages from large companies drops off quickly when there is very little commercial benefit. Populations for the majority of the world’s languages are quite small, which means there’s not much of a market for any technology developed for these languages.

Why annotated language data?

Most of the data that is available for low-resource languages is un-annotated, existing primarily in monolingual documents or recordings, some of which might have been transcribed and/or aligned. A notable exception to this is the DoReCo project, which has led to some interesting work (i.e. segmentation, transcription) leveraging these corpora. But this is not a very large sample of languages.

The assumption among NLP practitioners seems to be that given enough data, many of the gains made for English and other high-resource languages can be transferred from those languages to low-resource languages. While I’m sympathetic to this view (I used the technique of pos-tag transfer to develop the baseline taggedPBC) it ignores the massive gains that were made in the initial stages of NLP for languages like English due to trained parsers (i.e. Penn Treebank) and lexical databases (i.e. Wordnet). Such gains were only possible because of annotation of individual languages.

The importance of data annotation for NLP tasks, especially for low-resource languages, is highlighted particularly well in a recent paper by Michael Ginn and others affiliated with LECS lab in Colorado. They show that a model trained to automatically annote glosses (direct translations of a word, typically into English) improved greatly when 91k sentences were added to their initial 250k-sentence annotated training database. The paper is worth a read for additional insights related to their objectives, particularly for low-resource languages, but one main takeaway is the need for well-annotated data.

All this to say, pos-tagging is still a valuable pursuit for most of these languages. While the taggedPBC provides a baseline via crosslingual transfer, it’s much better to have “eyes on” the data for individual languages. Which brings me to the student projects which focus on this problem.

Student projects and corpora development

I reported previously on the first iteration of the projects for this course, with details about the way the projects are structured. This semester we ended up with 14 fully re-annotated corpora and an additional 28 partially annotated corpora. That’s another 42 languages in the taggedPBC that have some degree of update to the tags that were transferred automatically via crosslingual word alignment.

Note here that not all of the projects were successful, and the unsuccessful ones are not included in the numbers above. In some cases, the quality of the re-annotation was not sufficient to be accepted as a replacement for the original corpus, and so those languages will be available to be worked on next semester. In other cases, the process highlighted that there was external data that could be leveraged for better quality taggers.

As an example, there has been some work on a tagged corpus for Dzongkha (spoken in Bhutan). The work was initially published in 2010 and 2011, but it seems the training corpus was never released and now the website where the tools were hosted (http://www.panl10n.net/) no longer exists. The student project from this semester at least improves on the baseline, and so I have integrated it into the taggedPBC, but the resources that already exist could greatly improve the quality of the corpus.

This is the reason for multiple quality checks. Through this process, not only does a language corpus get the benefit of having a dedicated student researcher looking for resources to assist in annotation, but I also evaluate each methodology and output.

The projects were quite diverse in their approaches, though most used some combination of rules (derived from linguistic descriptions and other sources) and statistical taggers trained on corpora to (semi-)automate pos-tagging. One team also developed a Naive Bayes classifier using character-based features, which can be particularly useful when trying to predict tags for morphologically complex (but largely regular) languages.

Results and updates to findings

Part of the reason for the existence of the taggedPBC is to support research on universal properties of human language. For example, I have shown that word lengths derived from corpora can predict basic word order, which has implications for our understanding of language processing and language evolution. So far, the improvements to annotations for individual languages has not changed that finding. But as annotations continue to improve, I’m hopeful that it will allow us to answer many more questions about how different structures in language emerge.