Low-resource language processing (lessons from HG2051)

2026-01-26

Since 2023 I’ve been teaching a course on natural language processing at NTU Singapore. The goal is for students to learn basic Python programming skills that allow them to work with language data. Typically, in other NLP courses I’ve seen (and in past versions of this course), this involves using toy examples and projects that help to teach the concepts. Last year, I decided on an approach that would introduce students to working with actual low-resource languages. This post outlines the projects that I developed along with some of the things I learned through the process of implementation and the results.

Preliminaries

As part of the course I teach some basics of computer science (components of a computer, how to navigate the folder structure, how to use the command line/terminal). I allow students to use AI tools, and since I emphasize the problem solving and clear-thinking aspects of programming, this has not been an issue when assessing work.

One observation I made early on is that it is very difficult to learn programming if you don’t have a specific task or project to work on. Since the majority of my students are linguistics majors (learning phonetics, phonology, morphology, syntax and working with real languages), I wanted to find something for them to work on related to their expertise. On the other end, I wanted to be able to introduce CS and other students with more technical majors to the importance of linguistic/domain knowledge.

In past years students have written machine translation programs using pairs of languages in the Universal Declaration of Human Rights (leveraging parallel text), or part-of-speech (POS) taggers using annotated corpora (like the Pnar corpus) for different languages. However, availability of resources for most of the 7,000+ languages in the world is rather spotty, which means that students often end up working on the same small set of languages, and it is not clear that the toy projects have much benefit - they are useful for learning but don’t produce a lot that is meaningful.

However, it just so happens that I’ve been developing a large dataset of parallel texts for investigating crosslinguistic properties of language. The taggedPBC is based on the Parallel Bible Corpus, expanded with additional language data, and with a portion of verses automatically annotated for parts of speech. I wrote a paper on its development (which just got published! view it for free here), but more on that in another post. Suffice it to say that the baseline dataset is sufficiently annotated for broad comparison, but a majority of the languages could use much more detailed annotation.

The projects

The idea for these projects was partly inspired by a post on X/Twitter by Tom McCoy at the beginning of 2025.

I added a new assignment to my Computational Linguistics class last semester:
- Choose a linguistic phenomenon in a language other than English
- Give a 3-minute presentation about that phenomenon & how it would pose a challenge for computational models

Would recommend!

1/n pic.twitter.com/dwOk3VarnY
— Tom McCoy (@RTomMcCoy) January 28, 2025

Rather than simply a presentation, however, I wanted students to be engaged in developing an actual resource. It needed to be simple enough for a beginner, but complex enough to be engaging, while requiring both programming and critical thinking skill. For this purpose I came up with two projects, both related to the taggedPBC.

Project 1: annotation of a dataset in the taggedPBC

For Project 1, students choose a language in the taggedPBC that has not been worked on before and that no other student is working on (first come, first served). A list of languages is updated every semester - here is the list for 2025.
The first task is to use their programming skills to extract a set of verses from the CoNLLU-formatted corpus of the language they chose. These are 21 verses that have been identified as having the largest coverage of diverse parts of speech for all languages.
The second task is to improve the annotations for these 20 verses. This may involve writing rules to tag the data (semi-)automatically, or finding other resources such as grammatical descriptions, dictionaries, or wordlists for the language, and using those in various ways.
The final output should be the code to extract the verses, any additional code used to (semi-)automate the annotation process, a well-annotated set of verses, and a writeup that discusses the language and the process that the student followed, with any references to other resources for the language.

Project 2: developing a pos-tagger for a dataset in the taggedPBC

Project 2 is a group project, where 2-3 students again choose a language from the taggedPBC in order to train a POS tagger.
The language can either be one that a member of the group worked on for project 1, or it can be one that is present in both the taggedPBC and the UD Treebanks.
Here the students have a couple of options, they can either a) annotate another 100 verses and train a POS tagger, using the 21 verses from project 1 as gold-standard evaluation set, or b) use the hand-annotated UDT data for training/evaluation.
Final output is code for processing the data and training/evaluating a POS tagger, along with annotated datasets and a writeup that describes the process.
The ultimate goal is a well-annotated, complete dataset to replace the original and significantly reduce the number of unknown tags, but I treated this as a bonus.

Reflections

This is the first year that I attempted to integrate dataset development for low-resource languages into my NLP course, and I would say the outcome(s) greatly exceeded my expectations.

What worked:

Just in terms of output, we ended up with 6 fully re-annotated corpora (1,800+ verses) for low-resource languages (from project 2) and another 15 with fully annotated sets of between 21-121 verses (from projects 1 & 2). Students were also given the option of including their names as contributors, and many of them were happy to do so, which is encouraging.

From a teaching perspective I also think it worked for multiple reasons:

Because actual language data is messy, the students are forced to engage with the problem of annotation. This is not something that can be ‘farmed out’ to an AI tool for any decent result.
Because these are low-resource languages, the only way to effectively annotate the data is to acquire some degree of familiarity with the languages, which means you actually must develop a measure of expertise and apply linguistic knowledge.
Attempting to write a program to support annotation highlights the difficulty of developing such a system (orthography, word formation, homophones, translation issues etc.) and grappling with these concerns can lead to some unique solutions.
POS tagging is not a ‘one size fits all’ kind of problem, and for many languages the more recent approaches using large models don’t work well without significant effort. But for an individual language you can get pretty far with some specialized knowledge, a few rules, and the perceptron statistical tagger in NLTK, which is a decent introduction to ML concepts.

What was problematic:

Apart from the usual few students who just checked out or missed classes and then realized they couldn’t catch up on the material, the main problems I faced in running the projects was helping students to understand the task, which got easier as I gave examples. A few other difficulties are worth mentioning.

The CoNLLU format is not the easiest to work with initially, but it becomes more accessible after some exposure. I view this as a good opportunity to learn a robust annotation format for linguistic data.
There were some High-resource languages in the “allowed languages list” that were not very well annotated. Specifically, certain varieties of Malay and large languages like Burmese were still listed in the possible options for students to work on. I ended up making decisions independently regarding whether it was ok for students to work on them.
I hadn’t determined how to assess taggers that were trained on UDT data before the course started, but this was dealt with by deciding that evaluation should be done on a held-out portion of the training data.

Student feedback:

Student feedback was quite positive. A number of the students found a language to work on that they had a personal connection to. Some even found native speakers to work with or worked on languages that they were actively learning. One student emailed me to say

I thoroughly enjoyed the projects and knowledge this course gave me. It is one of the few courses so far that I was genuinely looking forward to doing the projects and explore more on my own.

Many were excited at having their efforts recognized, as well as contributing to low-resource language development in general. One also mentioned that it made the task of learning to program less intimidating.

Final thoughts

Learning Python or other tools for processing language is very useful, but for a lot of people it is simply a means to an end. These projects were developed in part to find a way to make learning Python, even for beginner linguistics students, a productive process. By working on an actual low-resource language and developing even a small portion of the data, these students are learning valuable skills while supporting language development and helping to preserve some of the 7,000+ languages of the world.

More work is definitely needed - I estimate that if I run this course twice a year, at the current rate it would take over 50 years (even at an increased productivity rate) to annotate the majority of the taggedPBC. Clearly we need additional assistance from other sources to complete the task. If you are interested in using the materials for this course, and possibly even getting your students to work on some of these languages, do get in touch.