Contributing to the taggedPBC

2026-06-02

If you’ve followed my blog for any length of time, you probably know that I’m interested in languages of the world. The incredible diversity in linguistic structures found in the roughly 7,000 world languages has been a long-standing subject of research and offers opportunities for us to uncover much about how we humans process the world around us. But this kind of discovery is largely dependent on the availability of data, which for the vast majority of languages is lacking.

What are “low-resource” languages?

Low resource languages (one definition from Felix Lauman)

Most languages spoken in the world today are termed “low-resource”. This term has a couple of different meanings (see a NLP-oriented definition linked to the image above), but generally refers to the availability of data for a language. For example, the ASJP has wordlists for nearly all known languages of the world, but the majority of languages are represented by fewer than 100 terms. The CHILDES corpora, on the other hand, have hundreds of hours of transcribed texts, but for a much more limited set of languages. Obviously a language represented in the CHILDES corpora is considered to have more resources, but if the data is not clearly annotated or labeled it may not be very usable, and so that language might be considered “low-resource” for a specific use-case.

Another consideration is whether there are technologies or tools available for a specific language. Beyond actual data in text form, are there literacy materials to aid learners, or dictionaries, or grammars? Or is there annotated data of any kind? Is there translation software, or web-based resources? You may find a language for which 10 different dictionaries are reported, but are those resources still available, either digitally or in print?

High-resource languages are typically those for which diverse data is available from multiple sources, with a wide variety of annotations and many tools to access the data. Low-resource languages have limited data and data with little diversity, and often what data is available is difficult to access.

How to address this

Languages represented in the taggedPBC

The taggedPBC was developed in part as an attempt to address some of these concerns. The goal is to provide a baseline of annotated data for a large number of low-resource languages in order to facilitate downstream development. Ultimately, the hope is that all of the corpora will be annotated for part of speech and dependency information, as well as morphology and phonetic transcriptions. This could be further expanded in different ways for various purposes.

Annotations do require some expertise in order to be useful. Differing degrees of expertise result in differing qualities of annotation. Since I am not an expert in the majority of these languages, I started with a statistical approach as a baseline - identifying basic word classes (nouns and verbs) for each language via word alignment. In order to validate this approach, I compared word order information derived from the resulting corpora (the “N1 ratio”) with expert determinations of word order, and (somewhat surprisingly) there was considerable alignment.

The N1 ratio and word order in 3 typological databases

Updating the taggedPBC

This dataset is just a starting point, a baseline. I’m continuing to update annotations, in part with the assistance of students. I hope to recruit additional specialists, and have adopted the CoNLL-U annotation framework to support this goal. The dataset is freely available on Github, which also means that contributing to a given dataset is as simple as opening an issue or pull request. If you are willing/able to contribute, please do so!

I also wonder about expanding the dataset to include other kinds of data. For now, the dataset is composed exclusively of select verses from the Bible (New Testament). This means that all languages are represented by similar contexts, but this also reduces the diversity of the data. This can be good for some use cases and not so good for others.

There is still a lot of work to do to develop this dataset, but as we make incremental gains, I am optimistic that these resources will benefit speakers of these languages as well as our understanding of how languages work.