When working with actual language data, one major concern is how to represent that data. Many times as researchers the things we want to compare are not equivalent, so we must use the best approximations we have available. This introduces noise into any study, which can be mitigated to some degree by increasing the size of a dataset, but how do you ensure the data is comparable to begin with?
Comparing linguistic data
Much of the comparative data linguists traditionally work with has been oriented toward understanding language history. This includes how sound patterns change and how languages are related to each other. Comparing the sounds of cognate words to group related languages (and reconstruct their ancestral states or “proto-languages”) has been a central concern of linguists, perhaps most famously by the Brothers Grimm. It’s only relatively recently that linguists have even considered trying to apply similar comparative approaches to other units of analysis like syntax.
To compare words between languages, linguists developed the International Phonetic Alphabet (IPA), which is a systematic framework for notating the actual sounds that humans (and presumably other animals) can produce. This took time and effort to produce, with various revisions over the years (deserving another blog post), but by now it is extremely rare to encounter a speech sound that can’t be transcribed. As anyone who has taken an introductory phonetics course will tell you, once you master the transcription system it is hard to forget, but it does require consistent practice to train your ear to recognize the various sounds and their places/manners of articulation.
Dealing with different scripts
Transcription in IPA is basically the gold standard for comparative data, but even at this fine-grained level noise can get introduced. This is because different linguists will hear slightly different sounds. In general the differences tend to be minor, and cognate sets can still be readily established, but it’s something to keep in mind. It also depends on what you’re intending to compare - if your interest is mainly syntax, then the comparative data may not need to be as detailed or equivalent at the phonetic/phonemic level, but you might need more detail about morpheme breaks, for example.
In the case of the taggedPBC we have a large set of data available to us, but there are a number of different challenges in terms of making it comparable. One concern was that although the original Parallel Bible Corpus (PBC) had a lot of parallel verses, many of the texts used different scripts. The challenge here was to represent each of the texts in a similar fashion so that they could be compared and, ultimately, annotated for additional comparison.
In order to achieve this for the taggedPBC, I
decided to convert all scripts to a romanized
form. While roman characters are not as precise
as a phonemic transcription, you are still
getting closer to representing the actual sounds
that people say than if you were to use a
pictographic or “abjad” representation. The
former might correspond to a complete
syllable/word, while the latter traditionally
only represents constants. In the case of Arabic
and other scripts that started as
abjads, the use of diacritics does
allow for nearly phonemic transliteration.
Additionally, along these lines there has been
some work toward romanizing various scripts and
there’s even a Python library - uroman
- to implement computer transliteration.
The case of Chinese
One particularly tricky case, however, are Chinese varieties. There are many different “dialects” spoken in major regions in China, united by a common writing system. Although the Chinese characters can be read by most educated folks, the local pronunciations differ quite a bit, so much so that a person from one region may find it difficult to understand someone from a far away region, if they each speak in their local variety. This is similar to the case in Germany, where a local from the north might find conversing with a southerner rather difficult, unless they speak to each other in Standard (high) German, which is learned in school.
For the PBC there are two bible translations
written in Chinese characters - one in Mandarin
and one in Cantonese. But both sets of
characters are treated by the uroman tool as
“Chinese” and are transliterated accordingly.
Despite the pronunciations being rather
different in the original, the resulting texts
end up looking much more similar than they
should. To combat this I found a different tool
- pycantonese
- to romanize the Cantonese text, with a much
better result.
Need for better NLP tools & localization
This highlights the need for better localization efforts for many languages. While some work has been done along these lines (see my paper for a few links to libraries supporting word segmentation for various scripts), there is still more to be done. Imagine collecting data in rural China, for example - would you transcribe in IPA or in Chinese characters? If the latter, how would you then be able to compare varieties? For the purpose of linguistic comparison it is important to have a common framework. Hopefully as we continue to work on the many spoken varieties of the world’s languages, we’ll be able to collect this kind of fine-grained data in order to gain further insight into how languages work.