Romanization of scripts and NLP resources

2026-04-20

Romanization of scripts and NLP resources

When I began developing the taggedPBC I needed to represent all of the roughly 2,000 languages in a similar way, in order to facilitate comparison. As noted in a previous post, while the majority of the world’s languages use roman alphabetic orthographies, many have non-roman scripts. This meant that I needed a systematic way to convert scripts to the same format. This post outlines some of the issues and provides a brief tutorial on transliteration using existing Python libraries, using the example of Japanese.

Converting scripts, i.e. Japanese

An existing tool uroman can transliterate most existing scripts to a romanized form, which then allows for comparison. But this tool simply maps between representations - it does not handle segmentation or other typical NLP tasks. As an example, consider the following implementation. Here we use Python to load the uroman library and get it to transliterate a Japanese text string.

import uroman as ur # import the library

uroman = ur.Uroman()   # load uroman data (takes about a second or so)
jpnstring = '人のふり見てわがふり直せ' # 'Hito no furi mite waga furi naose'
print("uroman output autodetect:", uroman.romanize_string(jpnstring)) # auto-detect the language
print("uroman output jpn:", uroman.romanize_string(jpnstring, lcode='jpn')) # specify the iso 639-3 language code

## Output:
# uroman output autodetect: rennofurijiantewagafurizhise
# uroman output jpn: rennofurijiantewagafurizhise

Japanese writing (from FlyD via Unsplash)

There are a couple of issues that we should highlight, the first related to tokenization and the second related to incorrect romanization. Japanese is a good example here because it actively uses multiple scripts, with a pretty complex historical relationship, i.e. with Chinese.

Tokenization concerns

Note that in Japanese, tokenization (or segmentation) of a text string into individual words is handled primarily by the reader, as traditional Japanese script does not identify word boundaries. This means that any computational tools need to be rather complex in order to tokenize Japanese text and use it for downstream tasks. One way of handling this is to use a specialized Japanese NLP tool like MeCab to first tokenize the text string.

Language-specific concerns (homographs, heteronyms)

The second issue is that many Japanese words/characters share written forms (and possibly meanings) but have different pronunciations depending on placement in a string. These are ‘heteronyms’ or ‘homographs’ (as opposed to ‘homophones’ which are words that mean different things but share the same pronunciation). The example text we used from Japanese is actually a complete sentence, the well-known proverb “Hito no furi mite waga furi naose”, which corresponds to English “One man’s fault is another’s lesson” (lit: Watch others’ behavior and correct your own behavior [other POSS behavior watch, own behavior correct]).

Clearly, the uroman tool does not handle either tokenization or language-specific homography, instead implementing a kind of “brute force” frequency-based replacement approach. This is a good baseline, but for better results we should probably use a language-specific tokenizer/romanizer. In the code below, we use fugashi (a wrapper for MeCab) to implement this pipeline.

from fugashi import Tagger # import the tagger

tagger = Tagger('-Owakati') # instantiate the tagger
res = tagger.parse(jpnstring) # parse the string
print("fugashi tagger output:", res)
print("uroman parsed fugashi output:", uroman.romanize_string(res, lcode='jpn'))

## Output:
# fugashi tagger output: 人 の ふり 見 て わが ふり 直せ
# uroman parsed fugashi output: ren no furi jian te waga furi zhise

We can see from the output that fugashi introduces spaces between combinations of characters that are deemed to be separate words. We can then convert this to a romanization using uroman, which gives us a representation of words separated by whitespace. Again, this is not perfectly ideal because it defaults to representing 人 as “ren” rather than the correct “hito”, and other characters are also incorrectly identified (見て = “jian te” vs “mite”; 直せ = “zhise” vs “noase”). This is an ok baseline for languages with few NLP resources, but a better romanization tool for Japanese would be something like cutlet. The code below shows the results using the language-specific tool for romanization instead.

import cutlet # import the library
katsu = cutlet.Cutlet() # instantiate the tool
katsu.use_foreign_spelling = False # disable using foreign spelling (on by default)
krji = katsu.romaji(jpnstring) # convert to romanized orthography
print("romanized using cutlet:", krji)

## Output:
# romanized using cutlet: Hito no furi mite waga furi naose

Conclusion

While using a language-specific tool is obviously the preferred option, this is not possible for the majority of the world’s languages, which simply don’t have the resources. A tool like uroman gives a baseline romanized representation of individual words in a sentence for non-roman scripts, which then lets us compare strings in one language to strings in other languages with roman(ized) orthographies. It is not as good a comparison as phonetic/phonemic transcription, but it at least allows for the comparison of speech/text in a standardized way across languages. This can be important for tasks like machine translation via word alignment, which can then facilitate other tasks like part-of-speech transfer, which I describe in a bit more detail in my paper.

The issue of orthography/scripts is an additional layer of complexity that needs to be managed when attempting to compare between/across languages. For many languages, there are general solutions that can be applied, but for others more specialized approaches are needed. This is where domain knowledge of particular languages can assist with making decisions about how a language should be processed, and is a future direction for developing the taggedPBC.

Some libraries for language-specific NLP

The following are some useful libraries for parsing and otherwise working with specific scripts, languages, and groups of languages. This is not an exhaustive list by any means. If you are a specialist in these languages, consider contributing to their development. And if you have expertise in languages not on this list, consider whether you could develop a resource to support the languages you know. Additionally, if you have preferred libraries you use for other languages, do get in touch.

Japanese: fugashi
Khmer/Kuy: khmernltk
Chinese: jieba
Burmese/Karen/Mon: pyidaungsu
Thai: attacut
Tibetan: botok