Romanization of scripts and NLP resources
When I began developing the taggedPBC
I needed to represent all of the roughly 2,000
languages in a similar way, in order to
facilitate comparison. As noted in a previous
post, while the majority of the world’s
languages use roman alphabetic orthographies,
many have non-roman scripts. This meant that I
needed a systematic way to convert scripts to
the same format. This post outlines some of the
issues and provides a brief tutorial on
transliteration using existing Python libraries,
using the example of Japanese.
Converting scripts, i.e. Japanese
An existing tool uroman
can transliterate most existing scripts to a
romanized form, which then allows for
comparison. But this tool simply maps between
representations - it does not handle
segmentation or other typical NLP tasks. As an
example, consider the following implementation.
Here we use Python to load the
uroman library and get it to
transliterate a Japanese text string.
import uroman as ur # import the library
uroman = ur.Uroman() # load uroman data (takes about a second or so)
jpnstring = '人のふり見てわがふり直せ' # 'Hito no furi mite waga furi naose'
print("uroman output autodetect:", uroman.romanize_string(jpnstring)) # auto-detect the language
print("uroman output jpn:", uroman.romanize_string(jpnstring, lcode='jpn')) # specify the iso 639-3 language code
## Output:
# uroman output autodetect: rennofurijiantewagafurizhise
# uroman output jpn: rennofurijiantewagafurizhise
There are a couple of issues that we
should highlight, the first related to
tokenization and the second related to incorrect
romanization. Japanese is a good example here
because it actively uses multiple scripts, with
a pretty complex historical relationship,
i.e. with Chinese.
Tokenization concerns
Note that in Japanese, tokenization (or segmentation) of a text string into individual words is handled primarily by the reader, as traditional Japanese script does not identify word boundaries. This means that any computational tools need to be rather complex in order to tokenize Japanese text and use it for downstream tasks. One way of handling this is to use a specialized Japanese NLP tool like MeCab to first tokenize the text string.
Language-specific concerns (homographs, heteronyms)
The second issue is that many Japanese words/characters share written forms (and possibly meanings) but have different pronunciations depending on placement in a string. These are ‘heteronyms’ or ‘homographs’ (as opposed to ‘homophones’ which are words that mean different things but share the same pronunciation). The example text we used from Japanese is actually a complete sentence, the well-known proverb “Hito no furi mite waga furi naose”, which corresponds to English “One man’s fault is another’s lesson” (lit: Watch others’ behavior and correct your own behavior [other POSS behavior watch, own behavior correct]).
Clearly, the uroman tool does
not handle either tokenization or
language-specific homography, instead
implementing a kind of “brute force”
frequency-based replacement approach. This is a
good baseline, but for better results we should
probably use a language-specific
tokenizer/romanizer. In the code below, we use
fugashi
(a wrapper for MeCab) to implement this
pipeline.
from fugashi import Tagger # import the tagger
tagger = Tagger('-Owakati') # instantiate the tagger
res = tagger.parse(jpnstring) # parse the string
print("fugashi tagger output:", res)
print("uroman parsed fugashi output:", uroman.romanize_string(res, lcode='jpn'))
## Output:
# fugashi tagger output: 人 の ふり 見 て わが ふり 直せ
# uroman parsed fugashi output: ren no furi jian te waga furi zhiseWe can see from the output that
fugashi introduces spaces between
combinations of characters that are deemed to be
separate words. We can then convert this to a
romanization using uroman, which
gives us a representation of words separated by
whitespace. Again, this is not perfectly ideal
because it defaults to representing
人 as “ren” rather
than the correct “hito”, and other
characters are also incorrectly identified
(見 て = “jian te” vs
“mite”; 直せ =
“zhise” vs “noase”).
This is an ok baseline for languages with few
NLP resources, but a better romanization tool
for Japanese would be something like cutlet.
The code below shows the results using the
language-specific tool for romanization
instead.
import cutlet # import the library
katsu = cutlet.Cutlet() # instantiate the tool
katsu.use_foreign_spelling = False # disable using foreign spelling (on by default)
krji = katsu.romaji(jpnstring) # convert to romanized orthography
print("romanized using cutlet:", krji)
## Output:
# romanized using cutlet: Hito no furi mite waga furi naoseConclusion
While using a language-specific tool is
obviously the preferred option, this is not
possible for the majority of the world’s
languages, which simply don’t have the
resources. A tool like uroman gives
a baseline romanized
representation of individual words in a sentence
for non-roman scripts, which then lets us
compare strings in one language to strings in
other languages with roman(ized) orthographies.
It is not as good a comparison as
phonetic/phonemic transcription, but it at least
allows for the comparison of speech/text in a
standardized way across languages. This can be
important for tasks like machine translation via
word alignment, which can then facilitate other
tasks like part-of-speech transfer, which I
describe in a bit more detail in my
paper.
The issue of orthography/scripts is an additional layer of complexity that needs to be managed when attempting to compare between/across languages. For many languages, there are general solutions that can be applied, but for others more specialized approaches are needed. This is where domain knowledge of particular languages can assist with making decisions about how a language should be processed, and is a future direction for developing the taggedPBC.
Some libraries for language-specific NLP
The following are some useful libraries for parsing and otherwise working with specific scripts, languages, and groups of languages. This is not an exhaustive list by any means. If you are a specialist in these languages, consider contributing to their development. And if you have expertise in languages not on this list, consider whether you could develop a resource to support the languages you know. Additionally, if you have preferred libraries you use for other languages, do get in touch.