Linguistic Tools


This past week I’ve been attending a workshop on the linguistic notion of Affectedness that my co-supervisor Frantisek Kratochvil organized. It has really helped me think about possible ways this feature could be at work in Pnar verbal constructions. And if you didn’t understand that sentence feel free to ask and I’ll try to explain it better. My brain has been fried most days this week.

While at the conference and in the evenings I’ve been working on organizing my linguistic database. A few weeks ago my friend Matt showed me how to use Python scripting to format and search the texts that output from my Toolbox database.

Toolbox allows for interlinearization of linguistic data, which is the standard for examples in linguistic papers and allows people who don’t understand anything about the language to see the grammatical structure. It usually includes a local orthographic line of text, followed by IPA (International Phonetic Alphabet) representation, a line of word for word glosses (translations), and a free translation. Glosses and free translation are usually in English.

The script Matt wrote (with my input) allows for regex (regular expression) searching and output. So in my corpus I can find all the verbs followed by nouns, for example, or all the verbs preceded by the form ‘ka’, and output their context.

The script I wrote this week (with his input) takes the whole Toolbox corpus (or a portion thereof) and reformats it so that I can read it with a typesetting program called LyX, a front end GUI of the popular but obtuse typesetter LaTeX. I still have a bit of work to do, but basically it allows me to turn my corpus database of 90,000+ words into a nice corpus, typeset with interlinearization, as a PDF.

After excluding about 2 hrs of data from my corpus because of parsing issues in Toolbox, my resulting PDF file was over 700 pages of just interlinearized examples, with no other formatting. I don’t think I’ll be including it all in the dissertation I plan to submit in August, but it’s amazing to have such a simple tool for outputting my data in a readable format.

I love technology…