Open Science, Data, and Linguistics


​One of the concerns that has occupied my mind for that past few years is the question of data accessibility in the field of Linguistics. I am happy to announce that the data that underpins my grammatical description of Pnar is now freely available as a downloadable archive in audio and text form (anonymized where requested by participants). You can find the link to the dataset at the bottom of this post, but in the meantime I’d like to explain my views surrounding data access and give a brief explanation of the tool I’ve used to make my linguistic data accessible.

Why accessible data?

  Those who are familiar with linguistics understand that traditional descriptions of language are often based on recorded, transcribed and translated interviews and stories by speakers of the language. Although some theoretical work may be based on a few utterances or a single example, most linguistic work is based on many actual examples from utterances that real speakers produce.

One issue here is that there is such between- and within-speaker variation in speech that unless the data you use is actually accessible to other linguists, one can easily question the veracity of a particular analysis. In the interest of scientific enquiry, then, it is incumbent on the analyst-linguist to make their actual data accessible to other researchers in at least some form, whether in an archive or in a database. Having the data accessible to multiple researchers may lead to disagreements about analysis (there may be more than one way of analyzing a particular linguistic structure, for example), but ultimately such disagreements are healthy because they expand our knowledge.

Research verifiability/reproducibility

  This touches on a larger issue in the world of science, that of verifiability and reproducibility of research, which has galvanized the larger scientific community towards Open Science (see this blog post for an explanation, and check out this OSF paper), and in some fields such as Psychology, has actually resulted in a whole journal devoted to “replication studies”. These kind of studies are aimed at trying to replicate results and findings of a particular study by following the same procedure as the original researchers. When replication studies uphold a particular result, it makes it more likely that the original study’s findings were not the result of a statistical anomaly or falsification of data, which is a very serious problem that can lead to erroneous claims requiring retraction. For more on this visit

What this means in the case of linguistic data is that the recordings, transcriptions, and translations that underlie a grammatical description or other study, whenever possible, should be made accessible to other linguists. Data sharing can be a touchy issue simply because of a) the ethical concerns of the providers of the data, b) potential cultural taboos, and c) because of the interests of the linguist who initially made and processed the data.

With proper permissions sought and precautions taken, these concerns can be minimized or dealt with appropriately. A linguist needs to (minimally) communicate to participants about how the data will be used, take the time to anonymize recordings and annotations when necessary, and create a license that constrains how the data can be used in the future. Ideally, if you are doing your research correctly, your university’s Institutional Review Board will have already helped you to think through these things. There are also some excellent bookspapers and chapters that deal (at least somewhat) with this subject, and there are a set of standards for social science research (with human subjects) and specifically for linguistics that researchers should be aware of.

Some reasons linguists don’t share data

  The final point (C, the interests of the linguist) is really the sticking point for most people. The reality is that many linguists do not want to release data for several reasons:

  1. They haven’t had time to go through it themselves to their satisfaction.

      This is often the main reason data doesn’t get shared. It is common in fieldwork linguistics to collect many hours of recordings that a linguist never has time to annotate. In my case I recorded something like 11 hours of stories and conversations, but I was only able to transcribe and translate (and annotate) around 8 hours or so. The other 3 hours just didn’t get processed, and this is one of my future tasks - to sit with a speaker and spend the time transcribing and translating. Consider, for example, that 5 minutes of recorded speech often takes something like one hour of time to transcribe and annotate with the help of a native speaker. This is very time-consuming and laborious work, which means that often there are recordings that remain with very little annotation and no transcription.

  2. They are worried that their analysis will be critiqued.

      Many linguists who do fieldwork are essentially apprentices, and are just starting to learn how to analyze linguistic structure on their own (i.e. PhD students). It can be extremely intimidating to know that you have lots of questions about how a language works, even after working on it for several years, and to know at the same time that people who have 20+ years of experience on multiple languages may be critiquing your data collection and analysis. I think this happens in many different disciplines, and it can be a barrier to making data public simply because of the personal fear that individuals can have.

  3. They are worried that their work will be ‘stolen’ or repackaged.

      The fact that this is even a concern is telling about both the field of linguistics and the way data is treated. In the field of linguistics, it is really incumbent on senior linguists to honor data ‘ownership’ or curation by the primary data collector, by citing data properly.

      Ownership is a bit of an issue to work out sometimes, as beginning field linguists are often paid or supported in their work by a supervisor’s grant. I think the best way is to follow a ‘time-spent’ principle. That is, the person who has spent the most time with the data has the largest share of ownership of the final form of the dataset. This is strictly regarding the annotation, transcription, and translation of the dataset (the speakers who speak on the recordings obviously have a different kind of ownership).

      Other kinds of ownership or use ought to be negotiated very early on by interested parties - in my case, for example, I and my supervisor agreed that the data I created would be available to him for research use, and that I would share it relatively freely with others for non-commercial use. Regarding data ownership, linguists can honor this by citing the source or providing proper attribution, but it may be the case that the data cited is not readily available. It may be printed in the back of a grammatical description (or some portions of it may be), but more often it is located in a collection of notebooks on the shelf of the linguist somewhere, gathering dust.

My Data

  This brings me back to the earlier ruminations that started this post, namely that data produced by a linguist and which underpins their work ought to be accessible to other scientists and linguists. When I first submitted my PhD at NTU, I took a look at some of the options for data archiving, and I approached the university library (which keeps digital copies of all theses submitted at the university) to see if they could also store my audio and transcription data (over 1GB). About a year ago, they contacted me to let me know that they were developing such a service, something called DataVerse, and wanted to know if they could use my dataset to test it. I was happy to have them do so, and after some tweaking and time, this tool is now available for use.

DataVerse is a database/archive tool developed at Harvard University that allows researchers to store datasets that other researchers can download for use and testing. It supports the Open Science initiative by making data accessible and open. It also solves one of the problems I noted above by creating a unique url identifier and citation for the dataset. You can check out my dataset at its DOI here and download it for research and non-commercial purposes.

Further thoughts

As I was thinking about this previously, I realized that what I wanted was not really an archive but a database that would allow me to develop and annotate my data further. Unfortunately DataVerse is not that - it is basically just a storage tool. What is nice is that it provides versioning, so the curator of the dataset can upload and publish changes. I think I may have to create my own database if I want something that will let me explore the data better. But for now, the data is freely accessible for other linguists (even though my analysis isn’t perfect), which is a bit of a load off my mind.