I have been quite lax with posting here mainly because I have been working very hard on a difficult problem. My current job involves (among other things) trying to automate a psychological coding system for text. This is pretty similar to research in Sentiment Analysis (SA, see this brief introduction), a task in Natural Language Processing (NLP) which attempts to predict the ‘positivity’ or ‘negativity’ of sentences (i.e. classification), such as from Twitter. Companies find this useful for (among other things) getting a broad understanding of how consumers respond to their products and services.
The coding system I am trying to automate is a bit more complex, but I am using similar techniques as in SA. I started initially with SVM and other classifiers using the SciKit-Learn library, and I have now moved on to using Neural Networks. These are essentially Machine Learning models that allow a computer to generalize patterns in data that correspond to particular outputs. Humans do this pretty naturally - we recognize patterns in language, for example, that allow us to parse sounds into words and words into units of meaning that, when combined, help us communicate.
A data-driven approach
The idea with neural networks is that given enough labeled data, the computer’s statistical model can capture the patterns that correspond to the labels and predict what the labels should be for data it has never seen before. This is essentially how character recognition (OCR) works - enough data has been fed to the computer that it has ‘learned’ what an “A” character looks like in a variety of images. But the model can only make good generalizations if it has a lot of data and a good diversity of data.
It is a common issue, especially in the field of Linguistics, that the amount of data available for a particular problem is limited (i.e. description of languages is often based on a 5-hour recorded corpus supplemented with elicitation, psycholinguistic experiments are usually conducted with sample sizes of 5-20 participants, though larger sample sizes are more ideal). This is partly because of how time-consuming data collection can be - combine this with the fact that you might be dealing with multiple languages and the issue is compounded. But even within a single language, particular problems only have limited datasets, and if we want to automate a system that has typically required trained humans (consider that PhD students who describe a language usually train/read for at least a year before fieldwork, after having completed an MA), there might be even less data than usual available for training, particularly data in the correct format.
Addressing problems in the neural network
This lack of data is a significant problem that we have been working to overcome by creating more coded data. In building ML models I have continued working with the existing data to achieve decent results on a small dataset, the hope being that once more coded data is available it can be incorporated and give a better result. Along the way I have come across several important learning points that I thought I would write here as a reference of sorts, to add to other great posts that have been helpful to me.
Check your code for bugs.
I am coding in Python simply because it has lots of great libraries for Machine Learning (Keras, Tensorflow, Theano), data transformation and storage (Pandas, Numpy), and language processing (NLTK, SpaCy). Sometimes there are little bugs that give your model the wrong kind of data to generalize, so it learns nothing. Garbage in, garbage out, unfortunately. Printing to the console at key points helps you to identify these bugs.
Check your data for errors.
I am not a specialist in the coding system I am trying to automate, but I am a linguist, which helps me to recognize when the input text is wrong. For the people labeling the data, this is probably the most time-consuming process - checking each others’ work. With linguistic data, I have found that it is helpful at times to normalize the text, but not too much. Basic stuff like lowercasing and removing punctuation can help normalize the data, as can simple spelling correction, but part-of-speech tagging and lemmatizing or stemming may not help at all. Some of this has to do with the degree to which NLP can accurately lemmatize or stem words. For example, WordNet is commonly used for lemmatizing, but if the POS tag is incorrect then it returns entirely the wrong word.
Balance your data.
This is probably the most important thing to do when pre-processing your data before feeding it into your model, especially when working with small datasets. Your data may be highly skewed toward one label. In my case, with one of the binary classification problems I’m working on the ratio of zeros to ones in the data is roughly 10: 1. If you are splitting your data into training and validation sets (which you should be!), it is possible that your proportions change depending on how the split is set up. With a random split, your training data could end up with a label ratio of 15: 1, while your validation data could be 6: 1. This makes it much harder for the model to generalize over unseen data. Just fixing the proportions in the training and validation sets (and then shuffling each set independently) improved my correlations with unseen data by about 0.20 (Pearson’s R) on average.
Predict on completely unseen data.
Many of the papers I have been reading (and I’ve been reading a lot!) report the accuracy of their model on validation data, which is usually a subset of the overall dataset, (between 10-50% depending on the problem). This is fine, but the validation data in a model is typically used to refine the weights of the overall model from epoch to epoch. To my mind, 94% accuracy (which is what some papers report for their SA models) only tells you how well the model has learned from the dataset, not how it performs on real-world instances. Ideally, to test this you should keep a separate dataset out of the training/validation altogether and then try to predict the labels using your model. Then looking at the correlations between the predicted labels and the actual labels will give you a much better idea of the accuracy of the model on unseen data.
Final thoughts and further links
These are just a few of the things that I have learned in the past few months while trying to sort out my classification problems. Papers and online forums have been extremely helpful in developing my understanding of the issues involved, and I have benefitted particularly from this blog and this one (among others) and examples of models on the Keras GitHub repository. For a good (though brief) discussion of trying to implement state of the art text classification models, see this post. Ultimately, as one person noted on a forum, developing neural networks is as much an art as a science, requiring experimentation and intuition to figure out how to apply a particular model architecture to solve a particular problem.