Some notes on LLMs in real-world contexts (Part 1)

Large language models (LLMs) seem to be all the rage these days, with the release of ChatGPT last year sparking a lot of conversation and interest. There have been quite a few efforts aimed at getting LLMs to do human-like tasks, but the primary use cases all seem to revolve around manipulating language in the online space: automating transcription, translation, summarizing larger texts, generating copy, generating similar kinds of data, explaining code, etc. Even though LLMs produce factually wrong responses much of the time, as well as errors in transcription or code, most people who use such processes consistently and have some domain knowledge regarding the material being produced seem to be ok with it, which suggests that it increases their productivity.

But what about for tasks like generating a description of a scene based on keywords, or identifying particular kinds of imagery in text? These are two real-world applications that I have worked on, and to be honest it seems to be more trouble than it’s worth for these particular purposes. But let me explain.

This blog post (Part 1) deals with the first scenario (text-to-text generation). If you’re interested in the second scenario (text classification), I’ll be writing about that in Part 2.

Descriptions from keywords

Last year when ChatGPT was first released I was approached by an author wondering if an LLM could speed up their workflow. “I love to come up with stories and plots and things, but I don’t like writing scene descriptions,” they told me. “Is there a way to develop a program that I could give keywords and it would write a scene description for me, but in my style?”

I answered that, yes, this was possible, but that the quality would vary depending on the base LLM model and the input (training) data that would be used for fine-tuning. After some back-and-forth we decided to test it out. Over the course of a couple days I came up with the following quick testing solution.

Model selection

I determined that since the author would be providing some of their written work to train the model, I would go with an open-source LLM (a model with a permissive license available from huggingface.co) as the base model, and training would be done locally to prevent leakage of copyright material. I settled on using Flan-T5, which was the best text-to-text model at the time (still surprisingly good) and with a permissive license.

Dataset selection and processing

Since the author didn’t have a ton of material of their own to train the model with, I decided to use data from the public domain for the majority of the training. The go-to for this is Project Gutenberg, so I downloaded a portion of their public domain catalog. This was then supplemented by some of the author’s written material.

After filtering by genre, the next step was to extract descriptive paragraphs from each text in order to train the model. This meant getting rid of any dialogue or other non-relevant text. There were also a significant number of names that had to be dealt with or filtered out.

Once I had processed the text, there was a final step before training: generating keywords! While I could have done this manually, it wasn’t feasible given the size of the training dataset, so I settled for using other methods, such as the keyword generator in the SpaCy library.

Training and results

Now that I had keywords and corresponding descriptions, it was time to train. I trained the model (fine-tuned) for roughly a day, using an NVIDIA 2080ti GPU. Inputs were the keywords, outputs were the descriptions. At the end of 24 hours I had a model that would generate descriptions given a set of keywords. But how good were the descriptions?

The proof is in the pudding: my author friend needed to test it out, and since they’re not a programmer they needed a way to interface with the model. So I made a small package that would work on their computer, added some settings that they could modify (temperature, context length, beams, etc..) and they gave it a whirl. Unfortunately the results (as people have found with ChatGPT) were not great. It was surprising how good they were, given the short training time and input data, and there was a clear stylistic similarity to the author. But my author friend determined that it would take more time to edit than it was worth, and the text it produced was not very imaginative. It also took too much time to generate descriptions on their computer.

Conclusion

Does this mean that building a keyword-based description generator is not feasible? Not at all. But what it does mean is that the use-case should be carefully considered. If you are generating product descriptions it might be ok, but writing as a creative endeavor can’t really be replicated by LLMs - training for longer will not allow the LLM to produce novel content, as it can only generate something similar to what it has seen in its training data, not create something new. Even generating product descriptions from keywords would require some manual editing - and how much time/effort is saved then? You’d have to determine this on a case-by-case basis.

This is a relatively straightforward task, with a pretty clear mapping between two different forms. And it doesn’t work as well as we might like. But what about a more complex task, where people are trained to recognize imagery pretty consistently, and you want the computer to do the same? Stay tuned for more discussion on that in Part 2.