Large language models made impressive progress recently. After the initial excitement around ChatGPT, the open-source community worked on new models and methods for fine-tuning. The good news is that it's become surprisingly easy to fine-tune models yourself, even on platforms like Colab.

At work, I've had some experience fine-tuning models for specific tasks, which got me thinking: why not try it with chat data? The cool thing is, WhatsApp lets you export chat history. So, I decided to export the seven years' worth of chat history with my girlfriend Lisa and train a model to imitate our conversations. It reminded me a bit of that episode from Black Mirror called "Be Right Back".

Spoiler: It worked ok, and I didn't have to put in too much effort. But, of course, I did encounter a few interesting challenges along the way that made the project even more intriguing.

One disclaimer, I am not an expert on chatbots and building them is quite complex. The goal of this small experiment is to use open-source LLMs, fine-tune them a bit and see what is possible.

How to build it?

You can find the notebook with the code here.

Data

You can export chat histories from WhatsApp quite easily. The dataset contains a row per message, plus a timestamp and username. I was not interested in parsing it, so I relied on whatstk.

Overall, it included 28k messages, and 26k after filtering files, gifs and stickers. Then I separated the individual messages into conversations. All messages that were written within one hour of the previous message were grouped into a conversation. I also filtered out the one-message conversations. In the end, I had around 3.5k conversations. A typical conversation looks like this:

“Daniel: Hey, when are you coming home? | Lisa: Around 6 pm | Daniel: Shall we go to the Italian restaurant | Lisa: Yes.”

Model

As a base model, I did choose “malteos/bloom-6b4-clp-german” from huggingface. It is a BLOOM model with 6b parameters fine-tuned in German. I was surprised that smaller open-source models in German are rare, so I was even more grateful for that one.

I fine-tuned the model using QLoRa, a combination of Low-Rank Adapter (LoRA) and 4-bit-quantization. LoRa is a technique that freezes the base model and adds a few additional parameters (called an adapter, which uses matrix factorisation to compute those). You do a regular forward pass through the model and adapter. During the backward pass, we only update the adapter parameters. Since they usually only make up a low one-digit percentage, you save tons of memory for gradients and optimiser states. Additionally, 4-bit quantization reduces the precision of parameters, which leads to a smaller model. In the end, you can even fine-tune up to 10-20b parameter models on Google Colab (using a 16GB T4 GPU).

I was impatient, so I trained only for one epoch which took me only two hours. Compared to the original tutorial, I added a callback that periodically makes sample predictions. When working with language models, I find it hard to trust loss metrics only but want more qualitative feedback by printing intermediate results.

Chatbot

The model is trained to predict how the input conversation continues. However, chats between people usually follow a flexible schema, which means that you don’t always have a single question-single answer structure. Often someone writes multiple messages without a reply in between.

To avoid those complications the chatbot works as follows.

  1. Provide Input: “Person1: How is it going? | Person2:”

  2. Get conversation prediction “Person1: How is it going? | Person2: Going well!| Person2: How about you?| Person1: All fine, thank you”

  3. Cut 2) after the first answer. “Person1: How is it going? | Person2: Going well!|”

  4. Provide the next input: “Person1: How is it going? | Person2: Going well!|| Person1: Do you have nice weather| Person2:”

So basically, you predict the full conversation but throw away everything except the next answer. The history of messages is always fed as input to keep the context.

Another good thing about this approach is that you can switch Person1 and Person2 and therefore simulate a chat from both perspectives.

Results

Overall, the results were surprisingly good. The model knows when you want to chat and responds in that form. I also give answers that make sense, so if you send a smiley, you get back a smiley. The answers are logical and often correct and also keep context over various messages. Here are a few examples (in German, but you’ll get it):

Lisa: 😘| Daniel: 😘😘😘| Lisa: Hast du Hunger?| Daniel: Ja, aber ich hab noch nicht gegessen.| Lisa: Wollen wir Pizza essen?| Daniel: Ja

Daniel: Müssen wir noch was einkaufen?| Lisa: Ja, aber ich hab noch nicht alles| Daniel: Was brauchen wir denn noch?| Lisa: Brot, Käse, Wurst, Milch, Butter, Marmelade, Nutella, Müsli, Obst

Daniel: Wollen wir gleich spazieren gehen? Das Wetter ist ziemlich gut.| Lisa: Ja, das ist gut.| Daniel: Glaubst du es regnet noch?| Lisa: Nee, ich glaube nicht

The bot can even answer with information that was not contained in the (fine-tuning) training data. For example:

Lisa:  Was ist die Hauptstadt von Japan?| Daniel: Tokyo| Lisa: Woher weißt du das?| Daniel: Google

Challenges

Even though the initial results are promising I encountered a few challenges:

Variance in Data

When asking the bot about simple things, like my favourite movie or birthplace, I did not give a correct answer. This was the case for almost every factual question I posed. That makes sense, we never talked about that in the chat, so there is nothing in the data. However, I was surprised at how narrow our communication on WhatsApp is. Often our conversations revolve around coordinating joint activities or our daily lives. So, you need to find another data source to teach the bot more facts about yourself.

Chronological order

The bot does not have a concept of past and present. It once said I would be late because I must study for university. The bot doesn’t know I completed it a long time ago and do not study anymore.

Too little knowledge of the world

Sometimes the bot suggests stupid things. For example, Lisa asked it to go skydiving without a parachute. The bot agreed.

Lisa: Wollen wir am Wochenende Fallschirmspringen?| Daniel: Jo, das wäre cool.| Lisa: Ohne Fallschirm ok?| Daniel: Ja, ohne

I suspect that is an issue with the base model. I would expect this knowledge to be encoded in the base model already.

Additional Ideas

Pre-train general conversation model

A good way to get more common sense into the model could be to have a more general fine-tuning on a larger dataset of chats. In that way, the bot could learn how human conversations work in general, with a much wider range of settings and also incorporating typical ground rules.

Train with multiple chats

You can train the model on multiple chats with different people. That way, the model learns representations that can be combined in new ways. For example, I add my chats with Lisa and Peter. As a result, I could also simulate a chat between Lisa and Peter.