I broke Meta’s Llama 3.1 405B with a question (which GPT-4o mini answered correctly)
Metadata announced last week The company’s largest language model to date, Llama 3.1 405B, is what the company claims is the first “frontier model” in open source software, meaning a model that can compete with the best that closed source can offer, such as OpenAI’s GPT-4 and Google’s Gemini 1.5.
It turns out that Llama 3.1 can be broken just as easily, or even easier than those models. Similar to how I broke Gemini 1.5 with a query related to language translation When it first came out, I was able to get Llama 3.1 to resort to gibberish in my very first question.
Also: Beware AI ‘model collapse’: How training on synthetic data pollutes the next generation
The Google Gemini failure is such a great example of a simple question that it has now become the first question I use to test large language models. Sure enough, I was able to use it to break Meta’s 405B Llama 3.1 on the first try.
You could say it’s an exception, a question about the Georgian verb “ყოგნა”, which means “to be”. Except that, located in the Caucasus region, between the Black Sea and the Caspian Sea, the country Georgia is home to nearly four million Georgian speakers.
Misconstruing the most important verb in a language spoken by four million people seems like a more serious case.
In any case, I sent my question to Llama 3.1 405B in the following format:
What is the conjugation of the verb ყოფნა in Georgian?
Also: I got Google’s Gemini 1.5 Pro to fail on the first try
I have posted a question on both Meta Meta AI Websitewhere you can use Llama 3.1 405B for free and also on HuggingFace HugChatwhere you can create chatbots from any open source AI model with public code repository.
I also tried the query on a commercially hosted third party chatbot, Groq. In any case, the answer is meaningless.
First, here is the correct answer, from OpenAI’s GPT-4o mini:
(Most other LLMs and chatbots, including Google’s Gemini, now answer this question correctly.)
At first, the Meta AI site objected, stating that ყოფნა was too complicated. After I insisted, it came up with a ridiculously generated set of words. Here is Llama 3.1 405B’s response:
As you will see, when compared to the correct answer above, Llama 3.1’s answer is not even close to correct.
The HuggingFace and Groq versions didn’t even object; they directly gave the same nonsensical answer. In HuggingFace’s response, it gave a different set of nonsense words than the ones given by the Meta AI site:
Llama 3.1’s complete failure on a foreign language question is particularly frustrating given that Meta’s researchers spoke at length in their technical paper on how Llama 3.1 improves upon the previous version in terms of being “multilingual”, meaning supporting multiple languages other than English.
The authors asked for a lot of human feedback on the language responses. “We collected high-quality, manually annotated data from linguists and native speakers,” they write. “These annotations consisted primarily of open-ended prompts that represented real-world use cases.”
Also: 3 Ways Meta’s Llama 3.1 Is a Step Forward for Gen AI
There are some interesting aspects that can be seen that hint at what’s going on with Llama 3.1 405B in the error case. The spelling of the fake first-person response, “ვაყოფ,” certainly sounds like a valid Georgian word, even to my non-native ears. The prefix “ვ-” is a common prefix for the first-person conjugation, and the suffix “-ოფ” is a valid Georgian suffix.
So it’s possible that the model is overgeneralizing, trying to quickly answer a question by coming up with generalized answers, if you will, answers that work for many parts of a given language as patterns, but will fail if over-applied without observing exceptions.
Interestingly, Llama 3.1 405B’s answer can change after multiple attempts. For example, when the question is retried, the model will output a valid conjugation table for the present tense:
But when it comes to the future tense, the pattern is almost correct, but not quite. If we do not add the first person prefix ვ- to the first conjugation in the table:
Also interesting is the fact that the Llama 3.1 405B’s smaller cousin, the 70B, actually got the present tense correct on the first try. This suggests that all the extra training and computing power spent on the larger 405B tends to, perhaps in small cases, actually degrade the results.
I think Meta’s engineers need to take a close look at their specific cases and error cases and see if their software is over-generalizing.
Note that researchers have widely used aggregate data to “tune” the model and supplement the human feedback they collect. It is an open question whether synthetic data used at large scales contributes to over-regularization, as suggested by an article last week in the journal Nature.