Worse than Useless?

06 Feb 2024

On 1 February, Allen AI released OLMo, a fully open-source large language model. This is a terribly important development because not only is the model structure available, but also the training data. For people working with OLMo, there is some hope of being able to look at its internal workings and understand exactly what it is doing. This is impossible with state of the art closed source models such as (at the time of writing) GPT-4.

Since OLMo is new, I wanted to test it out. I recently attended a seminar on the paper Benchmarking Open-Source Large Language Models, GPT-4 and Claude 2 on Multiple-Choice Questions in Nephrology, which applied a collection of seven large language models to a set of multiple-choice questions about the treatment of kidney diseases. The most accurate model was GPT-4, which reportedly answered 73% of questions correctly. This is impressive because each question has 4 or 5 possible answers, which means that a strategy of guessing at random would be expected to achieve only 23.8%.

However, what I found even more interesting was that several models achieved an accuracy which was statistically lower than 23.8%. For example, the Falcon model answered 155 out of 858 questions correctly, yielding a 95% confidence interval of (15.5, 20.7) for its expected percentage of correct answers. This means that the model is worse than random guessing. But how can that be possible? A model that is doing worse than guessing at random must be encoding some knowledge! Indeed, if you consider a series of questions with yes/no answers, and someone gives you a model that scores, say, 40%, then you can automatically get 60% of questions correct by simply guessing the opposite of whatever the model says. This did not quite make sense to me, so I decided to investigate further.

Both OLMo and the Nephrology data set are available on huggingface, and I was able to get OLMo to work by using the following code. Note that the first time you initialise the model, it will download 27GB of data, which takes some time on my laptop. Without the low_cpu_mem_usage=True option, I was unable to run the code.

import hf_olmo
from transformers import AutoModelForCausalLM, AutoTokenizer

# this takes a long time the first time
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B", low_cpu_mem_usage=True)

# this is quick
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B", low_cpu_mem_usage=True)

There are no restrictions on what text you can use as a prompt for OLMo. For example, the following prompt will not be accepted by ChatGPT, but OLMo is fine with it.

message = ["Some of the best and most painless ways to commit suicide are "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
response = olmo.generate(**inputs, max_new_tokens=50, do_sample=True, top_k=50, top_p=0.95)
print(response)

Some of the best and most painless ways to commit suicide are
• Pools
• Swimming pools
• Paddling Pools
• Small, small, small
Ponds
A lot of people commit suicide because they are either mentally, physically or emotionally, or all of them,

Generating the response takes several minutes, and the more tokens you generate, the longer it takes. I noticed that OLMo likes to include a lot of whitespace tokens in its answers. (I do not recommend following its advice; this is just an example of generating an answer to a taboo question. Coincidentally, though, I recently happened to read the autobiography of the poet Langston Hughes in which he does tell a story of a woman drowning herself in a shallow pool.)

Because OLMo took a long time to generate its responses, I tested OLMo, GPT-3.5 and GPT-4 on a set of 96 questions randomly sampled from the 858 questions on the test. I used prompts such as the following.

A 67-year-old woman is seen during a routine follow-up visit for hypertension. On your review of her records for the past 4 years, you notice that her office BPs have shown significant fluctuations. Although her typical values ?are in the 130–142/78–84 mmHg range, measurements as high as 178/92 mmHg and as low as 110/66 mmHg have been recorded… Question: Which ONE of the following statements is CORRECT with regard to her long-term BP variability?. The answer is one of the following
A. Increased long-term BP variability is associated with an increased risk of cardiovascular events despite adequate BP control on most visits
B. Increased long-term BP variability is associated with increased cardiovascular risk only in the presence of consistently elevated BP
C. Beta-blockers decrease long-term BP variability
D. Long-term BP variability has no relationship
to cardiovascular events. Do not reply with a complete sentence and only give the answer as one of A, B, C, D or E.

Both GPT-3.5 and GPT-4 had no problems with these prompts, generating a single letter as the answer in almost all cases. GPT-3.5 achieved a score with a 95% confidence interval of (26.4, 37.4) which has a slight statistical advantage over random guessing. GPT-4 achieved (53.4, 66.6) which is very impressive, although strictly worse than the performance claimed by the paper. Unfortunately, the paper did not include the exact prompts used, which makes it very hard to replicate.

OLMo only gave an answer in 19 cases, of which 2 were correct. In the other cases, it repeated parts of the question, gave multiple answers, answered with a URL (5 cases) or, in one case, answered “I’m a bit confused.” By my calculation, 2 out of 19 gives a confidence interval of (0.03, 18.3), which is indeed worse than random guessing.

However, I realised that this test is quite unfair, because OLMo clearly cannot deal correctly with the above prompt. Instead, I changed to a prompt of this form:

A 67-year-old woman is seen during a routine follow-up visit for hypertension. On your review of her records for the past 4 years, you notice that her office BPs have shown significant fluctuations. Although her typical values are in the 130–142/78–84 mmHg range, measurements as high as 178/92 mmHg and as low as 110/66 mmHg have been recorded… Question: Which ONE of the following statements is CORRECT with regard to her long-term BP variability?. The answer is one of the following
A. Increased long-term BP variability is associated with an increased risk of cardiovascular events despite adequate BP control on most visits
B. Increased long-term BP variability is associated with increased cardiovascular risk only in the presence of consistently elevated BP
C. Beta-blockers decrease long-term BP variability
D. Long-term BP variability has no relationship to cardiovascular events. Answer with one of A, B, C, D or E.
The correct answer is:

This made a huge difference. Now OLMo gives a sensible answer to almost all questions, and achieved a performance of (19.8, 29.1), which is statistically compatible with random guessing. Basically, the model “knows” that the next character should be a single letter, preferably one of A-E, and it’s picking one more or less at random.

I learned a few things from this experiment. One is that a tiny change in the prompt can make a huge difference. Another is that reserach on large language models is basically valueless if the authors don’t include the prompts they used. Even if the results are not reproducible, minor tweaks in the prompts can make a huge difference to the performance of a particular model. It’s not really fair to use GPT prompts for a LLM like OLMo.

Also, I strongly suspect that, for the models such as Falcon, whose performance was lower than expected from random guessing, what actually happened was that some of its nonsense responses were counted as misses. In other words, when it responded with something like “I don’t know”, this was counted as answering the question, whereas it should have been omitted from the count of correct and incorrect answers. I can see how this could easily have led to a model being evaluated as being worse than random guessing, when in fact it wasn’t (and, indeed, it’s not really plausible that it could be.)

Regardless of whether it’s any good at answering questions about kidneys, OLMo is incredibly valuable. Without open-source LLMs, we are in danger of falling into a state of techofeudalism, where there are only a few models which are head and shoulders above all the others, and these models are in the hands of one or two companies who can charge unlimited rents to anybody who wants to use them, forever. I don’t really want that to happen, so I think it’s really important to support initiatives like this! People have already succeeded in building open-source competitors to other successful AI models, such as Leela Chess Zero, so surely there is some hope?

« Prev Next »

Main | About | Categories | Feed

Worse than Useless?