Education

Aussie study casts doubt on OpenAI benchmarks

24 September 2024

AI-generated papers performed below the student average in an Australian law exam late last year, casting doubts on claims by one of the technology’s commercial pioneers, new research has found.

Another study conducted in 2023 found OpenAI’s GPT-4 to score higher than 90 per cent of human test takers on a simulated version of the all-important bar exam in the United States.

The benchmark is one of only two peddled on the company’s website to distinguish its more powerful large language model from GPT-3.5. It is also highlighted in the GPT-4 technical report.

In the new study, published in the Journal of Law, Innovation and Technology on Tuesday, University of Wollongong law expert Dr Armin Alimardani set out to test the claim and benchmark GenAI capabilities in an Australian context.

Dr Alimardani, who is also a consultant at OpenAI, evaluated the performance of OpenAI’s GPT-3 and GPT-4 (which has since been superseded by GPT-4o), and Google Gemini predecessor Bard on a final Criminal Law exam in Spring 2023.

The models were used to generate 10 “distinct answers” answers to the exam question, with responses later mixed in with real papers and handed to unsuspecting tutors for marking.

Half of the AI generated papers sought to simulate a student using the “minimum effort for prompt engineering” by providing no instructions to the models beyond the exam question.

A further five papers were generated with “specific instructions on how to formulate a response”, with the models provided with transcriptions of lectures and other materials like slides.

Dr Alimardani and his research assistant hand wrote the responses into exam booklets, using fake student names and numbers to identify that they had been produced by AI, before mixing them in with real papers and handing them to tutors for grading.

Only two of the five papers generated with AI but without special prompt techniques passed the test, with the lowest fail mark (17 out of 60) given to a paper generated by Bard.

Of the four papers generated by OpenAI model, two failed (one each for GPT-3 and GPT-4), which shows that the “same GenAI engine with the same prompts can yield considerably different outputs”.

“The best performing paper was only better than 14.7 per cent of the students. So, this small sample suggests that if the students simply copied the exam question into one of the OpenAI models, they would have a 50 per cent chance of passing,” he said.

While the papers generated with prompt engineering tricks performed better, the average student score (66.2 per cent) was higher than the grades of all but two AI generated papers.

“Three of the papers weren’t that impressive but two did quite well. One of the papers scored 44 (about 73 per cent) and the other scored 47 (about 78 per cent),” Dr Alimardani said.

“Overall, these results don’t quite match the glowing benchmarks from OpenAI’s US bar exam simulation and none of the 10 AI papers performed better than 90 per cent of the students.”

All AI papers did, however, perform better than students in “open-ended question and essay writing tasks”, although the study also acknowledges that “this advantage might not hold in the complexity of real-world practice”.

Dr Alimardani said the study shows that the “reliability of benchmarks may be questionable and the way [companies] evaluate models could differ significant from how we evaluate students”.

The study also found that none of the five tutors suspected any papers were generated by AI, contrary to “some of the previous studies on humans’ ability to identify AI-generated text”.

Complicated by the fact that “none of the 10 AI papers hallucinated legal principles”, three of the tutors also admitted they were unlikely to have identified the papers as AI-generated even if the exams weren’t handwritten.

“The AI generated answers weren’t as comprehensive as we expected. It seemed to me that the models were fine tuned to avoid hallucination by playing it safe and providing less detailed answers,” Dr Alimardani said.

Do you know more? Contact James Riley via Email.

CSIRO's new green and lean AI capability gets to work

Aged Care’s Salesforce upgrade $283m and rising

Aussie study casts doubt on OpenAI benchmarks

Leave a Comment Cancel reply

Guide

Categories

SUBSCRIPTION ENQUIRIES

Privacy

EDITORIAL CONTACT

COMMERCIAL CONTACT

Search Filters Clear All

Leave a Comment Cancel reply

Related stories