Australian government trial finds AI is much worse than humans at summarizing

Enlarge / ASIC evaluators found AI summaries were often “wordy and pointless—just repeating what was in the submission.” (credit: Getty Images)

As large language models have continued to rise in prominence, many users and companies have focused on their useful ability to quickly summarize lengthy documents for easier human consumption. When Australia’s Securities and Investments Commission (ASIC) looked into this potential use case, though, it found that the summaries it was able to get from the Llama2-70B model were judged as significantly worse than those provided by humans.

ASIC’s proof-of-concept study (PDF)—which was run in January and February, written up in March, and published in response to a Senate inquiry in May—has a number of limitations that make it hard to generalize about the summarizing capabilities of state-of-the-art LLMs in the present day. Still, the government study shows many of the potential pitfalls large organizations should consider before simply inserting LLM outputs into existing workflows.

Keeping score

For its study, ASIC teamed up with Amazon Web Services to evaluate LLMs’ ability to summarize “a sample of public submissions made to an external Parliamentary Joint Committee inquiry, looking into audit and consultancy firms.” For ASIC’s purposes, a good summary of one of these submissions would highlight any mention of ASIC, any recommendations for avoiding conflicts of interest, and any calls for more regulation, all with references to page numbers and “brief context” for explanation.

Read 10 remaining paragraphs | Comments