Language agents achieve superhuman synthesis of scientific knowledge

(arxiv.org)

Comments

tantalor 5 hours ago
I have an idea how to test whether AI can be a good scientist:

Train on all published scientific knowledge and observations up to certain point, before a breakthrough occurred. Then see if your AI can generate the breakthrough on its own.

For example, prior to 1900 quantum theory did not exist. Given what we knew then, could AI reproduce the ideas of Planck, Einstein, Bohr etc?

If not, then AI will never be useful for generating scientific theory.

therobot24 6 hours ago
i only performed a quick read of the paper but couldn't find how many humans they used to generate their expected human performance, this seems to be the main content:

> To ensure that we did not overfit PaperQA2 to achieve high performance on LitQA2, we generated a new set of 101 LitQA2 questions after making most of the engineering changes to PaperQA2. The accuracy of PaperQA2 on the original set of 147 questions did not differ significantly from its accuracy on the latter set of 101 questions, indicating that our optimizations in the first stage generalized well to new and unseen LitQA2 questions (Table 2).

> To compare PaperQA2 performance to human performance on the same task, human annotators who either possessed a PhD in biology or a related science, or who were enrolled in a PhD program (see Section 8.2.1), were each provided a subset of LitQA2 questions and a performance-related financial incentive of $3-12 per question to answer as many questions correctly as possible within approximately one week, using any online tools and paper access provided by their institutions. Under these conditions, human annotators achieved 73.8% ± 9.6% (mean ± SD, n = 9) precision on LitQA2 and 67.7% ± 11.9% (mean ± SD, n = 9) accuracy (Figure 2A, green line). PaperQA2 thus achieved superhuman precision on this task (t(8.6) = 3.49, p = 0.0036) and did not differ significantly from humans in accuracy (t(8.5) = −0.42, p = 0.66).

dcreater 5 hours ago
Academic writing is notoriously hard to read and often poorly written. If this lives up to billing it will be a game changer - no need to rely on sporadic, manual, intrinsically limited nature of surveys from academics, analysts through to gym bros, reddit posters.
PaulKeeble 6 hours ago
One of my big uses of LLM's has been searching through medical research. The issue has been a few times running into confidence where it shouldn't be but I have found it hallucinates a lot less in science topics than it does for more common topics.