WikiCrow: Automating Synthesis of Human Scientific Knowledge
Sam Cox, Michael Hammerling, Jakub Lála, Jon Laurent,
Sam Rodriques, Matt Rubashkin, Andrew White
As scientists, we stand on the shoulders of giants. Scientific progress requires curation and synthesis of prior knowledge and experimental results. However, the scientific literature is so expansive that synthesis, the comprehensive combination of ideas and results, is a bottleneck. The ability of large language models to comprehend and summarize natural language will transform science by automating the synthesis of scientific knowledge at scale. Yet current LLMs are limited by hallucinations, lack access to the most up-to-date information, and do not provide reliable references for statements.
Here, we present WikiCrow, an automated system that can synthesize cited Wikipedia-style summaries for technical topics from the scientific literature. WikiCrow is built on top of FutureHouse’s internal LLM agent platform, PaperQA, which in our testing, achieves state-of-the-art (SOTA) performance on a retrieval-focused version of PubMedQA and other benchmarks, including a new retrieval-first benchmark, LitQA, developed internally to evaluate systems retrieving full-text PDFs across the entire scientific literature.
As a demonstration of the potential for AI to impact scientific practice, we use WikiCrow to generate draft articles for the 15,616 human protein-coding genes that currently lack Wikipedia articles, or that have article stubs. WikiCrow creates articles in 8 minutes, is much more consistent than human editors at citing its sources, and makes incorrect inferences or statements about 9% of the time, a number that we expect to improve as we mature our systems. WikiCrow will be a foundational tool for the AI Scientists we plan to build in the coming years, and will help us to democratize access to scientific research.
WikiCrow
Enter a gene name below
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
No results
Sorry, we couldn’t find any results that matched your search terms.
Loading details...
Background
If you’ve spent time in molecular biology, you have probably encountered the “alphabet soup” problem of genomics. Experiments in genomics uncover lists of genes implicated in a biological process, like MGAT5B and ADGRA3. Researchers turn to tools like Google, Uniprot or Wikipedia to learn more, as the knowledge of 20,000 human genes is too broad for any single human to understand. However, according to our count, only 3,639 of the 19,255 human protein-coding genes recognized by the HGNC have high-quality (non-stub) summaries on Wikipedia; the other 15,616 lack pages or are incomplete stubs. Often, plenty is known about the gene, but no one has taken the time to write up a summary. This is part of a much broader problem today: scientific knowledge is hard to access, and often locked up in impenetrable technical reports. To find out about genes like MGAT5B and ADGRA3, you’d end up sinking hours into reading the primary literature.
WikiCrow is a first step towards automated synthesis of human scientific knowledge. As a first demo, we used WikiCrow to generate drafts of Wikipedia-style articles for all 15,616 of the Human protein-coding genes that currently lack articles or have stubs, using information from full-text articles that we have access to through our academic affiliations. We estimate that this task would have taken an expert human ~60,000 hours total (6.8 working years). By contrast, WikiCrow wrote all 15,616 articles in a few days (about 8 minutes per article, with 50 instances running in parallel), drawing on 14,819,358 pages from 871,000 scientific papers that it identified as relevant in the literature.
Our articles are still far from perfect. To evaluate WikiCrow, we randomly selected 100 statements and asked:
- Is the statement cited? Is there a nearby citation that is clearly intended to support this statement, and is the citation relevant?
- Is the statement correct according to the citation? Does the cited literature contain the information that is presented in the statement being evaluated?
All statements were thus characterized as either having irrelevant or missing citations; being cited and correct; or being cited and incorrect. We then repeated the same process for human-written articles. The results are as follows:
As you read WikiCrow articles, you will see incorrect statements about 9% of the time. You may also see repetitive statements, or citations that aren’t correct. We expect that these errors will become rarer as the underlying models and techniques improve. On the other hand, WikiCrow is much better at providing citations than human authors. Make sure to check any information you read here yourself before relying on it, and please alert us to any errors you may find. For more technical details, read on:
PaperQA as a Platform for WikiCrow
WikiCrow is built on top of PaperQA, a Retrieval-Augmented Generative (RAG) agent that, in our testing, can answer questions over the scientific literature better than other LLMs and commercial products. (See our paper on PaperQA) PaperQA reduces hallucinations, provides context and references for how an answer was generated, is orders of magnitude faster than humans, and retains accuracy on par with experts. Check out how PaperQA works on a mix of popular science and technical questions below:
Keywords:
Dream formation theories, 1900-1950
Purpose of dreams research, 1950-1980
Evolution of dream theories, 1980-2023
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Keywords:
Zebra stripes evolution, 1950-2023
Origin of zebra stripes, 1970-2023
Scientific explanation for zebra stripes, 1980-2023
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Keywords:
Stegosaurus plate function, 1990-2023
Stegosaurus back armor, 2000-2023
Role of stegosaurus dorsal plates, 1980-2023
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Keywords:
Pluto planet designation, 1930-2023
Pluto reclassification, 2006-2023
Historical discoveries about Pluto, 1900-2023
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Keywords:
Plant communication, 2010-2023.
Signaling pathways in plants, 2015-2023.
Plant response to environmental stimuli, 2000-2023.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Keywords:
Human color vision evolution, 2000-2023
Development of human visual system, 1950-2023
Adaptations in human color perception, 1980-2023
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Keywords:
Dinosaur coloration, 2018-2023
Fossil color analysis, 2015-2023
Pigment preservation in dinosaur fossils, 2010-2023
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Keywords:
Climate change and ocean currents, 2010-2023
Impact of global warming on weather patterns, 2000-2023
Ocean circulation changes due to climate change, 2015-2023
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Keywords:
Insect decline in urban areas, 1990-2023
Air pollution and insect populations, 1980-2023
Climate change impact on insect abundance, 2000-2023
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Keywords:
Galaxy evolution and black hole dynamics, 2015-2023
Influence of black holes on galactic structure, 2010-2018
Interplay between black holes and galaxy formation, 2000-2010
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
The function of dreams and dreaming : moving towards an integrated understanding (6 citations)
Katja Valli, A. Revonsuo, Outi Pälkäs, K. H. Ismail, Karzan Jalal Ali, and R. Punamäki. The threat simulation theory of the evolutionary function of dreaming: evidence from dreams of traumatized children. Consciousness and Cognition, 14:188-218, 2005. (135 citations) 10.1016/S1053-8100(03)00019-9
B. Larison, R. Harrigan, H. Thomassen, D. Rubenstein, Chan-Golston M. Alec, Elizabeth Li, and T. Smith. How the zebra got its stripes: a problem with too many solutions. Royal Society Open Science, 2015. (39 citations) 10.1098/rsos.140452
R. I. Pocock. Xliv.—on the colours of horses, zebras, and tapirs. Journal of Natural History, 4:404-415, 1909. (3 citations) 10.1080/00222930908692688
T. Caro and T. Stankowich. Concordance on zebra stripes: a comment on larison et al. (2015). Royal Society Open Science, 2015. (11 citations) 10.1098/rsos.150323
J. Farlow, Carl V. Thompson, and D. E. Rosner. Plates of the dinosaur stegosaurus: forced convection heat loss fins? Science, 192:1123 - 1125, 1976. (70 citations) 10.1126/science.192.4244.1123
R. Main, A. D. de Ricqlès, J. Horner, and K. Padian. The evolution and function of thyreophoran dinosaur scutes: implications for plate function in stegosaurs. In Paleobiology, volume 31, 291 - 314. 2005. (123 citations) 10.1666/0094-8373(2005)031[0291:TEAFOT]2.0.CO;2
Is Pluto a Planet? A Historical Journey Through the Solar System (0 citations) 10.1063/1.2800098
R. Jarman and B. McClune. ‘a planet of confusion and debate’: children's and young people's response to the news coverage of pluto's loss of planetary status. Research in Science & Technological Education, 27:309 - 325, 2009. (6 citations) 10.1080/02635140903162637
Suzanne H. Broughton, G. Sinatra, and E. Nussbaum. “pluto has been a planet my whole life!” emotions, attitudes, and conceptual change in elementary students’ learning about pluto’s reclassification. Research in Science Education, 43:529-550, 2013. (118 citations) 10.1007/S11165-011-9274-X
M. Landi. Airborne signals and abiotic factors: the neglected side of the plant communication. Communicative & Integrative Biology, 13:67 - 73, 2020. (6 citations) 10.1080/19420889.2020.1767482
C. Wasternack and B. Hause. Jasmonates: biosynthesis, perception, signal transduction and action in plant stress response, growth and development. an update to the 2007 review in annals of botany. Annals of botany, 111 6:1021-58, 2013. (1870 citations) 10.1093/aob/mct067
M. Coppola, P. Cascone, Valentina Madonna, I. Di Lelio, F. Esposito, C. Avitabile, A. Romanelli, E. Guerrieri, Alessia Vitiello, F. Pennacchio, R. Rao, and G. Corrado. Plant-to-plant communication triggered by systemin primes anti-herbivore resistance in tomato. Scientific Reports, 2017. (33 citations) 10.1038/s41598-017-15481-8
J. Nathans. The evolution and physiology of human color vision insights from molecular genetic studies of visual pigments. Neuron, 24:299-312, 1999. (317 citations) 10.1016/S0896-6273(00)80845-4
Barry B. Lee. The evolution of concepts of color vision. Neurociencias, 4 4:209-224, 2008. (13 citations)
J. Marshall, S. Collin, J. Kremers, R. Baraas, and N. Marshall. Human color vision. In Springer Series in Vision Research. 2016. (106 citations) 10.1007/978-3-319-44978-4
J. Vinther. The true colors of dinosaurs. Scientific American, 316 3:50-57, 2017. (2 citations) 10.1038/scientificamerican0317-50
Quanguo Li, K. Gao, Qingjin Meng, J. Clarke, M. Shawkey, L. D’Alba, R. Pei, Michael Ellison, M. Norell, and J. Vinther. Reconstruction of microraptor and the evolution of iridescent plumage. Science, 335:1215 - 1219, 2012. (175 citations) 10.1126/science.1213780
J. Wiemann, J. Wiemann, Tzu-Ruei Yang, Philipp N. N Sander, Philipp N. N Sander, Marion Schneider, M. Engeser, S. Kath‐Schorr, Christa E. Müller, and P. M. Sander. Dinosaur origin of egg color: oviraptors laid blue-green eggs. PeerJ, 2017. (37 citations) 10.7717/peerj.3706
E. Lewis-Brown, P. C. Reid, A. Andersson, R. Arthurton, N. Bates, M. Barangé, U. Bathmann, G. Beaugrand, W. Berger, N. Bindoff, H. Cattle, P. Chisholm, J. Church, D. de Gusmão, H. Drange, S. Dye, M. Edwards, A. Fischer, J. Flueckiger, T. Furevik, J. Gascard, R. Hopcroft, D. Iglesias-Rodriguez, C. Le Quere, M. Le Tissier, S. Kasten, M. Kendall, R. Knutti, F. Mackenzie, G. Malin, D. Martinson, W. Maslowski, R. Matear, C. Mauritzen, M. Meredith, C. Paull, R. Pingree, J. Raven, S. Rintoul, I. Salter, G. Schmidt, K. Shimada, M. Sparrow, D. Stevens, P. Tréguer, A. Tudhope, C. Turley, M. Visbeck, M. Vogt, C. Wallace, Zhaomin Wang, R. Washington, and R. Wood. The impacts of the oceans on climate change. In Electronics System-integration Technology Conference, 29-32. 2008. (60 citations) 10.1109/estc.2008.4684318
B. Franco, O. Defeo, A. Piola, M. Barreiro, Hu Yang, L. Ortega, I. Gianelli, J. P. Castello, C. Vera, C. Buratti, M. Pájaro, L. Pezzi, and O. O. Möller. Climate change impacts on the atmospheric circulation, ocean, and fisheries in the southwest south atlantic ocean: a review. Climatic Change, pages 1-19, 2020. (47 citations) 10.1007/s10584-020-02783-6
Johanna L. Miller. Ocean currents respond to climate change in unexpected ways. Physics Today, 70:17-18, 2017. (3 citations) 10.1063/PT.3.3415
G. Vogel. Where have all the insects gone? Science, 356 6338:576-579, 2017. (142 citations) 10.1126/science.356.6338.576
P. Muñoz, F. Torres, and A. Megías. Effects of roads on insects: a review. Biodiversity and Conservation, 24:659-682, 2015. (104 citations) 10.1007/s10531-014-0831-2
Emily K. Meineke. The insect crisis: the fall of the tiny empires that run the world. American Entomologist, 2022. (2 citations) 10.1093/ae/tmac044
A. Cattaneo, S. Faber, J. Binney, A. Dekel, J. Kormendy, R. Mushotzky, A. Babul, P. Best, M. Brüggen, A. Fabian, C. Frenk, Arman Khalatyan, H. Netzer, A. Mahdavi, J. Silk, M. Steinmetz, and L. Wisotzki. The role of black holes in galaxy formation and evolution. Nature, 460:213-219, 2009. (261 citations) 10.1038/nature08135
E. Choi, R. Somerville, J. Ostriker, T. Naab, and M. Hirschmann. The role of black hole feedback on size and structural evolution in massive galaxies. The Astrophysical Journal, 2018. (48 citations) 10.3847/1538-4357/aae076
PaperQA is more than just a search tool; it is an adaptive system that uses tools based on the question and intermediate research. These tools include:
- SEARCH: finding relevant papers from online databases, such as Arxiv and Pubmed;
- GATHER_EVIDENCE: parsing and summarizing text from these papers;
- ANSWER_QUESTION: ranking the relevance of the gathered context and synthesizing information into a final answer.
This process is non-linear. For example, if PaperQA sees a paper that uses a different word to refer to a concept, it can go back and search again with the new nomenclature. Compared to a standard RAG, PaperQA makes four key changes (each improved performance, measured via ablation testing):
- PaperQA breaks down the Retrieve and Generate (RAG) process into tools for an AI agent, enabling it to perform multiple searches with various keywords whenever the information at hand isn't enough.
- PaperQA employs a Map-Reduce inspired approach to summarization, where the AI first collects (maps) evidence from a range of sources and then condenses (reduces) this information to provide an answer. This increases the amount of sources that can be considered, enabling the LLM to provide preliminary insights before composing the final answer.
- PaperQA uses a hybrid search approach to work on all accessible papers, which number in the 100s of million. Namely, PaperQA uses LLM-assisted keyword search at the corpus level and semantic search at the granular level of pages of text.
- PaperQA implements prior-knowledge prompting strategies to access and utilize the underlying knowledge embedded within language models, when needed finding evidence in the scientific literature, and uses the resulting answer as a type of posterior knowledge.
Importantly, PaperQA builds upon the unique structure of scientific literature – its citation graph and categorization into journals and fields. This is only possible due to the excellent contributions of the Semantic Scholar team at Allen Institute for AI, whose API for exploring the citation graph of science is a key feature of PaperQA. We plan to make the full WikiCrow and PaperQA code available on GitHub soon. Until then, the essential aspects of the PaperQA algorithm are available (although you will need access to your own repository of full text scientific articles), as well as the prompts used for WikiCrow.
Benchmarking PaperQA
In our evaluations, PaperQA outperforms GPT-4, Perplexity, and other LLMs, as well as commercial RAG systems on several benchmarks. We show excellent performance on two scientific question-answering benchmarks - MedQA-USMLE and PubMedQA Blind, the latter of which is a modified version of PubMedQA, where original contexts are removed to challenge the system to find the papers to retrieve the context. Additionally, PaperQA outperforms a range of systems on LitQA, a new benchmark that we developed to validate our performance. LitQA consists of multiple-choice questions that are difficult or impossible to answer accurately without retrieval of one or more specific papers, all of which were published after the training cutoff dates of GPT-4 and Claude 2 in 2022. Today, LitQA is small, with only 50 questions, as it is extraordinarily time-consuming to generate and validate these types of questions, but we plan to scale it up in the future. Also note that we performed this testing in October 2023 (outside of Gemini Pro in December 2023) and did not try to optimize any of the commercial systems here, so it’s possible they could be engineered for higher performance, or would have higher performance if tested today.
WikiCrow Mechanics
We carefully prompt the PaperQA agent to collect information on specific genes from scientific papers for each essential Wikipedia article section: Structure, Function, Interactions and Clinical Significance. To develop these prompts, we started with Wikipedia’s existing molecular biology style guide, then made significant changes over several empirical iterations. This highlights the continued importance of prompt engineering and the need for improved alignment strategies.
Afterwards, we use another LLM call to edit these four independent sections into a coherent and concise Wikipedia-style article, appending an Overview paragraph to the top, while maintaining all citations. The specific prompts used are available. Additionally, we are in conversations with Wikipedia about hosting these articles, and will continue to make our versions available programmatically; for example you can use this gsutil command to list all genes available for download: gsutil ls gs://fh-public/wikicrow/
Statements from human-written Wikipedia articles usually failed evaluation due to irrelevant, inappropriate, or absent citation support. We believe this stems from the varying quality of authorship, as well as the format of Wikipedia not requiring all statements to be justified with peer reviewed articles. Interestingly, statements from WikiCrow AI generated articles follow an opposite pattern, where the majority of statements fail due to incorrect transmission of information from the cited article. This was typically due to the model’s difficulty discerning highly similar gene names (e.g., GSDMD vs. GSDME), or failure to parse the logic of complex sentences, such as “knockdown of a repressive gene”, which is a clause with multiple negatives.
Evaluation of performance of LLMs powered by RAG is a new area of study, and this evaluation strategy has several limitations and challenges, which we highlight here:
- We do not evaluate absolute statement accuracy: We only evaluate whether statements are cited and whether they are true as cited; we do not evaluate whether statements are objectively accurate. Statements that are accurate but either not cited or incorrectly cited, which are probably more common in human-written Wikipedia articles, are scored as incorrect on either the “properly cited” criterion or the “true as cited” criterion. Trivially correct statements are excluded from evaluation
- Evaluation is challenging to blind: WikiCrow-written articles use significantly more references to bolster individual claims, so it is usually easy to tell which articles were written by humans and which were written by WikiCrow in evaluations.
- Inconsistent citation strategies: Humans use inconsistent citation strategies which require subjective evaluation. For example, we identified several cases of circular references in human-written Wikipedia articles, and we also identified several cases where human articles would cite large database entries like Entrez, rather than primary literature, which were difficult to evaluate. The need to make subjective decisions about whether to exclude such statements raises bias concerns.
- Sample exclusion: Articles generated both by WikiCrow and by humans often contain trivial statements of fact, which also need to be excluded from evaluation on a subjective basis.
Despite these challenges, we think that our evaluation system is a reasonably accurate reflection of the “ground truth” quality of human-written and WikiCrow-written articles. If you have suggestions about how to improve evaluation, let us know, or consider applying to join our Assessment Team!
Conclusion
We built WikiCrow and PaperQA as foundational tools both for human researchers and for the AI Scientists we are building at FutureHouse. We plan for PaperQA to be one of many tools available to our AI Scientists, aiding in knowledge synthesis, experimental planning, hypothesis generation, and more. Moreover, PaperQA will be part of a closed-loop system, ensuring continuous and informed progression from theory to experimentation.
In addition, we believe that the WikiCrow approach will eventually enable synthesis and curation of all human scientific knowledge, in collaboration with human editors. Some directions we expect to explore include the use of dedicated models that are fine-tuned on Wikipedia edits, and improved alignment strategies to reduce the amount of prompt engineering that is needed for generation of comprehensive and coherent articles for a given topic. In the long run, we even envision a “Super-pedia,” where articles are generated about any topic in real time, on-demand, with the most up-to-date information. If you’re excited to work on this, get in touch.
Interested in using PaperQA or WikiCrow? Fill out the form here