Hacker Newsnew | past | comments | ask | show | jobs | submit | mskar's commentslogin

Do you have ideas for what would make a better experiment? The methodology for a literature search comparison, while simple, is the best I could come up with. We developed ~250 multiple choice questions which require a deep dive into a paper to answer, ideally with very convincing distractor answers. Then we gave 9 evaluators (post-docs and grad students in biology) a week to answer 40 questions each, without any limitations on their search. The evaluators were incentivized by providing a base pay per question completed, with a 50-100% bonus if they got enough questions correct.

Under those circumstances, the evaluators had an answer precision of 73.8%, and the AI system (PaperQA2) was 85.2%. Both the evaluators and PaperQA2 could choose not to answer on a particular question. If you look at accuracy, which takes into account not answering a question, evaluators were 67.7% and PaperQA2 was 66%. So in terms of overall accuracy -- humans still did a touch better. But when actually answering, the AI was more precise.

In terms of literature synthesis comparison, I think the methodology was pretty solid too, but would love more feedback. We had PaperQA2 write cited articles for ~19k human genes, of which there are (non-stub) Wikipedia articles for ~3.9k. It's worth noting that this is a particularly technical subset of Wikipedia articles. We sampled 300 articles that were in both sources, then extracted 500 statements from each (basically a paragraph block). These statements could be compound, or even multi-sentence statements. These statements were shuffled and obfuscated such that the origin could not be determined from the statement alone.

The statements were given to a team of 4 evaluators, who were each asked to evaluate if the information was correct as cited, i.e. did the source actually support the statement. So they had to access (if they could) and actually read all the sources. After we got the evaluator gradings back, we could compile and map each statement back to its origin for comparison. Under these circumstance, the PaperQA2 written articles were 83% cited and supported, while the Wikipedia articles were 61.5% cited and supported. Wikipedia had comparatively more uncited claims, so if we eliminate those and only focus on the cited claims themselves, then PaperQA2 had 86.1% of claims that were supported by the source and Wikipedia had 71.2%. We did an analysis of every single un-supported claim, and on Wikipedia, claims are often attributed to arbitrary or really broad sources, like a landing page to a database.

(here's the paper fwiw: https://arxiv.org/abs/2409.13740)


We measured PaperQA2 (https://github.com/Future-House/paper-qa) against the science portion of the RAG-Arena benchmark (https://arxiv.org/abs/2407.13998), it's the first time we've compared PaperQA2 against other systems based on Cohere or Contextual.ai. PaperQA2 achieves a 12.4% higher score than Contextual.ai on the same dataset (1,404 questions and 1.7M documents).

We're thrilled about this because it's open source, and getting better every day -- check out the code to reproduce this result in our cookbook here: https://futurehouse.gitbook.io/futurehouse-cookbook/paperqa/....


Great article, I’ve had similar findings! LLM based “document-chunk” ranking is a core feature of PaperQA2 (https://github.com/Future-House/paper-qa) and part of why it works so well for scientific Q&A compared to traditional embedding-ranking based RAG systems.


That's awesome. Will take a closer look!


This is awesome! If you’re interested, you could add a search tool client for your backend in paper-qa (https://github.com/Future-House/paper-qa). Then paper-qa users would be able to use your semantic search as part of its workflow.


I advise against it since binarized hamming distance isn't exactly that good unless your vector length is say a million.


I have the fp32 embeddings saved. It is for the website that I use binarised ones to combat latency.


paper-qa looks pretty cool. I will do so!


We used an open-source AI RAG library, PaperQA2 (https://github.com/Future-House/paper-qa), to generate well cited articles for every gene in the human genome, ~15k of which had no existing prior articles. In terms of factuality, we tested our generated claims against the same gene's human written Wikipedia article in a blinded study evaluated by PhD biologists. Our system's articles were more precise on average than cited claims from existing articles. (https://paper.wikicrow.ai)

The system is scalable in that we can comfortably generate all 19.2k gene articles once per week, building a repository of cited articles that automatically syncs with all published literature.


We're sharing some experiments in designing RAG systems via the open source PaperQA2 system (https://github.com/Future-House/paper-qa). PaperQA2's design is interesting because it isn't concerned with cost, so it uses expensive operations like agentic tool calling and LLM based re-ranking and contextual summarization for each query.

Even though the costs are higher, we see that the RAG accuracy gains (in question-answering tasks) are worth it. Including LLM chunk re-ranking and contextual summaries in your RAG flow also makes the system robust to changes in chunk sizes, parsing oddities and embedding model shortcomings. It's one of the largest drivers of performance we could find.


We are announcing PaperQA2 (https://github.com/Future-House/paper-qa), the first AI agent to achieve superhuman performance on a variety of different scientific literature search tasks. PaperQA2 is an agent optimized for retrieving and summarizing information over the scientific literature. PaperQA2 has access to a variety of tools that allow it to find papers, extract useful information from those papers, explore the citation graph, and formulate answers. PaperQA2 achieves higher accuracy than PhD and postdoc-level biology researchers at retrieving information from the scientific literature, as measured using LitQA2, a piece of the LAB-Bench evals set that we released earlier this summer. In addition, when applied to produce wikipedia-style summaries of scientific information, WikiCrow, an agent built on top of PaperQA2, produces summaries that are more accurate on average than actual articles on Wikipedia that have been written and curated by humans, as judged by blinded PhD and postdoc-level biology researchers.

To get a better feel for how it works, try out the repo or check this tweet thread here (https://x.com/SGRodriques/status/1833908643856818443). It's got some videos of the workflow live.

PaperQA2 allows us to perform analyses over the literature at a scale that are currently unavailable to scientists. At FutureHouse, we previously showed that we could use an older version (PaperQA) to generate a Wikipedia article for all 20,000 genes in the human genome, by combining information from 1 million distinct scientific papers. However, those articles were less accurate on average than existing articles on Wikipedia. Now that the articles we can generate are significantly more accurate than Wikipedia articles, one can imagine generating Wikipedia-style summaries on demand, or even regenerating Wikipeda from scratch with more comprehensive and recent information. In the coming weeks, we will use WikiCrow to generate Wikipedia articles for all 20,000 genes in the human genome, and will release them at wikicrow.ai. In the meantime, wikicrow.ai contains a preview of 240 articles used in the paper.

In addition, we are very interested in how PaperQA2 could allow us to generate new hypotheses. One approach to that problem is to identify contradictions between published scientific papers, which can point the way to new discoveries. In our paper, we describe how ContraCrow, an agent built on top of PaperQA2, can evaluate every claim in a scientific paper to identify any other papers in the literature that disagree with it. We can grade these contradictions on a Likert scale to remove trivial contradictions. We find 2.34 statements per paper on average in a random subset of biology papers that are contradicted by other papers from anywhere else in the literature. Exploring these contradictions in detail may allow agents like PaperQA2 and ContraCrow to generate new hypotheses and propose new pivotal experiments.


Is it possible for you to post the full text of the Alzheimer article that PaperQA2 actually generated in the Youtube video demo? It would be interesting to compare the PaperQA2 output to what other similar systems can produce.


Sure! Here's the output from the video--

The understanding of amyloid-beta's role in Alzheimer's disease has evolved significantly over time: Initially, amyloid plaques were considered the primary cause of Alzheimer's, leading to the amyloid cascade hypothesis (Fantini2020Progress pages 4-7). Research shifted focus from plaques to small oligomeric structures of amyloid-beta as the neurotoxic culprits (Fantini2020Progress pages 4-7). It was discovered that physiological (low) levels of amyloid-beta enhance memory, while only pathological (high) levels are harmful (Morley2014The pages 1-6). The view changed from Aβ peptides being purely harmful to recognizing their important physiological functions in the brain (Fantini2020Progress pages 1-4). Clinical trials targeting amyloid have had mixed results, leading to questioning of the amyloid hypothesis (Hardy2009The pages 2-3). Some researchers now propose rejecting the amyloid cascade hypothesis as insufficient to explain Alzheimer's complex biology (Herrup2015The pages 1-1). There's growing recognition that tau pathology may correlate better with cognitive impairment than amyloid accumulation (Mullane2020Alzheimer's pages 16-16). Current approaches are shifting towards considering amyloid-beta as one factor among many in the aging process, rather than the central cause of Alzheimer's (Josepha2001Copernicus pages 1-2).

References: Fantini2020Progress: Jacques Fantini, Henri Chahinian, and Nouara Yahi. Progress toward alzheimer’s disease treatment: leveraging the achilles’ heel of aβ oligomers? Protein Science, 29(8):1748–1759, July 2020. URL: http://dx.doi.org/10.1002/pro.3906, doi:10.1002/pro.3906. This article has 48 citations and is from a peer-reviewed journal.

Morley2014The: John E. Morley and Susan A. Farr. The role of amyloid-beta in the regulation of memory. Biochemical Pharmacology, 88(4):479–485, April 2014. URL: http://dx.doi.org/10.1016/j.bcp.2013.12.018, doi:10.1016/j.bcp.2013.12.018. This article has 96 citations and is from a domain leading peer-reviewed journal.

Hardy2009The: John Hardy. The amyloid hypothesis for alzheimer’s disease: a critical reappraisal. Journal of Neurochemistry, 110(4):1129–1134, July 2009. URL: http://dx.doi.org/10.1111/j.1471-4159.2009.06181.x, doi:10.1111/j.1471-4159.2009.06181.x. This article has 615 citations and is from a domain leading peer-reviewed journal.

Josepha2001Copernicus: J Josepha. Copernicus revisited: amyloid beta in alzheimer’s disease. Neurobiology of Aging, 22(1):131–146, January 2001. URL: http://dx.doi.org/10.1016/s0197-4580(00)00211-6, doi:10.1016/s0197-4580(00)00211-6. This article has 146 citations and is from a domain leading peer-reviewed journal.

Hamley2012The: I. W. Hamley. The amyloid beta peptide: a chemist’s perspective. role in alzheimer’s and fibrillization. Chemical Reviews, 112(10):5147–5192, July 2012. URL: http://dx.doi.org/10.1021/cr3000994, doi:10.1021/cr3000994. This article has 775 citations and is from a highest quality peer-reviewed journal.

Herrup2015The: Karl Herrup. The case for rejecting the amyloid cascade hypothesis. Nature Neuroscience, 18(6):794–799, May 2015. URL: http://dx.doi.org/10.1038/nn.4017, doi:10.1038/nn.4017. This article has 593 citations and is from a highest quality peer-reviewed journal.

Jacobs2022It’s: Noortje Jacobs and Bert Theunissen. It’s groundhog day! what can the history of science say about the crisis in alzheimer’s disease research? Journal of Alzheimer’s Disease, 90(4):1401–1415, December 2022. URL: http://dx.doi.org/10.3233/jad-220569, doi:10.3233/jad-220569. This article has 4 citations.

Mullane2020Alzheimer’s: Kevin Mullane and Michael Williams. Alzheimer’s disease beyond amyloid: can the repetitive failures of amyloid-targeted therapeutics inform future approaches to dementia drug discovery? Biochemical Pharmacology, 177:113945, July 2020. URL: http://dx.doi.org/10.1016/j.bcp.2020.113945, doi:10.1016/j.bcp.2020.113945. This article has 68 citations and is from a domain leading peer-reviewed journal.


Thank you! We spent a lot of time trying to keep the characters on rails. I hope your son enjoys it too!


You've got a great point! I would also love to do a registration-less version, but the unfiltered access to my LLM client really worries me since API abuse would be so easy.

I'll brainstorm some ways to protect against abuse and try them out! Maybe some session limits based on cookies.


I work in data at https://www.carrumhealth.com/, and I've been parsing this data for weeks. The transparency prices allow us to meaningfully negotiate with providers, and make tangible, incremental progress toward cheaper health care. Providers and existing insurance carriers leverage information asymmetry to control the market otherwise.

For context, we bundle the 100's of itemized costs into a single, static bill per surgery type. In doing so, we've built a custom virtual-network with the most efficient surgeons. These surgeons are able to meet the volume and quality requirements to allow for lower margins. We're able to get negotiated rates that are 10-40% cheaper than traditional insurance contracts when we have data that we trust.

Unfortunately, this data alone isn't enough to properly determine prices because organizations will spread costs across procedure and billing codes that often occur in aggregate groups. For example, in a joint replacement surgery, some organizations may dump the cost into the billing for the implant itself, while others may put it under the procedure code. You have to gather billing data en masse to see which charges occur together, then combine this pricing data to determine what costs will actually look like for someone experiencing a procedure.

It's a nightmare!


How much do you think it costs to maintain all these negotiated contracts VS just having a single payer system with the same price for all procedures?


It's very expensive, carriers have an economic incentive to simplify it and this is still where they end up. There are a long tail of provider circumstances that the single-payer model will need to figure out. Some examples:

* Small hospitals in low-density, underserved areas have to make up for underutilized equipment and personnel costs. They raise prices on unrelated, common procedures to break even (This is very common)

* CMS (medicare/medicaid) sets a low price for a procedure that's overly common in a particular facility, now that facility loses money for each occurrence. They choose other procedures to raise the price to try to break even.

* Larger hospitals have higher administrative and operations costs (for things like training and research) that benefit society, but need to be averaged out across all procedure costs. This differs from hospital to hospital.

* Smaller professional facilities or physicians groups (like Ambulatory Surgery Centers) have much lower administrative costs and a smaller staff, so they have lower overhead per procedure. They are designed to be efficient, and can handle lower prices. However if there are any major complications, they won't be able to service the patient, and have to send to a hospital. This then pushes all the highest-cost, ICU-type procedures into hospitals, where there is already a higher overhead, causing hospitals to need separate pricing to cover more complex patients.

A large single payer price set will probably force efficiencies into the healthcare system. It'll be great for folk's costs, but we may see many facilities close, and lines of care will be consolidated into specialty centers. (more travel to get imaging, procedures, or to see a specialist)


What do you think about how Kaiser has handled the whole thing? The insurance company employing the doctors and just paying them a standard salary seems to create all the right incentives.


My experience in talking to people with chronic conditions that aren’t easily treatable is that Kaiser’s model works great until anything that’s slightly out of the ordinary happens, and then it falls apart. If you’re a zebra (as in “when you hear hooves, think horses, not zebras”), their model is pretty horrible.


The best thing about Kaiser, IMHO, is there is never a surprise out-of-network astronomical charge on the bill as I've seen with regular insurance.


Isn't it pretty bad to be a zebra in general though? Certainly there isn't any place where zebras have it better than horses.


Yes but if you're at Kaiser in San Francisco and have a zebra there may only be one doc (or a small group) at UCSF that can treat your zebra, and they are not in the Kaiser network, so you go to Los Angles where Kaiser's specialist is, get treated by a lesser doc with a virtual visit assist from LA, or pay cash out of network.


I think their point is that it's relatively better under another system, not that it's amazing there.


Have insurance split into two parts, the 95% cases and the rare and expensive?


Sounds like they have intelligently optimized for the common case.


>>* CMS (medicare/medicaid) sets a low price for a procedure that's overly common in a particular facility, now that facility loses money for each occurrence. They choose other procedures to raise the price to try to break even.

This is precisely why most Doctors I speak with are abhorrently against a single payer system.


Most doctors I talk to vaguely run around the answer before mumbling that a huge way to cut costs (which will surely happen) is to cut doctors salaries.

Source: once engaged to a doctor who had doctor friends and doctor parents/family.


And there is a reason why we shouldn't go off of anecdotal evidence. It's blatantly false.

Doctors’ salaries account for only about 8% of U.S. healthcare costs. A 40% cut in these salaries would reduce healthcare spending by only about 3% [1][2].

Doctor salaries are not a huge way to cut costs. If anything this would make the problem worse.

[1] https://www.latimes.com/opinion/story/2021-09-14/dont-blame-...

[2] https://pnhp.org/news/doctors-salaries-are-not-the-big-cost/


It’s not false that doctors worry that. Doctors worry that single payer system will reduce their salaries. They’re an easy political target. They’re rich and (in this hypothetical case) their salary would come from the taxpayers. Taxpayers don’t like expensive salaries.

It’s irrelevant how much of the budget it is. It’s about perception and power. If you try to cut soap in the operating room or other supplies, you’ll look bad for endangering the patient. If you try to cut procedures you’ll look bad. If you try to cut doctor salaries, those “overpaid” doctors look bad for complaining.

Doctors have a reputation in america for being extremely well paid. If you tel people making $60k a year that their tax bill for medical costs could be lower if you reduce it by taking $50k from a doctor making $500k (taxpayer dollars!) they’ll support that. Even if it’s not a big amount.

Reducing healthcare spending 3% without any systemic change in medical treatments or equipment or negotiation with pharmaceutical companies is a huge and easy win.


PBS put out a documentary ages ago comparing America to other countries. At the time our administrative overhead was 25% while Taiwan's overhead was 2%.


Not much.

The net cost of insurance represents 6.4% of all healthcare spending.

https://www.ama-assn.org/delivering-care/patient-support-adv...


Is the data unique or has it been duplicated for multiple formats? In other words is there a CSV file right alongside a Json file and an XML file that contains the exact same data, just in different formats?

Is the data partitioned at all (e.g. by state) so that you can just download the data for California without downloading all the data; loading it into a huge database table; and then querying it (e.g. SELECT * from <table> WHERE state = 'California')?


There is some duplication, where different networks under the same carrier could benefit from normalization, but in-general duplication isn't the primary issue.

The data is partitioned for some carriers at the network level, but unless that carrier has networks that are unique to a given state it's difficult to partition by location.

The majority of the data is lumped into very large, single JSON (not newline delimited), so an initial parsing step is required to break out substructures for parallel processing via warehousing technologies. I think Aetna has a 300Gb compressed (single) json file.

After breaking the json to a single array entry per provider/network, parsing is still a bit tricky because there are some very "hot" keys. Some provider array entries may only have 1000 code and cost entries, others may have 100k. We've seen array entries >50Mb for a single provider/network/carrier.


Sounds like an application for ML, to determine which codes frequently coincide per-patient at each provider and then assign those groupings to cross-provider "Treatment XYZ" buckets to enable apples-to-apples comparisons.


I would think a basic statistical analysis should suffice.


most software billed as having 'ML' capabilities is just basic statistical analysis anyway - but that doesn't make for good marketing-speak.


Great call, many orgs in health tech use billing/procedure code embeddings to group, just like you're suggesting.


Calculating a basic median for those groups would be a non-trivial (indeed, probably quite difficult) exercise at this scale.


Applying ML to health care is a guaranteed path to wealth, and later, insanity.


How do you get info on bills for historic procedures? Do patients opt in, or will the hospitals provide that information as part of their cooperation with insurers?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: