My 2 cents (disclaimer: I am talking out of my ass) here is why GPTs actually suck at fluid knowledge retrievel (which is kinda their main usecase, with them being used as knowledge engines) - they've mentioned that if you train it on 'Tom Cruise was born July 3, 1962', it won't be able to answer the question "Who was born on July 3, 1962", if you don't feed it this piece of information. It can't really internally corellate the information it has learned, unless you train it to, probably via synthethic data, which is what OpenAI has probably done, and that's the information score SimpleQA tries to measure.
Probably what happened, is that in doing so, they had to scale either the model size or the training cost to untenable levels.
In my experience, LLMs really suck at fluid knowledge retrieval tasks, like book recommendation - I asked GPT4 to recommend me some SF novels with certain characteristics, and what it spat out was a mix of stuff that didn't really match, and stuff that was really reaching - when I asked the same question on Reddit, all the answers were relevant and on point - so I guess there's still something humans are good for.
Which is a shame, because I'm pretty sure relevant product recommendation is a many billion dollar business - after all that's what Google has built it's empire on.
You make a good point: I think these LLM's have a strong bias towards recommending the most popular things in pop culture since they really only find the most likely tokens and report on that.
So while they may have a chance of answering "What is this non mainstream novel about" they may be unable to recommend the novel since it's not a likely series of tokens in response to a request for a book recommendation.
That's really interesting - just made me think about some AI guy at Twitter (when it was called that) talking about how hard it is to create a recommender system that doesn't just flood everyone with what's popular righr now. Since LLMs are neural networks as well, maybe the recommendation algorithms they learn suffer from the same issues
Yep. I've often said RLHF'd LLMs seem to be better at recognition memory than recall memory.
GPT-4o will never offhand, unprompted and 'unprimed', suggest a rare but relevant book like Shinichi Nakazawa's "A Holistic Lemma of Science" but a base model Mixtral 8x22B or Llama 405B will. (That's how I found it).
It seems most of the RLHF'd models seem biased towards popularity over relevance when it comes to recall. They know about rare people like Tyler Volk... but they will never suggest them unless you prime them really heavily for them.
Your point on recommendations from humans I couldn't agree more with. Humans are the OG and undefeated recommendation system in my opinion.
An LLM on its own isn't necessarily great for fluid knowledge retrieval, as in directly from its training data. But they're pretty good when you add RAG to it.
For instance, asking Copilot "Who was born on July 3, 1962" gave the response:
> One notable person born on July 3, 1962, is Tom Cruise, the famous American actor known for his roles in movies like Risky Business, Jerry Maguire, and Rain Man.
Probably what happened, is that in doing so, they had to scale either the model size or the training cost to untenable levels.
In my experience, LLMs really suck at fluid knowledge retrieval tasks, like book recommendation - I asked GPT4 to recommend me some SF novels with certain characteristics, and what it spat out was a mix of stuff that didn't really match, and stuff that was really reaching - when I asked the same question on Reddit, all the answers were relevant and on point - so I guess there's still something humans are good for.
Which is a shame, because I'm pretty sure relevant product recommendation is a many billion dollar business - after all that's what Google has built it's empire on.