I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.
When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?
I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)
So is are LLMs reliable for research like that?
The least unreliable LLM I’ve found by far is perplexity, in the Pro mode. (By the way, if you want to try it out. You get a few free uses a day).
The reason is because the Pro mode doesn’t retrieve and spit out information from its internal memory bank, but instead, it uses that information to launch multiple search queries, then summarises the pages it finds, and then gives you that information.
Other LLMs try to answer “from memory” and then add some links at the bottom for fact checking but usually Perplexity’s answers come straight from the web so they’re usually quite good.
However, I still check (depending on how critical the task is) that the tidbit of information has one or two links next to it, that the links talk about the right thing, and I verify the data myself if it’s actually critical that it gets it right. I use it as a beefier search engine, and it works great because it limits the possible hallucinations to the summarisation of pages. But it doesn’t eliminate the possibility completely so you still need to do some checking.