My tests are very basic, but covers a lot of things relevant to my personal life. Gemini failed to correctly respond to a single one of my tests.
Gemini struggled the most with false information. Both creating falsehoods and accepting them without question. The most difficult thing for most LLMs was responding “I don’t know” or “that’s not right”. It is my firm belief that any lack of knowledge should be identified and handled by RAG.
Weird because Gemini scores the worst on my personal accuracy testing.
Do you suspect your personal accuracy testing is a lot better than the global lmsys arena leaderboard?
My tests are very basic, but covers a lot of things relevant to my personal life. Gemini failed to correctly respond to a single one of my tests.
Gemini struggled the most with false information. Both creating falsehoods and accepting them without question. The most difficult thing for most LLMs was responding “I don’t know” or “that’s not right”. It is my firm belief that any lack of knowledge should be identified and handled by RAG.
Same. Right now I rank it Claude, everything else except NaziAI, ChatGPT & Gemini, NaziAI.
I’ve found that IBM’s Granite4 3b works the best for home assistant. Claude is my #1 for code generation.
Ooo, I haven’t tried Granite4, I’ll check it out.
Same as Siri then