How bad boy
Full Member
The model has no sense of what is objectively right. At the core of it is an LLM system that's just chunking words up into tokens, pre-processing them, then pushing them through a neural net to predict the next most likely token.The issue is it thought a millimetre cubed was a millilitre.
You don’t need the right training data sets for that, that’s just objectively wrong.
Crude but I’d just discount any technology getting something basic wrong like that.
The cell count itself is first year biology. When I was in first year I’d open a first year biology book and find that out using the index. I thought Google had solved finding that kind of thing without going into Boole over 20 years ago - this is a step backwards.
The terrifying bit is this is where I know the answer is wrong. What if it’s something I ask it about which I know nothing about it? Which is most things.
It’s a good tool for creating more convincing bullshit artists. Like the world hasn’t enough of those already.
It does, without a shadow of a doubt, exhibit emergent intelligence.
And the speed of improvement is phenomenal, so it really matters which model you used that gave that answer.
You can make them look stupid easily.
The easiest one until about 6 months ago was to ask how many Rs are in the word strawberry. Almost all used to tell you there were 2, because the chunking up into tokens would break it into something like "Straw" as a single token and "berry" as another token. The exact composition of those words is effectively compressed out, so it would then only count 2 tokens associated with R.
They get it nowadays as they've put systems around the models to compensate for that sort of case. And the systems around the likes of ChatGPT Pro are a lot better than those around ChatGPT free. Same with Copilot free vs paid enterprise with ChatGPT5 enabled.
The benchmarks that were previously created, such as MMLU are now barely relevant, most good models get it 85% correct.

Humanity's Last Exam is the main benchmark people are tracking (albeit one might argue it's waaay too focused on science * maths), but even the progress there is amazing, GPT-4o could only achieve 2.7, GPT-5 is now up to 25.3%
I'm sceptical whether or not it'll actually go into human type of intelligence, I don't see how a token prediction engine can be particularly creative when it's trying to output the most probable answer*, but with the current rate of progress, would not rule it out entirely. Additionally, few people are actually creative for the vast, vast majority of their life, so if they can improve the accuracy and mitigate hallucinations, then there's still plenty improving to be done
It's amazing that simply using something that predicts the next word is actually as intelligent as it is, for all its flaws.
*Philosophers such as Wittgenstein have argued that while lanugage is not everything, it is one of the fundamental basis of our knowledge and understanding. Dictators such as the Soviets (and as a result, Orwell) understood this well enough, that if you ban the expression of an idea then you suppress the ability to express that idea long term, and thus you suppress the idea, the inverse also being true, that ideas arise from language. That if you can capture our language, you can capture our intelligence. Clearly, it's only partially true...





