> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
Because they are not.
Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.
By 0x20cowboy an hour ago
Did you seriously write all of this to strawman both LLM and Leetcode interviews? Impressive.
Get help.
By wiseowise an hour ago
please consider a less emotive, flaming/personal tone in the future, hacker news is much more readable without it!
I would broadly agree that it's a bit far, but the OPs point does have some validity, its often the same formulaic methodology
By bloqs 11 minutes ago
I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with lossy text descriptions of such things which they have to re-read and re-interpret at every step.
LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.
By modeless 2 hours ago
I wonder scaffolding synthesis is the way to go. Namely the LLM itself first reasons about the problem and creates scaffolding for a second agent that will do the actual solving. All inside a feedback loop to adjust the scaffolding based on results.
By M4v3R 2 hours ago
I toyed around with the idea of using an LLM to "compile" user instructions into a kind of AST of scaffolding, which can then be run by another LLM. It worked fairly wellbfor the kind of semi-structured tasks LLMs choke on like "for each of 100 things, do...", but I haven't taken it beyond a minimal impl.
By sixo 2 hours ago
I am working on something similar but with an AST for legal documents. So far, it seems promising but still rudimentary.
By harshitaneja an hour ago
In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.
By modeless 2 hours ago
If you've ever used Claude Code + Plan mode - you know that exactly this is true.
By plantain 2 hours ago
> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.
blank stare
By justatdotin 37 minutes ago
Actually really promising stuff. I think a lot of the recent advances in the last 6mo - 1yr is in the other loop (for ex. the google deepthink model which got IMO gold and the OAI IMO gold all use substantive other loop search strategies [though it's unclear what these are] to maybe parallelize some generation/verification process). So there's no reason why we can't have huge advances in this area even outside of the industry labs in my view (I'm uninformed in general so take this comment with a large grain of salt).
By Davidzheng 3 hours ago
This sounds interesting.
I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.
Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
I think this argument is a bit flawed. Yes, you can define AGI as being better than (average) humans in every possible task. But isn't this very arbitrary? Isn't it more reasonable to expect that different intelligent systems (including animals, humans) can have different strengths, and it is unreasonable to expect that one system is really better in everything? Maybe it's more reasonable to define ASI that way, but even for ASI, if a system is already better in a majority of tasks (but not necessarily in every task), I think this should already count as ASI. Maybe really being better in every possible task is just not possible. You could design a task that is very specifically tailored for human intelligence.
By albertzeyer an hour ago
I suspect (to use the language of the author) current LLMs have a bit of a "reasoning dead zone" when it comes to images. In my limited experience they struggle with anything more complex than "transcribe the text" or similarly basic tasks. Like I tried to create an automated QA agent with Claude Sonnet 3.5 to catch regressions in my frontend, and it will look at an obviously broken frontend component (using puppeteer to drive and screenshot a headless browser) and confidently proclaim it's working correctly, often making up a supporting argument too. I've had much more success passing the code for the component and any console logs directly to the agent in text form.
My memory is a bit fuzzy, but I've seen another QA agent that takes a similar approach of structured text extraction rather than using images. So I suspect I'm not the only one finding image-based reasoning an issue. Could also be for cost reasons though, so take that with a pinch of salt.
By bubblyworld 18 minutes ago
Those are bold claims
By jokoon 2 hours ago
Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.
By 0x20cowboy an hour ago
By wiseowise an hour ago
By bloqs 11 minutes ago
By modeless 2 hours ago
By M4v3R 2 hours ago
By sixo 2 hours ago
By harshitaneja an hour ago
By modeless 2 hours ago
By plantain 2 hours ago
By justatdotin 37 minutes ago
By Davidzheng 3 hours ago
By albertzeyer an hour ago
By bubblyworld 18 minutes ago
By jokoon 2 hours ago
By pilooch 4 hours ago
By doctorpangloss 4 hours ago