Hacker News

LLM Year in Review

324 points by swyx a day ago

121 Comments
For me, Claude Code was the most impressive innovation this year. Cursor was a good proof of concept but Claude Code is the tool that actually got me to use LLMs for coding.
The kind of code that Claude produces looks almost exactly like the code I would write myself. It's like it's reading my mind. This is a game changer because I can maintain the code that Claude produces.
With Claude Code, there are no surprises. I can pretty much guess what its code will look like 90% to 95% of the time but it writes it a lot faster than I could. This is an amazing innovation.
Gemini is quite impressive as well. Nano banana in particular is very useful for graphic design.
I haven't tried Gemini with coding yet but TBH, Claude Code does such a great job; if I could code any faster, I would get decision fatigue. I don't like rushing into architecture or UX decisions. I like to sit on certain decisions for a day or two before starting implementation. Once you start in a particular direction, it's hard to undo and you may try to double down on the mistake due to sunk cost fallacy. I try hard to avoid that.

By socketcluster 13 hours ago
I've used all of these tools and for me Cursor works just as well but has tabs, easy ways to abort or edit prompts, great visual diff, etc...
Someone sell me on how Claude Code, I just don't get it.

By thefourthchime a few seconds ago
Do you guys all work 100% on open source? Or are you uploading bits of your copyrighted code for future training to Anthropic? I hate patents so copyright is the only IP protection I have.

By yread 34 minutes ago
I don't even see much reason to use Cursor. I am used to IntelliJ IDEA, so I just downloaded the Claude Code plugin and basically now I use the IDE only for navigating in the code, finding references and reviewing the code. I can't even remember the last time I wrote more than 2 lines of code. Claude Code has catapulted my performance at least 5x if not more. And now that the cost of writing test is so minimal I am also able to achieve much better (and meaningful!) test coverage too. The AI agents is where the most productivity is. I just create a plan with Claude, iterate over, ask questions, then let it implement the plan, review, ask to do some adjustments. No manual writing of code at all. Zero.

By Daniel_sk 4 hours ago
IntelliJ has its own Claude integration too, but it does not use your Claude subscription: https://blog.jetbrains.com/ai/2025/09/introducing-claude-age...

By esafak 3 hours ago
Nano Banana Pro is legitimately an insane tool if you know how to use it. I still can’t believe they released it in the wild

By spaceman_2020 4 hours ago
What is there to using it more than asking it to generate an image of something?

By rolymath an hour ago
For one: modifying existing images in interesting ways ... adding characters, removing elements, altering or enhancing certain features, creating layers, and so on. Things that would take a while on Photoshop, done almost instantly. Really unlocks the imagination.

By kakapo5672 7 minutes ago
It's decent for things that would take a long time in Photoshop. Like most AI, sometimes it works great and sometimes it goes off the rails completely. Most recently, I used it to process some drone photos that were taken during late fall for the purpose of marketing a commercial property. All of the trees/grass/plants were brown, so I told it to make it look like the photos were taken during the summer but not to change anything else. It did a very good job, not just changing the color, but actually adding leaves to the plants and trees in a way that looked very realistic. It did in seconds what would have taken one of my team members hours, leaving them to work on other more pressing projects.

By IAmGraydon 2 hours ago
I first got into agentic properly with GLM coding plan (it's like $2/month), but I found myself very consistently asking Claude to make the code more elegant and readable. At which point I realized I was being silly and just switched to Claude code.
(GLM etc. get surprisingly close with good prompting but... $0.60/day to not worry about that is a no brainer.)

By andai 9 hours ago
I don’t have much time to evaluate tools every months and I have settled on Cursor. I’m curious on what I’m missing when using the same models?

By tarsinge 11 hours ago
You are missing an entire agentic experience. And I wouldn't call it vibe coding for an engineer; you're more or less empowered to truly orchestrate the development of your system.
Cursor has agent, but that's like whoever else tried to copy the Model T while Ford was developing it.

By ramoz 5 hours ago
This hasn’t been my experience at all. I’m finding Cursor with Opus 4.5 and plan mode to be just as capable as CC. And I prefer the UI/UX.

By senordevnyc 2 hours ago
I have only compared Claude Code with Crush and a tool of my own design. In my experience, Claude code is optimized for giant codebases and long tasks. It loves launching dozens of agents in parallel. So it's a bit heavy for smaller, surgical stuff, though it works decent for that too.
If you mostly have small codebases that fit in context, or make many small changes interactively, it's not really great for that (though it can handle it too). It'll just be spending most of its time poking around the codebase, when the whole thing should have just been loaded... (Too bad there's no small repo mode. I made startup hook that just dumps cat dir into context, but yeah, should be a toggle.)

By andai 9 hours ago
You're not missing much. You can generally use Cursor like Claude Code for normal day to day use. I prefer Cursor because I like reviewing changes in an IDE, and I like being able to switch to the current SOTA model.
Though for more automated work, one thing you miss with Cursor is sub agents. And then to a lesser extent skills (these are pretty easy to emulate in other tools). I'm sure it's only a matter of time though.

By afro88 8 hours ago
Claude Code's VS Code integration is very easy to set up and pretty helpful if you want to see/review changes in an IDE.

By Ozzie_osman 5 hours ago
The big limitation is that you have to approve/disapprove at every step. With Cursor you can iterate on changes and it updates the diffs until you approve the whole batch.

By ollysb 5 hours ago
There is an auto accept diffs mode

By fzzzy 2 hours ago
If you switch to Codex you will get a lot of tokens for $200, enough to more consistently use high reasoning as well. Cursor is simply far more expensive so you end up using less or using dumber models.
Claude Code is overrated as it uses many of its features and modalities to compensate for model shortcomings that are not as necessary for steering state of the art models like GPT 5.2

By wahnfrieden 10 hours ago
I think this is a total misunderstanding of Anthropic’s place in the AI race. Opus 4.5 is absolutely a state of the art model. I won’t knock anyone for preferring Codex, but I think you’re ignoring official and unofficial benchmarks.
See: https://artificialanalysis.ai

By MrOrelliOReilly 9 hours ago
> Opus 4.5 is absolutely a state of the art model.
> See: https://artificialanalysis.ai
The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.

By woadwarrior01 7 hours ago
Totally, however OP's point was that Claude had to compensate for deficiencies versus a state of the art model like ChatGPT 5.2. I don't think that's correct. Whether or not Opus 4.5 is actually #1 on these benchmarks, it is clearly very competitive with the other top-tier models. I didn't take "state of the art" to here narrowly mean #1 on a given benchmark, but rather to mean near or at the frontier of current capabilities.

By MrOrelliOReilly 4 hours ago
One thing to remember when comparing ML models of any kind is that single value metrics obscure a lot of nuance and you really have to go through the model results one by one to see how it performs. This is true for vision, NLP, and other modalities.

By gessha 3 hours ago
https://lmarena.ai/leaderboard/webdev
LM Arena shows Claude Opus 4.5 on top

By dr_dshiv 5 hours ago
is x-high fast enough to use as a coding agent?

By fzzzy 2 hours ago
https://x.com/giansegato/status/2002203155262812529/photo/1
https://x.com/METR_Evals/status/2002203627377574113
> Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
What an insane take for anybody uses these models daily.

By ramoz 4 hours ago
What am I missing? As suspicious as benchmarks are, your link shows GPT 5.2 to be superior.
It is also out of date as it does not include 5.2 Codex.
Per my point about steerability compensated for by modalities and other harness features: Opus 4.5 scores 58% while GPT 5.2 scores 75% for the instruction following benchmark in your link! Thanks for the hard evidence - GPT 5.2 is 30% ahead of Opus 4.5 there. No wonder Claude Code needs those harness features for the user to manually reign in control over its instruction following capability.

By wahnfrieden 2 hours ago
I disagree, the claude models seem the best at tool calling, opus 4.5 seems the smartest, and claude code (+ claude model) seems to make good use of subagents and planning in a way that codex doesn't

By ccmcarey 10 hours ago
Opus 4.5 is so bad at instruction following (30% worse per benchmark shared above) that it requires a manual toggle for plan mode.
GPT 5.2 simply obeys instruction to assemble a plan and avoids the need to compensate for poor steerability that would require the user to manually manage modalities.
Opus has improved though so the plan mode is less necessary than it was before, but it is still far behind state of art steerability.

By wahnfrieden 2 hours ago
I noticed that despite really liking Karpathy and the blog, I was am kind of wincing/involuntarily reacting to the LLM-like "It's not X, its Y"-phrases:
> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer
> it's not just about the image generation itself, it's about the joint capability coming from text generation
There would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me

By augment_me 11 hours ago
You’re absolutely right!
Jk jk, now that you pointed it out I can’t unsee it.

By karpathy 2 hours ago
I used to use a lot of em dashes normally in my writing - they were my go-to replacements for commas and semicolons
But I had to change how I write because people started calling my writing “AI generated”

By spaceman_2020 4 hours ago
2026 will be the year of the ;

By athrowaway3z 2 hours ago
Please no that's my go to

By vatsachak an hour ago
so you switched to using hyphens instead?

By fzzzy 2 hours ago
Yeah, came to read Karpathy's thoughts, but might as well ask an LLM myself..

By matsemann 5 hours ago
I hated these sentences way before LLMs, at least in the context of an explanation.
> it's not just a website you go like Google, it's a little spirit/ghost that "lives" on your computer
This type of sentence, I call rhetorical fat. Get rid of this fat and you obtain a boring sentence that repeats what has been said in the previous one.
Not all rhetorical fats are equal, and I must admit I find myself eyerolling on the "little spirit" part more than about the fatness.
I understand the author wants to decorate things and emphasize key elements, and the hate I feel is only caused by the incompatible projection of my ideals to a text that doesn't belong to me.
> it's not just about the image generation itself, it's about the joint capability coming from text generation.
That's unjustified conceptual stress.
That could be a legitimate answer to a question ("No, no, it's not just about that, it's more about this"), but it's a text. Maybe the text wants you to be focused, maybe the text wants to hype you; this is the shape of the hype without the hype.
"I find image generation is cooler when paired with text generation."

By d-lisp 9 hours ago
It is not a decoration. Karpathy juxtaposes ChatGPT (which feels like a "better google" to most people) to Claude Code, which, apparently, feels different to him. It's a comparison between the two.
You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.
ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.

By killerstorm 5 hours ago
Indeed, I was probably grumpy at the time I wrote the comment. I do find some truth in it still.
You're right ! The strawman theory is based.
But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).

By d-lisp 5 hours ago
Karpathy should go back to what he does best: educating people about AI on a deep level. Running experiments and sharing how they work, that sort of stuff. It seems lately he is closer to an influencer who reviews AI-based products. Hopefully it is not too late to go back.

By amelius 4 hours ago
I feel these review stuff is more like a side / pass time to him. Look at nanochat for example. My impression is that these are the thongs he spends most of his energy still.
After all,l he's been a "influencer" for a long time, starting from the "software 2.0" essay.

By flakiness 3 hours ago
I cannot unsee this anymore and it ruins the whole internet experience for me

By yard2010 7 hours ago
Same here, had to configure ChatGPT to stop making these statements. Also had to configure bunch of other stuff to make it bland when answering questions.

By another_twist 9 hours ago
The way to make AI not sound like ChatGPT is to use Claude.
I realized that's what bothered me. It's not "oh my god, they used ChatGPT." But "oh my god, they couldn't even be bothered to use Claude."
It'll still sound like AI, but 90% of the cringe is gone.
If you're going to use AI for writing, it's just basic decency to use the one that isn't going to make your audience fly into a fit of rage every ten seconds.
That being said, I feel very self conscious using emdashes in current decade ;)

By andai 8 hours ago
I love em dashes—they basically indicate a more deliberate pause than a … without the tight vibes of a semicolon.

By dr_dshiv 5 hours ago
I dont think Ive ever noticed someone use an emdash until chatgpt appeared

By ionwake 8 hours ago
https://xkcd.com/3126/
I mostly use them in Telegram because it auto converts -- into emdash. They are a pain to type everywhere else though!

By andai 4 hours ago
Same, I cringe when I read this structure.

By huevosabio 10 hours ago
It's not text - it's clickbait distillied to grammar.

By nathias 9 hours ago
I appreciate Andrej’s optimistic spirit, and I am grateful that he dedicates so much of his time to educating the wider public about AI/LLMs. That said, it would be great to hear his perspective on how 2025 changed the concentration of power in the industry, what’s happening with open-source, local inference, hardware constraints, etc. For example, he characterizes Claude Code as “running on your computer”, but no, it’s just the TUI that runs locally, with inference in the cloud. The reader is left to wonder how that might evolve in 2026 and beyond.

By thoughtpeddler 19 hours ago
The CC point is more about the data and environmental and general configuration context, not compute and where it happens to run today. The cloud setups are clunky because of context and UIUX user in the loop considerations, not because of compute considerations.

By karpathy 19 hours ago
Agree with the GP, though -- you ought to make that clearer. It really reads like you're saying that CC runs locally, which is confusing since you obviously know better.

By CamperBob2 15 hours ago
I think we need to shift our mindset on what an agent is. The LLM is a brain in a vat connected far away. The agent sits on your device, as a mech suit for that brain, and can pretty much do damn near anything on that machine. It's there, with you. The same way any desktop software is.

By ramoz 5 hours ago
Yeah, I made some edits to clarify.

By karpathy 13 hours ago
From what I can gather, llama.cpp supports Anthropic's message format now[1], so you can use it with Claude Code[2].
[1]: https://github.com/ggml-org/llama.cpp/pull/17570
[2]: https://news.ycombinator.com/item?id=44654145

By magicalhippo 19 hours ago
One of the most interesting coding agents to run locally is actually OpenAI Codex, since it has the ability to run against their gpt-oss models hosted by Ollama.
```
  codex --oss -m gpt-oss:20b
```
Or 120b if you can fit the larger model.
By simonw 17 hours ago
What do you find interesting about it, and how does it compare to commercial offerings?

By AlexCoventry 16 hours ago
It's rare to find a local model that's capable of running tools in a loop well enough to power a coding agent.
I don't think gpt-oss:20b is strong enough to be honest, but 120b can do an OK job.
Nowhere NEAR as good as the big hosted models though.

By simonw 15 hours ago
Think of it as the early years of UNIX & PC. Running inferences and tools locally and offline opens doors to new industries. We might not even need client/server paradigm locally. LLM is just a probabilistic library we can call.

By ontouchstart 13 hours ago
Thanks.

By AlexCoventry 13 hours ago
What he meant was, agents will probably not be these web abstractions that run in deployed services (langchain, crew); agents meaning the Harnesses (software wrapper) specifically that call the LLM API.
It runs on your computer because of its tooling. It can call Bash. It can literally do anything on the operating system and file system. That's what makes it different. You should think of it like a mech suit. The model is just the brain in a vat connected far away.

By ramoz 15 hours ago
The section on Claude Code is very ambiguously and confusingly written, I think he meant that the agent runs on your computer (not inference) and that this is in contrast to agents running "on a website" or in the cloud:
> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
However, if so, this is definitely a distinction that needs to be made far more clearly.

By D-Machine 19 hours ago
Well Microsoft had thier "localhost" AI before CC but that was a ghost without a clear purpose or skill.

By realcul 15 hours ago
Excellent more grounded review. A few questions:
> LLMs are emerging as a new kind of intelligence, simultaneously a lot smarter than I expected and a lot dumber than I expected
Isn't this concerning? How can we know which one we get? In the realm of code it's easier to tell when mistakes are being made.
> regular people benefit a lot more from LLMs compared to professionals, corporations and governments
We thought this would happen with things like AppleScript, VB, visual programming. But instead, AI is currently used as a smarter search engine. The issue is that's also the area where it hallucinates the most. What do you think is the solution?

By cheesecompiler 2 hours ago
> In the same way, LLMs should speak to us in our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc.
You think every Electron app out there re-inventing application UX from scratch is bad, wait until LLMs are generating their own custom UX for every single action for every user for every device. What does command-W do in this app? It's literally impossible to predict, try it and see!

By jkubicek 17 hours ago
On the other side of the spectrum, I see some of the latest agents, like Codex, take care to get accessibility right -- something not even many humans bother to do.

By johnfn 11 hours ago
It's an extension of how I've noticed that AIs will generally write very buttoned-down, cross-the-ts-and-dot-the-is code. Everything gets commented, every method has a try-catch with a log statement, every return type is checked, etc. I think it's a consequence of them not feeling fatigue. These things (accessibility included) are all things humans generally know they 'should' do, but there never seems to be enough time in the day; we'll get to it later when we're less tired. But the ghost in the machine doesn't care. It operates at the same level all the time

By becquerel 9 hours ago
>our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc
If you look at how humans actually communicate I'd guess #1 is text/speech, #2 pictures

By tim333 6 hours ago
But that's exactly what an LLM solved.
It's the best ui ever.
It understands a lot of languages and abstract concepts.
It will not be necessary at all to let LLM generate random uis.
I'm not a native English speaker. I sometimes just throw in a German word and it just works.

By Aiisnotabubble 7 hours ago
The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR is the mental model I didn't know I needed to explain the current state of jagged intelligence. It perfectly articulates why trust in benchmarks is collapsing; we aren't creating generally adaptive survivors, but rather over-optimizing specific pockets of the embedding space against verifiable rewards.
I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.

By starchild3001 17 hours ago
> The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR
I don't see these descriptions as very insightful.
The difference between general/animal intelligence and jagged/LLM intelligence is simply that humans/animals really ARE intelligent (the word was created to describe this human capability), while LLMs are just echoing narrow portions of the intelligent output of humans (those portions that are amenable to RLVR capture).
For an artificial intelligence to be intelligent in it's own right, and therefore be generally intelligent, it would need to need - like an animal - to be embodied (even if only virtually), autonomous, predicting the outcomes of it's own actions (not auto-regressively trained), learning incrementally and continually, built with innate traits like curiosity and boredom to put and keep itself in learning situations, etc.
Of course not all animals are generally intelligent - many (insects, fish, reptiles, many birds) just have narrow "hard coded" instinctual behaviors, but others like humans are generalists who evolution have therefore honed for adaptive lifetime learning and general intelligence.

By HarHarVeryFunny 4 hours ago
> I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
We should keep in mind that currently our LLM use is subsidized. When the money dries up and we have to pay the real prices I’ll be interested to see if we can still consider whipping up one time apps as basically free

By fourside 2 hours ago
I've been doing it for months, it's lovely
https://tech.lgbt/@graeme/115749759729642908
It's a stack based on finishing the job Jupyter started. Fences as functions, callable and composable.
Same shape as an MCP. No training required, just walk them through the patterns.
Literally, it's spatially organized. Turns out a woman named Mrs Curwen and I share some thoughts on pedagogy.
There does in fact exist a functor that maps 18th century piano instruction to context engineering. We play with it

By graemefawcett 15 hours ago
It’s funny how every podcaster/public ai figure is so certain text as a Ui will go away and it’s not going anywhere.

By lysecret 7 hours ago
A few days ago I was trying to unsubscribe to a service (notably an AI 3D modeling tool that I was curious about).
I spent 5 minutes trying to find a way to unsubscribe and couldn't. Finally, I found it buried in the plan page as one of those low-contrast ellipses on the plan card.
Instead of unsubscribing me or taking me to a form, it opened a convos with an AI chatbot with a preconfigured "unsubscribe" prompt. I have never felt more angry with a UI that I had to waste more time talking to a robot before it would render the unsubscribe button in the chat.
Why would we bring the most hated feature of automated phone calls to apps? As a frontend engineer I am horrified by these trends.

By devalexwells an hour ago
It's probably increased during my lifetime. People used to talk, now they sit and text into smartphones.

By tim333 6 hours ago
There might be some confusion about the transition to what some call post-literate era: era where text is not the primary medium. That’s not necessarily bad because you get the advantages of other mediums - oral and visual but it is something to keep in mind.

By gessha 3 hours ago
I'm bit skeptical that a post-literate era is happening. I gather it appears in some sci-fi but I don't see much sign in reality. I mean here we are on a text only site. If anything we seem to be heading for a 100% literate society. Literacy graphs here: https://ourworldindata.org/grapher/cross-country-literacy-ra...

By tim333 2 hours ago
I don’t think the post-illiterate era means that text will disappear. I think it’s just not going to be dominant anymore but I also have my reservations since I do prefer the text medium.

By gessha 2 hours ago
Notable omission: 2025 is also when the ghosts started haunting the training data. Half of X replies are now LLMs responding to LLMs. The call is coming from inside the dataset.

By victorbuilds 21 hours ago
Any tips to spot this? I want to avoid arguing with a X bot.

By vlod 18 hours ago
Really easy: don't argue on the internet. The approach has many benefits.

By shtack 16 hours ago
Also, don't use X.

By jckahn 14 hours ago
also, please just do not use X

By bdangubic 14 hours ago
Ok, fine, but do you have a better way to build a bot following and expose oneself to trending MAGA memes?

By dr_dshiv 5 hours ago
I would love Andrej's take on the fast models we got this year. Gemini 3 flash and Grok 4 fast have no business being as good + cheap + fast as they are. For Andrej's prediction about LLMs communicating with us via a visual interface we're going to need fast models, but I feel like AI twitter/HN has mostly ignored these.

By mips_avatar 17 hours ago
Just guessing here, but these small models may well be essentially distillations of larger ones, with this being where their power comes from. e.g. Use a large model to generate synthetic reasoning traces, then train a small model on those.

By HarHarVeryFunny 4 hours ago
check out Sasha Luccioni

By gnerd00 14 hours ago
Do you have a link to anything they wrote about this?

By mips_avatar 14 hours ago
I think one of the things that is missing from this post is engaging a bit in trying to answer: what are the highest priority AI-related problems that the industry should seek to tackle?
Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?
In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.
Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.

By TheAceOfHearts 21 hours ago
Google is doing that with A2UI. LLM will be able to decide how to present info to the user.

By lukax 5 hours ago
The bit about o3 being the turning point is very interesting. I heard someone say that o3 (or perhaps the cheaper o4-mini) should have been called gpt-5, and that people would have been mind blown. Instead it kind of went under the radar as far as the mainstream goes.
Whereas we just got the incremental progress with gpt-5 instead and it was very underwhelming. (Plus like 5 other issues at launch, but that's a separate story ;)
I'm not sure if o4-mini would have made a good default gpt though. (Most use is conversational and its language is very awkward.) So they could have just called it gpt-5 pro or something, and put it on the $20 tier. I don't know.

By andai 9 hours ago
I agree with this fwiw, for many months I talked to people who never used o3 and didn’t know what it was because it sounded weird. Maybe it wasn’t obvious at the time but that was a good major point release to make then.

By karpathy 2 hours ago
> I like this version of the meme for pointing out that human intelligence is also jagged in its own different way.
The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.
If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.

By delichon 18 hours ago
I don't think it will become well rounded because that is not cost sensitive. Intelligence is sensitive to cost, it is the core constraint shaping it. Any action has a cost - energy, materials, time, opportunity or social. Intelligence is solving the cost equation, if we can't solve it we die. Cost is also why we specialize, in a group we can offload some intelligence to others. LLMs also have their own costs, and are shaped by it into some kind of jagged intelligence, they are no spherical cows either.

By visarga 12 hours ago
> In this world view, nano banana is a first early hint of what that might look like.
What is he referring to here? Is nano banana not just an image gen model? Is it because it's an LLM-based one, and not diffusion?

By mvkel 18 hours ago
What's interesting about Nano Banana (and even more so video models like Veo 3) is that they act as a weird kind of world model when you consider that they accept images as input and return images as output.
Give it an image of a maze, it can output that same image with the maze completed (maybe).
There's a fantastic article about that for image-to-video models here: https://video-zero-shot.github.io/
> We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more.

By simonw 17 hours ago
I think he is referring to capability, not architecture, and say that NB is at the point that it is suggestive of the near-future capability of using GenAI models to create their own UI as needed.
NB (Gemini 2.5 Flash Image) isn't the first major-vendor LLM-based image gen model, after all; GPT Image 1 was first.

By dragonwriter 17 hours ago
Here's the source for the jagged spiky intelligence diagram:
https://x.com/colin_fraser/status/1994235521812328695
https://karpathy.bearblog.dev/the-space-of-minds/

By andai 8 hours ago
Beyond graduating students, I see model labs as “accelerators/incubators” bundling, launching, and productizing observed ideas that gain traction. The sheer strength of their platforms, the number of eyes watching them, near-zero marginal costs, and seemingly unlimited budgets mean that only slow decision-making can prevent them from becoming the next Amazons of everything.

By nkko 10 hours ago
Something I’ve been thinking about is how as end stage users (eg building our own “thing” on top of an LLM) we can broadly verify it’s doing what we need without benchmarks. Does a set of custom evals built out over time solve this? Is there more we can do?

By dandelionv1bes 9 hours ago
xposted to https://x.com/karpathy/status/2002118205729562949

By swyx a day ago
And also accessible sans login via https://xcancel.com/karpathy/status/2002118205729562949 .

By CamperBob2 15 hours ago
LLMs still need to bring clear added value to enterprise and corporate work; otherwise, they remain a geek’s toy.
Big media agencies that claim to use AI rely on strong creative teams who fine-tune prompts and spend weeks doing so. Even then, they don’t fully trust AI to slice long videos into shorter clips for social media.
Heavy administrative functions like HR or Finance still don’t get approval to expose any of their data to LLMs.
What I’m trying to say is that we are still in the early stages of LLM development, and as promising as this looks, it’s still far from delivering the real value that is often claimed.

By alexgotoi 10 hours ago
I think their non-deterministic nature is what’s making it difficult to adopt. It’s hard to train somebody in the old way of “if you see this, do this” because when you call the LLM twice you most likely get different results.
It took a long time to computerize businesses and it might take some time to adopt/adapt to LLMs.

By gessha 3 hours ago
Friendly reminder: There is no ghost in the machine. It is a system executing code, not a being having thoughts. Let’s admire the tool without projecting a personality onto it.

By distalx 7 hours ago
For me, that’s kind of the point. It’s similar to how the characters in a novel don’t really exist, and yet you can’t really discuss what happens in a novel without pretending that they do. It doesn’t really make sense to treat the author’s motivations and each character’s motivations as the same.
Similarly, we’re all talking to ghosts now, which aren’t real, and yet there is something there that we can talk about. There are obvious behavioral differences depending on what persona the LLM is generating text for.
I also like the hint of danger in “talking to ghosts.” It’s difficult to see how a rational adult could be in any danger from just talking, but I believe the news reports that some people who get too deep into it get “possessed.”

By skybrian 4 hours ago
Consciousness is weird and nobody understands it. There is no good reason to assume that these systems have it. But there is also no good reason to rule it out.

By ngruhn 4 hours ago
You sound as if you have grounds for certainty about this. What are they?

By squidbeak 4 hours ago
That’s the old way of thinking about it. there is a new way.

By dr_dshiv 4 hours ago
find on page:slop=0

By metalman 3 hours ago
Vibe coding is sufficient for job hoppers who never finish anything and leave when the last 20% have to be figured out. Much easier to promote oneself as an expert and leave the hard parts to other people.

By bgwalter 19 hours ago
I’ve found incredible productivity gains writing (vibe coding) tools for myself that will never need to be “productionised” or even used by another person. Heck even I will probably never use the latest log retrieval tool, which exists purely for Claude code to invoke it. There is a ton of useful software yet to be written for which there _is_ no “last 20%”.

By zingar 18 hours ago
These tools are so useful and make you so much more "productive" that you don't think anyone else would want to pay anything for them huh? Did your boss at least give you a big raise for your "productivity" increase, or maybe lay off some of your underperforming coworkers bc you are just so much better now?

By diamond559 10 hours ago
All software is not meant to be open-source, in production and working on 100 platforms.
Sometimes the point of the software is to make an app with 2 buttons for your mom to help her do her grocery shopping easier

By augment_me 11 hours ago
Do you mean vibe coding as-in producing unreviewed code with LLMs and prompting at it until it appears to work, or vibe coding as a catch-all for any time someone uses AI-assistance to help them write code?

By simonw 17 hours ago
tl;dr seems like llms are maturing on the product side and for day-day usage

By ausbah 16 hours ago