I would argue that it's going to be the opposite. At re:Invent, one of the popular sessions was in creating a trio of SRE agents, one of which did nothing but read logs and report errors, one of which did analysis of the errors and triaged and proposed fixes, and one to do the work and submit PRs to your repo.
Then, as part of the session, you would artificially introduce a bug into the system, then run into the bug in your browser. You'd see the failure happen in browser, and looking at Cloudwatch logs you'd see the error get logged.
Two minutes later, the SRE agents had the bug fixed and ready to be merged.
"understand how these systems actually function" isn't incompatible with "I didn't write most of this code". Unless you are only ever a single engineer, your career is filled with "I need to debug code I didn't write". What we have seen over the past few months is a gigantic leap in output quality, such that re-prompting happens less and less. Additionally, "after you've written this, document the logic within this markdown file" is extremely useful for your own reference and for future LLM sessions.
AWS is making a huge, huge bet on this being the future of software engineering, and even though they have their weird AWS-ish lock-in for some of the LLM-adjacent practices, it is an extremely compelling vision, and as these nondeterministic tools get more deterministic supporting functions to help their work, the quality is going to approach and probably exceed human coding quality.
I agree with both you and the GP. Yes, coding is being totally revolutionized by AI, and we don't really know where the ceiling will be (though I'm skeptical we'll reach true AGI any time soon), but I believe there still an essential element of understanding how computer systems work that is required to leverage AI in a sustainable way.
There is some combination of curiosity of inner workings and precision of thought that has always been essential in becoming a successful engineer. In my very first CS 101 class I remember the professor alluding to two hurdles (pointers and recursion) which a significant portion of the class would not be able to surpass and they would change majors. Throughout the subsequent decades I saw this pattern again and again with junior engineers, bootcamp grads, etc. There are some people no matter how hard they work, they can't grok abstraction and unlock a general understanding of computing possibility.
With AI you don't need to know syntax anymore, but to write the write prompts to maintain a system and (crucially) the integrity of its data over time, you still need this understanding. I'm not sure how the AI-native generation of software engineers will develop this without writing code hands-on, but I am confident they will figure it out because I believe it to be an innate, often pedantic, thirst for understanding that some people have and some don't. This is the essential quality to succeed in software both in the past and in the future. Although vibe coding lowers the barrier to entry dramatically, there is a brick wall looming just beyond the toy app/prototype phase for anyone without a technical mindset.
I get its necessary for investment, but I'd be a lot happier with these tools if we didn't keep making these wild claims, because I'm certainly not seeing 10x the output. When I ask for examples, 90% its claude code (not a beacon of good software anyway but if nearly everyone is pointing to one example it tells you thats the best you can probably expect) and 10% weekend projects, which are cool, but not 10x cool. Opus 4.5 was released in Dec 2025, by this point people should be churning out year long projects in a month, and I certainly haven't seen that.
I've used them a few times, and they're pretty cool. If it was just sold as that (again, couldn't be, see: trillion dollar investments) I wouldn't have nearly as much of a leg to stand on
Any semi-capable coder could build a Reddit clone by themselves in a week since forever. It's a glorified CRUD app.
The barrier to creating a full blown Reddit the huge scaling, not the functionality. But with AWS, Azure, Google Cloud, and backends like S3, CF etc, this hasn't been a barrier since a decade or more, either.
What I could do in a week is maybe set up an open source clone of reddit (that was written by many people for many months) and customize it a little bit.
And I have a pretty decent career behind me as a aoftware developer and my peers percieved me as kinda good.
Even capable coders can’t create a Reddit clone in a week. Because it’s not just a glorified CRUD app. And I encourage you to think a bit harder before arguing like that.
Yes you can create a CRUD app in some kind of framework and style it like Reddit. But that’s like putting lines on your lawn and calling it a clone of the Bernabeu.
But even if you were right, the real barrier to building a Reddit clone is getting traction. Even if you went viral and did everything right, you’d still have to wait years before you have the brand recognition and SEO rankings they enjoy.
In what way (that's not related to the difficulty of scaling it, which I already addressed separately)?
The point of my comment was:
"Somebody with AI cloning Reddit in a week is not as special as you make it to be, all things considering. A Reddit clone is not that difficult, it's basically a CRUD app. The difficult part of replicating it, or at least all the basics of it, is its scaling - and even that wouldn't be as difficult for a dev in 2026, the era of widespread elastic cloud backends".
The Bernabeu analogy handwavingly assumes that Reddit is more challenging than a homegrown clone, but doesn't address in what way Reddit differs from a CRUD app, and how my comment doesn't hold.
And even if it did, it would be moot regarding the main point I make, unless the recent AI-clone also handles those differentiating non-CRUD elements and thus also differs from a CRUD app.
>But even if you were right, the real barrier to building a Reddit clone is getting traction.
True, but not relevant to my point, which is about the difficulty of cloning Reddit coding-wise, not business wise, and whether it's or isn't any great feat for someone using AI to do it.
Calling Reddit a CRUD app isn’t wrong, it’s just vacuous.
It strips away every part that actually makes Reddit hard.
What happens when you sign up?
A CRUD app shows a form and inserts a row.
Reddit runs bot detection, rate limits, fingerprinting, shadow restrictions, and abuse heuristics you don’t even see, and you don’t know which ones, because that knowledge is their moat.
What happens when you upvote or downvote?
CRUD says “increment a counter.”
Reddit says “run a ranking algorithm refined over years, with vote fuzzing, decay, abuse detection, and intentional lies in the UI.” As the number you see is not the number stored.
What happens when you add a comment?
CRUD says “insert record.”
Reddit applies subreddit-specific rules, spam filters, block lists, automod logic, visibility rules, notifications, and delayed or conditional propagation.
What happens when you post a URL?
CRUD stores a string.
Reddit fingerprints it, deduplicates it, fetches metadata, detects spam domains, applies subreddit constraints, and feeds it into ranking and moderation systems.
Yes, anyone can scaffold a CRUD app and style it like Reddit.
But calling that a clone is like putting white lines on your lawn and calling it the Bernabeu.
You haven’t cloned the system, only its silhouette.
> Reddit runs bot detection, rate limits, fingerprinting, shadow restrictions, and abuse heuristics you don’t even see, and you don’t know which ones, because that knowledge is their moat.
> Reddit says “run a ranking algorithm refined over years, with vote fuzzing, decay, abuse detection, and intentional lies in the UI.” As the number you see is not the number stored.
> etc...
The question is; is moltbook doing this? That was the original point, it took a week to build a basic reddit clone, as you call it the silhouette, with AI, that should surely be the point of comparison to what a human could do in that time
I mean as has already been pointed out the fact that its a clone is a big reason why, but then I also think I could probably churn out a simple clone of reddit in less than a week. We've been through this before with twitter, the value isnt the tech (which is relatively straightforward), its the userbase. Of course Reddit has some more advanced features which would be more difficult, but I think the public db probably tells you that wasn't much of a concern to Moltbook either, so yeh, I reckon I could do that.
Double your estimate and switch the unit or time to next larger one. That's how programmers time estimate tend to be. So two months and I'm right there with you.
Even if I am only slightly more productive, it feels like I am flying. The mental toll is severely reduced and the feel good factor of getting stuff done easily (rather than as a slog) is immense. That's got to be worth something in terms of the mental wellbeing of our profession.
FWIW I generally treat the AI as a pair programmer. It does most of the typing and I ask it why it did this? Is that the most idiomatic way of doing it? That seems hacky. Did you consider edge case foo? Oh wait let's call it a BarWidget not a FooWidget - rename everything in all other code/tests/make/doc files Etc etc.
I save a lot of time typing boilerplate, and I find myself more willing (and a lot less grumpy!!!) to bin a load of things I've been working on but then realise is the wrong approach or if the requirements change (in the past I might try to modify something I'd been working on for a week rather than start from scratch again, with AI there is zero activation energy to start again the right way). Thats super valuable in my mind.
I absolutely share your feelings. And I realise I’m way less hesitant to pick up the dredge tasks; migrating to new major versions of dependencies, adding missing edge case tests, adding CRUD endpoints, nasty refactorings, all these things you usually postpone or go on procrastination sprees on HN are suddenly very simple undertakings that you can trivially review.
Because the world is still filled with problems that would once have been on the wrong side of the is it worth your time matrix ( https://xkcd.com/1205/ )
There are all sorts of things that I, personally, should have automated long ago that I threw at claud to do for me. What was the cost to me? Prompt and a code review.
Meanwhile, on larger tasks an LLM deeply integrated into my IDE has been a boon. Having an internal debate on how to solve a problem, try both, write a test, prove out what is going to be better. Pair program, function by function with your LLM, treat it like a jr dev who can type faster than you if you give it clear instructions. I think you will be shocked at how quickly you can massively scale up your productivity.
Yup, I've already run like 6 of my personal projects including 1 for my wife that I had lost interest in. For a few dollars, these are now actually running and being used by my family. These tools are a great enabler for people like me. lol
I used to complain when my friends and family gave me ideas for something they wanted or needed help with because I was just too tired to do it after a day's work. Now I can sit next to them and we can pair program an entire idea in an evening.
If it is 20% slower for you to write with AI, but you are not stressed out and enjoy it so you actually code then the AI is a win and you are more productive with it.
I think that's what is missing from the conversation. It doesn't make developers faster, nor better, but it can automate what some devs detest and feel burned out having to write and for those devs it is a big win.
If you can productively code 40 hours a week with AI and only 30 hours a week without AI then the AI doesn't have to be as good, just close to as good.
I'm in agreeance with you 100%. A lot of my job is coming into projects that have been running already and having to understand how the code was written, the patterns, and everything else. Generating a project with an LLM feels like doing the same thing. It's not going to be a perfect code base but it's enough.
Last night I was working on trying to find a correlation between some malicious users we had found and information we could glean from our internet traffic and I was able to crunch a ton of data automatically without having to do it myself. I had a hunch but it made it verifiable and then I was able to use the queries it had used to verify myself. Saved me probably 4 or 5 hours and I was able to wash the dishes.
The matrix framing is a very nice and way to put it. This morning I asked my assistant to code up a nice debugger for a particular flow in my application. It’s much better than I would have had time/patience to build myself for a nice-to-have.
I sort of have a different view of that time matrix. If AI is only able to help me do tasks that are of low value, where I previously wouldn’t have bothered—- is it really saving me anything? Before where I’d simply ignore auxiliary tasks, and focus on what matters, I’m now constantly detoured with them thinking “it’ll only take ten minutes.”
I also primarily write Elixir, and I have found most Agents are only capable of writing small pieces well. More complicated asks tend to produce unnecessarily complicated solutions, ones that may “work,” on the surface, but don’t hold up in practice. I’ve seen a large increase in small bugs with more AI coding assistance.
When I write code, I want to write it and forget about it. As a result, I’ve written a LOT of code which has gone on to work for years without touching it. The amount of time I spent writing it is inconsequential in every sense. I personally have not found AI capable of producing code like that (yet, as all things, that could change).
Does AI help with some stuff? Sure. I always forget common patterns in Terraform because I don’t often have to use it. Writing some initial resources and asking it to “make it normal,” is helpful. That does save time. Asking it to write a gen server correctly, is an act of self-harm because it fundamentally does not understand concurrency in Erlang/BEAM/OTP. It very much looks like it does, but it 100% does not.
tldr; I think the ease of use of AI can cause us to over produce and as a result we miss the forest for the trees.
It excels at this, and if you have it deeply integrated into your workflow and IDE/dev env the loop should feel more like pair programing, like tennis, than it should feel like its doing everything for you.
> I also primarily write Elixir,
I would also venture that it has less to do with the language (it is a factor) and more to do with what you are working on. Domain will matter in terms of sample size (code) and understanding (language to support). There could be 1000s of examples in its training data of what you want, but if no one wrote a commment that accurately describes what that does...
> I think the ease of use of AI can cause us to over produce and as a result we miss the forest for the trees.
This is spot on. I stopped thinking of it as "AI" and started thinking of it as "power tools". Useful, and like a power tool you should be cautious because there is danger there... It isnt smart, it's not doing anything that isnt in its training data, but there is a lot there, everything, and it can do some basic synthesis.
Like others are saying, AI will accelerate the gap between competent devs and mediocre devs. It is a multiplier. AI cannot replace fundamentals, at least not a good helmsman with a good rational, detail-oriented mind. Having fundamentals (skill & knowledge) + using AI will be the cheat code in the next 10 years.
The only historical analogue of this is perhaps differentiating a good project manager from an excellent one. No matter how advanced, technology will not substitute for competence.
At the company I work for, despite pushing widespread adoption, I have seen exactly a zero percent increase in the rate at which major projects get shipped.
This is what keeps getting me. People here keep posting benchmarks, bragging about 5x, 10x, 20x. None of the companies we work with are putting anything faster.
The evangelist response is to call it a skill issue, but looking around it seems like no one anywhere is actually pushing out new products meaningfully faster.
Several experiments have shown quality of output at every skill level drops.
In many cases the quantity of output is good enough to compensate, but quality is extremely difficult to improve at scale. Beefing up QA to handle significantly more code of noticeably lower quality only goes so far.
> But something I'd bet money on is that devs are 10x more productive at using these tools.
If this were true, we should be seeing evidence of it by now, either in vastly increased output by companies (and open source projects, and indie game devs, etc), or in really _dramatic_ job losses.
This is assuming a sensible definition of 'productive'; if you mean 'lines of code' or 'self-assessment', then, eh, maybe, but those aren't useful metrics of productivity.
It is tempting to think that we can delegate describing the mental model to AI, but it seems like all of this boils down to humans making bets, and it also seems like the fundamental bets engineers are making are about the formalisms that encode the product and make it valuable.
What an awful professor! When I first tried to learn pointers, I didn't get it. I tried again 6 months later and suddenly it clicked. The same thing happened for another guy I was learning with.
So the professor just gaslit years of students into thinking they were too dumb to get programming, and also left them with the developmental disability of "if you can't figure something out in a few days, you'll never get it".
I don’t think there will be an “AI native” generation of developers. AI will be the entity that “groks pointers” and no one else will know it or care what goes on under the hood.
Speaking as someone who has been both a SRE/DevOps from all levels from IC to Global Head of a team:
- I 100% believe this is happening and is probably going to be the case in the next 6 months. I've seen Claude and Grok debug issues when they only had half of the relevant evidence (e.g. Given A and B, it's most likely X). It can even debug complex issues between systems using logs, metrics etc. In other words, everything a human would do (and sometimes better).
- The situation described is actually not that different from being a SRE manager. e.g. as you get more senior, you aren't doing the investigations yourself. It's usually your direct reports that are actually looking at the logs etc. You may occasionally get involved for more complex issues or big outages but the direct reports are doing a lot of the heavy lifting.
- All of the above being said, I can imagine errors so weird/complex etc that the LLMs either can't figure it out, don't have the MCP or skill to resolve it or there is some giant technology issue that breaks a lot of stuff. Facebook engineers using angle grinders to get into the data center due to DNS issues comes to mind for the last one.
Which probably means we are all going to start to be more like airline pilots:
- highly trained in debugging AND managing fleets of LLMs
- managing autonomous systems
- around "just in case" the LLMs fall over
P.S. I've been very well paid over the years and being a SRE is how I feed my family. I do worry, like many, about how all of this is going to affect that. Sobering stuff.
> Which probably means we are all going to start to be more like airline pilots:
Airline pilots are still employed because of regulations. The industry is heavily regulated and the regulations move very slowly because of its international cooperative nature. The regulations dictate how many crew members should be on board for each plane type and other various variables. All the airlines have to abide by the rules of the airspace they're flying over to keep flying.
The airlines on the other hand along with the technology producers (airbus for example) are pursuing to reduce number of heads in the cockpit. While their recent attempt to get rid of co-pilots in EASA land has failed [1], you can see the amount of pursuit and investment. The industry will continue to force through cost optimization as long as there's no barrier to prevent. The cases where automation has failed will be just a cost of the business, since the life of the deceased is no concern to the company's balance sheet.
Given the lack of regulation in the software, I suspect the industry will continue the cost optimization and eliminate humans in the loop, except in the regulated domains.
It's crazy how many developers are starry-eyed optimists about all of this, just casually assuming that they'll still be highly-paid, well-respected professionals ("we'll be like pilots, mostly monitoring the autopilot") if this technology doesn't hit a wall in the next year or two, despite lacking any of the legal protections that other professions enjoy.
Regulation and strong unions are the only thing holding airlines back from doing what the cruise lines did long ago: importing all of their labor from cheaper countries and paying them trash while working them to the bone.
In the meantime, captains at legacy airlines are the only ones getting paid well. Everyone else struggles to make ends meet. All while airlines constantly compain that they "can't find enough qualified pilots." Where have I heard this said before...
Also, every pilot is subject to furloughs, which happen every time economic headwinds blow a little too hard, which resets their tenure, and their payscales, if they switch employers.
It'll probably look like the code version of this, an image run through a LLM 101 times with the directive to create a replica of the input image: https://www.reddit.com/r/ChatGPT/comments/1kbj71z/i_tried_th... Despite being provided with explicit instructions, well...
People are still wrongly attributing a mind to something that is essentially mindless.
They do okay-ish for things that don't matter and if you don't look that hard. If you do look, the "features" turn out to be very limited, or not do what they claim or not work at all.
It’s still a collaborative and iterative process. That doesn’t mean they don’t work. I don’t need ai to one shot my entire job for it to be crazy useful.
If you find it helpful, that's fine. I like it as spicy autocorrect, and turn it off when I find it annoying.
I actually do look into what people do because as much fun as being a hater is, it's important not to get lost in the sauce.
From what I've seen, it's basically all:
1. People tricking themselves into feeling productive but they're not, actually
2. People tricking themselves into feeling productive but they're actually doing sloppy work
3. Hobby or toy stuff
4. Stuff that isn't critical to get right
5. Stuff they don't know how to judge the quality of
6. ofc the grifters chasing X payouts and driving FOMO
7. People who find it kinda useful in some limited situations (me)
It has its uses for sure, but I don't find it transformative. It can't do the hard parts and for anything useful, I need to check exactly what it did, and if I do that, it's much faster to do myself. Or make a script to do it.
Sure, if all you ask it to do is fix bugs. You can also ask it to work on code health things like better organization, better testing, finding interesting invariants and enforcing them, and so on.
I have some healthy skepticism on this claim though. Maybe, but there will be a point of diminishing returns where these refactors introduce more problems than they solve and just cause more AI spending.
Code is always a liability. More code just means more problems. There has never been a code generating tool that was any good. If you can have a tool generate the code, it means you can write something on a higher level of abstraction that would not need that code to begin with.
AI can be used to write this better quality / higher level code. That's the interesting part to me. Not churning out massive amounts of code, that's a mistake.
Microsoft will be an excellent real-world experiment on whether this is any good. We so easily forget that giant platform owners are staking everything on all this working exactly as advertised.
Some of my calculations going forward will continue to be along the lines of 'what do I do in the event that EVERYTHING breaks and cannot be fixed'. Some of my day job includes retro coding for retro platforms, though it's cumbersome. That means I'll be able to supply useful things for survivors of an informational apocalypse, though I'm hoping we don't all experience one.
There's an interesting phenomenon I noticed with the "skeptics". They're constantly using what-ifs (aka goalpost moving), but the interesting thing is that those exact same what-ifs were "solved" earlier, but dismissed as "not good enough".
This exact thing about optimisation has been shown years ago. "Here's a function, make it faster". With "glue" to test the function, and it kinda worked even with GPT4 era models. Then came alphaevolve where google found improvements in real algorithms (both theoretical i.e. packing squares and practical i.e. ML kernels). And yet these were dismissed as "yeah, but that's just optimisation, that's easyyyy. Wake me up when they write software from 0 to 1 and it works".
Well, here we are. We now have a compiler that can compile and boot linux! And people are complaining that the code is unmaintainable and that it's slow / unoptimised. We've gone full circle, but forgot that optimisation was easyyyy. Now it's something to complain about. Oh well...
I use LLM’s daily and agents occasionally. They are useful, but there is no need to move any goal posts; they easily do shit work still in 2026.
All my coworkers use agents extensively in the backend and the amount of shit code, bad tests and bugs has skyrocketed.
Couple that with a domain (medicine) where our customer in some cases needs to validate the application’s behaviour extensively and it’s a fucking disaster —- very expensive iteration instead of doing it well upfront.
I think we have some pretty good power tools now, but using them appropriately is a skill issue, and some people are learning to use them in a very expensive way.
I find that chat is pretty good when you're describing what you want to do, for saying "actually, I wanted something different," or for giving it a bug report. For making fine adjustments to CSS, it would be nice if you could ask the bot for a slider or a color picker that makes live updates.
It doesn't really matter for hobby projects or demos or whatever, but there's this whole group who thinks they can yell at the computer and have a business fall out and no.
I agree but want to interject that "code organization " won't matter for long.
Programming Languages were made for people. I'm old enough to have programmed in z80 and 8086 assembler. I've been through plenty of prog.langs. through my career.
But once building systems become prompting an agent to build a flow that reads these two types of excels, cleans them,filters them, merges them and outputs the result for the web (oh and make it interactive and highly available ) .
Code won't matter. You'll have other agents that check that the system is built right, you'll have agents that test the functionality and agents that ask and propose functionality and ideas.
Most likely the Programming language will become similar to the old Telegraph texts (telegrams) which were heavily optimized for word/token count. They will be optimized to be LLM grokable instead of human grokable.
What you’re describing is that we’d turn deterministic engineering into the same march of 9s that FSD and robotics are going through now - but for every single workflow. If you can’t check the code for correctness, and debug it, then your test system must be absolutely perfect and cover every possible outcome. Since that’s not possible for nontrivial software, you’re starting a march of 9s towards 100% correctness of each solution.
That accounting software will need 100M unit tests before you can be certain it covers all your legal requirements. (Hyperbole but you get the idea) Who’s going to verify all those tests? Do you need a reference implementation to compare against?
Making LLM work opaque to inspection is kind of like pasting the outcome of a mathematical proof without any context (which is almost worthless AFAIK).
There are certainly people working on making this happen. As a hobbyist, maybe I'll still have some retro fun polishing the source code for certain projects I care about? (Using our new power tools, of course.)
The costs for code improvement projects have gone down dramatically now that we have power tools. So, perhaps it will be considered more worthwhile now? But how this actually plays out for professional programming is going to depend on company culture and management.
In my case, I'm an early-retired hobbyist programmer, so I control the budget. The same is true for any open source project.
My unpopular opinion is AI sucks at writing tests. Like, really sucks. It can churn out a lot of them, but they're shitty.
Actually writing good tests that exercise the behavior you want, guard against regressions, and isn't overfitted to your code is pretty difficult, really. You need to both understand the function and understand the structure to do it
Even for hobby projects, it's not great. I'm learning asyncio by writing a matrix scraper and writing good functional tests as you go is worth it to make sure you actually do understand the concepts
And what happens when these different objectives conflict or diverge ? Will it be able to figure out the appropriate trade-offs, live with the results and go meta to rethink the approach or simply delude itself ? We would definitely lose these skills if it continues like this.
And even if you are the single engineer, I'll be honest, it might as well have been somebody else that wrote the code if I have to go back to something I did seven years ago and unearth wtf.
It would be in danger if LLMs could actually do that for me, but they're still very far from it and they progress slowly. One day I could start worrying, but it's not today.
It's nice that AI can fix bugs fast, but it's better to not even have bugs in the first place. By using someone else's battle tested code (like a framework) you can at least avoid the bugs they've already encountered and fixed.
I spent Dry January working on a new coding project and since all my nerd friends have been telling me to try to code with LLM's I gave it a shot and signed up to Google Gemini...
All I can say is "holy shit, I'm a believer." I've probably got close to a year's worth of coding done in a month and a half.
Busy work that would have taken me a day to look up, figure out, and write -- boring shit like matplotlib illustrations -- they are trivial now.
Things that are ideas that I'm not sure how to implement "what are some different ways to do this weird thing" that I would have spend a week on trying to figure out a reasonable approach, no, it's basically got two or three decent ideas right away, even if they're not perfect. There was one vectorization approach I would have never thought of that I'm now using.
Is the LLM wrong? Yes, all the damn time! Do I need to, you know, actually do a code review then I'm implementing ideas? Very much yes! Do I get into a back and forth battle with the LLM when it gets starts spitting out nonsense, shut the chat down, and start over with a newly primed window? Yes, about once every couple of days.
It's still absolutely incredible. I've been a skeptic for a very long time. I studied philosophy, and the conceptions people have of language and Truth get completely garbled by an LLM that isn't really a mind that can think in the way we do. That said, holy shit it can do an absolute ton of busy work.
What kind of project / prompts - what’s working for you? /I spent a good 20 years in the software world but have been away doing other things professionally for couple years. Recently was in the same place as you, with a new project and wanting to try it out. So I start with a generic Django project in VSCode, use the agent mode, and… what a waste of time. The auto-complete suggestions it makes are frequently wrong, the actions it takes in response to my prompts tend to make a mess on the order of a junior developer. I keep trying to figure out what I’m doing wrong, as I’m prompting pretty simple concepts at it - if you know Django, imagine concepts like “add the foo module to settings.py” or “Run the check command and diagnose why the foo app isn’t registered correctly” Before you know it, it’s spiraling out of control with changes it thinks it is making, all of which are hallucinations.
I'm just using Gemini in the browser. I'm not ready to let it touch my code. Here are my last two prompts, for context the project is about golf course architecture:
Me, including the architecture_diff.py file: I would like to add another map to architecture_diff. I want the map to show the level of divergence of the angle of the two shots to the two different holes from each point. That is, when your are right in between the two holes, it should be a 180 degree difference, and should be very dark, but when you're on the tee, and the shot is almost identical, it should be very light. Does this make sense? I realize this might require more calculations, but I think it's important.
Gemini output was some garbage about a simple naive angle to two hole locations, rather than using the sophisticated expected value formula I'm using to calculate strokes-to-hole... thus worthless.
Follow up from me, including the course.py and the player.py files: I don't just want the angle, I want the angle between the optimal shot, given the dispersion pattern. We may need to update get_smart_aim in the player to return the vector it uses, and we may need to cache that info. We may need to update generate_strokes_gained_map in course to also return the vectors used. I'm really not sure. Take as much time as you need. I'd like a good idea to consider before actually implementing this.
Gemini output now has a helpful response about saving the vector field as we generate the different maps I'm trying to create as they are created. This is exactly the type of code I was looking for.
I recently started building a POC for an app idea. As framework I choose django and I did not once wrote code myself. The whole thing was done in a github codespace with copilot in agentic mode and using mostly sonnet and opus models.
For prompting, I did not gave it specific instructions like add x to settings. I told it "We are now working on feature X. X should be able to do a, b and c. B has the following constraints. C should work like this." I have also some instructions in the agents.md file which tells the model to, before starting to code, ask me all unclear questions and then make a comprehensive plan on what to implement. I would then go over this plan, clarify or change if needed - and then let it run for 5-15 minutes. And every time it just did it. The whole thing, with debugging, with tests. Sure, sometimes there where minor bugs when I tested - but then I prompted directly the problem, and sure enough it got fixed in seconds...
Not sure why we had so different experiances. Maybe you are using other models? Maybe you miss something in your prompts? Letting it start with a plan which I can then check did definitly help a lot. Also a summary of the apps workings and technical decissions (also produced by the model) did maybe help in the long run.
I don't use VSCode, but I've heard that the default model isn't that great. I'd make sure you're using something like Opus 4.5/4.6. I'm not familiar enough with VSCode to know if it's somehow worse than Claude Code, even with the same models, but can test Claude Code to rule that out. It could also be you've stumbled upon a problem that the AI isn't that good at. For example, I was diagnosing a C++ build issue, and I could tell the AI was off track.
Most of the people that get wowed use an AI on a somewhat difficult task that they're unfamiliar with. For me, that was basically a duplicate of Apple's Live Captions that could also translate. Other examples I've seen are repairing a video file, or building a viewer for a proprietary medical imaging format. For my captions example, I don't think I would have put in the time to work on it without AI, and I was able to get a working prototype within minutes and then it took maybe a couple more hours to get it running smoother.
Also >20 years in software. The VSCode/autocomplete, regardless of the model, never worked good for me. But Claude Code is something else - it doesn't do autocomplete per se - it will do modifications, test, if it fails debug, and iterate until it gets it right.
I'm (mostly) a believer too, and I think AI makes using and improving these existing frameworks and libraries even easier.
You mentioned matplotlib, why does it make sense to pay for a bunch of AI agents to re-invent what matplotlib does and fix bugs that matplotlib has already fixed, instead of just having AI agents write code that uses it.
I mean, the thesis of the post is odd. I'll grant you that.
I work mostly with python (the vast majority is pure python), flask, and htmx, with a bit of vanilla js thrown in.
In a sense, I can understand the thesis. On the one hand Flask is a fantastic tool, with a reasonable abstraction given the high complexity. I wouldn't want to replace Flask. On the otherhand HTMX is a great tool, but often imperfect for what I'm exactly trying to do. Most people would say "well just just React!" except that I honestly loathe working with js, and unless someone is paying me, I'll do it in python. I could see working with an LLM to build a custom tool to make a version of HTMX that better interacts with Flask in the way I want it to.
In fact, in my project I'm working on now I'm building complex heatmap illustrations that require a ton of dataprocessing, so I've been building a model to reduce the NP hard aspects of that process. However, the illustrations are the point, and I've already had a back and forth with the LLM about porting the project into HTML, or some web based version of illustration at least, simply because I'd have much more control over the illustrations. Right now, matplotlib still suits me just fine, but if I had to port it, I could see just building my own tool instead of finding an existing framework and learning it.
Frameworks are mostly useful because of group knowledge. I learn Flask because I don't want to build all these tools from scratch, and because I makes me literate in a very common language. The author is suggesting that these barriers -- at least for your own code -- functionally don't exist anymore. Learning a new framework is about as labor intensive as learning one you're creating as you go. I think it's short-sighted, yes, but depending on the project, yea when it's trivial to build the tool you want, it's tempting to do that instead learning to use a similar tool that needs two adapters attached to it to work well on the job you're trying to do.
At the same time, this is about scope. Anyone throwing out React because they want to just "invent their own entire web framework" is just being an idiot.
Well maintained, popular frameworks have github issues that frequently get resolved with newly patched versions of the framework. Sometimes bugs get fixed that you didn't even run into yet so everybody benefits.
Will your bespoke LLM code have that? Every issue will actually be an issue in production experienced by your customers, that will have to be identified (better have good logging and instrumentation), and fixed in your codebase.
Frameworks that are (relatively) buggy and slow to address bugs lose popularity, to the point that people will spontaneously create alternatives. This happened too many times.
Have you? Then you know that the amount of defects scales linearly with the amount of code. As things stand models write a lot more code than a skilled human for a given requirement.
In practice using someone else’s framework means you’re accepting the risk of the thousands of bugs in the framework that have no relevance to your business use case and will never be fixed.
Yet people still use frameworks, before and after the age of LLMs. Frameworks must have done something right, I guess. Otherwise everyone will vibe their own little React in the codebase.
> Unless you are only ever a single engineer, your career is filled with "I need to debug code I didn't write".
True, but there's usually at least one person who knows that particular part of the system that you need to touch, and if there isn't, you'll spend a lot of time fixing that bug and become that person.
The bet you're describing is that the AI will be the expert, and if it can be that, why couldn't it also be the expert at understanding the users' needs so that no one is needed anywhere in the loop?
What I don't understand about a vision where AI is able to replace humans at some (complicated) part of the entire industrial stack is why does it stop at a particular point? What makes us think that it can replace programmers and architects - jobs that require a rather sophisticated combination of inductive and deductive reasoning - but not the PMs, managers, and even the users?
Steve Yegge recently wrote about an exponential growth in AI capabilities. But every exponential growth has to plateau at some point, and the problem with exponential growth is that if your prediction about when that plateau happens is off by a little, the value at that point could be different from your prediction by a lot (in either direction). That means that it's very hard to predict where we'll "end up" (i.e. where the plateau will be). The prediction that AI will be able to automate nearly all of the technical aspects of programming yet little beyond them seems as unlikely to me as any arbitrary point. It's at least as likely that we'll end up well below or well above that point.
> Steve Yegge recently wrote about an exponential growth in AI capabilities
I'm not sure that the current growth rate is exponential, but the problem is that it doesn't need to be exponential. It should have been obvious the moment ChatGPT and stable diffusion-based systems were released that continued, linear progress of these models was going to cause massive disruption eventually (in a matter of years).
That doesn't change the fact that the submission is basically repeating the LISP curse. Best case scenario: you end up with a one-off framework and only you know how it works. The post you're replying to points out why this is a bad idea.
It doesn't matter if you don't use 90% of a framework as the submission bemoans. When everyone uses an identical API, but in different situations, you find lots of different problems that way. Your framework, and its users become a sort of BORG. When one of the framework users discovers a problem, it's fixed and propagated out before it can even be a problem for the rest of the BORG.
That's not true in your LISP curse, one off custom bespoke framework. You will repeat all the problems that all the other custom bespoke frameworks encountered. When they fixed their problem, they didn't fix it for you. You will find those problems over and over again. This is why free software dominates over proprietary software. The biggest problem in software is not writing the software, it's maintaining it. Free software shares the maintenance burden, so everyone can benefit. You bear the whole maintenance burden with your custom, one off vibe coded solutions.
I think back on the ten+ years I spent doing SRE consulting and the thing is, finding the problems and identifying solutions — the technical part of the work — was such a small part of the actual work. So often I would go to work with a client and discover that they often already knew the problem, they just didn’t believe it - my job was often about the psychology of the organization more than the technical knowledge. So you might say “Great, so the agent will automatically fix the problem that the organization previous misidentified.” That sounds great right up until it starts dreaming… it’s not to say there aren’t places for these agents, but I suspect ultimately it will be like any other technology we use where it becomes part of the toolkit, not the whole.
>I would argue that it's going to be the opposite. At re:Invent, one of the popular sessions was in creating a trio of SRE agents, one of which did nothing but read logs and report errors, one of which did analysis of the errors and triaged and proposed fixes, and one to do the work and submit PRs to your repo.
If you manage a code base this way at your company, sooner or later you will face a wall. What happens when the AI can't fix an important bug or is unable to add a very important feature? now you are stuck with a big fat dirty pile of code that no human can figure out because it wasn't coded by human and was never designed to be understood by a human in the first place.
I treat code quality, and readability, as one of the goals. The LLM can help with this and refactor code much quicker than a human. If I think the code is getting too complex I change over to architecture review and refactoring until I am happy with it.
What happens when humans can’t fix a bug or build an important feature? That is a pretty common scenario, that doesn’t result in the doomsday you imply.
There will always be bugs you can't fix, that doesn't mean we should embrace having orders of magnitude more of them. And it's not just about bugs, it's also about adding new features.
This is tech debt on steroid. You are building an entire code base that no can read or understand and pray that the LLM won't fuck up too much. And when it does, no one in the company knows how to deal with it other than by throwing more LLM tokens at it and pray it works.
As I said earlier, using pure AI agents will work for a while. But when it doesn't you are fucked.
Automatically solving software application bugs is one thing, recovering stateful business process disasters and data corruption is entirely another thing.
Customer A is in an totally unknown database state due to a vibe-coded bug. Great, the bug is fixed now, but you're still f-ed.
the other issue is "fixing" false-positives; i've seen it before with some ai tools: they convince you its a bug and it looks legit and passes the tests but later on something doesn't quite work right anymore and you have to triage and revert it... it can be real time sink.
Because nobody posts the good ones. They're boring, correct, you merge them and move on to the next one. It's like there's a murder in the news every day but generally we're still all fine.
Don't assume that when people make fun of some examples that there aren't thousands more that nobody cares to write about.
Amazon, which employs many thousands of SREs (or, well, pseudo-SREs; AIUI it's not quite the conventional SRE role), is presumably just doing so for charitable purposes, if they are so easy to replace with magic robots.
Many years ago, java compilers, though billed out as a multiple-platform write-once-run-anywhere solution, those compilers would output different bytecode that would behave in interesting and sometimes unpredictable fashion. You would be inside jdb, trying to debug why the compiler did what it did.
This is not exactly that, but it is one step up. Having agents output code that then gets compiled/interpreted/whatever, based upon contextual instruction, feels very, very familiar to engineers who have ever worked close to the metal.
"Old fashioned", in this aspect, would be putting guardrails in place so that you knew that what the agent/compiler was creating was what you wanted. Many years ago, that was binaries or bytecode packaged with lots of symbols for debugging. Today, that's more automated testing.
You are ignoring the obvious difference between errors introduced while translating one near-formal-intent-clear language to another as opposed to ambiguous-natural-language to code done through a non-deterministic intermediary. At some point in the future the non-deterministic intermediary will become stable enough (when temperature is low and model versions won't affect output much) but the ambiguity of the prompting language is still going to remain an issue. Hence, read before commit will always be a requirement I think.
A good friend of mine wrote somewhere that at about 5 agents or so per project is when he is the bottleneck. I respect that assessment. Trust but verify. This way of getting faster output by removing that bottleneck altogether is, at least for me, not a good path forward.
Unfortunately, reading before merge commit is not always a firm part of human team work. Neither reading code nor test coverage by themselves are sufficient to ensure quality.
In that analogy "someone" is an AI, who of course switches from answering questions from humans, to answering questions from other AIs, because the demand is 10x.
> Governments have typically expected efficiency gains to lower resource consumption, rather than anticipating possible increases due to the Jevons paradox
I think that it's true that governments want the efficiency gains but it's false that they don't anticipate the consumption increases. Nobody is spending trillions on datacenters without knowing that demand will increase, that doesn't mean we shouldn't make them efficient.
"You have to assume that any work done outside classroom has used AI."
That is just such a wildly cynical point of view, and it is incredibly depressing. There is a whole huge cohort of kids out there who genuinely want to learn and want to do the work, and feel like using AI is cheating. These are the kids who, ironically, AI will help the most, because they're the ones who will understand the fundamentals being taught in K-12.
I would hope that any "solution" to the growing use of AI-as-a-crutch can take this cohort of kids into consideration, so their development isn't held back just to stop the less-ethical student from, well, being less ethical.
What possible solution could prevent this? The best students are learning on their own anyways, the school can't stop students using AI for their personal learning.
There was a reddit thread recently that asked the question, are all students really doing worse, and it basically said that, there are still top performers performing toply, but that the middle has been hollowed out.
So I think, I dunno, maybe depressing. Maybe cynical, but probably true. Why shy away from the truth?
And by the way, I would be both. Probably would have used AI to further my curiosity and to cheat. I hated school, would totally cheat to get ahead, and am now wildly curious and ambitious in the real world. Maybe this makes me a bad person, but I don't find cheating in school to be all that unethical. I'm paying for it, who cares how I do it.
Well, it seems the vast majority doesn't care about cheating, and is using AI for everything. And this is from primary school to university.
It's not just that AI makes it simpler, so many pupils cannot concentrate anymore. Tiktok and others have fried their mind. So AI is a quick way out for them. Back to their addiction.
As someone who had a college English assignment due literally just yesterday, I think that "the vast majority" is an overstatement. There are absolutely students in my class who cheat with AI (one of them confessed to it and got a metaphorical slap on the wrist with a 15 point deduction and the opportunity to redo the assignments, which doesn't seem fair but whatever), but the majority of my classmates were actively discussing and working on their essays in class.
Whatever solution we implement in response to AI, it must avoid hurting the students who genuinely want to learn and do honest work. Treating AI detection tools as infallible oracles is a terrible idea because of the staggering number of false reports. The solution many people have proposed in this thread, short one-on-one sessions with the instructor, seems like a great way to check if students can engage with and defend the work they turned in.
Sure, but the point is that if 5% of students are using AI then you have to assume that any work done outside classroom has used AI, because otherwise you're giving a massive advantage to the 5% of students who used AI, right?
11% success rate for what is effectively a spear-phishing attempt isn't that terrible and tbh it'll be easier to train Claude not to get tricked than it is to train eg my parents.
What ! 1 in 10 successfully phished is ok ? 1 in 10 page views. That has to approach 100% success rate over a week say month of browsing the web with targeted ads and/or link farms to get the page click
One in ten cases that take hours on a phone talking to a person with detailed background info and spoofed things is one issue. One in ten people that see a random message on social media is another.
Like 1 in 10 traders on the street might try and overcharge me is different from 1 in 10 pngs I see can drain my account.
The kind of attack vector is irrelevant here, what's important is the attack surface. Not to mention this is a tool facilitating the attack, with little to no direct interaction with the user in some cases. Just because spear-phishing is old and boring doesn't mean it cannot have real consequences.
(Even if we agree with the premise that this is just "spear-phishing", which honestly a semantics argument that is irrelevant to the more pertinent question of how important it is to prevent this attack vector)
>Claude not to get tricked than it is to train eg my parents.
One would think but apparently from this blog post it is still succeptible to the same old prompt injections that have always been around. So I'm thinking it is not very easy to train Claude like this at all. Meanwhile with parents you could probably eliminate an entire security vector outright if you merely told them "bank at the local branch," or "call the number on the card for the bank don't try and look it up."
Roam has always felt like a bit of a chore -- while it's easy enough to set up backlinks, having to do that one step has always been like a waste of time to me. This is the kind of thing that imo an agentic workflow could do for you:
- Just start typing
- Let the LLM analyze what you're typing, given the RAG database of everything else you've added, and be able to make those kinds of correlations quickly.
- One-button approve the backlinks that it's suggesting (or even go Cursor-style yolo mode for your backlinks).
Then, have a periodic process do some kind of directed analysis; are you keeping a journal, and want to make sure that you're writing enough in your journal? Are you talking about the same subjects over and over again? Should you mix things up? Things like that would be perfect for an LLM to make suggestions about. I don't know if Roam is thinking of doing this or not.
But... backlinks are fully automated. If you just make forward-links that you'd normally do in the course of writing.
You're thinking of an optional step of adding extra links "just because", but IMO that's as a learning process in the beginning when you're not used to adding any forward-links whatsoever.
IMO the 3 table-stakes features for a notetaking app in 2025 are AI-powered search (including a question-answering capability), showing related / recommended notes (via RAG), and automated clustering (K Means + LLM) to maintain a category hierarchy.
I think this might be the most exciting use-case of LLM's I've seen suggested here. I've struggled with exactly this problem with note-taking and personal knowledge-bases.
I'd love to have this but only if it runs entirely on my own machine or on a server I own. Uploading all my notes to somebody else's cloud is a nonstarter.
I would imagine you could launch a new rack, dump the old one, and connect the new one to the existing solar / cooling array. Hopefully with some sort of re-entry and recycling plan for the old one. The sheer size the arrays are going to need to be feel like they are going to be the more important part of it.
Tesla is only in business today because it was able to sell carbon credits to other automakers. Take that government subsidy away and Tesla would have died in 2009.
The real problem is that American consumers are demanding these gigantic monstrosity SUVs and trucks which literally cannot fit on European streets. When Ford et al were making hot hatchbacks, they were incredibly popular overseas. The inefficiency is at the consumer level.
My town is filled with massive Ford pickups. Pristine and clean, nothing in the beds. These people are no 'utilizing' the thing, it's just a status symbol. Annoys me so much.
European streets? My American city isn't even that old - much of the infrastructure is mid 90s - but modern vehicles just barely fit in the parking lots throughout it. It's common to see some asshole parking their pickup horizontally.
The bespoke UK/EU models are not the priority, again because they aren't being made in the US, so yes the quality drops.
You cannot get, for example, a new Focus in the US market. When you could, they were much higher quality.
The only Chevrolet you can buy in the UK is the Corvette. Chevrolet makes nine SUVs, four trucks (with however many infinite variations), and exactly one shitbox non-Corvette car.
If US automakers started turning their eyes towards smaller more efficient cars, where hauling Brayden to and from their soccer games didn't require multiple tons of steel, then they could compete in the EU market.
TBF, pretty soon you won't be able to buy a new Focus anywhere else, production finishes this year. Stellantis is still making cars of a similar size and could brand them as Chrysler for the US market.
Then, as part of the session, you would artificially introduce a bug into the system, then run into the bug in your browser. You'd see the failure happen in browser, and looking at Cloudwatch logs you'd see the error get logged.
Two minutes later, the SRE agents had the bug fixed and ready to be merged.
"understand how these systems actually function" isn't incompatible with "I didn't write most of this code". Unless you are only ever a single engineer, your career is filled with "I need to debug code I didn't write". What we have seen over the past few months is a gigantic leap in output quality, such that re-prompting happens less and less. Additionally, "after you've written this, document the logic within this markdown file" is extremely useful for your own reference and for future LLM sessions.
AWS is making a huge, huge bet on this being the future of software engineering, and even though they have their weird AWS-ish lock-in for some of the LLM-adjacent practices, it is an extremely compelling vision, and as these nondeterministic tools get more deterministic supporting functions to help their work, the quality is going to approach and probably exceed human coding quality.
reply