The equivalent of "draw me a dog" -> not a masterpiece!? who would have thought? You need to come up with a testing methodology, write it down, and then ask the model to go through it. It likes to make assumptions on unspecified things, so you got to be careful.
More fundamentally I think testing is becoming the core component we need to think about. We should not vibe-check AI code, we should code-check it. Of course it will write the actual test code, but your main priority is to think about "how do I test this?"
You can only know the value of a code up to the level of its testing. You can't commit your eyes into the repo, so don't do "LGTM" vibe-testing of AI code, it's walking a motorcycle.
Me too, I already exported my data from all platforms including HN and indexed them in a RAG database but I don't feel like using them much. They are past oriented and I need present oriented stuff. With LLMs I noticed I don't like when they use chat history search or memory functions, it makes them fall into a rut and become less creative.
I even got to a point where I made an "anti-memory system" - a MCP tool that just calls a model without context from the full conversation or past conversations, to get a fresh perspective. And I instruct the host model to reveal only partially the information we discussed, explaining that creativity is sparked when LLMs get to see not too much, not too little - like a sexy dress.
When it comes to stimulating AI creativity, it may indeed be better to impose fewer constraints. However, in most scenarios, problems are likely still solved through simple information aggregation, refinement, analysis, and planning, right?
Many say generative AI is like a vending machine. But if your vending machine has not 1 button but a keyboard, and you type anything you want in, and it makes it (Star Trek Replicator) and you use it 10,000 times to refine your recipes, did you learn something or not? How about a 3D printer, do you learn something making designs and printing them?
Instead of "vending machine", I see many people calling generative AI "slot machine", which more aptly describes current genAI tools.
Yes, we can use it 10,000 times to refine our recipes, but "did we learn from it"? I am doubtful about that, given that even after running with the same prompt 10 times, it will give different answers in 8/10 responses.
But I am very confident that I can learn by iterating and printing designs on a 3D printer.
Star Trek replicators were deterministic. They had a library of things they could replicate that your programmed in and that’s the extent of what they could do. They replicated to the molecular level, no matter how many times you ask for something, you got the exact same thing. You’d never ask for a raktajino and get a raw steak. In the rare instances where they misbehaved as a plot point, they were treated as being broken and needing fixing, no one ever suggested “try changing your prompt, or ask it seventeen times until you get what you want”.
The study measures if participants learn the library, but what they should study is if they learn effective coding agent patterns to use the library well. Learning the library is not going to be what we need in the future.
> "We collect self-reported familiarity with AI coding tools, but we do not actually measure differences in prompting techniques."
Many people drive cars without being able to explain how cars work. Or use devices like that. Or interact with people who's thinking they can't explain. Society works like that, it is functional, does not work by full understanding. We need to develop the functional part not the full understanding part. We can write C without knowing the machine code.
You can often recognize a wrong note without being able to play the piece, spot a logical fallacy without being able to construct the valid argument yourself, catch a translation error with much less fluency than producing the translation would require. We need discriminative competence, not generative.
For years I maintained a library for formatting dates and numbers (prices, ints, ids, phones), it was a pile of regex but I maintained hundreds of test cases for each type of parsing. And as new edge cases appeared, I added them to my tests, and iterated to keep the score high. I don't fully understand my own library, it emerged by scar accumulation. I mean, yes I can explain any line, but why these regexes in this order is a data dependent explanation I don't have anymore, all my edits run in loop with tests and my PRs are sent only when the score is good.
Correctness was never grounded in understanding the implementation. Correctness was grounded in the test suite.
You can, most certainly, drive a car without understanding how it works. A pilot of an aircraft on the other hand needs a fairly detailed understanding of the subsystems in order to effectively fly it.
I think being a programmer is closer to being an aircraft pilot than a car driver.
Sure, if you are a pilot then that makes sense. But what if you are a company that uses planes to deliver goods? Like when the focus shifts from the thing itself to its output
> Many people drive cars without being able to explain how cars work.
But the fundamentals all cars behave the same way all the time. Imagine running a courier company where sometimes the vehicles take a random left turn.
> Or interact with people who's thinking they can't explain
Sure but they trust those service providers because they are reliable . And the reason that they are reliable is that the service providers can explain their own thinking to themselves. Otherwise their business would be chaos and nobody would trust them.
How you approached your library was practical given the use case. But can you imagine writing a compiler like this? Or writing an industrial automation system? Not only would it be unreliable but it would be extremely slow. It's much faster to deal with something that has a consistent model that attempts to distill the essence of the problem, rather than patching on hack by hack in response to failed test after failed test.
But isn't the corrections of those errors that are valuable to society and get us a job?
People can tell they found a bug or give a description about what they want from a software, yet it requires skills to fix the bugs and to build software. Though LLMs can speedup the process, expert human judgment is still required.
If you know that you need O(n) "contains" checks and O(1) retrieval for items, for a given order of magnitude, it feels like you've all the pieces of the puzzle needed to make sure you keep the LLM on the straight and narrow, even if you didn't know off the top of your head that you should choose ArrayList.
Or if you know that string manipulation might be memory intensive so you write automated tests around it for your order of magnitude, it probably doesn't really matter if you didn't know to choose StringBuilder.
That feels different to e.g. not knowing the difference between an array list and linked list (or the concept of time/space complexity) in the first place.
My gut feeling is that, without wrestling with data structures at least once (e.g. during a course), then that knowledge about complexity will be cargo cult.
When it comes to fundamentals, I think it's still worth the investment.
To paraphrase, "months of prompting can save weeks of learning".
I think the kind of judgement required here is to design ways to test the code without inspecting it manually line by line, that would be walking a motorcycle, and you would be only vibe-testing. That is why we have seen the FastRender browser and JustHTML parser - the testing part was solved upfront, so AI could go nuts implementing.
I partially agree, but I don’t think “design ways to test the code without inspecting it manually line by line” is a good strategy.
Tests only cover cases you already know to look for. In my experience, many important edge cases are discovered by reading the implementation and noticing hidden assumptions or unintended interactions.
When something goes wrong, understanding why almost always requires looking at the code, and that understanding is what informs better tests.
Instead, just learning concepts with AI and then using HI (Human Intelligence) & AI to solve the problem at hand—by going through code line by line and writing tests - is a better approach productivity-, correctness-, efficiency-, and skill-wise.
I can only think of LLMs as fast typists with some domain knowledge.
Like typists of government/legal documents who know how to format documents but cannot practice law. Likewise, LLMs are code typists who can write good/decent/bad code but cannot practice software engineering - we need, and will need, a human for that.
Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.
Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?
I tried doing the same. Sometimes it made my understanding of things much clearer. However most fimes, I found it worked best when I had a clear idea on paper, either to validate the idea or when I needed to an opinion. Otherwise, ChatGPT in my case, built upon my idea that I hadn't thought through well and confuse the shit out of me.
> either you look at code or don’t, either you edit code by hand or you exclusively direct agents, either you’re the anti-AI-purist or the agentic-maxxer – is unhelpful.
If you're looking at all your code you are just walking the motorcycle. You need tests to automate your eyes. In fact I believe tests and specs are the new product, code can be regenerated at will.
That is why we see vibe coding projects that replicate well specced and implemented products like web browsers, you get both the specs and differential testing for free.
I've been using a screen reader Chrome extension for 15 years using the Alex voice on MacOS. Some people find it robotic but I could not replace it yet. I speed it up to 1.4x. When I tried Eloquence voice now it sounded even more robotic, but I can relate to that.
That's my main usage for LLMs, they are usually intellectual sparring partners or researching my ideas to see who came up with them before and how they thought about them. So it's debate and literature research.
> Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels
Exactly, why can't we just hoard our bookmarks and a list of curated sources, say 1M or 10M small search stubs, and have a LLM direct the scraping operation?
The idea is to have starting points for a scraper, such as blogs, awesome lists, specialized search engines, news sites, docs, etc. On a given query the model only needs a few starting points to find fresh information. Hosting a few GB of compact search stubs could go a long way towards search independence.
This could mean replacing Google. You can even go fully local with local LLM + code sandbox + search stub index + scraper.
The equivalent of "draw me a dog" -> not a masterpiece!? who would have thought? You need to come up with a testing methodology, write it down, and then ask the model to go through it. It likes to make assumptions on unspecified things, so you got to be careful.
More fundamentally I think testing is becoming the core component we need to think about. We should not vibe-check AI code, we should code-check it. Of course it will write the actual test code, but your main priority is to think about "how do I test this?"
You can only know the value of a code up to the level of its testing. You can't commit your eyes into the repo, so don't do "LGTM" vibe-testing of AI code, it's walking a motorcycle.
reply