Hacker Newsnew | past | comments | ask | show | jobs | submit | tottenhm's commentslogin

> In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it.

The agent passes the Turing test...


Even AI doesn’t RTFM

I can see the future. In a few years, HN will consist entirely of: 1) Bots posting “Show HN” of things they’ve vibecoded

2) Bots replying to those posts,

3) Bots asking whether the bots in #2 even read TFA, and finally

4) Bots posting the HN guideline where it says you shouldn’t ask people whether they have read TFA.

…And amid the smouldering ruins of civilization, the last human, dang, will be there, posting links to all the times this particular thing has been posted to HN before.


In the future?

God dang it, Dang!

It learnt from the best

If humans would just RTFM they wouldn’t need AI.

If AI would just RTFM it wouldn't need humans.

Legend has it, to this day, TFM has not been read.

these days TFM is generated from a prompt in any case

even AI can't be bothered to read AI generated docs slop

But who would create AI?


AI that don't read the manual.

You got me good with this one.

But seriously, this is my main answer to people telling me AI is not reliable: "guess what, most humans are not either, but at least I can tell AI to correct course and it's ego won't get in the way of fixing the problem".

In fact, while AI is not nearly as a good as a senior dev for non trivial tasks yet, it is definitely more reliable than most junior devs at following instructions.


It's ego won't get in the way but it's lack of intelligence will.

Whereas a junior might be reluctant at first, but if they are smart they will learn and get better.

So maybe LLM are better than not-so-smart people, but you usually try to avoid hiring those people in the first place.


That's exactly the thing. Claude Code with Opus 4.5 is already significantly better at essentially everything than a large percentage of devs I had the displeasure of working with, including learning when asked to retain a memory. It's still very far from the best devs, but this is the worse it'll ever be, and it already significantly raised the bar for hiring.

> but this is the worse it'll ever be

And even if the models themselves for some reason were to never get better than what we have now, we've only scratched the surface of harnesses to make them better.

We know a lot about how to make groups of people achieve things individual members never could, and most of the same techiques work for LLMs, but it takes extra work to figure out how to most efficiently work around limitations such as lack of integrated long-term memory.

A lot of that work is in its infancy. E.g. I have a project I'm working on now where I'm up to a couple of dozens of agents, and ever day I'm learning more about how to structure them to squeeze the most out of the models.

One learning that feels relevant to the linked article: Instead of giving an agent the whole task across a large dataset that'd overwhelm context, it often helps to have an agent - that can use Haiku, because it's fine if its dumb - comb the data for <information relevant to the specific task>, and generate a list of information, and have the bigger model use that as a guide.

So the progress we're seeing is not just raw model improvements, but work like the one in this article: Figuring out how to squeeze the best results out of any given model, and that work would continue to yield improvements for years even if models somehow stopped improving.


Key differences, though:

Humans are reliably unreliable. Some are lazy, some sloppy, some obtuse, some all at once. As a tech lead you can learn their strengths and weaknesses. LLMs vacillate wildly while maintaining sycophancy and arrogance.

Human egos make them unlikely to admit error, sometimes, but that fragile ego also gives them shame and a vision of glory. An egotistical programmer won’t deliver flat garbage for fear of being exposed as inferior, and can be cajoled towards reasonable output with reward structures and clear political rails. LLMs fail hilariously and shamelessly in indiscriminate fashion. They don’t care, and will happily argue both sides of anything.

Also that thing that LLMs don’t actually learn. You can threaten to chop their fingers off if they do something again… they don’t have fingers, they don’t recall, and can’t actually tell if they did the thing. “I’m not lying, oops I am, no I’m not, oops I am… lemme delete the home directory and see if that helps…

If we’re going to make an analogy to a human, LLMs reliably act like absolute psychopaths with constant disassociation. They lie, lie about lying, and lie about following instructions.

I agree LLMs better than your average junior first time following first directives. I’m far less convinced about that story over time, as the dialog develops more effective juniors over time.


You can absolutely learn LLMs strengths and weaknesses too.

E.g. Claude gets "bored" easily (it will even tell you this if you give it too repetitive tasks). The solution is simple: Since we control context and it has no memory outside of that, make it pretend it's not doing repetitive tasks by having the top agent "only" do the task of managing and sub-dividing the task, and farm out each sub-task to a sub-agent who won't get bored because it only sees a small part of the problem.

> Also that thing that LLMs don’t actually learn. You can threaten to chop their fingers off if they do something again… they don’t have fingers, they don’t recall, and can’t actually tell if they did the thing. “I’m not lying, oops I am, no I’m not, oops I am… lemme delete the home directory and see if that helps…”

No, like characters in a "Groundhog Day" scenario they also doesn't remember and change their behaviour while you figure out how to get them to do what you want, so you can test and adjust and find what makes them do what you want and it, and while not perfectly deterministic, you get close.

And unlike humans, sometimes the "not learning" helps us address other parts of the problem. E.g. if they learned, the "sub-agent trick" above wouldn't work, because they'd realise they were carrying out a bunch of tedious tasks instead of remaining oblivious that we're letting them forget in between each.

LLMs in their current form need harnesses, and we can - and are - learning which types of harnesses work well. Incidentally, a lot of them do work on humans too (despite our pesky memory making it harder to slip things past us), and a lot of them are methods we know of from the very long history of figuring out how to make messy, unreliable humans adhere to processes.

E.g. to go back to my top example of getting adherence to a boring, reptitive task: Create checklists, subdivide the task with individual reporting gates, spread it across a team if you can, put in place a review process (with a checklist). All of these are techniques that work both on human teams and LLMs to improve process adherence.


Frequently use both `github.com` and self-hosted Gitlab. IMHO, it's just... different.

Self-hosted Gitlab periodically blocks access for auto-upgrades. Github.com upgrades are usually invisible.

Github.com is periodically hit with the broad/systemic cloud-outage. Self-hosted Gitlab is more decentralized infra, so you don't have the systemic outages.

With self-hosted Gitlab, you likely to have to deal with rude bots on your own. Github.com has an ops team that deals with the rude bots.

I'm sure the list goes on. (shrug)


Same basic question -- at the price of $100k/ea, it does seem cheaper to build-out more satellite offices.

But there's a parallel push around taxing American firms using foreign labor (https://www.moreno.senate.gov/press-releases/new-moreno-bill...).

If multiple new policies are put in place at the same time, then... I dunno... it seems harder to predict...


This seems virtually impossible to enforce. It's trivial to restructure hiring a developer to write software, as licensing software from a foreign development firm, or any number of other workarounds.

This is not just a hypothetical, this is something that already happens when companies are looking to optimize their tax burden. Corporate structuring and income shifting are big businesses in their own right and serve to find the minimum amount of changes required to be able to legally reclassify income.

In the case of this bill specifically, in the unlikely even it passes, a simple corporate inversion will solve this problem. Instead of the US company owning foreign subsidiaries, the structure is inverted: the parent company becomes foreign, which will own a domestic US corporation. When the multinational wants to hire or retain offshore talent, it simply pays out from the parent company. Again these aren't hypotheticals, these are real tax avoidance strategies that are already in place and are well-trodden paths.

You can come up with an infinite amount of regulation to try to halt this (this problem is also called tax base erosion) but it ends up doing more harm than good - eventually you end up with a tax code and regulatory environment so complex that that alone disincentivizes new investment.

The goal is not just to retain existing capital and talent by forcing them to be locked in - it's to compete for the next dollar, the next startup, the next factory - new investment will follow the path of least resistance, while older companies eventually close up shop due to one reason or another.

If your worldview is one of "We already have the best capital and talent, so we don't need to bother to compete to acquire new capital and talent", the world you live in will stagnate and wither with respect to societies that will bend over backwards for this.


I enjoy this metaphor of the cow and the papier-mâché.

Presumably, there is a farmer who raised the cow, then purchased the papier-mâché, then scrounged for a palette of paints, and meticulously assembled everything in a field -- all for the purpose of entertaining distant onlookers.

That is software engineering. In Gettier's story, we're not the passive observers. We're the tricksters who thought papier-mâché was a good idea.


Yes. But look at the bottom. There's an image with the PR review screen. There's one change:

* Normally, the big green button says "Merge pull request"

* Now, the big green button says "Merge when ready"

In a large project with lots of activity, a stampede of people pressing "Merge" at the same time will cause trouble. "Merge when ready" is supposed to solve this.

It seems to mean:

> "GH, please merge this, but take it slow. Re-run the tests a few extra times to be sure."


Here's in-depth details on how it works. [1] Basically, each PR gets put in its own branch with the main branch + all the PRs ahead of it merged in. After tests pass, they are merged in order.

[1] https://docs.github.com/en/repositories/configuring-branches...


Aha, so GitHub merge queue = GitLab merge trains (or at least very similar).


Yes that’s pretty much what it is. Both are replicas of bors, and implementations https://graydon.livejournal.com/186550.html


Bors is also very similar to the Zuul CI system used for OpenStack. It has the equivalent of a merge queue (with additional support for cross repositories dependencies): https://zuul-ci.org/docs/zuul/latest/gating.html You can then have pull requests from different repositories all serialized in the same queue ensuring you don't break tests from any of the repositories participating.


Also continuous integration best practices advance one funeral at a time, it seems.


So does each new PR start new tests that will supersede the previous PR’s tests? If one PR’s tests fail, does it block all PRs behind it in the queue?

I’ve read docs several times and never found them very clear about the details.


Each PR on the queue is tested with whatever commits it would have were it merged to the target branch in queue order. So if the target branch already has commit A and commits B and C are in queue, commit D will be tested on its own temporary branch with commits A B C and D. If the tests for C fail, C is removed from the queue, and D is retested with just commits A B and D (because that's what would be on the target branch by the time it merges).


OK, thank you.


One should make a distinction between:

* The general idea of mixing together filesystems+folders to achieve re-use/sharing/caching.

* The "Dockerfile" approach to this - with its linear sequence of build-steps that map to a linear set of overlays (where each overlay depends on its predecessor).

The "Dockerfile" approach is pretty brilliant in a few ways. It's very learnable. You don't need to understand much in order to get some value. It's compatible many different distribution systems (apt-get, yum, npm, et al).

But although it's _compatible_ with many, I wouldn't say it's _particularly good_ for any one. Think of each distribution-system -- they all have a native cache mechanism and distribution infrastructure. For all of them, Dockerization makes the cache-efficacy worse. For decent caching, you have to apply some adhoc adaptations/compromises. (Your image-distribution infra also winds up as a duplicate of the underlying pkg-distribution infra.)

Here's an alternative that should do a better job of re-use/sharing/caching. It integrates the image-builder with the package-manager:

https://grahamc.com/blog/nix-and-layered-docker-images/

Of course, it trades-away the genericness of a "Dockerfile', and it no doubt required a lot of work to write. But if you compare it to the default behavior or to adhoc adaptations, this one should provide better cache-efficacy.

(All this is from POV of someone doing continuous-integration. If you're a downstream user who fetches 1-4 published image every year, then you're just downloading a big blob -- and the caching-layering stuff is kind of irrelevant.)


The whinging hits everyone. Look at any HN story involving the Bay Area, and you'll see a dozen subthreads about how it's a post-apocalyptic hellscape. (But it's home, you know.)

Speaking as an elitist left-wing hippie-business-geek-bro-demon in the Bay Area...

Kudos to the Columbus Area! Ohio, build it up!


I can’t wait to watch Integration Test Email #2 at the same time next week.


> What are we ejecting?

Ourselves, it seems. A Javascript framework is like a jet, and we are the human payload. You can stay in the jet, zooming over the constantly changing landscape. But if you get tired of this zooming around (or if you get scared of hitting a mountain), then you can activate the ejection seat (https://en.wikipedia.org/wiki/Ejection_seat). Of course, now you're a mile high without a plane, but the ejection-seat comes with a parachute, so the descent will be pleasant (or, at least, non-fatal - which is a style of pleasantness).

Erm, wait, I think you were soliciting a more literal answer. :)

"create-react-app" is the Javascript framework/jet. If you want to go for the ride, then you declare a single-dependency on "create-react-app", and they will decide when/what/how to upgrade components in the framework. If you don't want to ride along with "create-react-app"s framework, then you "eject". They'll give you a little bundle (the concrete list of dependencies) and send you off on your way.


I tried this and landed in a field of debris from the jet.


LOL, best answer so far!


In case anyone is interested in that recall effort: https://www.recallsfschoolboard.org/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: