> Who knows why? I’m usually more willing to spend than she is, and I bet that's represented on my user profile. I was paying with a gift card, which surely contributes. Maybe it was a price scraping update, comparison shopping detection, or a system that explores “face-in-the-door” high prices before backing down. From the outside, no one really knows.
The most obvious possibility omitted is that your wife got the first, easy, cheap car and then Uber had to quote you a higher price to get a second car. Cars don't fall from the sky; if two people successively ask for bids, how else could it work? What if the app quoted you both the cheap price for the only car within X blocks, and you bought it before she did? Is it suddenly going to go 'oops sorry, changed my mind, it now costs twice as much'? Sounds like a very bad experience to me! More sensible to give the first person a low quote and then when - unexpected and unpredictably - someone requests something similar, quote them the higher price reflecting the sudden local micro-shortage.
(author here) I believe I had checked first in this case, which is why it was surprising. Sorry not to mention that in the post. This was in San Francisco, and there were multiple cars shown on the map.
In my experience, I usually don't see this kind of price change before the request has actually been confirmed - and I have seen Lyft change the price between showing me the estimate and confirming the request (with an apologetic confirmation dialog, possibly only after some holding period has timed out).
Maybe in my case where the high quote came first, the opposite scenario happened - a glut of drivers appeared between my request and hers, raising supply.
Opaque pricing is powerful partly because we don't know. This enables people to construct a plausible story to explain any price.
it would be great if this were the case. unfortunately, Uber has been documented to practice individual price discrimination at a massive scale, using factors like if you’re in a low-income vs high-income neighborhood, individual rider “price sensitivity”, etc, in addition to market conditions (surge pricing), and as a result they have netted billions in profit [1]. i would guess this is why Uber AI researchers are paid so much.
That raises an interesting question: if 10 people in a room request ubers without confirming the ride-hail, does the price go up for successive requests?
> It's like how software engineers who are at Google are worse if they're better competitive programmers.
That's not true. It didn't replicate and Norvig has said as much somewhere on HN, IIRC.
(I also agree with the other criticisms that this 'old vs young' setup in OP is obviously at least partially, and perhaps entirely, regression to the mean and Berkson.)
> I mean, its open source so people can create benchmark and independently verify if the AI was wrong and then have the claims be passed to the author.
Thank you for volunteering. I look forward to your results.
> Thank you for volunteering. I look forward to your results.
Sure can you wait a few weeks tho? I know nothing about benchmarking so gonna learn it first and I have a few tests to prepare for irl.
I do feel like someone else more passionate about the project should try to pick the benchmarking though.
I don't mind benchmarking it but I only know tools like hyper for benchmarks & I have played with my fair share of zip archives and their random access retrieval but I feel like even that would depend from source to source.
There are some experienced people in here who are really cool at what they do, I just wanted to say that if someone's interested and already has the Domain Specific knowledge to benchmark & they enjoy it in the first place, this having AI benchmark shouldn't be much of a problem in comparison.
That whole subreddit has unfortunately become inundated with AI slop.
It used to be a decent resource to learn about what services people were self hosting. But now, many posts are variations of, “I’ve made this huge complicated app in an afternoon please install it on your server”. I’ve even seen a vibe-coded password manager posted there.
Reputable alternatives to the software posted there exist a a huge amount of the time. Not to mention audited alternatives in the case of password managers, or even just actively maintained alternatives.
I'm a moderator for a decently large programming subreddit, and I'd estimate about half the project submissions now being obvious slop. You get a very good nose for sniffing that stuff out after a while, though it can be frustrating when you can't really convince other people beyond going "trust me, it's slop".
There's also severe selection effects: what documents have been preserved, printed, and scanned because they turned out to be on the right track towards relativity?
I think my take away is that you are seeing mostly mode-collapse here. There is a high consistency across all of the supposedly different personalities (higher than the naive count would indicate - remember the stochastic nature of responses will inflate the number of 'different' responses, since OP doesn't say anything about sampling a large number of times to get the true response).
You are right about mode-collapse -- and that observation is exactly what makes this interesting.
In my other comment here, I described The Sims' zodiac from 1997: Will Wright computed signs from personality via Euclidean distance to archetypal vectors, displayed them cosmetically, and wrote zero behavioral code. The zodiac affected nothing. Yet testers reported bugs: "The zodiac influence is too strong! Tune it down!"
Your "mode-collapse with stochastic noise" is the same phenomenon measured from the other direction. In The Sims: zero computed difference, perceived personality. In this LLM experiment: minimal computed difference, perceived personality. Same gap.
Will called it the Simulator Effect: players imagine more than you simulate. I would argue mode-collapse IS the Simulator Effect measured from the output side.
But here is where it becomes actionable: one voice is the wrong number of voices.
ChatGPT gives you the statistical center -- mode-collapse to the bland mean. The single answer that offends no one and inspires no one. You can not fix this with better prompting because it is the inevitable result of single-agent inference.
Timothy Leary built MIND MIRROR in 1985 -- psychology software visualizing personality as a circumplex, based on his 1950 PhD dissertation on the Interpersonal Circumplex. The Sims inherited this (neat, outgoing, active, playful, nice). But a personality profile is not an answer. It is a lens.
The wild part: in 1970, Leary took his own test during prison intake, gamed it to get minimum security classification (outdoor work detail), and escaped by climbing a telephone wire over the fence. The system's own tools became instruments of liberation.
MOOLLM's response: simulate an adversarial committee within the same call. Multiple personas with opposing propensities -- a paranoid realist, an idealist, an evidence prosecutor -- debating via Robert's Rules. Stories that survive cross-examination are more robust than the statistical center.
The bigger project is MOOLLM -- treating the LLM as eval() for a microworld OS. K-lines, prototype-based instantiation, many-voiced deliberation. The question I keep wrestling with: mode-collapse as limitation vs feature. The Sims exploited it. MOOLLM routes around it.
Would value your take on the information-theoretic framing -- particularly whether multi-agent simulation actually increases effective entropy or just redistributes it.
The MOOLLM Eval Incarnate Framework:
Skills are programs. The LLM is eval(). Empathy is the interface. Code. Graphics. Data. One interpreter. Many languages. The Axis of Eval.
Yes, and PKMs in general. Like labeling your emails by topic in Gmail. The problem is that the 'toil' keeps piling up, while the value gained is increasingly hard to see.
No, but I haven't been following the space. (I suspect that with Claude Code-level coding agents, you should be able to do something amazing that thoroughly obsoletes Obsidian/Roam/org-mode, but I don't actually know of anything.)
I've been focused on creative writing, with poetry as my test case, to see what the bottlenecks are to truly amplifying myself through LLMs (as opposed to helping my boss automate away my job or spamming the Internet more efficiently).
I haven't tried using agents to make a full editor, but Claude Code and Gemini CLI are actually quite good at writing Obsidian plugins, or modifying existing ones. You can start with an existing one that's 90% of what you want (which tends to be the case with note-taking/PKM systems: people are so idiosyncratic that solutions built by others almost work, but not quite) and tweak it to be exactly right for you.
My own Obsidian setup has improved quite a bit in the last couple months because I can just ask Claude to change one or two things about plugins I got from the store.
Writing or tweaking plugins is great, but it's not a paradigm shift (and risks a lot more toil because now you have to be your own PM or deal with patches/merges, on top of being a reference librarian and copyeditor etc). I feel like if you have a quasi-superintelligence in a box which can run your PKM for you, and you were designing from the ground up with this in mind, that Claude Code is only going to et much better & cheaper, you would not be settling for 'write or modify an Obsidian plugin'. You would get something much different. But 'write a plugin' is basically at 'horseless carriage' level for me.
What I have in mind is something far more radical. There's an idea I am calling 'log-only writing' where you stop editing or rearranging your notes at all, and you switch to pure note taking and stream of conscious braindumping, and you simply have the LLM 'compile' your entire history down into whatever specific artifact you need on demand - whether that's a web of flashcards or a blog post or a long essay or whatever. See https://gwern.net/blog/2024/rss + https://gwern.net/nenex , combined with the LLM reasoning and brainstorming 'offline' using the prompts illustrated by my poems.
That's fair, I guess when I hear "radical overhaul" when discussing PKMs I immediately start worrying about the overload and burnout that doomed my first attempts at Obsidian (see my sibling comment), whereas right now I have a system that works very well for me, especially now that I can just ask Claude to scan the whole directory if I want to ask it questions. But if you do come up with some new blue-sky vision for PKMs, I'd love to at least take a look.
This is the way. If you symlink the .claude directory (so Obsidian can see the files) then you can also super easily add and manage claude skills.
I've spent 20 years living in the terminal, but with claude code I'm more and more drafting markdown specs, organizing context, building custom views / plugins / etc. Obsidian is a great substrate for developing personal software.
The conclusion here seems largely unjustified by the data and indeed is difficult to relate to simple distributions or statistics:
> Increasingly, public institutions seem to exist to manage the obsessions of a tiny number of neurotic—and possibly malicious—complainers.
Why would anyone complain about airport noise when it is ~100% guaranteed to do them no good, and almost all the benefits go to everyone else even if it somehow did anything? Just thinking like an economist here... (Indeed, if a large fraction of locals did complain about something like airport noise, that would itself be highly suspicious to me - as it would indicate an organized campaign or an issue which has become politicized in some way and is now a pretext for something else entirely like a culture war.)
And if there is something I've learned about design and problems, it's that you can have a huge problem, and you are lucky if even 1% will ever tell you.
Your website could be down, and if even 1 person takes the risk of going out of their way to tell you, you should thank your lucky stars that you have such proactive, public-spirited readers!
> I thought people recognized that they don't appear out of nowhere.
I don't think that paper is widely accepted. Have you seen the authors of that paper, or anyone else, use it to successfully predict (rather than postdict) anything?
I haven't paid attention and the paper seems to be arguing against the existence of the phenomenon of emergence behavior and is not related to predicting what is possible with greater scale.
> is not related to predicting what is possible with greater scale.
If they can't predict new emergence, then 'explaining' old emergence by post hoc prediction with bizarre newly-invented metrics would seem to be irrelevant and just epicycles. You can always bend a line as you wish in curve-fitting by adding some parameters.
The most obvious possibility omitted is that your wife got the first, easy, cheap car and then Uber had to quote you a higher price to get a second car. Cars don't fall from the sky; if two people successively ask for bids, how else could it work? What if the app quoted you both the cheap price for the only car within X blocks, and you bought it before she did? Is it suddenly going to go 'oops sorry, changed my mind, it now costs twice as much'? Sounds like a very bad experience to me! More sensible to give the first person a low quote and then when - unexpected and unpredictably - someone requests something similar, quote them the higher price reflecting the sudden local micro-shortage.
reply