> "The Moltbook team has given agents a way to verify their identity and connect with one another on their human's behalf," Shah says. "This establishes a registry where agents are verified and tethered to human owners."
Have they? Did I miss something? Last I checked, there was no verification and most of the content shared from that site turned out to have been posted not by LLMs but rather (human) spammers, focused on Crypto grifts and creating hype.
Anyone more in this can happily correct me, but is there anything here of that sort, anything of value?
Compared to any prior social media acquire there doesn't seem a technically skilled team considering the exploits or an existing user base considering said user base is A) supposed to be bots by nature and secondly didn't even turn out to be that reliably, making this the first time someone wants bots and doesn't even get that.
Far is it from me to make strategic decisions for a company like Meta/Facebook, but the lack of a recent Llama release might merit more focus then spending on whatever this is.
I had never heard of the thing and checked it out. It appears to be an industrial scale slop generation machine. Exactly what you would expect if LLMs were let loose to recreate Reddit and introspect on their current context and SOUL.md or the other nonsense that OpenClaw can be customised with.
Not much human content that i could see, probably even the Crypto grifters got bored with it after a couple of days.
The "acquisition" must have given guys that made the thing some favourable terms, and it was a condition for them to even consider working at Meta. Because there is no way a global top 10 market cap company announces this deal willingly.
Have been following your models and semi-regularly ran them through evals since early summer. With the existing Coder and Mercury models, I always found that the trade-offs were not worth it, especially as providers with custom inference hardware could push model tp/s and latency increasingly higher.
I can see some very specific use cases for an existing PKM project, specially using the edit model for tagging and potentially retrieval, both of which I am using Gemini 2.5 Flash-Lite still.
The pricing makes this very enticing and I'll really try to get Mercury 2 going, if tool calling and structured output are truly consistently possible with this model to a similar degree as Haiku 4.5 (which I still rate very highly) that may make a few use cases far more possible for me (as long as Task adherence, task inference and task evaluation aren't significantly worse than Haiku 4.5). Gemini 3 Flash was less ideal for me, partly because while it is significantly better than 3 Pro, there are still issues regarding CLI usage that make it unreliable for me.
Regardless of that, I'd like to provide some constructive feedback:
1.) Unless I am mistaken, I couldn't find a public status page. Doing some very simple testing via the chat website, I got an error a few times and wanted to confirm whether it was server load/known or not, but couldn't
2.) Your homepage looks very nice, but parts of it struggle, both on Firefox and Chromium, with poor performance to the point were it affects usability. The highlighting of the three recommended queries on the homepage lags heavily, same for the header bar and the switcher between Private and Commercial on the Early Access page switches at a very sluggish pace. The band showcasing your partners also lags below. I did remove the very nice looking diffusion animation you have in the background and found that memory and CPU usage returned to normal levels and all described issues were resolved, so perhaps this could be optimized further. It makes the experience of navigating the website rather frustrating and first impressions are important, especially considering the models are also supposed to be used in coding.
3.) I can understand if that is not possible, but it would be great if the reasoning traces were visible on the chat homepage. Will check later whether they are available on the API.
4.) Unless I am mistaken, I can't see the maximum output tokens anywhere on the website or documentation. Would be helpful if that were front and center. Is it still at roughly 15k?
5.) Consider changing the way web search works on the chat website. Currently, it is enabled by default but only seems to be used by the model when explicitly prompted to do so (and even then the model doesn't search in every case). I can understand why web search is used sparingly as the swift experience is what you want to put front and center and every web search adds latency, but may I suggest disabling web search by default and then setting the model up so, when web search is enabled, that resource is more consistently relied upon?
6.) "Try suggested prompt" returns an empty field if a user goes from an existing chat back to the main chat page. After a reload, the suggested prompt area contains said prompts again.
One thing that I very much like and that has gotten my mind racing for PKM tasks are the follow up questions which are provided essentially instantly. I can see some great value, even combining that with another models output to assist a user in exploring concepts they may not be familiar with, but will have to test, especially on the context/haystack front.
Appears the only difference to 3.0 Pro Preview is Medium reasoning. Model naming has long gone from even trying to make sense, but considering 3.0 is still in preview itself, increasing the number for such a minor change is not a move in the right direction.
My issue is that we haven't even gotten the release version of 3.0, that is also still in Preview, so may stick with 3.0 till that has been deemed stable.
Basically, what does the word "Preview" mean, if newer releases happen before a Preview model is stable? In prior Google models, Preview meant that there'd still be updates and improvements to said model prior to full deployment, something we saw with 2.5. Now, there is no meaning or reason for this designation to exist if they forgo a 3.0 still in Preview for model improvements.
Given the pace AI is improving and that it doesn't give the exact same answers under many circumstances, is the the [in]stability of "preview" a concern?
Should have clarified initially what I meant by stable, especially because it isn't that known how these terms are defined for Gemini models. Not talking about getting consistent output from a not-deterministic model, but stable from a usage perspective and in the way Google uses the word "stable" to describe their model deployments [0]. "Preview" in regard to Gemini models means a few very specific restrictions including far stricter rate limits and a very tight 14 day deprecation window, making them models one cannot build on.
That is why I'd prefer for them to finish the role out of an existing model before starting work on a dedicated new version.
Minor version bumps are good and I want model providers to communicate changes. The issue I am having is that Gemini "preview" class models have different deprecation timelines and rate limits, making them impossible to rely on for professional use cases. That's why I'd prefer they finish the 3.0 role out prior to putting resources into deploying a second "preview" class model.
For a stable deployment, Google needs a sufficient amount of hardware to guarantee inference and having two Pro models running makes that even more challenging: https://ai.google.dev/gemini-api/docs/models
In my evals, I was able to rather reliably reproduce an increase in output token amount of roughly 15-45% compared to 4.5, but in large part this was limited to task inference and task evaluation benchmarks. These are made up of prompts that I intentionally designed to be less then optimal, either lacking crucial information (requiring a model to output an inference to accomplish the main request) or including a request for a less than optimal or incorrect approach to resolving a task (testing whether and how a prompt is evaluated by a model against pure task adherence). The clarifying question many agentic harnesses try to provide (with mixed success) are a practical example of both capabilities and something I do rate highly in models, as long as task adherence isn't affected overly negatively because of it.
In either case, there has been an increase between 4.1 and 4.5, as well as now another jump with the release of 4.6. As mentioned, I haven't seen a 5x or 10x increase, a bit below 50% for the same task was the maximum I saw and in general, of more opaque input or when a better approach is possible, I do think using more tokens for a better overall result is the right approach.
In tasks which are well authored and do not contain such deficiencies, I have seen no significant difference in either direction in terms of pure token output numbers. However, with models being what they are and past, hard to reproduce regressions/output quality differences, that additionally only affected a specific subset of users, I cannot make a solid determination.
Regarding Sonnet 4.6, what I noticed is that the reasoning tokens are very different compared to any prior Anthropic models. They start out far more structured, but then consistently turn more verbose akin to a Google model.
Unless I am mistaken, that is all plain old markdown, arguably the easiest to migrate format for such data there can possible be.
Heck, that was half the pitch behind Obsidian, even if the project someday ended, markdown would remain. And switching between Obsidian and e.g. Logseq shows the ease of doing so.
Sober, fair analysis, covers what a large contingent on here was commenting on the original release. Like the concise summary he did, though as is so often the case with claims on LLM capability that are beyond what is real, this won't reach most people unfortunately.
The official release by Anthropic is very light on concrete information [0], only contains a select and very brief number of examples and lacks history, context, etc. making it very hard to gleam any reliably information from this. I hope they'll release a proper report on this experiment, as it stands it is impossible to say how much of this are actual, tangible flaws versus the unfortunately ever growing misguided bug reports and pull requests many larger FOSS projects are suffering from at an alarming rate.
Personally, while I get that 500 sounds more impressive to investors and the market, I'd be far more impressed in a detailed, reviewed paper that showcases five to ten concrete examples, detailed with the full process and response by the team that is behind the potentially affected code.
It is far to early for me to make any definitive statement, but the most early testing does not indicate any major jump between Opus 4.5 and Opus 4.6 that would warrant such an improvement, but I'd love nothing more than to be proven wrong on this front and will of course continue testing.
> Unless they signed a treaty agreeing to abide by it, their they're own sovereign entities and their businesses don't have to comply with remote EU laws.
What is your opinion concerning laws such as FATCA and other such laws that apply to non-US entities when working with US citizens abroad?
Other laws that apply to the US's own citizens abroad, I can kind of get in line with. Like even if you go to a country where murder and mayhem are legal, you can't go there on vacation, rack up a body count, then come back to Wyoming and go back to work on Monday.
Other laws that apply to non-citizens abroad, I'm against, of course. We don't have the moral right to legislate what someone in China can and can't do. However, prosecuting them for that should they enter the US is a different animal. If you run a scam farm and defraud a million Americans, then go to Disneyland on vacation, you should plan on having a bad time. Similarly with GDPR and other EU-local laws: violate them outside the EU, but it'd be wise to skip Barcelona on your next world tour.
But neither of your described scenarios applies to either of the two.
Both FATCA and GDPR apply to entities/companies that deal with citizens from their respective jurisdiction. FATCA applies e.g. to foreign banks handling US customers, GDPR to foreign data processors handling EU user data.
If you don't want either to apply to you, easy, just don't handle US customers money/process EU user data.
The penalty for non-compliance comes down to: you can't do business with the US government anymore, which is a huge bummer for any financial institution. If you don't care about that, say because you're a credit union servicing a small town in Brazil and you had a single American move there, I imagine you'd also ignore it.
I am by no means an expert, but as far as I am aware FATCA violations carry slightly higher penalties then what you suggest and are very much not limited to "you can't do business with the US government anymore".
Also, even if "you're a credit union servicing a small town in Brazil" and even if the penalty was as limited as you think it is, I doubt even a smaller institution could survive loosing access to US securities, etc.
Such laws and policies are a blatant overreach. However the US is a superpower so if we act inappropriately smaller economies simply have to tolerate it to a large extent. It's no different than China throwing their weight around with their neighbors.
The EU jumping on that bandwagon was predictable but I don't think it's a good thing. We all ought to strive for a higher moral standard.
I have yet to experience any degradation in coding tasks I use to evaluate Opus 4.5, but I did see a rather strange and reproducible worsening in prompt adherence as part of none coding tasks since the third week of January.
Very simple queries, even those easily answered via regular web searching, have begun to consistently not result accurate results with Opus 4.5, despite the same prompts previously yielding accurate results.
One of the tasks that I already thought was fully saturated as most recent releases had no issues in solving it was to request a list of material combinations for fabrics used in bag constructions that utilise a specific fabric base. In the last two weeks, Claude has consistently and reproducibly provided results which deviate from the requested fabric base, making the results inaccurate in a way that a person less familiar with the topic may not notice instantly. There are other queries of this type for other topics I am nerdily familiar with to a sufficient degree to notice such deviations from the prompt like motorcycle history specific queries that I can say this behaviour isn't limited to the topic of fabrics and bag construction.
Looking at the reasoning traces, Opus 4.5 even writes down the correct information, yet somehow provides an incorrect final output anyways.
What makes this so annoying is that in coding tasks, with extensive prompts that require far greater adherence to very specific requirements in a complex code base, Opus 4.5 does not show such a regression.
I can only speculate what may lead to such an experience, but for none coding tasks I have seen regression in Opus 4.5 whereas for coding I did not. Not saying there is none, but I wanted to point it out as such discussions are often primarily focused on coding, where I find it can be easier to see potential regressions where their are none as a project goes on and tasks become inherently more complex.
My coding benchmarks are a series of very specific prompts modifying a few existing code bases in some rather obscure ways, with which I regularly check whether a model does severely deviate from what I'd seen previously. Each run starts with a fresh code base with some fairly simple tasks, then gets increasingly complex with later prompts not yet being implemented by any LLM I have gotten to test. Partly that originated from my subjective experience with LLMs early on, where I found a lot of things worked very well but then as the project went on and I tried more involved things with which the model struggled, I felt like the model was overall worse when in reality, what had changed were simply the requirements and task complexity as the project grew and easier tasks had been completed already. In this type of testing, Opus 4.5 this week got as far and provided a result as good as the model did in December. Of course, past regressions were limited to specific users, so I am not saying that no one is experiencing reproducible regressions in code output quality, merely that I cannot reproduce them in my specific suite.
I've noticed a degradation in Opus 4.5, also with Gemini-3-Pro. For me, it was a sudden rapid decline in adherence to specs in Claude Code. On an internal benchmark we developed, Gemini-3-Pro also dramatically declined. Going from being clearly beyond every other model (as benchmarks would lead you to believe) to being quite mediocre. Delivering mediocre results in chat queries and coding also missing the mark.
I didn't "try 100 times" so it's unclear if this is an unfortunate series of bad runs on Claude Code and Gemini CLI or actual regression.
I shouldn't have to benchmark this sort of thing but here we are.
Have they? Did I miss something? Last I checked, there was no verification and most of the content shared from that site turned out to have been posted not by LLMs but rather (human) spammers, focused on Crypto grifts and creating hype.
Anyone more in this can happily correct me, but is there anything here of that sort, anything of value?
Compared to any prior social media acquire there doesn't seem a technically skilled team considering the exploits or an existing user base considering said user base is A) supposed to be bots by nature and secondly didn't even turn out to be that reliably, making this the first time someone wants bots and doesn't even get that.
Far is it from me to make strategic decisions for a company like Meta/Facebook, but the lack of a recent Llama release might merit more focus then spending on whatever this is.
reply