Analyzing frontier LLM performance on my favorite daily puzzle game (https://www.nicksypteras.com/blog/cbs-benchmark.html) Next step is to assess how well the LLMs can create their own new, logically satisfiable puzzles in the same style. Then I'll have them battle it out, with one model creating a puzzle and the other attempting to solve it!
Thanks for sharing! I want to have some sort of agentic "helper" to my new puzzles website [1], and I've learned some tips from your post/code, thank you!
Have you given any thought about how to create the puzzles? Do you think it'd possible to create them using LLMs?
Congrats on launching! One immediate thought is that people will always be wary of running LLM-generated code on their machines even if it's sandboxed. Is one of the future business cases for this to host a remote execution environment that pctx can call out to rather than running the code locally?
Ya interesting thought - would be fascinating if generating games w/solutions is part of the training data pipeline. There's been previous work done on on testing LLMs on logic puzzles[1][2][3] so they could possibly be building off those ideas to improve performance.
I'm impressed it recommended so many books i've already read and liked! I have a big reading backlog but once it's whittled down I will likely come back to this. One feature request would be to also show a "why this is recommended" for each recommendation so I can further narrow down the list for what I'm looking for
"Counter Chinese Influence in International Governance Bodies" and grouping them in with US "adversaries" and "rivals" is quite undiplomatic language to throw in under "Lead in International AI Diplomacy and Security" section. Diplomacy with China should be an important part of this initiative but will inevitably be bungled.
Even if it’s not perfect, I’m happy to see there’s a focus on AI Security. NIST has been a reliable producer of quality international standards for cybersecurity. Hopefully this action plan will lead to similarly high quality recommendations for AI Security.
China is an adversary of the West, and leading in international security means posing a challenge (or, in an ideal world, a better alternative) to Chinese influence on the international stage.
1984: U.S. withdraws.
2003: U.S. rejoins.
2011: U.S. stops paying dues after Palestine joins.
2017: U.S. announces withdrawal (effective end of 2018).
2023: U.S. rejoins, pledges to repay dues.
2025: U.S announces withdrawal
They're getting ready to bomb Iran's UNESCO sites. They did bomb several UNESCO sites in Yugoslavia and other places while they left. Their boy Grossi also told the whole world that there is a big target on a UNESCO site a short while back.
History mismatch/Mandela effect? Some of the bombed sites were already known as culturally significant but not recognized by unesco yet, like Novi Sad that became a unesco creative city in 2023.
Similar to the Israeli ambassador being recalled from Dublin. They mean it as a big dramatic statement but they've done it that many times it's lost all significance.
She only gets reinstated again for the purpose of making another dramatic exit.
I suppose looking at it from the Israeli government's perspective, Ireland is a very safe place for Israelis and Jewish people in general, but the public and government are vocal on Israel's actions and there's no defence/intelligence links between the two countries. Trade links are on the European level.
There'll never be a reason for them to send a skilled diplomat, so may as well send a shit stirrer who's only good for causing controversy.
They’re never happy about the loss of money. For UN institutions, the US usually contributes a theoretical cap of about 22% but in real terms I think it’s more like a quarter of their annual budget or a little over in some cases. When we’re not paying, that’s a lot of money that UNESCO isn’t getting.
Predictably, if/when China becomes the premier funder of UN organizations, there will be a lot of grousing about it by US politicians. The amount of soft-power being trashed is astounding
We’re the ones seeking to cap our contributions. The formula currently doesn’t allow for any one country to pay more than 22% with America the only one actually paying that much, save for the institutions we’ve cut off. For UN peacekeeping we’re actually assessed at 27% but Congress capped that to 25% back in 1993.
All of our foreign policy prior to January 20th 2025 is in a state of flux. Officially, Congress cares, but the first 7 or so months of this year have been enlightening in a strange way, and with our President taking the lead, there is a strong possibility that Congress will not care if the possibility of the PRC paying more comes up in any policy discussions.
If you abandon it completely something else might rise up - but funding/participating only up to a point, it works to suppress it - see Ukraine aid policies aswell
Cycle of politician appeasing their genocidal masters until the government start to realize what that means exactly at which point we pull back to humanity.
Same here! Kiwix comes in clutch on flights. I've used it so many times to get background knowledge on topics mid-read. Plus free and open source. Such a great service.
I think that would be one of the success cases described in the article because HITL is an integral part of good customer support chatbots. Support chats can be escalated to a human whenever the agent is unable to provide a satisfactory answer to the user.