Hacker Newsnew | past | comments | ask | show | jobs | submit | nsypteras's commentslogin


Analyzing frontier LLM performance on my favorite daily puzzle game (https://www.nicksypteras.com/blog/cbs-benchmark.html) Next step is to assess how well the LLMs can create their own new, logically satisfiable puzzles in the same style. Then I'll have them battle it out, with one model creating a puzzle and the other attempting to solve it!


Thanks for sharing! I want to have some sort of agentic "helper" to my new puzzles website [1], and I've learned some tips from your post/code, thank you!

Have you given any thought about how to create the puzzles? Do you think it'd possible to create them using LLMs?

[1]: https://www.puzzleship.com


Congrats on launching! One immediate thought is that people will always be wary of running LLM-generated code on their machines even if it's sandboxed. Is one of the future business cases for this to host a remote execution environment that pctx can call out to rather than running the code locally?


I don't see a reason to be nervous about running AI on a local system if it's VM encapsulated with cgroups.


yes! coming soon


Ya interesting thought - would be fascinating if generating games w/solutions is part of the training data pipeline. There's been previous work done on on testing LLMs on logic puzzles[1][2][3] so they could possibly be building off those ideas to improve performance.

[1] https://huggingface.co/papers/2504.00043 [2] https://huggingface.co/blog/yuchenlin/zebra-logic [3] https://arxiv.org/pdf/2403.12094


I'm impressed it recommended so many books i've already read and liked! I have a big reading backlog but once it's whittled down I will likely come back to this. One feature request would be to also show a "why this is recommended" for each recommendation so I can further narrow down the list for what I'm looking for


"Counter Chinese Influence in International Governance Bodies" and grouping them in with US "adversaries" and "rivals" is quite undiplomatic language to throw in under "Lead in International AI Diplomacy and Security" section. Diplomacy with China should be an important part of this initiative but will inevitably be bungled.


The language lets you get around a bunch of pesky laws by declaring it a "national defense emergency."


Even if it’s not perfect, I’m happy to see there’s a focus on AI Security. NIST has been a reliable producer of quality international standards for cybersecurity. Hopefully this action plan will lead to similarly high quality recommendations for AI Security.


China is an adversary of the West, and leading in international security means posing a challenge (or, in an ideal world, a better alternative) to Chinese influence on the international stage.


It’s necessary to put pressure on trying to prevent a Taiwan invasion.


1984: U.S. withdraws. 2003: U.S. rejoins. 2011: U.S. stops paying dues after Palestine joins. 2017: U.S. announces withdrawal (effective end of 2018). 2023: U.S. rejoins, pledges to repay dues. 2025: U.S announces withdrawal

Seems to be a revolving door


They're getting ready to bomb Iran's UNESCO sites. They did bomb several UNESCO sites in Yugoslavia and other places while they left. Their boy Grossi also told the whole world that there is a big target on a UNESCO site a short while back.


Which site in Yugoslavia did they bomb?


NATO bombings damaged a Kosovo (post Yugoslavia) church in 1999 that was later added to UNESCO in 2006

https://en.wikipedia.org/wiki/Gra%C4%8Danica_Monastery


So its a time traveling crime?


History mismatch/Mandela effect? Some of the bombed sites were already known as culturally significant but not recognized by unesco yet, like Novi Sad that became a unesco creative city in 2023.


UNESCO Creative cities are very different from UNESCO world heritage sites.


Makes me wonder if officials at UNESCO even cares about the decision. "Oh that again?" Probably already used to this.


Similar to the Israeli ambassador being recalled from Dublin. They mean it as a big dramatic statement but they've done it that many times it's lost all significance.

She only gets reinstated again for the purpose of making another dramatic exit.


They always send their most incompetent ambassadors to Dublin, ones that put their foot in their own mouth.


I suppose looking at it from the Israeli government's perspective, Ireland is a very safe place for Israelis and Jewish people in general, but the public and government are vocal on Israel's actions and there's no defence/intelligence links between the two countries. Trade links are on the European level.

There'll never be a reason for them to send a skilled diplomat, so may as well send a shit stirrer who's only good for causing controversy.


when you put that way its pretty logical.


They’re never happy about the loss of money. For UN institutions, the US usually contributes a theoretical cap of about 22% but in real terms I think it’s more like a quarter of their annual budget or a little over in some cases. When we’re not paying, that’s a lot of money that UNESCO isn’t getting.


Predictably, if/when China becomes the premier funder of UN organizations, there will be a lot of grousing about it by US politicians. The amount of soft-power being trashed is astounding


We’re the ones seeking to cap our contributions. The formula currently doesn’t allow for any one country to pay more than 22% with America the only one actually paying that much, save for the institutions we’ve cut off. For UN peacekeeping we’re actually assessed at 27% but Congress capped that to 25% back in 1993.

https://betterworldcampaign.org/us-funding-for-the-un/un-bud...

If any other country wants to step in and fill the gap, I don’t think Congress will care.


> If any other country wants to step in and fill the gap, I don’t think Congress will care

"Countering the PRC Malign Influence Fund Authorization Act of 2023[1]" says otherwise.

1. https://www.congress.gov/bill/118th-congress/house-bill/1157...


All of our foreign policy prior to January 20th 2025 is in a state of flux. Officially, Congress cares, but the first 7 or so months of this year have been enlightening in a strange way, and with our President taking the lead, there is a strong possibility that Congress will not care if the possibility of the PRC paying more comes up in any policy discussions.


Eh china finances a ton of members, who better vote in line as debtors should


If you abandon it completely something else might rise up - but funding/participating only up to a point, it works to suppress it - see Ukraine aid policies aswell


Look at the years, and see how they match up with the administration in power...


  1984 withdraw Reagan
  2003 rejoin   Bush
  2011 protest  Obama (forced by law)
  2017 withdraw Trump
  2023 rejoin   Biden
  2025 withdraw Trump
Kinda tracks, except for the Bush one.


Tbf, if you remove the Biden 2023 pledge, the rest makes sense:

In the two decades between 1984 and 2003, UNESCO implemented a number of reforms in management+transparency+politicization, and the U.S. returned.

Then Palestine was admitted, and the U.S. left.


Cycle of politician appeasing their genocidal masters until the government start to realize what that means exactly at which point we pull back to humanity.


Same here! Kiwix comes in clutch on flights. I've used it so many times to get background knowledge on topics mid-read. Plus free and open source. Such a great service.


Yes! I’ve used it on flights and long train rides (and generally when travelling) when the network connection might be a bit patchy.


I think that would be one of the success cases described in the article because HITL is an integral part of good customer support chatbots. Support chats can be escalated to a human whenever the agent is unable to provide a satisfactory answer to the user.


> The transportation agency has spent years looking for an innovative way to allow passengers to move faster through the security checkpoints.

I think the writer had some fun with this one


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: