Thanks! Any idea why I'm getting such poor performance on these new models? Whether Small or Tiny, on my 24GB 7900XTX I'm seeing like 8 tokens/s using the latest llama.cpp with vulkan. Even if it was running 4x faster than this I would be asking why I'm getting so few tokens/s when it sounds like the models are supposed to bring increased inference efficiency.
Thank you! No worries at all! Yes! Sleep mode is super cool since this means the allocation of memory for inference can be totally decoupled away from training, which opens the door to many larger RL runs!
The primary goal of the release and our notebook https://colab.research.google.com/github/unslothai/notebooks... was actually to showcase how to mitigate reward hacking in reinforcement learning - for example when RL learns to cheat and output global variables instead like editing the timer to cheat on benchmarking and others. You can edit the notebook to do rl on other powerful models like Qwen, Llama etc automatically with Unsloth as well via our automatic compiler! We also made sink attention and moe inference super optimized for training - note flash attention 3 doesn't have sink backwards support so you'll have to use unsloth.
Gpt-oss tbh in our tests is a truly powerful model, especially the 120b variant - it's extremely popular in western enterprises since yes it's from openai but also because reasoning mode high and the censored nature and its reasoning capabilities are attractive. A big underutilized feature is its web search and internal intermediate tool calling which it can do as part of its reasoning chain just like o3 or gpt5.
RL yes isn't an all powerful hammer, but it can solve so many more new problems. For a financial institution, you can make automatic trading strategies via RL. For an intelligence agency, decryption via RL. For a legal startup, possibly case breakthroughs via RL, automatic drug candidates etc. And yes, big labs want to automate all tasks via massive RL for eg being able to play pokemon and all other games as one example. RL opened so many doors since you don't need any data, just one prompt like "make fast matrix multiplications kernels", and reward functions - it can allow many more interesting use cases where data is a constraint!
Definitely not breaking any modern day standards, but from what I understand, some folks are trying it on simple ciphers or combinations of simple ciphers to first see if RL can help.
I think you can train a model to decrypt an encrypted. My friend tried this only on like simple example tho. As long as we have the environment, we can do these things.
I’m sorry but I don’t buy for a second that you can do meaningful and even close to reliable decryption with RLHF on currently known secure ciphers.
Furthermore, I’m very worried that whoever may be paying for this is barking up the wrong tree. I feel that the damage done with extremely bad decryption attempts would massively outweigh the very few times when whatever it “decrypts” is meaningfully close to what the actual text was.
I’m aware of how easy certain things in surveillance are (I.e n-gram analysis is enough to dox anyone on HN in like 10 words of text) - but even sort of decent decryption of SHA-256 would be a literally front page of the world achievement.
If you’re going to be rude and arrogant, then the level of knowledge you exhibit has to match. SHA-256 decryption would be a front of the world achievement because it would be redefining foundational mathematics since it’s not an encryption algorithm. The words you’d be looking for are either a collision of SHA-256 or breaking encryption algorithms like AES, RSA, ECC etc.
sha-256 is used in the construction of certain encryption algorithms as a primitive but by itself never encrypts anything. If it did you’ve also got middleout compression invented since you could encrypt arbitrary length input into 256 bits of output.
Oh yes if RL breaks SHA-256 that'll be revolutionary - but definitely not that - some folks are for now investigating basic combinations of old school ciphers for now - security applications with RL are most likely for now related to automatically finding attack surfaces and creating defensive layers - I probably should have re-worded "decryption for RL" to just "security for RL" sorry!
To run them locally, we made some GGUFs: https://huggingface.co/unsloth/Qwen-Image-2512-GGUF
reply