The rotary embeddings bit is neat. I wonder if a complex representation would simplify vs complexify things (readability, performance, expressive power).
Some implementations use a complex rotary encoding, but it makes it a bit harder to port to platforms or frameworks which do not support complex numbers natively.
The tensor cores that do the bulk of the flops on the bulk of the gpus people use are just various sizes of floats, i think. We're in a funny position where progress in models and progress in hardware are kind of linked.
As far as expressive power goes, it shouldn't make a difference for the models in common use, but I could totally imagine models where it improves readability.