TBH that looks pretty decent to me. GPU programming is very very complicated. A simple kernel in torch is hundreds to thousands of LOCs, with much worse style.
I agree with some poster above, the code looks hard simply because the problem is very hard, not because they want to write it that way.
I agree with some poster above, the code looks hard simply because the problem is very hard, not because they want to write it that way.