I agree it’s unlikely, but that may tell you that the compiler can compile the two with similar performance, but won’t tell you whether the compiler will do so in your code.
If you want to be sure, you have to benchmark your production code with production data.
Look at the generated machine code may also help, but can be difficult, as the generated code may be different, but of similar performance, and judging whether two instruction sequences have similar performance is hard on modern hardware, with its out-of-order executing, multiple layers of caches, etc.
Well, to be fair, to run an unquantized 70B model is going to take somewhere in the area of 160gb of VRAM (if my quick back of the napkin math is ok). I'm not quite sure of the state of GPUs these days, but getting a 2x a100 80gb (or 4x 40gb) setup is probably going to cost more than a Mac Studio with maxed out RAM.
If we are talking quantized, I am currently running LLaMA v1 30B at 4 bits on a MacBook Air 24GB ram, which is only a little bit more expensive than what a 24GB 4090 retails for. The 4090 would crush the MacBook Air in tokens/sec, I am sure. It is however completely usable on my MacBook (4 tokens/second, IIRC? I might be off on that).
A 4 bit 70B model should take about 36GB-40GB of RAM so a 64GB MacStudio might still be price competitive with a dual 4090 or 4090 / 3090 split setup. The cheapest Studio with 64GB of RAM is 2,399.00 (USD).
compilers are usually optimizing these kind of statements so they end up similar or identical
if you wanna be sure, make two loops running a few million times with random if else / switches and time them