Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can’t speak to the Tesla stuff but I run an Epyc 7713 with a single 3090 and creatively splitting the model between GPU/8 channels of DDR4 I can do about 9 tokens per second on a q4 quant.


Impressive. Is that a distillation, or the real thing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: