More

owlbite · 2026-01-21T15:21:36 1769008896

I think calling VLIW "an adandoned design" is somewhat of an exaggeration, such architectures are pretty common for embedded audio processing.

matt_d · 2026-01-21T17:40:49 1769017249

Worth adding on that note:

From JAX to VLIW: Tracing a Computation Through the TPU Compiler Stack, https://patricktoulme.substack.com/p/from-jax-to-vliw-tracin...

Google’s Training Chips Revealed: TPUv2 and TPUv3, HotChips 2020, https://hc32.hotchips.org/assets/program/conference/day2/Hot...

Ten Lessons From Three Generations Shaped Google’s TPUv4i, ISCA 2021, https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf

mike_hearn · 2026-01-22T09:59:47 1769075987

Thanks, that JAX writeup was interesting.

mike_hearn · 2026-01-21T15:30:33 1769009433

Sure. I did mention DSPs. But how many people write code for DSPs?

HarHarVeryFunny · 2026-01-21T18:23:08 1769019788

x86-64 SSE and AVX are also SIMD

vel0city · 2026-01-21T19:38:38 1769024318

SIMD and VLIW are somewhat similar but very different in the end.

HarHarVeryFunny · 2026-01-21T19:51:50 1769025110

True.

The ISA in this Anthropic machine is actually both, VLIW and SIMD, and both are relevant to the problem.

owlbite · 2025-12-01T03:17:36 1764559056

This book provides a high level overview of many methods without (on a quick skim) really hinting at the practical usage. Basically this reads as a encyclopedia to me, whereas Nocedal and Wright is more of an introductory graduate course going into significantly more detail on a smaller selection of algorithms (generally those that are more commonly used).

Picking on what I'd consider one of the major workhorse methods of continous constrained optimization, Interior Point Methods get a 2-3 page super high level summary in this book. Nocedal and Wright give an entire chapter on the topic (~25 pages) (which of course still is probably insufficient detail to implement anything like a competitive solver).

ted_dunning · 2025-12-02T19:33:44 1764704024

It's a bit like the old Numerical Recipes book in that regard.

(but better)

owlbite · 2025-11-06T06:18:22 1762409902

But it can be even worse than that. It's "we assassinated the phone", "algorithm says vehicle has suspicious travel history and must die". There's no real thinking human in the loop for some of this stuff, just some model decided the metadata has a high probability of being associate with an opponent of some flavor and then everyone in the vicinity is blown to bits as computer said kill.

owlbite · 2025-10-31T17:52:37 1761933157

Very true, but a lot of stuff builds on a few core optimized libraries like BLAS/LAPACK, and picking up a build of those targeted at a modern microarchitecture can give you 10x or more compared to a non-targeted build.

That said, most of those packages will just read the hardware capability from the OS and dispatch an appropriate codepath anyway. You maybe save some code footprint by restricting the number of codepaths it needs to compile.

owlbite · 2025-10-24T06:15:04 1761286504

They just label such people as Applied Mathematicians, or worse: Physicists and Engineers; and then get back to sensible business such as algebraic geometry, complex analysis and group theory.

owlbite · 2025-10-24T06:10:48 1761286248

Introduction to PhD study: "How hard can it be, I'm sure I could write that in a week"

owlbite · 2025-10-19T00:10:51 1760832651

I thought GPLv3 adoption by GCC was what really lit the flames on moving to llvm by commercial entities?

fsckboy · 2025-10-19T00:36:02 1760834162

you only need to worry about GPLv3 if you are modifying gcc in source and building it and distributing that. Just running gcc does not create a GPLv3 infection. And glibc et al are library licensed so they don't infect what you build either, most especially if you are not modifying its source and rebuilding it.

o11c · 2025-10-19T01:28:47 1760837327

And what we've seen from e.g. Apple is that "make a private fork and only distribute binaries" is exactly what they wanted the whole time.

wmf · 2025-10-19T01:29:51 1760837391

you only need to worry about GPLv3 if you are modifying gcc in source and building it and distributing that.

That's the context here. If you build a new compiler based on GCC, GPL applies to you. If you build a new compiler based on LLVM it doesn't.

fsckboy · 2025-10-19T04:01:36 1760846496

the context here doesn't actually specify whether we are talking about companies using llvm sources to create proprietary compilers (or maybe integrated with a proprietary IDE) or using llvm to quickly bootstrap and craft a compiler for a new processor, new language, etc., where they will distribute the source to the compiler anyway

but such a compiler or IDE would not GPLv3 infect it's users' target sources and binaries.

int_19h · 2025-10-19T07:50:40 1760860240

The main problem with GPLv3 specifically from the perspective of various commercial vendors is the patent clause.

johannes1234321 · 2025-10-19T01:27:05 1760837225

Still some companies try hard to avoid GPLv3, see Apple, who either provide old GPLv2 licensed software or invest in BSD/MIT replacements.

tehjoker · 2025-10-19T00:13:33 1760832813

You might know this history better than me.

owlbite · 2025-10-13T14:59:32 1760367572

What I suspect he really means is that FORTRAN lays out its arrays column-major, whilst C choose row-major. Historically most math software was written in the former, including the de facto standard BLAS and LAPACK APIs used for most linear algebra. Mix-and-matching memory layouts is a recipe for confusion and bugs, so "mathematicians" (which I'll read as people writing a lot of non-ML matrix-related code) tend to prefer to stick with column major.

Of course things have moved on since then and a lot of software these days is written in languages that inherited their array ordering from C, leading to much fun and confusion.

The other gotcha with a lot of these APIs is of course 0 vs 1-based array numbering.

Const-me · 2025-10-13T23:32:09 1760398329

> is written in languages that inherited their array ordering from C

It’s not just C. Modern GPU hardware only supports row major memory layout for 2D and 3D textures (ignoring specialized layouts like swizzling and block compression but none of them are column major either). Modern image and video codecs only support row major layout for bitmaps.

bee_rider · 2025-10-13T16:26:19 1760372779

The MKL blas/lapack implementation also provides the “cblas” interface (I’m sure most blas implementations do, I’m just familiar with MKL—BLIS seems quite willing to provide additional interfaces to I bet they provide it as well) which explicitly accepts arguments for row or column ordering.

Internally the matrix is tiled out anyway (for gemm at least) so column vs row ordering is probably a little less important nowadays (which isn’t to say it never matters).

owlbite · 2025-10-13T23:00:16 1760396416

Oh yes, from an actual implementation POV you can just apply some transpose and ordering transforms to convert from row major to column major or vice-versa. cblas is pretty universal though I don't think any LAPACK C API ever gained as wide support for non column-major usage (and actually has some routines where you can't just pull transpose tricks for the transformation).

Certain layouts have performance advantages for certain operations on certain microarchitectures due to data access patterns (especially for level 2 BLAS), but that's largely irrelevant to historical discussion of the API's evolution.

owlbite · 2025-10-08T19:54:27 1759953267

Much better to burn the area for multiple smaller units, its a bit more area for frontend handling, but worth it for the flexibility (see Apple's M-series chips vs intel avx*).

Remnant44 · 2025-10-08T19:59:12 1759953552

Yes and no. I think neon is undersized for today at 128bit registers -- if you're working with doubles for example, that's only two values per register, which is pretty anemic. Things like shuffles and other tricky bitops benefit from wider widths as well (see my other reply)

adgjlsfhk1 · 2025-10-08T22:32:19 1759962739

Agreed that 128 bit is undersized, but 512 feels pretty good for the time being. We're unlikely to see further size increases since going to 1024 would require doubling the cache line, register file, and ram bandwidth, while just adding an extra fma port is far less hardware.

Remnant44 · 2025-10-08T23:15:53 1759965353

totally - especially given how bandwidth constrained CPUs still are, going wider than 512 doesn't make much sense. 512 itself was a stretch for quite a long time (and all the negative press on the original implementations was a consequence of being not-quite-ready for primetime), but for current hardware I think it's perfect.

But 128bit is just ancient. If you're going to go to significant trouble to rewrite your code in SIMD, you want to at least get a decent perf return on investment!

adgjlsfhk1 · 2025-10-09T00:20:39 1759969239

128 bit is already really nice for things like Int8 comparison (e.g lots of string operations and Swiss Dict key search)

owlbite · 2025-09-23T15:29:38 1758641378

In the UK specifically the radical reform (read destruction) of council housing by the Thatcher government had a large impact on the housing market in the 1980s.