That's right. They use a fully differentiable renderer, which allows them to optimize the properties of this set of 3D gaussians using back propagation / gradient descent.
But ultimately what they end up with is an explicit representation of the scene, so unlike NeRF there's no "inference" during rendering. In that sense it's somewhere between a traditional mesh representation and an implicit representation like NeRF or signed distance fields. That's what makes it fast too; they get to make use of the rasterization acceleration capabilities of GPUs , unlike NeRFs which need to be sampled many times along a ray to render a scene.
They seem to just optimize the positions and shapes of gaussian primitives as well as the reflectance properties.
Certainly a lot more "explainable" than a NeRF.