IMHO there are some serious problems here that won't relate to many situations, ...

wozzio · 2025-12-13T21:02:50 1765659770

you are technically right that requests are scheduling hints but in a cluster autoscaler world, requests=bill.

If I request 8GB for a pod that uses 1GB, the autoscaler spins up nodes to accommodate that 8GB reservation. That 7GB gap is capacity the company is paying for but cannot use for other workloads.

Valid point on Goodhart's Law, tho the goal shouldn't be fill the RAM, but rather lower the request to match the working set so we can bin-pack tighter.

nyrikki · 2025-12-13T22:44:40 1765665880

This script does nothing to solve that and actually exasperates problem of people over specing.

It makes assumptions about pricing[0], when if you do need a peak of 8GB it would force you into launching and consuming that 8GB immediately, because it is just reading a current snapshot from /proc/$pid/status:VmSize [1] and says you are waisting memory if "request - actual usage (MiB)" [2]

What if once a week you need to reconcile and need that 8GB, what if you only need 8GB once every 10 seconds? There script won't see that; so to be defensive you can't release that memory, or you will be 'wasting' resource despite that peek need.

What if your program only uses 1GB, but you are working on a lot of parquet files, and with less ram you start to hit EBS IOPS limits or don't finish the nightly DW run because you have to hit disk vs working from the buffer with headroom etc..

This is how bad metrics wreck corporate cultures, the ones in this case encourage overspending. If I use all that ram I will never hit the "top_offender" list[3] even if I cause 100 extra nodes to be launched.

Without context, and far more complicated analytics "request - actual usage (MiB)" is meaningless, and trivial to game.

What incentive except making sure that your pods request ~= RES 24x7x356 ~= OOM_KILL limits/2, to avoid being in the "top_offender" does this metric accomplish?

Once your skip's-skip's-skip sees some consultant labeled you as a "top_offender" despite your transient memory needs etc... how do you work that through? How do you "prove" that against a team gaming the metric?

Also as a developer you don't have control over the clusters placement decisions, nor typically directly choosing the machine types. So blaming the platform user on the platform teams' inappropriate choice of instance types, while shutting down many chances of collaboration by blaming the platform user typically also isn't a very productive path.

Minimizing cloud spend is a very challenging problem, which typically depends on collaboration more than anything else.

The point is that these scripts are not providing a valid metric, and absolutely presenting that metric in a hostile way. It could be changed to help a discovery process, but absolutely will not in the current form.

[0] https://github.com/WozzHQ/wozz/blob/main/scripts/wozz-audit.... [1] https://github.com/google/cadvisor/blob/master/cmd/internal/... [2] https://github.com/WozzHQ/wozz/blob/main/scripts/wozz-audit.... [3] https://github.com/WozzHQ/wozz/blob/main/scripts/wozz-audit....

wozzio · 2025-12-13T23:00:50 1765666850

Really fair critique regarding the snapshot approach. You're right optimizing limits based on a single point in time is dangerous for bursty workloads the need 8GB for 10 seconds scenario.

The intent of this script isn't to replace long-term metric analysis like Prometheus/Datadog trends, but to act as a smoke test for gross over-provisioning, the dev who requested 16GB for a sidecar that has flatlined at 100MB for weeks.

You make a great point about the hostile framing of the word waste. I definitely don't want to encourage OOM risks. I'll update the readme to clarify that this delta represents potential capacity to investigate rather than guaranteed waste.

Appreciate the detailed breakdown on the safety buffer nuances.