I got really confused by the title since I assumed it was about running microservices on embedded devices (which sounds like madness) and doing SRE work for that.
The author uses 'embedded', given that SahAssar didn't explain it either, in the same sense as an 'embedded reporter', That is to say, as a member with a distinct role in an otherwise homogeneous team.
The author talks about adding visualization, isitio, and
an api migration he was part of.
> This was occurring at the same time each hour because a CronJob has been configured to restart the pod each hour, in order to prevent memory leaks in other services using the same GKE cluster
I worked at a place that used to restart their JBoss (haven't used that name in years...) cluster every night at 2 AM. They never did find the real source of the memory leak.
It's been a long time, but I seem to remember that Amazon used to do a similar thing back when they had a giant monolithic webserver process. Maybe 2004? The executable was >1GB back when that was unimaginably huge and it was kind of comical.
The process (Gurupa?) used to leak memory and the solution was to kill the process after a small number of requests, and start a new one. This is second-hand knowledge so it's possible that I'm confused.
Boggles my mind people would just restart stuff instead of fixing it - classic "devops" antipattern. It never took me more than a day to find the gnarliest memory leak bug in Go applications, haven't worked with Java in ages but I assume it should be similar. C/C++ on the other hand... tho should be better now with advent of tc/jemalloc which have proper instrumentation.
I haven't resorted to the "restart in a cronjob", but on Python/Django apps, we pretty much always use the gunicorn/uwsgi setting that automatically restarts worker processes after serving some number of requests (typically 10-50k for us). There are just so many ways that a Python app can leak small amounts of memory and so many 3rd party libraries involved that it's so much easier to default to the automatic restart approach rather than play whack-a-mole tracking down and fixing leaks.
FWIW, my experience with Go apps is similar to yours. I have apps that handle hundreds of requests per second and run for months at a time without leaking memory. When there is a leak, it's easy to profile and find it.
Those apps just tend to be more constrained in what they do than eg, a major user facing Django app, that is more subject to feature creep and ends up with a huge surface area of infrequently used endpoints.
In my experience, long running Python was even worse than long running Java for memory leaks. It often wasn't so much a memory "leak" in the logical sense. It was the the process size "bloating" from tons of mallocs and frees, leading to heap fragmentation (may not be exactly the right term.) This was many years ago.
Yes it was likely fragmentation. You could run python with jemalloc (not sure what it does by default these day) and get much nicer memory footprint especially over time.