sanposhiho

Born on January 24, 2021•20 Karma

on March 21, 2024

•on: Tortoise: Shell-Shockingly-Good Kubernetes Autosca...

At Mercari, the responsibilities of the Platform team and the service development teams are clearly distinguished. Not all service owners possess expert knowledge of Kubernetes.

Also, Mercari has embraced a microservices architecture, currently managing over 1000 Deployments, each with its dedicated development team.

To effectively drive FinOps across such a sprawling landscape, it's clear that the platform team cannot individually optimize all services. As a result, they provide a plethora of tools and guidelines to simplify the process of the Kubernetes optimization for service owners.

But, even with them, manually optimizing various parameters across different resources, such as resource requests/limits, HPA parameters, and Golang runtime environment variables, presents a substantial challenge.

Furthermore, this optimization demands engineering efforts from each team constantly - adjustments are necessary whenever there’s a change impacting a resource usage, which can occur frequently: Changes in implementation can alter resource consumption patterns, fluctuations in traffic volume are common, etc.

Therefore, to keep our Kubernetes clusters optimized, it would necessitate mandating all teams to perpetually engage in complex manual optimization processes indefinitely, or until Mercari goes out of business.

To address these challenges, the platform team has embarked on developing Tortoise, an automated solution designed to meet all Kubernetes resource optimization needs.

This approach shifts the optimization responsibility from service owners to the platform team (Tortoises), allowing for comprehensive tuning by the platform team to ensure all Tortoises in the cluster adapts to each workload. On the other hand, service owners are required to configure only a minimal number of parameters to initiate autoscaling with Tortoise, significantly simplifying their involvement.

yolo3000•

on March 21, 2024

I would find it annoying for the platform team to readjust the specs of the pods I'm running on. To give insights is valuable, but otherwise it's an invitation for incidents to happen.

saulrh•

on March 21, 2024

Doing things I don't understand myself is also a recipe for disaster, and in my experience a rather greater one. The platform team is liable to make mistakes like scaling the service wrong or failing to anticipate upcoming changes. These incidents can be easily resolved by improving monitoring and communication, which are fundamentally useful things that I should already be doing for myriad other reasons. The mistakes I'm likely to make are things like "sequenced a complicated change wrong and null-routed the entire application" or "typo'd a volume name and found out that that autodeletes the entire database including backups", which I am simply not good at avoiding and constitute one of the major reasons I am in engineering instead of ops or IT. We are better off if I do the things I am best at and they do the things they are best at.

yolo3000•

on March 21, 2024

I would say both of what you said and what I said are recipes for disaster, but letting another team do things behind your back on things that you're responsible for, is not something you want to have. How would you feel if the cloud provider engineers suddenly downgraded your nodes to different specs, and causing downtime for your users? I think it's a false premise to assume that the application teams cannot observe their usage patterns and optimize themselves.

AeroNotix•

on March 21, 2024

I think this mostly comes down to whether applications can handle downtime if their workloads are restarted, scale up/down based on demand.

It happens shockingly often that applications only support working with a single replica and even worse when those applications cannot run concurrently with replicas of themselves which prevent smooth rolling updates.

IME if applications are fault tolerant of restarts, or support concurrent replicas then scaling up and down to meet demand is absolutely fine.

mkl95•

on March 21, 2024

The reality for most engineers is that their CTOs stopped caring about tech somewhere between the late 90s and mid 2000s. You'll have to put up with processes designed by some dude who still views platform orgs as a bunch of sysadmins and webmasters.

almostdeadguy•

on March 21, 2024

Treating performance and reliability (which is inescapably impacted upon by performance characteristics) as externalities is a great way to create perverse incentives for your engineering team.

Also this reads like a cry for help:

> Therefore, to keep our Kubernetes clusters optimized, it would necessitate mandating all teams to perpetually engage in complex manual optimization processes indefinitely, or until Mercari goes out of business.

beeboobaa3•

on March 21, 2024

Or you could learn the platform you are deploying your software to