Kirill Kashin

I am an SRE Engineer with a background in technical support, DevOps, Kubernetes, CI/CD, production platforms, and observability. My engineering path started with customer incidents, alerts, and technical problem analysis, then moved into DevOps and later into a more complete SRE practice with SLOs, postmortems, on-call, and service ownership. I have always been interested in the area where technical problems directly affect users or business processes, so most of my experience is connected with incident management, monitoring, automation, infrastructure, and improving production processes.

Starting in Technical Support

My path started in technical support at Tinkoff, where I worked with incident management, alerts, and technical problems affecting customers. This was not a first-line or second-line support team. Our work was closer to incident management and technical support for incidents that could affect a small number of users, hundreds of thousands of customers, or sometimes all users of a service.

At that time, the company was very focused on customer experience and public feedback. For example, if a complaint with a low rating appeared on a major banking review website, it automatically became a high-priority case. At the same time, the company was actively developing chat support inside the mobile app, which created a large number of new requests and increased the load on support teams.

To understand how first-line support really worked, we had a special training stage: we had to complete a basic training course for inbound support specialists. It was not a regular practice, more like a one-time part of the onboarding process. Also, we sometimes listened to real customer calls to understand what customers and operators actually faced, where delays appeared, and which explanations really helped.

Most people in our team came through internal hiring from first-line or second-line support, so the team had a good understanding of how support worked from the inside.

Monitoring and Incident Handling

The highest priority was given to failures in infrastructure, integrations with other banks, providers, payments, transfers, deposits, the mobile app, or the website. We had monitors with dashboards where we watched the API platform, website UI, mobile applications, and key financial operations in real time.

The main tool for log analysis was Splunk. In some ways, it felt similar to Elasticsearch or OpenSearch, but with a more powerful search language. It allowed us to do complex operations directly in the search line: deduplication, aggregations, tables, searches by user identifiers, and so on. For example, we could quickly find similar errors, deduplicate them by user_id or phone number, and get a list of customers affected by the problem.

Not all systems were unified. Some teams used Elasticsearch, some used Graylog, but we worked with them less often. Mobile applications were also a separate case: Splunk mostly had logs related to API interaction, while client-side crashes and errors were tracked in Firebase.

It is important to say that we did not prevent incidents at the development or infrastructure level. The incidents had already happened. Our task was different: quickly understand the impact, reduce response time, give first-line support a clear instruction, and minimize the damage for customers and the call center.

If a failure happens and first-line support has no instruction, operators spend too much time on every request. They ask extra questions, search for information, escalate cases, and create tickets. With a large number of requests, this quickly creates a waiting queue. So the main goal was to reduce the handling time for each request related to a known problem.

A typical process looked like this: an alert appeared, or we detected a problem on dashboards. We assessed the impact, created a news post on the internal portal with a description of the issue and an instruction for first-line support, notified the responsible administrators or developers, and gave an estimated time of resolution. If the ETA changed, we updated the instruction, and operators started giving customers the new expected time.

Another part of the work was communication with customers who had already faced the problem or could have been affected, even if they had not contacted support. For this, we created lists of affected customers from logs, deduplicated them by unique fields, exported the data to SAP or other internal tools, and sent SMS or email notifications. The idea was simple: inform not only the people who complained, but also everyone who was actually affected by the incident.

Knowledge Base, Team, and Training

During that period, there were two equally important parts of the work: daily incident operations and building the foundation that helped the team work in a more stable way.

Incident management itself was very valuable experience. I had to understand impact quickly, communicate clearly, stay calm during failures, and keep contact with technical teams and support. But without documentation and training, this process does not scale well.

At that time, it was not always easy to find technical descriptions of the services we supported. A lot of information had to be collected directly from developers, architects, system analysts, and administrators. So we started building our own knowledge base: service descriptions, real cases, instructions, repeated scenarios, and playbooks for typical incidents.

The team grew from around five people in the office to twenty, and later to thirty, with office-based leads and remote operational specialists. Because of budget limits, we could not rely only on candidates with strong technical skills. Training became very important. I was responsible for preparing and running a two-week onboarding course: tests, video materials, analysis of real cases, and incident management training.

When the process became stable enough, the knowledge base started to live and grow every day. The main knowledge was moved into tests, videos, and training materials. At some point, I had to choose whether to continue growing as a manager and focus more on soft skills, or to go deeper into the technical side. I chose the second option.

Moving to DevOps

The next stage was a DevOps team that supported the internal CRM platform called TCRM — Tinkoff Customer Relationship Management. This system was built to replace Siebel, the old CRM system where call center operators performed almost all actions for customers.

Through TCRM, operators could reissue or block a card, transfer money, open a product, update customer information, create a service request, and perform many other operations. The system was integrated with almost the whole ecosystem: banking products, insurance, investments, travel, tickets, credit history checks, debts, fines, and other services.

Because of this, TCRM stability was critical. If the system was unavailable, operators basically could not work.

When I joined the team, both the department and the system were about two years old. Development was still very active: the team had to move functionality from the old CRM and also build new scenarios. This was an advantage because the architecture and infrastructure were created almost from scratch and were not heavily limited by legacy. The project architect and DevOps tech lead sat nearby, and communication was very close. This helped me understand not only the infrastructure, but also the product itself.

Technology Stack

At that time, the stack looked very large to me, and the first months were mostly about learning it quickly.

The code was stored in Bitbucket, and CI/CD was based on TeamCity. The main services were orchestrated in Rancher v1. Additional infrastructure was deployed with Ansible: installing Docker, running containers, and configuring hosts. Test environments were created in OpenStack with Terraform, while production hosts were requested from the infrastructure team. Later, this process was partly automated through a Slack bot.

For load balancing, we used HAProxy and Nginx. For service discovery, we used Eureka, which made sense because the main backend stack was based on Spring Boot. Databases were PostgreSQL. Logs were in ELK. Tracing was in Zipkin. Metrics were in Prometheus, Alertmanager, and Grafana.

The backend was mostly Java and Scala, while the frontend was Node.js and Angular. Code organization was also different: backend used a more classic process with branches and merge requests into master, while the frontend was a monorepo with release tags from master.

Besides TCRM, our team was also responsible for an important chat service. It had a different stack: Python, multithreading, queues, and RabbitMQ. The deployment process was simpler because it used a blue-green approach. We could run two environments and switch the traffic percentage through HAProxy configs. This service gave me my first deeper experience with RabbitMQ, queues, stuck queues, plugins, and upgrades.

There was also an internal portal for first-line support — the same portal where I used to publish incident updates. Later, current incidents were integrated directly into TCRM, which made request handling faster. The portal itself was not a very pleasant service: it was based on Bitrix, written in PHP, and did not receive much attention because it was one of the candidates for decommissioning.

Deployment and Canary Release

The first area I focused on as a DevOps engineer was deployment automation. The main deploy tooling was written in Python, so the entry barrier was acceptable. The most interesting part was canary deployment.

Deployment in Rancher was implemented through Rancher API. For canary, it was important not only to create a new service, but also to track container statuses, logs, metrics, and decide whether the rollout could continue. For this, there was a separate stateful service where pipelines were described in YAML. It provided an API to start deployment, get status, stop deployment, and manage the procedure. TeamCity only called simple scripts and API methods.

There were both historical and technical reasons for this. Network access was not complete everywhere, running containers inside jobs was not properly implemented yet, and the performance of agents was not great. So the main logic was moved out of TeamCity into a separate service.

One deployment was represented as a procedure with a set of statuses and modular features, one of which was canary. Procedure configuration was stored separately, and the runtime state was stored in MongoDB.

In general, the pipeline worked like this: a new service was created in Rancher, part of the new instances was started, metrics and logs were checked, and then the new service was gradually scaled up while the old one was scaled down. But there was an important detail: old instances could not simply be killed. Before scale down, we had to set a special decommission tag so that no new requests would go to them, while already started user procedures could finish.

This time depended on the service, but often it was around 15 minutes. For a service with dozens of containers and limited resources, when you cannot start many new instances at once, deployment could take quite a long time. Frontend applications were easier because they needed very few resources, so we could quickly start new instances and switch traffic to them.

First Serious Incident

I started doing on-call quite early, even before I had full expertise in the system. This was normal: in the beginning, many problems were solved either by rolling back to a previous version or by finding a problem in an external system and coordinating the incident. My previous incident management experience was very useful for this.

The first incident I clearly remember happened quite soon. I am not sure if it was literally the first one, but I remember it well because I caused it myself.

I had to deploy a new version of a service. The deployment took longer than expected and continued into the late evening. The auto-deploy system sometimes failed because of long timeouts, small batches, long decommission time for old instances, and long warmup time for new ones. So some steps had to be completed manually: send a curl request for decommission, wait, check requests, add new services, remove old ones, and repeat this several times.

I stayed alone in the office in front of the dashboard monitors. After one of the standard actions, which I had checked several times, I saw a fast growth of red and yellow error bars on the graphs. At first, I started checking logs and looking for proof that it was not related to my actions. But after a few minutes, it became clear that the problem was global and most likely related to the new application version.

In that situation, many questions appear at once: why the errors were not caught during deployment, whether to roll back immediately, or whether to investigate the exact reason first. But the procedure is the procedure. I wrote a message about the incident in the common incident channel, collected the impact from logs and dashboards at the same time, and called the tech lead, who had left the office about 15 minutes earlier. Luckily, he lived close to the office.

After that, we rolled back the release together. I waited for his confirmation on the key actions because I did not have enough experience yet. It felt like everything was happening very fast, and there was never enough time. In reality, the incident lasted around 30–40 minutes. It was a good lesson: not because outages stopped being scary after that, but because I understood how this process feels from the inside and what I needed to do to act more confidently.

First Experience with Kubernetes and Kustomize

The migration from Rancher did not happen at once. It took time to fully understand the limits of Rancher v1, especially its overlay network, which did not handle the growing number of containers well. Kubernetes became the target direction, but a migration of this size took more than two years.

My first practical experience with Kubernetes, actually through Rancher 2.0, started with feature environments for frontend applications. The idea was that every developer could create a branch and deploy it as a separate application with a unique URL.

For this, the existing deploy tooling had to be adapted for Kubernetes. In practice, we had to write a new version of the deploy service and describe specs in a Kubernetes-compatible format. At that time, we chose Kustomize because it was relatively simple, locally popular, and better suited for the task than a heavier Helm-based approach.

The new deploy service became an evolution of the previous approach: API, Swagger spec, dynamic configuration, the ability to register a new service through API, UI for checking status, support for different repository types, auto-tagging, and other features.

As a result, deployment happened in Kubernetes, and access to services was provided through Ingress. For each branch, a routing rule was created automatically: the branch name became part of the host or path, for example <branch>.service.domain.

Not everything worked perfectly. Cleanup of old resources was not fully implemented, so unused objects accumulated over time. There were also periodic node-level problems: system resources could be exhausted, or we could face issues with networking, file descriptors, or disks, and the cluster had to be cleaned and fixed manually. But as a first production-adjacent Kubernetes use case, it worked and helped prepare the ground for a wider migration.

Service Migration Details

One of the challenges of the migration was not only infrastructure, but also the behavior of TCRM itself.

The system had an important feature: procedure generation. This component allowed users to create algorithms that operators had to follow, or even flows that customers could complete themselves without a human operator. Such procedures could last for dozens of minutes, and all requests within one procedure had to go to the same service instance where the procedure started.

As the load grew, there were more and more procedure services, and they needed to scale horizontally. But at the same time, it was important to keep each request chain attached to the same instance.

For this, we used a routing tag. It was passed in request headers, and the load balancer used it to understand where to send traffic. The tags themselves were registered during deployment in service discovery: each instance announced itself with a specific tag.

During the migration to Kubernetes, this had to be handled separately. It was not enough to just start a Pod, create a Service, and call the task done. We had to keep the same routing behavior, the decommission process for old instances, and the ability to finish long-running procedures. So part of the complexity was not Kubernetes itself, but moving the existing traffic model and application lifecycle to the new platform.

TeamCity / Bitbucket → GitLab

An important milestone was the migration of the codebase and CI/CD from Bitbucket and TeamCity to GitLab. By that time, I already understood TeamCity quite well: jobs, inheritance, repository connections, templates, and most of its features. We even started running Docker-in-Docker on agents, which helped simplify some scripts and keep them in one place.

TeamCity already had config-as-code through Kotlin DSL, but visually and operationally it was quite heavy. GitLab YAML looked simpler and easier to understand, and GitLab also had advantages in cost and usability as a single platform.

The migration started with simple internal repositories: utility scripts, build-and-push jobs, report publishing, Elasticsearch index management, Terraform for OpenStack hosts, Ansible playbooks, and other projects without strict production dependencies. This allowed us to make early mistakes safely and learn the new tool.

Secrets were first stored in GitLab variables, and later moved to Vault when it became the standard for secret management.

Metrics, Logs, and Tracing

Metrics were based on Prometheus. Grafana was used for dashboards, and Alertmanager plus Grafana alerts were used for alerting. Logs were stored in the ELK stack, but we constantly had problems with it because of limited resources.

We had to introduce log storage quotas, limit the amount of data, and sometimes intentionally lower application log levels. This helped keep the logging system stable, but sometimes made investigations harder: if an incident happened and the required logs were missing, finding the root cause became more difficult.

The main problem was disk performance and the limited amount of resources. The Elasticsearch cluster was small: some nodes had SSDs, and some used HDDs as cold storage. Hot indices for the current day were placed on SSDs, while cold indices were moved to HDDs. The daily index volume could reach several terabytes.

If there was a large spike in events, it affected write speed. Logstash provided buffering and helped us avoid losing data immediately, but if Elasticsearch became unavailable, recovery was manual and slow: clearing queues, resetting caches, restarting shard allocation, and restoring throughput. Even after the cluster came back, accumulated queues were not drained quickly.

Tracing was more stable. We used standard protocols and Zipkin, the data volume was smaller, and our team maintained a tracing library with small additions on top of a standard Java approach.

The logging problem was fully solved not by further optimizing the existing ELK stack, but by migrating logging to the unified Sage platform, where we did not have the same strict resource limits.

Transformation into SRE

The next stage was the transformation from DevOps to SRE. For me, this was one of the best ways to understand the difference between these approaches in practice.

Some practices already existed before: on-call, incident management, monitoring, deployment automation, and production support. But after the move to SRE, they became more formal and more systematic.

In many ways, the first step was reading the Google SRE Book. Then there was a top-down initiative: the new direction was led by a person from Google, from the environment where the SRE approach was originally formed. The team was reorganized: developers were added to the team to increase our influence on the codebase and reduce the distance between operations, architecture, and development.

SLOs, SLAs, and SLIs were introduced for services. Availability and quality were no longer measured only as “working” or “not working”, but through concrete metrics, error budgets, and user expectations. Postmortems for user-impacting incidents became formalized. A single on-call channel was created to find responsible people faster and coordinate work during incidents. Before joining on-call, SRE engineers had to pass a qualification test.

But the main change was not in the terminology. It was in responsibility. The SRE team was no longer only supporting services. It became more involved in architecture, reliability practices, planning, observability, and process improvement.

Postmortems

Postmortems are one of the parts of SRE practice that I especially like. I have always been interested in incidents because they quickly show how a system really works: where the weak points are, where monitoring is missing, where the process does not work, and where the architecture is too complex or poorly documented.

A good postmortem is not about finding someone to blame. It is analytical work. It shows the connection between the problem, the impact, and what needs to be improved. It is one of the most effective ways to push changes: improve observability, fix processes, add alerts, change architecture, and close technical debt.

Without a postmortem, many improvements stay at the level of discussions. You need to convince people, explain the problem, and fight for priority. After a good postmortem, the problem and action items are documented, discussed by the team, and become part of the normal engineering backlog.

We used a structure close to the one described in the Google SRE Book: owners, dates, impact, background, timeline, mitigation, and follow-up tasks. We also documented how many users were affected, what errors happened, how fast the problem was detected, and how long recovery took.

An important point is that impact and severity are not the same thing. A problem can affect all users but be low severity, for example a visual defect. And the opposite is also possible: a problem can affect a small group of users but completely block an important operation for them. So one of the first tasks during an incident is to correctly define the overall level of the problem: impact, severity, affected users, business effect, and the required escalation level.

I also liked sections like “what went well” and “what didn’t go well”. They are useful for team self-reflection: what helped, what slowed down the response, where we were lucky, and where the system or process worked worse than expected.

Services, SDKs, and Ownership

Over time, the SRE team became responsible for its own services and libraries. This was an important shift: the team was not only supporting infrastructure, but also building tools used by product teams.

One example was an internal access-control service. It was not just a shared library package, but a separate service that checked whether a user had access to certain types of resources, allowed creation of new resources, and managed permissions.

The team also maintained Java SDKs for logging, metrics, profiling, Swagger, and other observability practices. The goal was not to make every product team invent its own way to do logging, metrics, or profiling. We standardized the basic level and made the right approach easier to use.

Canary deployment also evolved. It was migrated into a single application and started from GitLab pipelines as a separate job. We also used downstream and generated pipelines to manage more complex CI/CD scenarios.

Kubernetes resource management was another important area. Because capacity was limited, we had to define requests and limits carefully. In general, the approach was to make requests equal to limits, so resources were guaranteed for services and behavior was more predictable.

Criteo, Observability, and Zabbix Migration

I currently work at Criteo on the Observability team. My focus areas are monitoring, alerting, SLOs, Grafana, Prometheus/VictoriaMetrics, logging, automation, and infrastructure processes.

One of the major projects was migrating teams from the legacy alerting stack, including Zabbix. As part of the merger and infrastructure consolidation, we had to help many teams move to a new monitoring and alerting approach, and unify tools, configurations, and processes. In my resume, I describe this as migrating 50+ teams and a large alert volume from Zabbix to the new system.

I also worked on consolidating many cron jobs into a single CLI application. The goal was not just to “rewrite scripts”, but to standardize the codebase, configuration management, secret handling, and integration with Vault and Kubernetes. For secrets, we used Vault and vault-secrets-webhook.

Another area was an SLO framework based on Sloth. We built an internal layer on top of it, integrated it with a Kubernetes operator written in Go, and used the CRD approach to manage vmalerts instances. This made it possible to describe SLOs closer to the Kubernetes-native model and manage them in a more reproducible way.

Bare Metal, Cloud, and Infrastructure Duality

Criteo infrastructure is different from my previous experience because it has a clear duality. On one side, there is Kubernetes, Google Cloud, and modern cloud-native practices. On the other side, a large part of the infrastructure runs on bare metal and VMs, with Puppet and Chef for configuration.

This means that I have to work at different levels of abstraction. Sometimes the task is about Kubernetes manifests, operators, Prometheus rules, or GitLab pipelines. Sometimes it is about Linux, NUMA, IRQ affinity, network interfaces, queues, memory pressure, and the behavior of a specific process on a specific node.

One example is an investigation of an rsyslog problem in a multi-DC log pipeline. On some bare-metal nodes, throughput started to degrade under peak load. The problem was not caused by a single parameter. It involved NIC IRQ distribution, queue parallelism, memory pressure, and process behavior under load. In perf, we saw that page faults dominated, and there were also issues with queues and message processing.

The solution included comparing problematic nodes with healthy ones, checking IRQ distribution, analyzing rsyslog behavior, tuning imptcp and queue dequeue parameters, reducing queue parallelism, and rebalancing interrupts. For me, this is a good example of an SRE task where it is not enough to know one tool. You need to follow the whole path from symptoms on a dashboard down to the Linux system level.

Why I Like SRE

Because it is a broad engineering area that combines different types of work: postmortems, observability, writing code, helping users and product teams, managing infrastructure, architecture planning, capacity planning, automation, and process improvement.

I like that in this role you need to understand technical details and also think about the system as a whole: how it is deployed, how it is monitored, how it fails, how it recovers, who responds to incidents, what risks exist, and what can be improved so that next time the problem is detected faster or does not happen again.

For me, SRE is not only about “supporting production”. It is about making production systems more reliable, understandable, and manageable through code, processes, infrastructure, and engineering practices.