April 23, 2025

DevOps
monitoring
Cloud

How to encourage your team to participate in Meetings

When we talk about monitoring systems, first of all, imagine metrics such as:

Databases (up/down)
Kubernetes services (restarts, CPU/memory usage)
HTTP (errors count, response duration)
Logging (errors count)

But, all these metrics are related to SRE monitoring, this is not a DevOps monitoring and does tell about application health nothing.

The main thing is to understand that both approaches do not replace each other but complement each other. SRE monitoring allows you to quickly understand which component of the system is not working at all. DevOps monitoring allows you to understand the causes, predict problems long before they occur, and understand the health of the project as a whole The difference becomes especially noticeable when you try to analyze business logic or set up alerting

Metrics and alerting

let's take an imaginary HTTP /login and set up request duration metrics to understand when the system started to degrade. What should be the value: 0.5s, 1s, 5s, 10s? All these values are normal and acceptable, even 10s (last mile issues on the client side) and we can’t use a simple alerting system because it will be spammy

We can't round the values, it will be a very inaccurate sample. What we can do?... let's group all requests by duration and calc % of each group


duration	% requests
<= 0,5s	15%
> 0.5s <= 1s	60%
> 1s <= 5s	22%
> 5s	3%

#### And now we can build two metrics

95% of all requests are no longer than 5s-
requests longer than 5s do not exceed 3% of all requests

This approach is called working with percentiles

How it looks in real life

In the screenshot, you can see the request latency for the DjAdmin service

99% of all requests are complete in 200ms and this is a very good metrics value !files/Pasted image 20250422165138.png

0.1% of requests are complete in 2s which is within acceptable limits

If we talk about request latency, the value of 5s per request is bad, and of course, it should be immediately fixed. But how to understand where is a problem?

Typical SRE monitoring analyzes only total duration request but under the hood, the real path can be next:

External load balancer
Kubernetes ingress
Service middlewares
Service controller
Database query
Service business login
Response

Another example - memory usage. From the SRE perspective, we can easily measure Kubernetes POD or app memory usage. !files/Pasted image 20250422165224.pngBut this metric is totally useless for developers. Each application has different memory types and each of them is related to different issues. For example memory structure for the “Prerender-API” service (node.js).

How it looks in real life: !files/Pasted image 20250422165250.png

At SRE monitoring you can’t see the issue are exist and this is a very bad example because we have memory leaks. And you will not see the problem if the service is often deployed. It is impossible to export metrics from within the application using only the SRE approach/utils !files/Pasted image 20250422165328.png Monitoring is primarily a tool for programmers and should be convenient for programmers and have APIs and libraries for integration

Therefore, programmers choose the technical stack and do not impose it on programmers

because CPU/Memory usage, up/down statuses, etc are a very small part of the complex monitoring system, where business metrics the biggest and most important part

Business metrics

business metrics, for the most part, cannot be developed by the SRE team, it is the responsibility of programmers. But the following metrics are extremely important and informative

Let's look at a simple example:

We have the next flow:

The user tries to SignUp on to the frontend, the frontend sends an API request to Auth0 API
Auth0 store user in the internal database,
Auth0 sends a webhook to our backend API
Backend API store user info in PostgreSQL
Auth0 sends a verification email to the user through SendGrid
Users after clicking on to link in the verification email are redirected to Auth0 to confirm email verification
Users are redirected to the backend API to put email confirmations status in PostgreSQL

As you can see, lot of places where something can go wrong. Monitoring, logging, and alerting is a huge work. But let's see how it can be simplified

A simple chart with business metric. All lines are synchronous, which means that all services work as expected

but still we might have a problems !files/Pasted image 20250422165422.png

Potential problems:

frontend and Okta integrations
Okta issues

!files/Pasted image 20250422165618.png

Potential problems:

Okta actions missconfiguration
Project ingress route unavailable
User service or DB issues

User verification status giving a slack

!files/Pasted image 20250422165853.png

Potential problems

Issue with Email service (config, profile or billing, domain reputation, etc )

how to build alerting?

Two simple metrics can cover a lot of different cases:

diff between (SignUp, Webhooks, Registered) all-time should be a “0”
**loos between (SignUp, Webhooks, Registered) and “Verified” should be less than 100% **, Because this looks like a strange but completely normal situation

03 JUN, 2020

Advanced Online Payment Security: Threats and Maintenance

In this article, we are going to discuss the most common and ruthless Advanced Online Payment Security threats and how to fight them or at least maintain an environment that is less likely to be exploited.

13 JUL, 2020

Long Term Relationships With Webdevelop Pro

Our priority at Webdevelop Pro is to partner with small teams for the long haul. It may sound cliché but truthfully, we’re at our best when we’re helping our partners grow over the long term, flexibly adapting to new business needs as the team and business scales.

31 SEP, 2023

4 mistakes in scaling long-running product and how to fix them

Our experience with building and delivering new core functionality into long-running products.