How to encourage your team to participate in Meetings
How to encourage your team to participate in Meetings
When we talk about monitoring systems, first of all, imagine metrics such as:
- Databases (up/down)
- Kubernetes services (restarts, CPU/memory usage)
- HTTP (errors count, response duration)
- Logging (errors count)
But, all these metrics are related to SRE monitoring, this is not a DevOps monitoring and does tell about application health nothing.
The main thing is to understand that both approaches do not replace each other but complement each other. SRE monitoring allows you to quickly understand which component of the system is not working at all. DevOps monitoring allows you to understand the causes, predict problems long before they occur, and understand the health of the project as a whole The difference becomes especially noticeable when you try to analyze business logic or set up alerting
Metrics and alerting
let's take an imaginary HTTP /login and set up request duration metrics to understand when the system started to degrade. What should be the value: 0.5s, 1s, 5s, 10s? All these values are normal and acceptable, even 10s (last mile issues on the client side) and we can’t use a simple alerting system because it will be spammy
We can't round the values, it will be a very inaccurate sample. What we can do?... let's group all requests by duration and calc % of each group
duration | % requests |
<= 0,5s | 15% |
> 0.5s <= 1s | 60% |
> 1s <= 5s | 22% |
> 5s | 3% |
#### And now we can build two metrics
- 95% of all requests are no longer than 5s-
- requests longer than 5s do not exceed 3% of all requests
This approach is called working with percentiles
How it looks in real life
In the screenshot, you can see the request latency for the DjAdmin service
99% of all requests are complete in 200ms and this is a very good metrics value !files/Pasted image 20250422165138.png
0.1% of requests are complete in 2s which is within acceptable limits
SRE monitoring is blind to application issues
If we talk about request latency, the value of 5s per request is bad, and of course, it should be immediately fixed. But how to understand where is a problem?
Typical SRE monitoring analyzes only total duration request but under the hood, the real path can be next:
- External load balancer
- Kubernetes ingress
- Service middlewares
- Service controller
- Database query
- Service business login
- Response
Another example - memory usage. From the SRE perspective, we can easily measure Kubernetes POD or app memory usage. !files/Pasted image 20250422165224.pngBut this metric is totally useless for developers. Each application has different memory types and each of them is related to different issues. For example memory structure for the “Prerender-API” service (node.js).
How it looks in real life: !files/Pasted image 20250422165250.png
At SRE monitoring you can’t see the issue are exist and this is a very bad example because we have memory leaks. And you will not see the problem if the service is often deployed. It is impossible to export metrics from within the application using only the SRE approach/utils !files/Pasted image 20250422165328.png Monitoring is primarily a tool for programmers and should be convenient for programmers and have APIs and libraries for integration
Therefore, programmers choose the technical stack and do not impose it on programmers
because CPU/Memory usage, up/down statuses, etc are a very small part of the complex monitoring system, where business metrics the biggest and most important part
Business metrics
business metrics, for the most part, cannot be developed by the SRE team, it is the responsibility of programmers. But the following metrics are extremely important and informative
Let's look at a simple example:
We have the next flow:
- The user tries to SignUp on to the frontend, the frontend sends an API request to Auth0 API
- Auth0 store user in the internal database,
- Auth0 sends a webhook to our backend API
- Backend API store user info in PostgreSQL
- Auth0 sends a verification email to the user through SendGrid
- Users after clicking on to link in the verification email are redirected to Auth0 to confirm email verification
- Users are redirected to the backend API to put email confirmations status in PostgreSQL
As you can see, lot of places where something can go wrong. Monitoring, logging, and alerting is a huge work. But let's see how it can be simplified
A simple chart with business metric. All lines are synchronous, which means that all services work as expected
SignUp all data growing
but still we might have a problems !files/Pasted image 20250422165422.png
Potential problems:
- frontend and Okta integrations
- Okta issues
SignUp, webhooks growing dispoposetly to all other data
!files/Pasted image 20250422165618.png
Potential problems:
- Okta actions missconfiguration
- Project ingress route unavailable
- User service or DB issues
User verification status giving a slack
!files/Pasted image 20250422165853.png
Potential problems
- Issue with Email service (config, profile or billing, domain reputation, etc )
how to build alerting?
Two simple metrics can cover a lot of different cases:
- diff between (SignUp, Webhooks, Registered) all-time should be a “0”
- **loos between (SignUp, Webhooks, Registered) and “Verified” should be less than 100% **, Because this looks like a strange but completely normal situation