Logging is more than technical issues such as log level or a logback configuration. They are also important but optional. I'll show you what I mean.
The project I came to needed logging. Only errors were logged. We were aiming for 15 million calls per day. Can you imagine any problems?
Yes, we started to have problems, like customers complaining about seeing something twice or another API receiving the wrong data. We needed to sit down and start logging from scratch.
So, what to aim for? My recent experience leads me to these three questions:
Why do we want to log
What do we want to log
How do we want to log
WHY
Seemingly useless question, right? You want your precious info in the files/Kibana. That's it. Why am I bothering you with this? I had such an opinion in the beginning. I changed it very fast after discussions with my colleagues. Everyone had different views on what to log and how much to log. The source of this schism is in our expectations, in our unspoken assumptions. That is the reason we need to speak about "why".
Our why was simple. We need to solve production incidents. We need to know what is going on in our system. Mostly on borders - in and out information.
WHAT
When we answered why, this should be straightforward. We desired to log all incoming requests and responses. Easy. Or not?
There are two main problems. First, as always, is disk space. Logging is expensive, especially in the cloud. Yes, we have this nice Azure public cloud infrastructure where we pay for everything we use. We are using Kubernetes with many pods, so disk space is pricy. We needed to fix this—see how below.
We need to comply with restrictions because we work in the banking industry. Personal data should not be logged; if so, only a few persons should have access. The solution is to hash it. We need to search and pair the data. Hashing solves this nicely, and company guidelines approve it.
How
There were two problems and one implementation of logging ahead of us.
Implementation of rest requests and responses is easy - pick a filter or interceptor. We went with the interceptor. Easy to test, one point of concern. We also did performance tests. This is an essential but often ignored topic. Logging can slow your server down if you have 200 requests per second on a pod.
How did we deal with disc space? Costly - we decided to implement the ELK stack. It will cost more in the beginning, but pay later. We have FileBeat, Logstash, and Kibana. Another advantage of Kibana is that it allows users to search through all the hordes of data files. With 20 pods, there are plenty of log files. Did I mention we have 13 services?
Hashing mostly fixed compliance and data security. We rewrote our interceptor, and here it goes. Logging works as expected.
What is your experience with logs? I want to read it. Leave a comment below.
コメント