Browsing through logs is always hard, even when you are on a single node system. You are scrolling up and down, trying to figure out what events happened before a certain error. Often you want to see what followed after the error which happened, then you go back again to see the actual cause etc. Most of the time you want to see how often a similar error had already happened in the past. Try to imagine some of these problems in distributed systems where each problem gets multiplied by the number of machines in the system.
We are working a lot with Cassandra. Normal cluster installation consists of at least 6 nodes. This is only horizontal distribution, you also have a lot of layers on top cluster which are usually on separate machines. In order to figure out what caused a particular problem, you need to correlate information from multiple Cassandra nodes on the one hand, and to figure out what happened above the database on the other.
We have already scratched the surface with Monitoring Stack for Distributed Systems blog post. In this blog post, we will concentrate on logs alone and provide Ansible scripts to provision machines with all the tools you need, so you can effectively monitor logs in distributed system.
CsshX - Cluster SSH tool
This nice tool solves a part of the problem. You can SSH to multiple machines at the same time and follow what is happening in multiple terminals.
You provide a property file with ip addresses of machines you want to connect to. At the bottom of the screen you have a red area for entering combined commands (executing in each terminal) and you can isolate a single machine by selecting the desired terminal. This way you can save a lot of time, especially when you have something like Cassandra cluster, where files are on the same paths on multiple nodes and you need to perform the same commands on all of them. If you want to find out more details about the tool, check out this link. The problem with it is that it is a Mac specific tool (even though I believe there are similar tools for other platforms) but also this approach does not solve the problem of browsing logs and going back and forth. A real solution would be a centralized place where logs are stored and indexed for easy browsing.
ELK stack to the rescue
ELK (Elasticsearch, Logstash, Kibana) stack is a perfect solution for this problem. It gives you all: agents to send log information (log shippers), a server which collects these log messages, stores and indexes them so they can be searched easily and a visualization tool to present them.
You start off by installing Filebeat agent on each machine you want to collect log events from. This agent will send changes in configured logs to a configured IP address where Logstash server listens. This way, all logs will end up at the centralized location. Filebeat configuration is done in yaml file where you define the path to logs, scan frequence, where to send and which template to use. Log events are transformed into beats based on the Filebeat template. You can check more details inour smartcat-ops Github repository where we created Ansible role to install Filebeat agent and configure it to send beats to Logstash server.
Logstash server accepts log messages and it is configured to use Elasticsearch to index those messages. Usually, default configuration is just right. Kibana is the last piece of stack, and it is basically a visualisation tool. You have a nice interface where you can add search criteria, time range, and all kind of queries and see combined results from all machines sorted by time. It uses Elasticsearch index underneath to pull up data based on provided criteria. This way you can easily see everything in one place, find exactly what you are looking for and correlate things that happened at the same time on different machines. In distributed systems usually the event in one place is a source of problem in another place and this way those kinds of problems can be spotted easily. We used palkan.elk from Ansible galaxy to provide a monitoring server machine. On their Github account you can see more details and templates for configuration files.
Going a step further with Grafana
Having ELK stack alone is enough for most use cases. However, when you want to see trends and combine log data with data from other data sources, this may become tricky. You need additional tools. Our weapon of choice is Grafana. It has rich support for a variety of data sources, a nice overview of plugins and a lot of nice and shiny visualization tools to present your data. It has integration with Elasticsearch and you can set it up as data provider. After that, you can pull already indexed log messages using ELK stack from it to plot it on graph or show it in a table. On the screen below you can see a combination of metrics graph which plots values of latency on Y axes, count of error messages in logs and actual table with error message details. It reacts to the selected interval and you have a nice trend of error messages throughout the selected time.
A nice part of this is that with Grafana you can mix different data providers. On this screen, the upper graph is plotted from InfluxDB and second and third graphs are using the already mentioned integration with Elasticsearch. Which brings us to one more nice Grafana advantage, it can combine logs with metrics to get answers why some measurements went off from expected values.
Combine with metrics
We worked on a project where we had a really tight SLA, we needed to lower down the latency for Cassandra cluster for 99.999% of requests under a certain threshold. In order to do that, we created the cassandra-diagnostics project, which hooks into query path, measures latency and sends all queries above the configured threshold over configurable reporter to metrics server. We managed to tune down the cluster to produce queries under a certain threshold, but we stayed with peaks. In order to figure out what caused these peaks, we needed to combine metrics with log information to see what was the root cause of slow queries at some point in time. This is just one example, you can probably think of more, where it would be interesting to attach the context provided by log messages with a value you are measuring at that point in time.
Having logs in the central place is a huge time saver, it really pays off to set everything up once, and use it multiple times. From previous experiences, this would be step one for us on all distributed systems. Usually people decide to add monitoring when an error has already occurred. This way you have two benefits, you are pro-active and have all the infrastructure in place if an error occurs to figure out the cause and you are getting to know the technology better by visualizing logs in the central place. And stay tuned, because we are preparing another blog post which will go deeper in the monitoring part of this story.