I recently attended an excellent free developer conference called DDD11 and went to Joel Hammond-Turner’s session on Splunk. I know our infrastructure team uses Splunk at work and I knew its a powerful framework, although I was really impressed with what it offers… and all for free!
My use cases
In my current project, our GUIs are deployed globally and client logs are stored locally on the end users machines. If we get an unhandled exception or there is an error, we will never know unless the user contacts our support teams.
It would be much better if we could take a proactive approach and deal with errors, or at least be aware of them before our users call up. I built a nodejs service and a log4net extension that posts any errors or fatals to the service. They are then logged to a mongo database. I then built a web front end to display the exceptions and allow us to group them and calculate % occurrence etc. We can then fix common issues that are not reported and keep an eye on the site to monitor our clients GUIs. My next task is to integrate socket.io or signalr to provide a live alerting dashboard.
User actions at the GUI are asynchronously passed through several systems. At some point in the future we will get a callback with the new value. For example if someone manually marks a spread, it will go through the system and then we will get an update with the new spread value.
We have had a few instances of people reporting their GUI is slow as they are adjusting the spread and its taking “ages” to take affect (i.e. the round trip time is taking a long time). I then have to spend ages trawling through the logs of various systems to try and work out where the time went.
As a solution, I pushed for the other systems to accept a tracking id. I then pass a unique id when the user performs an action and track this through all the systems, logging various way-points along the way.
We then built a tool that reads the output logs and graphs each requests in a stacked bar, showing what system and where in that system, most of the time went. You’ll need time syncronised machines to get an accurate representation on what’s going on, or you can just take the round trip time using the clients clock.
Splunk or ELK?
To be honest, I haven’t used either in anger yet. The Spunk platform does look slightly better, although comes at a price. We have a team using it at work, so I might be able to use their instance, if not, it will take months to have the budget signed off.
Rather than waiting months, I thought I would investigate free alternatives. ELK, appears to be its main competitor and from the video’s I have seen so far, it looks awesome too.
It turns out, I have wasted a bit of time writing my own nodejs service, mongo database and web front end (although it was good fun). ELK supports all of this out of the box. I currently post JSON from my log4net extension, so I will post straight to the elastic service’s ReST API. I then don’t need to setup Logstash.
I can then configure Kibana to graph our performance statistics. It has some great features, such as sliding windows, so we can see the performance of everyone’s requests in the last 30mins for example, and then render that in a stacked bar. Here is a screenshot of Kibana from the elastic website:
Elastic also provide another framework called Watcher that can watch elasticsearch and send alerts. This means we can get an email every time one of our clients logs an error. This would be a great way to annoy us into fixing issues.