Cassandra Datastax C Sharp Driver problems - solution
In the previous post I’ve described a strange problem related to Cassandra Datastax C# Driver which was happening once in the production environment. It’s time to reveal the mystery.
Two root causes
One of the hidden, but important metric, which you won’t find usually in your logs is the CPU usage. What important is that the connection setup to the Cassandra cluster consists of many small steps. In production, when there was a very high CPU usage (around 100% - for reason known and already eliminated), the connection setup process was timing out in such a moment, that the final result was reported as NoHostAvailableException. This shows, how important is to track and prevent 100% CPU usage.
But why, let me quote myself:
Things get back to normal after the client restart… and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.
The problem is here:
1 2 3 4 5 6 7 8 9 10 11 12
private readonly Cluster _cluster;
private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache
public ISession GetSession(string keyspaceName)
_sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
var result = _sessions[keyspaceName];
Do you see it? It turns out that when an exception is thrown in _cluster.Connect(key) method, the Lazy<T> will cache this exception. Therefore all invocations to Lazy<T>.Value will result in the same, cached exception instead of retrying the connection to the Cassandra cluster. If you are planning to use the Lazy<T> class, there are more “gotchas”. Read the documentation on MSDN.
CPU usage is very, very important and critical metric. Do not ignore it, as it may lead to numerous, strange errors.
RTFM! Read the documentation when using any class for the first time. Especially when copying&pasting code from the StackOverflow.
Bob is a very successful guy. He is auto scaling his service by automatically adding hosts when the CPU increases, and he is removing them when CPU goes down. Dear Bob, there is a trap waiting for you around the corner.
Monitoring services is crucial, if you care about the application uptime. There are hundreds if not thousands parameters which you can (and should) monitor, related to CPU, network, hosts, application and so on. What are they? What are the non-obvious choices?
It seems that most people know the importance of software design patterns, best practices or continuous integration. While those subjects are important, there is one more equally essential term, which yields only one relevant result link on the first Google page. Meet Operational Excellence.
First of all, this post won’t be for people who think developer’s job is to design, write code and test it. It’s far beyond that. One of the important responsibilities is to ship your code to production. How to do that safely?