The dark side of Cassandra.


The five stages of coming to terms with JavaScript are:
  1. Denial: “I won’t need this language.”
  2. Anger: “Why does the web have to be so popular?”
  3. Bargaining: “OK, at least let me compile a reasonable language to JavaScript.”
  4. Depression: “Programming is not for me, I’ll pursue a career in masonry, like I always wanted.”
  5. Acceptance: “I can’t fight it, I may as well prepare for it.”

The same is with Cassandra - however, IMO in the opposite order:
  1. Acceptance: “I will use Cassandra. It’s… AMAZING! Let me just quote Apache Cassandra landing page:"
       The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
  2. Depression: “Damn, it’s so well designed, but a complex piece of software and it doesn’t work as expected.”
  3. Bargaining: “OK, at least let me try to tune it or report some bugs.”
  4. Anger: “Why is it so popular? Why it has so good PR?”
  5. Denial: “I won’t use it or recommend it ever again.”

The context

I’ve done the research, checked multiple websites - read about performance, architecture, hosting, maintenance, TCO, libraries, popularity… and Cassandra seemed to be a good database for time-series logs storage, with 95% writes (with SLA) and only 5% reads (without SLA). I’ve chosen prepared Cassandra Datastax virtual disk image on Amazon with bootstrap scripts, made a proof-of-concept solution and read a book or two about Cassandra. All seemed good. However, it’s not post about the good. So …fast forward…

The bad

Some stories which I remember:
  • Cassandra cluster is on production (along with pararell, old solution for this purpose). Phone rings at 2AM. C* cluster is down. Quick look at logs - OutOfMemoryException in random place in JVM. Outage time: 1h - let me just remind you “proven fault-tolerance”. Cluster restart, it works again.
  • Next day at work, random hour, the same thing. Related bug: OutOfMemoryError after few hours from node restart 
  • After few days… firing repair - the standard C* operation, which you have to run at least every gc_grace_seconds, by default 10 days. Usually it worked, but then, unexpectedly the server died and later again and again, related issue: “Unknown type 0” Stream failure on Repair. Outage time: stopped counting.
  • Because of the failing servers in the cluster I decided to scale it out a little. Unfortunately, the issue above also made the scaling impossible.
  • After a while, I’ve encountered a second (thrid?) problem with the repair. Related bug: Repair session exception Validation failed

Fail

Let’s get back to the landing page:
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
Now, let’s see at critical JIRA issue dates:

This means that for around one month at least few people could scale or repair their Cassandra clusters. I fully understand - it's free and Open-Sourced-Software. However, even if something it's free you expect it to work - that's the harsh reality. If it doesn't work just you look for something else. No offence Cassandra Datastax/Apache teams, you are doing truly amazing work, however in resilient software, stability is a TOP 1 requirement. 

Maybe it's me? Maybe I'm the only one having problems?

Fortunately (for me) not:
  1. Here is a presentation how guys at Adform switched from Cassandra to Aerospike: Big Data Strategy Minsk 2014 - Tadas Pivorius - Married to Cassandra
  2. My friend working at a different company also told me, that they used Cassandra and they abandoned it.
  3. Just looked at linked issues and the number of watchers.
In all cases the problems were similar to mine.