The five stages of coming to terms with Cassandra
The dark side of Cassandra.
The five stages of coming to terms with JavaScript are:
- Denial: “I won’t need this language.”
- Anger: “Why does the web have to be so popular?”
- Bargaining: “OK, at least let me compile a reasonable language to JavaScript.”
- Depression: “Programming is not for me, I’ll pursue a career in masonry, like I always wanted.”
- Acceptance: “I can’t fight it, I may as well prepare for it.”
The same is with Cassandra - however, IMO in the opposite order:
- Acceptance: “I will use Cassandra. It’s… AMAZING! Let me just quote Apache Cassandra landing page:"
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
- Depression: “Damn, it’s so well designed, but a complex piece of software and it doesn’t work as expected.”
- Bargaining: “OK, at least let me try to tune it or report some bugs.”
- Anger: “Why is it so popular? Why it has so good PR?”
- Denial: “I won’t use it or recommend it ever again.”
The context
I’ve done the research, checked multiple websites - read about performance, architecture, hosting, maintenance, TCO, libraries, popularity… and Cassandra seemed to be a good database for time-series logs storage, with 95% writes (with SLA) and only 5% reads (without SLA). I’ve chosen prepared Cassandra Datastax virtual disk image on Amazon with bootstrap scripts, made a proof-of-concept solution and read a book or two about Cassandra. All seemed good. However, it’s not post about the good. So …fast forward…The bad
Some stories which I remember:- Cassandra cluster is on production (along with pararell, old solution for this purpose). Phone rings at 2AM. C* cluster is down. Quick look at logs - OutOfMemoryException in random place in JVM. Outage time: 1h - let me just remind you “proven fault-tolerance”. Cluster restart, it works again.
- Next day at work, random hour, the same thing. Related bug: OutOfMemoryError after few hours from node restart
- After few days… firing repair - the standard C* operation, which you have to run at least every gc_grace_seconds, by default 10 days. Usually it worked, but then, unexpectedly the server died and later again and again, related issue: “Unknown type 0” Stream failure on Repair. Outage time: stopped counting.
- Because of the failing servers in the cluster I decided to scale it out a little. Unfortunately, the issue above also made the scaling impossible.
- After a while, I’ve encountered a second (thrid?) problem with the repair. Related bug: Repair session exception Validation failed
Fail
Let’s get back to the landing page:
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Now, let’s see at critical JIRA issue dates:
This means that for around one month at least few people could scale or repair their Cassandra clusters. I fully understand - it's free and Open-Sourced-Software. However, even if something it's free you expect it to work - that's the harsh reality. If it doesn't work just you look for something else. No offence Cassandra Datastax/Apache teams, you are doing truly amazing work, however in resilient software, stability is a TOP 1 requirement.
Maybe it's me? Maybe I'm the only one having problems?
Fortunately (for me) not:- Here is a presentation how guys at Adform switched from Cassandra to Aerospike: Big Data Strategy Minsk 2014 - Tadas Pivorius - Married to Cassandra
- My friend working at a different company also told me, that they used Cassandra and they abandoned it.
- Just looked at linked issues and the number of watchers.
In all cases the problems were similar to mine.