This post will be about my journey with fixing nasty Cassandra Datastax C# driver problem, which took me a lot more time than expected… Can you guess the problem source?

Once upon a time, I’ve been fixing following exception:

Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: x.x.x.x:9042, x.x.x.x:9042, x.x.x.x:9042)
   at Cassandra.ControlConnection.Connect(Boolean firstTime)
   at Cassandra.Cluster.Connect(String keyspace)
   at Company.Code.CassandraSessionCache.GetSession(String keyspaceName)
   at Company...

The CassandraSessionCache looked like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public class CassandraSessionCache
{
    private readonly Cluster _cluster;
    private readonly ConcurrentDictionary<string, Lazy<ISession>> _sessions; //lockless session cache

    public CassandraSessionCache(Cluster cluster)
    {
        _cluster = cluster;
        _sessions = new ConcurrentDictionary<string, Lazy<ISession>>();
    }

    public ISession GetSession(string keyspaceName)
    {
        if (!_sessions.ContainsKey(keyspaceName))
        {
            _sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() => _cluster.Connect(key)));
        }
        var result = _sessions[keyspaceName];
        return result.Value;
    }
}
Nothing fancy, however let me give you an insight about the architecture and circumstances of the error:
  • Cassandra cluster is in Amazon
  • The client is Cassandra Datastax C# Driver 2.6.0, also on server in Amazon
  • Both the client and the Cassandra cluster is the same Amazon Region
  • Amazon Region had no availability issues during given period
  • The solution was working fine for over 1 month! The client process is being restarted ~every week for various reasons
  • The client follows Cassandra Datastax C# Driver Best Practices
  • Heartbeat is turned on, so the connection should be alive, all the times
  • Things get back to normal after the client restart… and gets back to madness few hours later, at higher load. Incredible high number of NoHostAvailableExceptions, like almost any connection to the Cassandra fails.
  • Of course, it works on my machineĀ®

What didn’t happen?

There are plenty of questions about Cassandra.NoHostAvailableException on StackOverflow. So let’s get through some of them and exclude them:
  • [1][2] - no, because following C# Driver best practices excludes this
  • [3] - no, because we are using default retry strategy from driver version 2.6.0
  • [4][5][6] - no, because we are able to connect to the Cluster at the beginning
  • [7] - no, because we are not misusing batches

Debugging…

Logs on server revealed that the client closed the connection:

INFO  [SharedPool-Worker-3] yyyy-mm-dd 11:04:45,625 Message.java:605 - Unexpected exception during request; channel = [id: 0x9eaf52c5, /x.x.x.x:y :> /x.x.x.x:9042]
java.io.IOException: Error while read(...): Connection reset by peer
  at io.netty.channel.epoll.Native.readAddress(Native Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
  at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollRdHupReady(EpollSocketChannel.java:689) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]

While the message on the client says that there are no hosts available, what is confirmed by debug logs on the client side. Pretty interesting, huh? Being confused, I’ve decided to give an update from 2.6.3 to 2.7 a try… but that didn’t help.
Accoring to yet another issue regarding NoHostsAvailableException on StackOverflow I’ve started to log whole exception, with serialized errors property. This is what I’ve logged:

System.Exception: NoHostAvailableException happened. Errors: {
  "x.x.x.x:9042": {
    "NativeErrorCode": 10060,
    "ClassName": "System.Net.Sockets.SocketException",
    "Message": "A connection attempt failed because the connected party did not 
	properly respond after a period of time, or established connection 
	failed because connected host has failed to respond",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "   
  at Cassandra.Connection.<Open>b__9(Task`1 t)
  at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  at System.Threading.Tasks.Task.Execute()",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
    "HResult": -2147467259,
    "Source": "Cassandra",
    "WatsonBuckets": null
  },
  "x.x.x.x:9042": {
    "NativeErrorCode": 10060,
    "ClassName": "System.Net.Sockets.SocketException",
    "Message": "A connection attempt failed because the connected party did not 
	properly respond after a period of time, or established connection 
	failed because connected host has failed to respond",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "
  at Cassandra.Connection.<Open>b__9(Task`1 t)
  at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  at System.Threading.Tasks.Task.Execute()",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\n<Open>b__9\nCassandra, Version=2.7.0.0, Culture=neutral, PublicKeyToken=10b231fbfc8c4b4d\nCassandra.Connection\nCassandra.AbstractResponse <Open>b__9(System.Threading.Tasks.Task`1[Cassandra.AbstractResponse])",
    "HResult": -2147467259,
    "Source": "Cassandra",
    "WatsonBuckets": null
  }
}

Unfortunately, no interesting data is here.

So what could possibly go wrong?

Can you spot the error? I couldn’t. Any guesses? Find the answer in next post.