Little-known, useful, charming and beautiful algorithms - part 1
Warning: this post won’t be about “boring” or “typical” algorithms from Computer Science which we all have learned on studies (like quick sort, merge sort, xxx sort, A*, FFT). Instead, this will be about other little-known, especially USEFUL algorithms, which people working as professional developers should know or heard of.
As we at Twitter move away from Mysql towards Cassandra, we’ve needed a new way to generate id numbers. There is no sequential id generation facility in Cassandra, nor should there be.they designed a system for generating ID’s with following requirements (source):
- minimum 10k ids per second per process
- response rate 2ms (plus network latency)
For high availability within and across data centers, machines generating ids should not have to coordinate with each other.
(Roughly) Time Ordered
We have a number of API resources that assume an ordering (they let you look things up “since this id”).
However, as a result of a large number of asynchronous operations, we already don’t guarantee in-order delivery.
We can guarantee, however, that the id numbers will be k-sorted (references: http://portal.acm.org/citation.cfm?id=70413.70419 and http://portal.acm.org/citation.cfm?id=110778.110783) within a reasonable bound (we’re promising 1s, but shooting for 10’s of ms).
The ids should be sortable without loading the full objects that the represent. This sorting should be the above ordering.
There are many otherwise reasonable solutions to this problem that require 128bit numbers. For various reasons, we need to keep our ids under 64bits.
The id generation scheme should be at least as available as our related services (like our storage services).
Concealed algorithm, which .NET developers use every day
Nagle’s document, Congestion Control in IP/TCP Internetworks (RFC 896) describes what he called the “small packet problem”, where an application repeatedly emits data in small chunks, frequently only 1 byte in size. Since TCP packets have a 40 byte header (20 bytes for TCP, 20 bytes for IPv4), this results in a 41 byte packet for 1 byte of useful information, a huge overhead. (…)
Nagle’s algorithm works by combining a number of small outgoing messages, and sending them all at once. Specifically, as long as there is a sent packet for which the sender has received no acknowledgment, the sender should keep buffering its output until it has a full packet’s worth of output, so that output can be sent all at once.
Little algorithmsThere are two “short” algorithms which I personally like. They aren’t a big deal and by googling the problem, you probably will use one.
- Box–Muller transform which you probably want to use, when you want to turn standard C# Random class to Gaussian distributed random generator. C# code here: Random Gaussian Variables
- Fisher–Yates shuffle which is a solution for randomly shuffling an array. C# code here: Best way to randomize an array with .NET