As this was the biggest NoSQL event in the world and the biggest gathering of Cassandra professionals, we were full of expectations and this year’s summit fulfilled them all. Before the first day of the summit we attended the world’s first Cassandra Developer Certification and both of us passed which makes SmartCat one of the first companies in the world with certified Cassandra developers. After successfully passing that test we could relax and enjoy 2 days of lectures, fresh ideas and great networking.
But let's start from the beginning, this is my choice of the most interesting lectures:
Keynote - was like the size of the event, huge and extraordinary. It started with Patrick McFadin and Rachel Pedreschi explaining the key features of Cassandra, resilience, scalability and no single point of failure, which makes this database a natural choice for more and more companies. There were two data centres on stage with Raspberry Pi’s and they started fire in one datacenter to prove that the application will continue to work. Then chaos monkeys came on stage to start trashing random nodes. This part was both entertaining and educational. Billy Bosworth gave a keynote speech and called the Microsoft guys on stage. It came as a surprise to many of us. Datastax and Microsoft worked on Azure Cloud Platform to make it easy for developers to spawn nodes and clusters in cloud. It was a live demo where a 90 node cluster was created in 5 minutes with nice online monitoring. Even command line and log information were shown in web browser of Azure platform. At the end, Jonathan Ellis took the stage, to explain some of the features in 2.2 version and some upcoming features in version 3.0. As for 2.2 version, JSON Api is the most exciting feature, instead of normal CQL insert and update statements, plain JSON with proper structure can be inserted in Cassandra, which will save serialisation and deserialisation overhead and most probably speed up the development of REST Apis on top of Cassandra. As for 3.0 version features, everybody was excited about materialised views which should solve most of the problems with duplicate data and updating on multiple places.
7 Deadly Sins for Cassandra Ops (Rachel Pedreschi) - a really great talk for guys who have to install and maintain cluster instance. She promised to hold the same kind of talk for Cassandra developers and we are looking forward to that. She reflected on the topic of memory management and also explained why using SAN is not a good idea. We really liked the explanation of importance to stress test Cassandra cluster with stress tool since forgetting to do so happens quite easily. Cassandra ops guys should not just hope that everything will work, but back that up with real world testing with a tool designed solely for that purpose. She also explained why and how to configure an operating system to get the most of Cassandra database and provided scripts and examples of configuration.
Old dogs, new tricks. Teaching your relational DBA to fetch (Patrick McFadin) - a really nice comparison of relational modelling techniques and modelling techniques specific to Cassandra. Doing modelling in relational world is much more easier than doing the same for Cassandra, but this is because relational databases are doing all kind of joins on read, you do not have to have a clear vision and understanding of application usage and workflows and you can add more joins and indexes along the way. With Cassandra, you need to know all the facts (or at least most of them) to build an optimised model for reads which makes it much harder. He explained some key differences, why there are no sequences in distributed systems, why to use UUIDs, when to do reverse ordering so data can be accessed more efficiently. An important thing to take away here is that data model will evolve in Cassandra along the way, new requirements will evolve and not all tables will be designed well from start. We should remodel, create new tables and transfer data to them. This is nice proof that we are on the right track with our Cassandra Migration Tool which does exactly that, allows developers to change schema and data from within the application.
Titan 1.0: Scalable real time and analytic graph query (Matthias Broecheler) - a nice addition to Cassandra ecosystem, this is a graph database built on top of Cassandra. It is a tough problem to map a graph to a sequential system such as hard disk, and second tough problem is how to deal with memory management. They decided to build a property graph data model and annotate vertices and edges (Apache Tinkerpop). If you really want to leverage graph you need a good engine and underneath structure and some kind of query language. This is where Gremlin which is a graph query language, comes into play. Use cases where graph database is a perfect fit are social networks and recommendation engines. There are less obvious use cases such as security and fraud detection where you need to connect a lot of dots to see the anomalies and potential frauds. A graph solves problems for highly connected data. But there is one important thing to grasp from this talk, as Matthias also emphasised, which is the fact that these are the early days of graphs. Anyway, he got me interested and TitanDB appeared on my list of things to explore in the near future.
Kafka, Spark and Cassandra - the new streaming data troika (Typesafe & William Hill) - Back in the days, Hadoop was the main player, and still is a powerful tool, mature and with a great ecosystem, but there are more and more use cases where everything should be viewed as real time streams, because time to respond has become so important. Event batch view should be processed in streams and Kafka, Spark and Cassandra enable that. Kafka is a durable message queue, it is resilient, scalable and distributed. It can replay messages if something happens. Spark is used for reach analytics and aggregation and it is the most popular tool right now. Cassandra is great for stream data and has a good connector for Spark which makes it a great fit for storage engine choice. The guys from William Hill presented Omnia, a DPM based framework built on lambda architecture with speed layer and batch layer. It has Chronos, the first component which collects data in time (uses Akka which sends messages to Kafka), Fates which is a batch layer built from timelines and views and Neo Cortex which is a speed layer built with reactivity in mind. The last component is Hermes which is distributed cache with push communication. The idea which we liked the most is how they treat user interaction, since it goes through Hermes again to Chronos and through the whole pipe so interactions can be monitored and personalisation can be applied.
Escaping Disco Era Data Modelling (Aaron Ploetz) - a nice talk which, once again, emphasised a couple of mostly relational mistakes which we make when doing data modelling for Cassandra. He is mostly active on StackOverflow so he gave a couple of examples and possible solutions, and I liked this talk’s problem solution approach. There are a couple of things to take from this talk, and here are some of them: do not use artificial partition keys when you want to place all data to a single partition, since this will create hot spots and data will be grouped which makes data losing probability higher. Avoid modelling for SELECT *, always try to group data somehow. Use a relational database if you are solving a relational problem, Cassandra is not a silver bullet. There is also an accompanying article related to this talk which is a great read.
This is only a selection of several talks, there were more but this post would be too long. The food was great, after events were so good, people attending the summit were happy to talk on any topic so we really had a great time. We are looking forward to Cassandra 3.0 release and next year's summit.