Streaming Data Stores edit   discuss  

So this week we’ve been looking at streaming data stores, technologies for the buffering and long term storage of continuous data streams for consumption by multiple downstream consumers.

And as part of that look, I’ve updated our Apache Kafka pages, and we’ve taken a look at some new technologies - Pravega, the new kid on the block, and Confluent Open Source and Confluent Enterprise, Confluents offerings built around Kafka.

So let’s spout some thoughts on streaming data stores and the technologies we’ve looked at this week.

I going to make a very bold statement - in a few years time, I have a sneaking suspicion that streaming data stores will be seen as the biggest shakeup of the technology space for getting data into analytical systems for thirty years. Part of this is driven by the rise of streaming analytics, where these technologies are pretty much a de-facto standard for connecting streaming data flows together, but I think they’re going to become the standard for connecting any data processing chains together (most of which I think will become more real-time continuous flows anyway). What I’m really interested about with these technologies is the way they address some of the issues in building batch pipelines - without significant engineering effort these are often extreemly tightly coupled together, causing significant issues if jobs fail or you want to change your pipeline by adding in new flows or re-generating data stores. Persistent data buffers or queues help solve a lot of these problems de-coupling jobs, and although this concept exists in a number of commercial data integration tools, it’s never been a mainstream concept until now.

Apache Kafka is obviously the technology that’s driven most of this change, and is now seeing significant traction, with commercial support available from Confluent and inclusion in most Hadoop distributions. However the market for these technologies is extreemly immature, and Kafka itself has a number of potential limitations depending on your use case - the primary one being that it’s probably no as elastic as it could be.

Which is why it’s nice to see new technologies appearing in this space, showing there’s a healthy demand for these technologies, and providing competition and diversity. MapR Streams is one, providing a Kafka compatible API over the MapR file system, and there are a range of cloud based services listed on the streaming data stores technology page, however the one we looked at this week was Pravega - an open source product coming out of Dell EMC. They’re aiming to address what they see as the limitations of Kafka (summarised here), and I think they’re going to be an interesting technology to watch, but it’s still extreemly early days for them - they don’t have a production release yet, there is a significant capability gap to Kafka, and they’re going to need some significant commercial backing and vendor support if they’re going to be successful. What’s clear is that they’re off to a great start, and have already built a significant development community.

Confluent are the biggest backers of Kafka, and we looked at some of their offerings this week as well. They have what they call their Confluent Platform, which in two flavours - Confluent Open Source and Confluent Enterprise. If you’re planning on using the open source version of Kafka, Confluent Open Source may well be worth a look, even if it’s just to adopt some of their open source components. Confluent Enterprise then gives you full support and extra management features, and what’s interesting here is that although the Hadoop vendors have started bundling Kafka, they don’t yet have an equivalent range of capabilities to Confluent.