The Week That Was - 05/05/2017 edit  

It’s nearly the weekend, which means it’s time to summarise the week.

We started late this week (let’s hear it for public holidays), finishing off our look at MapR, before taking a quick look at Cloudera’s Data Science Workbench. No technology summary today, but we’ll take a look at Apache Metron as part of this post.

What I’ve really liked about MapR is their strategy around their common data platform to underpin a bunch of different data storage capabilities. I talked a little bit about their data platform last time, but this week as part of looking at MapR-DB and MapR-Streams I’ve been thinking about how this compares and contrasts to Hadoop. Firstly, they’re both aiming to provide a common data platform that provides the ability to have a single cluster than can provide flexibility and value for money by allowing you to exploit the same infrastructure for multiple use cases. MapR appears to have fully embraced this, ensuring they support the ability to scale, partition and manage the platform in ways that Hadoop can’t yet, and by providing capabilities that Hadoop (and more specifically HDFS) doesn’t that actually make it work as a general purpose data platform - full random read and write access for starters. I’m also taken by MapR’s ability to provide access to the common data platform at different layers - rather than just build capabilities on top of their file system API, they’ve integrated (for example) MapR-DB at a much lower level, providing a range of benefits over HBase running over HDFS. It’s clear that Hadoop still has a long way to go to fulfil it’s potential, and without addressing some of it’s limitations we’re going to continue to see new technologies opting to implement their own storage systems from scratch (Kudu being a great example), leading to Hadoop clusters running multiple independent storage stacks on the same data nodes, which feels like it’s defeating the point.

I’ve also started wondering why there aren’t more common storage sub-system’s that multiple technologies leverage - not necessarily so that they could all co-exist on the same cluster (along this would be a side benefit), but just because storage systems are hard and complex, and it feels like there should be huge wins by having a strong and robust solution that can be leveraged for multiple capabilities. There are very few data platforms that don’t have some limitation or constraint, and that a world class storage system with a range of APIs implemented on top could be instantly competitive against a wide range of technologies. MapR are starting to demonstrate this - there’s certainly some evidence that MapR-Streams leverages their data platform and a Kafka compatible API to provide a solution that addresses a number of Kafka’s limitations.

Moving on, Cloudera’s Data Science Workbench is now generally available. Their use of docker seems inspired - the flexibility this gives you to use different versions of different libraries in different notebooks, and to have this execution environment follow the notebook around feels like a huge win. It’s still early days for the product however - the number of interpreters seems light (not being able to run SQL or Solr searches directly in the notebook feels like a gap), and it remains to be seen how it will fair as a commercial product against the open source Apache Zeppelin and Jupyer.

I said on Wednesday I’d do a technology summary of Apache Metron, however I’ve decided as a packaged analytical application it’s probably slightly outside the conceptual remit of this site. It’s definitely worth digging into if you have time however, as it’s an interesting use case for what can be done with the Hadoop ecosystem, and an interesting capability in it’s own right. If you’re looking for reading material, there’s the Apache Metron homepage, Hortonwork’s overview of Metron, and their user documentation, but a good a place to start as any is the architecture overview in the Apache Metron Wiki. In summary, the architectures probably fairly standard - Apache Kafka as an input point, fed by custom probes and Apache NiFi, with data then processed using Apache Storm supporting a level of data transformation and enrichment, real time alerting and built in scoring using machine learning models, with persistence of storage into HDFS and HBase, and then a range of dashboards and visualisation capabilities over the top. Apache Spark is in there somewhere as well.

That’s all (for this week) folks - see you on Monday.