So last week Hortonworks announced their Open Hybrid Architecture Initiative, and I said we’d take a look, so let’s take a look…
So to start with, this is another “big” initiative from Hortonworks ala their Stinger initiative (which aimed to turn Hive into an interative analytical database platform), looking to drive a bunch of significant architectural changes into the core Hadoop platform.
But before we start, let’s do a quick potted history of Hadoop. Version 1 was about allowing companies run batch analytics (using MapReduce) over vast amounts of data using very large numbers of inexpensive machines. It started off as a pretty niche technology - you only needed it if you had the data volumes (and the skills) that meant there was no other alternative technology you could cost effectively use. However, version 2 brought YARN, allowing multiple different analytical tools (hello Spark, Hive on Tez, Impala etc.) to be used and nicely co-exist, and suddenly the Hadoop proposition changed from being about massive scale to being about flexibility - it was now a general purpose platform that gave you storage and compute, over which you can run a wide range of analytical workloads.
And as great as that was on-premises (I can use new analytical tools without deploying new infrastructure), in the cloud it doesn’t work (why would I provision a persistent Hadoop cluster when I’m not using it half the time) and isn’t needed (because storage and compute is already abstracted away). In the cloud Object Stores are the de-facto standard for storage, and Kubernetes (and maybe a bit of Mesos) the emerging standard for compute. And if you need to abstract away storage and compute on-premises, it’s going to make sense to standardise on those technologies going forward. This moves us away from the original Hadoop premise of having the compute and storage co-located, but with the explosion of high bandwidth network connectivity that doesn’t feel like a huge issue any more.
So what does this mean for Hadoop? Well firstly it will have to evolve if it wants to remain relevant as a shared infrastructure platform, and it feels like this will involve support for an object store interface and the integration of Kubernetes, so workloads you run in the cloud using these technologies can also be run on your local Hadoop cluster. But Hadoop (and more specifically the Hadoop ecosystems sold by the big vendors) are more than just the underlying Infrastructure - they’re a set of analytical tools bound together by some shared standards - the Hadoop Compatible Filesystem specification, YARN, the Hive Metastore, and common security, audit and metadata integrations (although these aren’t standard across the Hadoop vendors). And there will still be lasting and enduring value in this, however these tools and workloads will need to evolve to exist in the new world.
And this is what Hortonworks are planning to do via their three point plan:
- Containerisation of workloads - this means you’ll be able to spin up individual services (HBase, Hive etc.) or transient workloads (e.g. Spark) on demand, when you need, for as long as you need, and with exactly the dependencies you need. For Hortonworks this will be managed through their DataPlane Service (DPS), and one presumes you’ll (eventually) be able to do this on-site on your persistent Hadoop cluster or in the cloud on dynamically provisioned clusters.
- Add an object store interface to HDFS - this is the new Hadoop Ozone project, which is introducing a new more scalable block storage layer (the Hadoop Distributed Data Store that will underpin both Ozone and HDFS. Have a read of our new technology summarises on the other side of these links for more details. However it’s worth noting here that the move from HDFS to Object Stores is not trivial, there there are still plenty of issues to address before object stores support the range of features that HDFS does, such as consistency and fine grained access control. And it’s also worth noting that Ozone will be an object store optimised for fast IO scanes (as it uses the HDFS storage layer), which makes it unusual in the object store world.
- Add support for container management to Hadoop, with the suggestion that this means making Kubernetes into your on-premises Hadoop cluster, turning it into a platform that gives you object storage and Kubernetes, making it your on site cloud platform.
So that’s Hortonworks, but what are the other big Hadoop vendors doing. Cloudera have already started their move to the cloud with Cloudera Altus which allows you to programatically spin up Hadoop workloads on transient clusters. And although they’re not outwardly supporting this Hortonworks initiative, you can bet they’re interested in containerisation of Hadoop workloads. ZDNet have a good interview with Mike Olsen here that’s worth a read.
And MapR have kind of been here for a while. MapR-FS is pretty much a cloud scale storage solution already, and they’re heavy sponsors of Apache Myriad for running YARN jobs on Mesos clusters, so they’re heading in the same direction.
If you fancy some more reading, Hortonworks announcement is here, and ZDNet have a good write up here that is worth a read. There are also views from Datanami, GeekWire, Silicon Angle and Search Data Management if you want more reading.