A distributed storage and compute platform consisting of a distributed filesystem (HDFS) and a cluster workload and resource management layer (YARN), along with MapReduce, a solution built on HDFS and YARN for massive scale parallel processing of data. Has an extensive ecosystem of compatible technologies. An Apache Open Source project, started in January 2006 as a Lucene sub-project, becoming a top level project in January 2008, with a 1.0 release in December 2011 (containing HDFS and MapReduce), and a 2.2 release (the first 2.x GA release) in October 2013 (adding YARN). Work is currently underway to split out the data storage layer of HDFS (the HDDS sub-project) and to implement an object store on top of this that can co-exist with HDFS (the Ozone sub-project). Very active, with a deep and broad range of contributors, and backing from multiple commercial vendors.
Other Names Hadoop Vendors The Apache Software Foundation Type Commercial Open Source Last Updated January 2019 - v3.2
Apache Hadoop > HDFS A highly resilient distributed cluster file system proven at extreme scale. Consists of a single NameNode service (that's responsible for all metadata management, including the filesystem namespace and block management) plus DataNode services that run on all storage nodes (that manage block IO). Supports NameNode high availability, metadata resilience (via a transaction log), data resilience (via block replication or erasure coding), user authentication, extended ACLs, snapshots, quotas, central caching, a REST API, an NFS gateway, rolling upgrades, rack awareness, transparent encryption, NameNode federation (support for multiple independant NameNodes on the same cluster serving different namespaces) and support for heterogeneous storage. Part of the original Hadoop code base, becoming an Apache Hadoop sub-project in July 2009. Currently being updated to run over the new HDDS (Hadoop Distributed Data Storage) layer, moving block management from the NameNode to a new Storage Container Manager to increase scalability. Apache Hadoop > MapReduce A data transformation and aggregation technology proven at extreme scale that works on key value pairs and consists of three transformation stages - map (a general transformation of the input key value pairs), shuffle (brings all pairs with the same key together) and reduce (an aggregation of all pairs with the same key). Part of the original Hadoop code base, becoming an Apache Hadoop sub-project in July 2009. Apache Hadoop > YARN Resource management and job scheduling & monitoring for the Hadoop ecosystem. Includes support for capacity guarantees amongst other scheduling options, long running services, GPU and FPGA scheduling and isolation and experimental support for launching applications within docker containers. Added as an Apache Hadoop sub-project as part of Hadoop 2.x (with a GA release as part of 2.2 in October 2013) having been started in January 2008. Apache Hadoop > HDDS A common distributed and resilient block storage layer that will eventually underpin HDFS and Ozone, delivering increased scalability. Implemented as a Storage Container Manager (SCM) service (that performs block management) and DataNode services (inherited from HDFS that run on storage nodes and manage block IO). Blocks are arranged into containers (with the replication strategy defined at the container level). Currently under active development as part of the development of Ozone. Previously known as HDSL (Hadoop Distributed Storage Layer) Apache Hadoop > Ozone An object store built on top of the new Hadoop HDDS block storage layer that can co-exist with HDFS. Implemented as an Ozone Manager (OM) service that manages the object store namespace, utilising the HDDS Storage Container Manager for block management. Objects are arranged into buckets, which themselves are arranged into volumes. Supports consistent writes, an RPC API, an Amazon S3 compatible REST API, a CLI, a load generation tool (Freon, previously Corona), and an Hadoop Compatible File System (OzoneFS), with a stated plan for mountable LUN storage (Quadra). Originally announced in October 2014, re-invigorated under the Hortonwworks Open Hybrid Architecture Initiative in September 2018, and currently under active development with a suggested release as part of HDP 3.2.
Is packaged by Apache Bigtop, Hortonworks Data Platform, Cloudera CDH, Amazon EMR, Google Cloud DataProc, Qubole Data Service
version release date release links release comment 2.8 2017-03-22 summary Note that 2.8.2 is the first GA version for production use 2.9 2017-11-17 summary 3.0 2017-12-14 summary; announcement 3.1 2018-04-06 summary; announcement; Hortonworks post Support for containerised workloads, GPU/FPGA support 3.2 2019-01-23 summary; announcement; Datanami