The Apache Software Foundation edit  

The Apache Software Foundation is a non-profit organisation that supports a wide range of open source projects, including providing and mandating a standard governance model (including the use of the Apache license), holding all trademarks for project names and logos, and providing legal protection to developers. It was founded in 1999 and now oversees nearly 200 projects.

Vendor Information

Other NamesApache

Analytical Query Capabilities

HAWQA port of the Greenplum MPP database (which itself is based on PostgreSQL) to run over YARN and HDFS.
TajoDistributed analytical database engine supporting queries over data in HDFS, Amazon S3, Google Cloud Storage, OpenStack Swift and local storage, and querying over Postgres, HBase and Hive tables.
KuduColumnar storage technology for tables of structured data, supporting low latency reads, updates and deletes by primary key, as well as analytical column/table scans.
Quickstep (Retired)High performance database engine supporting SQL queries based on a University of Wisconsin-Madison project - now retired
HiveSupports the execution of SQL queries over data in HDFS using MapReduce, Spark or Tez based on tables defined in the Hive Metastore
PigTechnology for running analytical and data processing jobs written in Pig Latin against data in Hadoop using MapReduce, Tez and Spark
MRQL (Incubating)Supports the execution of MRQL queries over data in Hadoop using MapReduce, Hama, Spark or Flink -
ImpalaAn MPP query engine that supports the execution of SQL queries over in HDFS, HBase, Kudu and S3 based on tables defined in the Hive Metastore
DrillAn MPP query engine that supports queries over one or more underlying databases or datasets without first defining a schema and with the ability to join data from multiple datastores together.
LensProvides a federated view over multiple data stores using a single shared schema server based on the Hive Metastore -
KylinSupports the creation and querying of OLAP cubes on Hadoop, building cubes from star schema data in Hive into HBase, and then providing a SQL interface that queries across Hive and HBase as required -

Analytical Search Capabilities

SolrA search server built on Apache Lucene with a REST-like API for loading and searching data.

Compute Cluster Management

Hadoop/YARNResource management and job scheduling & monitoring for the Hadoop ecosystem.
Slider (Incubating)Application for deploying long running cluster applications on YARN, now effectively dead following the plan to add support for long running services directly into YARN
TwillAbstraction over YARN that reduces the complexity of developing distributed applications -
MesosResource management over large clusters of machines
AuroraMesos framework for long-running services and cron jobs
ZooKeeperService for managing coordination (e.g. configuration information and synchronisation) of distributed and clustered systems.
CuratorA set of Java libraries that make using Apache ZooKeeper much easier -
Myriad (Incubating)Tool that allows YARN applications to run over Apache Mesos, allowing them to co-exist and share cluster resources.
REEFA framework for developing distributed apps on top of cluster frameworks such as YARN or Mesos -

Data Formats

AvroData serialisation framework that supports both messaging and data storage, primarily using a compact binary format but also supports a JSON format.
ParquetData serialisation framework that supports a columnar storage format to enable efficient querying of data.
ArrowIn memory columnar data format supporting high performance data exchange and fast analytical access
ORCFileEvolution of RCFile, spun out into it’s own Apache project
CarbonDataColumnar format created by Huawei to address a number of perceived shortcomings in existing formats
Iceberg (incubating)File based table format for large, slow-moving tabular data -

Data Ingestion

NifiGeneral purpose technology for the movement of data between systems, including the ingestion of data into an analytical platform.
Gobblin (Incubating)Framework for managing big data ingestion, including replication, organization and lifecycle management
FlumeSpecialist technology for the continuous movement of data using a set of independent agents connected together into pipelines.
SqoopSpecialist technology for moving bulk data between Hadoop and structured (relational) databases.
ManifoldCFFramework for replicating data from content repositories to analytical search technologies -

Data Processing

Hadoop/MapReduceA data transformation and aggregation technology proven at extreme scale that works on key value pairs
SparkA high performance general purpose distributed data processing engine based on directed acyclic graphs that primarily runs in memory, but can spill to disk if required
TezData processing framework based on Directed Acyclic Graphs (DAGs), that runs natively on YARN and was designed to be a replacement for the use of MapReduce within Hadoop analytical tools
CrunchAn abstraction layer over MapReduce (and now Spark) that provides a high level Java API for creating data transformation pipelines
Nemo (Incubating)A runtime for data processing languages that dynamically adjusts to the runtime environment -
Crail (Incubating)High performance distributed and tiered (in memory, flash and disk) storage layer for temporary data that provides memory, storage and network access that bypasses the JVM and OS, and support for Spark and Hadoop -

Graph Technologies

GiraphAn iterative, highly scalable graph processing system built on top of MapReduce and based on Pregel
HamaA general purpose BSP (Bulk Synchronous Parallel) processing engine inspired by Pregel and DistBelief that runs over Mesos or YARN.
Commons RDF (0)Commons library for working with RDF data - <>
JenaFramework for developing Semantic Web and Linked Data applications in Java -
Rya (Incubating)RDF triple store built on Apache Accumulo -
S2Graph (Incubating)OLTP graph database built on Apache HBase -
TinkerPopGraph compute framework for transactional and analytical use cases that’s integrated with a number of graph database technologies -
Spark/GraphXSpark library for processing graphs and running graph algorithms
HadoopA distributed storage and compute platform consisting of a distributed filesystem (HDFS), a cluster resource management layer (YARN), and MapReduce, a solution built on HDFS and YARN for massive scale parallel processing of data
BigtopApache open source distribution of Hadoop
AmbariPlatform for installing, managing and monitoring Apache Hadoop clusters
AtlasA metadata and data governance solution for Hadoop.
KnoxA stateless gateway for the Apache Hadoop ecosystem that provides perimeter security
RangerA centralised security framework for managing access to data in Hadoop
SentryA centralised security framework for managing access to data in Hadoop
EagleSecurity and performance monitoring solution for Hadoop, donated by eBay -
FalconData feed management system for Hadoop, although no longer appears under development and is deprecated from HDP.

In Memory Technologies

IgniteA distributed in-memory data fabric/grid, supporting a range of different use cases and capabilities
GeodeIn memory data management platform, born of Pivotal Gemfire -
MnemonicHybrid memory / storage object model framework -

Machine Learning Technologies

Spark/MLLibSpark library for running Machine Learning algorithms
MahoutMachine learning technology comprising of a Scala based linear algebra engine (codenamed Samsara) with an R-like DSL/API that runs over Spark (with experimental support for H2O and Flink)
MADlibMachine learning in SQL for PostgreSQL, Greenplum and Apache HAWQ -
OpenNLPMachine learning based toolkit for the processing of natural language text -
SAMOA (Incubating)Machine learning framework that runs over multiple stream processing engines including Storm, Flink and Samza -
SINGA (Incubating)Framework for developing machine learning libraries over a range of hardware -
SystemMLDelarative machine learning over local, Spark or MapReduce execution engines -
Hivemall (Incubating)Scalable machine learning library implemented as Hive UDFs/UDAFs/UDTFs -

NoSQL Wide Column Stores

AccumuloNoSQL wide-column datastore based on Google BigTable that runs on Hadoop and HDFS
CassandraDistributed wide-column datastore based on Amazon Dynamo and Google BigTable
HBaseNoSQL wide-column datastore based on Google BigTable that runs on Hadoop and HDFS
FluoImplementation of Google Percolator for maintaining aggregations in Accumulo -
Omid (Incubating)ACID transaction support over MVCC key/value NoSQL datastores with support for Apache Hbase -
Tephra (Incubating)ACID transaction support over Apache Hbase, used by Tigon and Apache Phoenix -

OLTP Databases

PhoenixAn OLTP SQL query engine over Apache HBase tables that supports a subset of SQL 92 (including joins), and comes with a JDBC driver.
TrafodionOLTP on Hadoop solution based on Tandom NoStop database IP with commercial support from Esgyn -

IoT Databases

IoTDB (incubating)Massive scale IoT time series DB -;

Streaming Analytics

StormSpecialised distributed stream processing technology based on a single record (not micro batch) model with at least once processing semantics.
FlinkSpecialised stream processing technology inspired by the Google Data Flow model based on a single record (not micro batch) model, with exactly once processing semantics (for supported sources and sinks) via light weight checkpointing and support for batch processing.
Spark/StreamingSpark library for continuous stream processing, that allows stream and batch processing (including Spark SQL and MLlib operations) to be combined
Apache Kafka StreamsStream processing framework built over Apache Kafka, with support for stateful tables
BeamModel and SDKs for running batch and streaming workflows over Apex, Flink, Spark and Google Dataflow -
ApexData transformation engine based on Directed Acyclic Graph (DAG) flows configured through a Java API or via JSON that runs over YARN and HDFS with native support for both micro-batch streaming and batch uses cases
Heron (Incubating)The stream processing framework that Twitter built after Storm, with a Storm compatible API -
SamzaStream processing framework built on Kafka and YARN -
BahirA suite of streaming connectors for Spark and Flink, including support for Akka, MQTT, Twitter and ZeroMQ -
Gearpump (Retired)Real-time streaming engine based on the micro-service Actor model, now retired -

Streaming Data Stores

KafkaTechnology for buffering and storing real-time streams of data between producers and consumers, with a focus on high throughput at low latency.
BookKeeperDistributed log storage service from Yahoo -
DistributedLogDistributed log service from Twitter supporting durability, replication and strong consistency built over Apache BookKeeper -
PulsarDistributed pub-sub messaging from Yahoo, with persistent message storage based on Apache BookKeeper -

Workflow Management

OozieTechnology for managing workflows of jobs on Hadoop clusters.
AirflowWorkflow automation and scheduling system that can be used to author and manage data pipelines

Other Technologies

DataFuA set of libraries for working with data in Hadoop, consisting of two sub-projects - DataFu Pig (a set of Pig User Defined Functions) and DataFu Hourglass (a framework for incremental processing using MapReduce).
AsterixDBScalable “Big Data Management System” -
ChukwaSpecialist technology for the ingestion of continuous data flows into an Hadoop cluster, and the subsequent management and analysis of the data -
Edgent (Incubating)Stream processing programming model and lightweight runtime to execute analytics at devices on the edge or at the gateway, previously known as Quarks -
GoraORM with support for a range of NoSQL, Search and Hadoop data formats -
HelixA framework for building long lived persistent distributed systems -
KerbyJava Kerberos binding -
MetaModelTechnology for reading and writing database metadata with connectors for a wide range of databases -
Toree (Incubating)Framework to allow interactive applications to communicate with a remote Spark cluster -
CalciteA framework for building SQL based data access capabilities, supporting a SQL parser and validator and tools for the transformation and (cost based) optimisation of SQL expression trees.
Livy (Incubating)A service that allows Spark jobs (pre-compiled JARs) or code snippets (Scala or Python) to be executed by remote systems over a REST API or via clients for Java, Scala and Python.
Superset (Incubating)Web based tool for interactive exploration for OLAP style data, supporting interactive drag and drop querying, composable dashboards and a SQL workspace (SQL Lab).
ZeppelinA web based notebook for interactive data analytics.
Commons CompressSuite of Java libraries for working with a range of compression and packaging formats -
Commons CSVSuite of Java libraries for workng with CSV files -
GriffinData Quality Service platform built on Apache Hadoop and Apache Spark -
TikaToolkit for extracting text from a wide range of document formats -
UIMAFramework for unstructured data analysis -


Blog Posts