Java based framework for ingesting data into Hadoop. Ingestion jobs are defined through job configuration files, and are made up of a number of stages - a Source identifies work to be done and generates Work Units which are then processed by Tasks, with Tasks consisting of an Extractor (reads the records to be processed), one or more Converters (a 1:N transformation of records), a Quality Checker (covers both record and file checks), a Fork Operator (allows data to be written to multiple targets) and a Writer (writes out completed records), with the output of a completed task being committed by a Publisher. Gobblin ships with a number of standard components, including support for a range of sources and targets, as well as supporting custom implementations of any stage. Jobs can be run using a number of frameworks, including MapReduce (with all tasks running as mapper only jobs), YARN, and as Java threads within a single JVM, with some modes also supporting an internal scheduler and job management engine. Supports job locks (to ensure multiple instances of the same job don't run at the same time), job history metadata (via a job execution history store that supports a REST API that can be used to monitor jobs), exactly-once processing support (via Publisher commits), failure handling (retrying both within and across jobs), capture and forwarding of execution and data quality metrics, post processing of data (e.g. to remove duplicates or generate aggregations), partitioned writers, job configuration file templates, Hive table registration, high availability, data retention management (automatically deleting old data according to a number of retention rules), and data purging (Gobblin Compliance). Developed at LinkedIn from late 2013, first announced in November 2014 and open sourced shortly afterwards, before being donated to the Apache Foundation in February 2017, and with stated deployments at a number of large organisations.
Other Names Apache Gobblin, Gobblin Vendors The Apache Software Foundation Type Open Source - Active Last Updated December 2018 - v0.14
version release date release links release comment 0.5 2015-09-28 Annoucement 0.6 2015-12-21 GitHub release page 0.7 2016-05-18 GitHub release page Announcement Dataset lifecycle features 0.8 2016-09-03 GitHub release page 0.9 2016-12-19 GitHub release page 0.10 2017-05-05 GitHub release page First Apache release 0.11 2017-07-20 GitHub release page 0.12 2018-07-02 GitHub release page 0.13 2018-09-20 GitHub release page 0.14 2018-12-08 GitHub release page