Apache Gobblin (Incubating) edit   discuss  

Java based framework for ingesting data into Hadoop. Ingestion jobs are defined through job configuration files, and are made up of a number of stages - a Source identifies work to be done and generates Work Units which are then processed by Tasks, with Tasks consisting of an Extractor (reads the records to be processed), one or more Converters (a 1:N transformation of records), a Quality Checker (covers both record and file checks), a Fork Operator (allows data to be written to multiple targets) and a Writer (writes out completed records), with the output of a completed task being committed by a Publisher. Gobblin ships with a number of standard components, including support for a range of sources and targets, as well as supporting custom implementations of any stage. Jobs can be run using a number of frameworks, including MapReduce (with all tasks running as mapper only jobs), YARN, and as Java threads within a single JVM, with some modes also supporting an internal scheduler and job management engine. Supports job locks (to ensure multiple instances of the same job don't run at the same time), job history metadata (via a job execution history store that supports a REST API that can be used to monitor jobs), exactly-once processing support (via Publisher commits), failure handling (retrying both within and across jobs), capture and forwarding of execution and data quality metrics, post processing of data (e.g. to remove duplicates or generate aggregations), partitioned writers, job configuration file templates, Hive table registration, high availability, data retention management (automatically deleting old data according to a number of retention rules), and data purging (Gobblin Compliance). Developed at LinkedIn from late 2013, first announced in November 2014 and open sourced shortly afterwards, before being donated to the Apache Foundation in February 2017, and with stated deployments at a number of large organisations.

Technology Information

Other NamesApache Gobblin, Gobblin
VendorsThe Apache Software Foundation
TypeOpen Source - Active
Last UpdatedDecember 2018 - v0.14

Release History

versionrelease daterelease linksrelease comment 
0.52015-09-28 Annoucement 
0.62015-12-21GitHub release page  
0.72016-05-18GitHub release pageAnnouncementDataset lifecycle features
0.82016-09-03GitHub release page  
0.92016-12-19GitHub release page  
0.102017-05-05GitHub release page First Apache release
0.112017-07-20GitHub release page  
0.122018-07-02GitHub release page  
0.132018-09-20GitHub release page  
0.142018-12-08GitHub release page  

News