The first step in this journey is going to be creation of a catalogue of the technologies that are going to be of interest to us as we explore the world of data engineering.
The aim is to provide a valuable reference for anyone starting a technology evaluation to understand what technology options they might have for a given situation, or to understand if and how a technology might fit into an existing ecosystem.
For each technology, the plan is therefore to provide a short summary describing the technology, along with its background and current status. As mentioned in my previous post, for some technologies, I also want to do a deep dive to provide a longer summary with more detail that gives a solid introduction to the technology, and my hope is that we’ll get contributions to provide these for the vast majority of the technologies that I won’t get time to look at.
As part of this we’ll need to look at providing a categorisation of technologies, although making this useful is going to be challenging given that multiple categories of technologies could be used to meet a given use case. The technologies we’ll look at broadly fall into three groups however:
- Data Transformation tools, be they one of the established commercial products or a new Open Source technology
- Data Platforms, be they a traditional relational database, an Hadoop based data platform, a real time broker such as Kafka, or a NoSQL data platform
- Technologies that address the other supporting capabilities around these, such as data catalogues or metadata management
The plan is to start by looking at the key vendors in the Data Engineering space including:
- The pure play Hadoop distributions - Apache Big Top, Hortonworks, Cloudera and MapR
- The rest of the Apache Data Engineering ecosystem
- The major Cloud vendors - Amazon, Google and Azure
- The big multi play vendors - IBM, Oracle, Teradata, Microsoft, Pivotal, SAP and SAS
- The big specialist commercial data integration vendors - Ab Initio and Informatica
- Other commercial data integration and data platform vendors - perhaps based at least partially on looking at the latest Gartner and Forrester reports
- The major open source cloud scale companies, if only to see what they do in this space - Facebook, Netflix, LinkedIn, Google, Yahoo, eBay and Twitter
This will only be a start on the technologies I’d expect to see in our catalogue however, so once this is done the plan is to then go through by technology category.
This is obviously going to take some time given the vast range of technologies, so if you’re interested in contributing in whatever form, if you spot any issues or omissions, or if you have any comments you’d like to add, then please do share your thoughts.