The Plan edit  


One more post before we get started.

The following are my current thoughts for some of the topics I’d like to cover on this site, both as a reference for my future self to look back at my naive optimism, but also if anyone wants to start contributing to any of these now, or to start a discussion on any the later topics to start framing and exploring them.

Theme 1 - the technology catalogue

The plan here is to start building up a technology catalogue by looking at the key vendors in the Data Engineering space. This will only be a start on the technologies I’d expect to see in our catalogue however, so once this is done the plan is to then go through by technology category to complete the catalogue.

I’d also like to look at providing a concise yet detailed introduction to some technologies that describes exactly what it is, how it works, and what the key features are. So much material that can be found on the internet is marketing material that glosses over the information I’m interested in knowing to understand whether a technology might meet my use cases and integrate into my environment, and my hope is that I can use this site to address that.

Theme 2 - data engineering use cases

One thing I don’t want to do on this site is define another data ecosystem architecture - there are too many already, most of them are designed to sell specific technologies, and none of them will fit the range of different requirements and constraints that different organisations will have.

However, what I do want to do is look at the range of different of different use cases that you might use data engineering technologies for, from a Data Lake (and we’ll look at what that overloaded term actually means) to a Data Warehouse (and why they’re still relevant), from the acquisition of data to the preparation of a Query Focused Dataset, and from the management of a data catalogue to the monitoring of data quality metrics.

I’d then like to look at how different technologies and architectural patterns can support these use cases - how do you implement a Data Lake using Hadoop, what technologies support data governance and data catalogues, and how do the various streaming frameworks compare.

As part of this I also want to look at the core principles behind Data Transformation, what state of the art in this space looks like, and how the established enterprise technologies compare to the new Open Source upstarts.

Theme 3 - delivery

As if the above isn’t already massively ambitious enough, I’d also like to talk about the delivery of data solutions, including:

  • How we can use best practice delivery concepts (e.g. configuration management, continuous integration and testing, automated deployment, infrastructure and database management) and what these mean within a data solution
  • How we can bring some the new best practices from Lean and Agile into the data space, and what data transformation tools need to do in order to be able to support this
  • Why data projects can have a reputation for late delivery, cost overruns, poor quality data and a high cost of change, and what can be done about this

I think that will more than do us. Getting through that lot will take some time, but with help and contributions I think this site could be hugely valuable.