So you’re building that data-intensive solution for your Machine Learning / Data Analytics Startup / Corporate transformation project or product test… Regardless of where you sit, if you’re in the data engineering space this is frankly essential reading.
Organisations often want to jump into this space, without considering the basics first. It's a lot more than just data preparation. An organisation that is not serious about data quality, is also not serious about data governance and data integrity.
What follows here is a practical guide for CTOs, Data Engineers, Data Scientists, their teams, Enpreneurs and business folks. I will also outline important considerations both in terms of business deliverables but also basic approaches and principles you need to consider in this space.
About me and this article
As a Consultant CTO, I have worked with a number of startups all of which are considering data-driven solutions, machine learning capabilities and data engineering requirements. Mostly within highly compliant sectors such as financial services and healthcare where data accuracy is critical to business success, reputation, compliance and security.
For the vast majority of businesses in this space, data is their IP, it's the knowledge from which they need to build and evolve their capabilities. It is also critical to monetisation as more and more application-side software is commoditised making IP accessible via an API for a fee an increasingly appealing business model.
Data Engineering — MVP to Enterprise
In order to achieve data integrity, you first need to focus on data quality. For anyone new to this space this is really priority number 1.
The key aspect to tackling data quality is to start early and don’t run before you can walk. A typical scenario is a startup team that builds an MVP, with business folks talking about data without understanding its function, and impact either technically or in terms of business outcomes. Data extraction sources get added, it gets messy, nothing is planned and frankly, all these execution issues could be avoided.
Start with the basics
In AWS a data lake is simply S3 buckets strung together, and that’s because a data lake is a centralized repository designed to store, process, and secure large amounts of structured, semi-structured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits. Well, that’s also what an object store does which is what an S3 bucket is.
Data Lakes allow you to import any amount of data and collect it from multiple sources and moved it into the data lake in its original format. This process allows you a lot of flexibility to scale while saving time in defining data structures, schema, and transformations.
My own experience is that people just believe you can throw anything into a Datalake and it can then just be ingested into some kind of analytics solution which will show you magically accurate results without any effort…
Oh my sweet summer child, how naive this is…
First things first.
Writing your Data Schema
It's a bit difficult to define how to store your data if you don’t know what queries you want to run. Anyone who has an MVP together should know this. The earlier you document this the better position you are in. It really settles a lot of arguments from engineering teams fairly quickly. As query types will dictate database types, as will you need to scale them and possibly add new query types over time, that will also dictate what you select for initial storage and become the first part of your roadmap.
How you document this is up to you. Some DB solutions allow you to export your schema which would give you a starter for ten. Another thing which is critical is defining your entity relationships. Generating a diagram for this, is a really good exercise for backend programmers if you’re in an early-stage business, doing this as a group activity or reviewing this as a group activity can be really helpful for forming a collective understanding of the data model within your team.
This will then need to be periodically reviewed and considered against the requirements of your immediate needs as a business and longer-term goals. It is also important to do this ahead of approaching building your first data pipeline.
Building your first pipeline
When you don’t have huge amounts of data to ingest, transform and consume but anticipate that this will be a future architectural requirement.. this is the point at which to start to consider building an initial pipeline.
The key thing is to ensure you separate raw data, from clean, transformed and consumed (minimum x3 S3 buckets, however for architecture ref see below). The next challenge is then how you move the data along the pipeline. Assuming you don’t have any “real-time” or “near real-time” data consumption demands and your data consumption and processing is fairly limited there are a range of approaches you can use to move the data between the buckets. Many of these approaches are not a significant learning curve. As soon as you scale up to large volumes of data and data processing this becomes a significant task and requires a lot of learning of complex products and toolkits (Apache Spark, AWS Glue for example) in order to efficiently tackle it. This is something that needs to be planned for and engineers need to gradually learn(or preferably experienced talent hired in) in order to be effective. The learning curve for these enterprise solutions is frankly hard.
But what is true in any scenario, is you should start simple with one or two use cases and work out how you will scale it. This is just as relevant and important in big corporates as in early-stage businesses.
Architecture and technical roadmap are key; understanding what is the bare minimum to meet your business requirements now and a plan of how you will gradually move towards a longer-term goal.
Early-stage startup — limited data volumes
There are of course a gazillion slightly different approaches, depending on your use case, but here are some straightforward guides to get started.
For early-stage startups/product prototypes and easy learning:
Basic data pipeline using step functions:
Automate a data pipeline using Lambda
https://medium.com/codex/automating-an-end-to-end-data-pipeline-on-aws-cloud-7b5489e2925d
Once you have decided to progress towards enterprise scale and have requirements for transforming large amounts of data you can make a start by setting up Glue Data Catalogue and implementing your first Glue Crawler. Get started early on the learning curve and gradually build up to this over time where possible.
https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html
Later stage startup — increasing data volumes
A good basic design pattern — (regardless of whether you will use serverless or not)
This is taken from Serverless Data Pipelines — see the blog post and links to white papers here: https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/
I like serverless however it's not always appropriate. Assuming you have a lot of event-triggered a-synchronous tasks it could be the right approach for your product. However, it's a big jump for most teams to a fully serverless approach… see pitfalls.
Other tools/solutions and services for data pipelines and data ingestion at scale, now can vary wildly depending on your use case so here is a mix of Open Source and Managed solutions.
Data Processing at Scale
This is very dependent on your use case of course, but here are some of the most popular solutions on the market. Make sure you work with someone experienced in this space to ensure you select the right tool for the right context. Mistakes here are not just costly in terms of infrastructure but also in terms of learning time for individuals and also entire teams.
An important word on cost…
Data processing at scale is expensive make sure you are careful to map everything that can be optimised for cost efficiency and exploit things like spot instances for non-production workloads. Data streaming in particular is costly, do the users really need “real-time” data? Or is it nice to have? Really drill down on ABSOLUTE requirements.
There are benchmarks thrown around for SaaS businesses that infrastructure should not be more cost-wise than 10% of ARR for a business. This for early stage business that hasn’t generated enough turnover is often entirely unrealistic and heavy Data Analytics and ML products and solutions will be costly to set up and maintain. It's simply unrealistic to think this will fit neatly into 10% ARR. No one who has experience of running a real business in this space has this unrealistic expectation. Experienced talent costs, and so does heavy compute workloads.
Open source does not necessarily help you. The overheads are very high (and never ever factored in by OS obsessives) this is really because of the endless maintenance, security patching, and fiddly configuration for which teams need a lot of time and effort(this additional overhead gets hidden). OS in principle is brilliant and it can save you money, but you must factor in the learning curve and additional effort for maintenance.
Some commonly used products and solutions to consider(depending on your use case of course see below).
Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. A good potential combination with Kinesis for real-time data processing and workloads for which AWS offers a managed integration.
Apache Flink — is a distributed processing engine and a scalable data analytics framework. You can use Flink to process data streams at a large scale and to deliver real-time analytical insights about your processed data with your streaming application.
Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly. Similar to Spark but some key differences with Spark uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce.
AWS Kinesis — You can use Amazon Kinesis Data Streams to collect and process large streams of data records in real time. You can create data-processing applications, known as Kinesis Data Streams applications. A typical Kinesis Data Streams application reads data from a data stream as data records. This is part of a family of related products which are all tailored for different use cases at scale.
AWS EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform for running big data frameworks, such as Apache Hadoop and Apache Spark. It performs some similar ETL functions to Glue however EMR is a more highly configurable solution.
AWS Glue — AWS Glue is a fully managed ETL service for preparing and loading data for analytics. You can create and run an initial ETL job with a few clicks in the AWS Management Console. Glue leverages its parallel processing to run large workloads faster than Lambda(developers often get confused here). Glue is not an easy solution to learn, however, and most of the configuration needs to be set up via the Cloud CLI(Cloud console functionality is actually limited).
Snowflake — snowflake is a cloud-based data warehouse designed for SQL-based querying and data warehousing. It is often confused with Hadoop (a fundamentally different solution see above)hence it is mentioned here. It would be better compared to the Data Warehousing solution AWS Redshift.
Google Big Query — This is really Redshift or a Snowflake alternative. I’m not sure why people confuse this one with data processing solutions for what you need…
Google Dataflow — Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing. It can also be used for ETL workloads and where Spark Streaming may not be suitable for applications that require low-latency data processing Google Dataflow is a solution to consider.
Data Pipelines with Google Cloud
Intro to Google Cloud’s batch-based pipelines here:
https://www.cloudskillsboost.google/course_templates/53
Azure Data Factory — ETL solution has similar functions to AWS Glue as well as some fundamental differences. It's possible to build pipelines in Data Factory — a quick start guide is below.
https://learn.microsoft.com/en-us/azure/data-factory/quickstart-get-started
Azure Synapse - another Redshift/snowflake alternative.
I like Microsoft’s infrastructure products as they are often more flexible and multi-configurable than many of the AWS offerings. Whilst Google has been very good at offering a developer-centric approach. However, with both Microsoft and Google’s offerings outside of Enterprise's big corporate environments the costs are high. This is because their solutions are deliberately packed as groups of products for sale to big corporates. AWS costs are only ever high when engineers have no idea how to configure and run those services. This knowledge is key to cost management (and not to mention good security).
Summary of a data pipeline approach using Apache Spark and Airflow.
https://towardsdatascience.com/how-to-build-data-engineering-pipelines-at-scale-6f4dd3746e7e
Pitfalls
Startups in particular really struggle as soon as they have to consider enterprise scale. I have seen businesses with simple Webapp and limited data ingestion with junior engineering teams busily implementing data mesh just for it to be later ditched once the incumbent (and inappropriately skilled) leadership move on…
Architecture
To do a good competent job in leadership here you need architecture which is appropriate for the business and scenario. But anticipating inappropriate solutions is a very common mistake, and very common for inexperienced teams but also happens routinely for experienced engineers. One challenge is engineering teams wanting to upskill and seeing this as an opportunity to do so as they view something anticipated in the roadmap.
Learning curve
Learning enterprise solutions takes time, patience and learning on the job need to be mixed with guidance from senior engineers who can train less experienced team members over time. Assuming your business is well funded and scaling this should be considered an evolutionary approach over time. Admittedly this is not a luxury that’s always possible in startups.
Quality management
Another major pitfall is not viewing quality management of data at an early stage (hence this article) building this in is absolutely key to evolving significant data drive IP as well as good quality product outputs.
Major architectural shifts — move to serverless
Assuming that just because you “know it well” it's appropriate and simple for other people in the team..is a mistake. A move to a major architectural shift like Serverless is a big jump even for an experienced team. To consider that an inexperienced team will manage this is wishful thinking. Again either a lot of experienced support has to be procured or a gradual approach is taken to evolve the solution to this architecture. The move to serverless much like the move to microservices has been popular, poorly applied in many scenarios and misunderstood. Make sure it's appropriate for your scenario. No approach is perfect…
Understanding Quality Metrics
So the way to make a start here is to first understand the types of problems that may arise. This is something you need to do as a development team, is to write down the data processing and quality requirements PER DATA SOURCE. These need to be mapped into your pipeline. The earlier you do this the better position you are in to reduce error rates.
This will define your strategy.
Mapping data integrity requirements
This needs to be defined per specific use case and scenario, however, I have listed some of the possible data quality challenges you are likely to encounter and how to approach each area.
False positives
Now this is quite a common scenario in cybersecurity where you are scanning for vulnerabilities and there are a number of tools on the market for scanning for this. However it's also a scenario in machine learning. In order to eliminate these ( or reduce them) you need to make sure they are properly defined. Review your dataset to assess whether bias or drift is occurring, eliminate any rules you don’t need, and tune rules to any specific environment thresholds.
Missing fields
Are these fields missing from the consumption layer? Or because you extracted data records from an API which had either poor data records or added new fields? One thing is certain and thats you must have a defined data schema and understand what queries you need to run.
Missing data
Not all data records are consistent, what do you do when fields are simply missing? How will this be handled in your data consumption solution? How do you check for missing data records?
Disambiguation
Data that is extracted from API often you are faced with extracting the entire data record and then disambiguating the data fields required from the original raw data source. This often may also need to be transformed. You will need to ensure all data that is extracted and transformed is the same data type post-transformation.
Corrupt data
SQL has built-in tools for checking for corrupt files, there are other specific toolkits out on the market for automating assessing data however manual QA may often be required. Monitoring of
Data transformation
Solutions like AWS Glue are ideal for transforming large amounts of data, however for small amounts that don’t have scaling demands it's possible to run scripts in Python and use data frames. For enterprise workloads, you need tools like Glue.
Incorrect data type
This is so basic and something that early-stage teams get wrong. It is critical that you have your data schema documented, your queries documented and that your engineering teams assess the languages and frameworks used to ensure that those queries are efficient. Data typing varies a lot between programming languages, some are flexible and will allow data to be consumed even if it's the wrong “type”.
Repetition
Again here you can run tooling to detect data records that are repeated. At its most basic running scripts to identify possible repetition in datasets and manually check and remove duplicates.
Metadata
You will need to be able to query your data, and for that you need to formalise your data governance layer. In AWS solutions like Lake Formation and AWS Glue Data Catalogue facilitate this.
Data Governance
Data governance means setting internal standards — data policies — that apply to how data is gathered, stored, processed, and disposed of. It governs who can access what kinds of data and what kinds of data are under governance. It's important not just to understand how to exploit tooling like Glue Data Catalogue, but to have documented approaches to how data is gathered, stored and processed and ensure business and technical staff understand this. You will need to define this per data source, data type and storage /consumption solution.
A quick start guide to Data Governance from AWS here:
https://aws.amazon.com/what-is/data-governance/
Tools for managing data quality
For early-stage business propositions, MVP there is realistically a lot you can do with Python scripting, lambda’s, step functions and data frames. However, as soon as you start to scale these are very restrictive and you will need to hire an architect and start to consider tooling in your technology roadmap.
A great list of open-source data-quality Python libraries can be found here:
If you’re an early-stage business, MVP or similar I would recommend starting there.
However, as soon as you start to scale you start to need industrial tooling, for Bigdata you might consider solutions like Apache Griffin or AWS Deequ (which has support for Apache Spark if you’re looking at enterprise data processing). However, this has now been replaced by Glue Data Quality.
https://docs.aws.amazon.com/glue/latest/ug/gs-data-quality-chapter.html
Other tools that could be worth considering…
Enterprise Scale and Data Integrity
For early-stage businesses with limited propositions data quality is just as important as it is for later-stage organisations. But as your business scales to the enterprise level you need to start to consider data integrity across all your business functions.
So…. what do we mean by data integrity?
“Data integrity is the overall accuracy, completeness, and consistency of data. Data integrity also refers to the safety of data in regard to regulatory compliance — such as GDPR compliance — and security. It is maintained by a collection of processes, rules, and standards implemented during the design phase.”
There are a number of lenses you need to apply to ensure that your data both is and remains complete, accurate and reliable.
Physical Integrity
How is your data stored, managed and retrieved? If your data centre crashes, if the power goes down in your building(if you have any privately networked on-site servers) if a disk drive crashes. Or if someone is simply daft and manages to delete or compromise your data in some way… what are your mechanisms for handling this?
Contracts and business / cyber insurance for network and data centre outages. Training and procedures for staff members to ensure security and basic human errors are minimised. Escalation procedures if something does go wrong. Who is responsible how will you solve it? Document possible scenarios and tackle the risks one by one.
Logical Integrity
Logical integrity protects data from human error and hackers as well, but in a much different way than physical integrity does.
- Entity integrity — relies on the creation of primary keys to ensure that data isn’t listed more than once and that no field in a table is null.
- Referential integrity. Referential integrity refers to the series of processes that make sure data is stored and used uniformly. Rules may be embedded into your DB, and for example: may include constraints that eliminate things like the entry of duplicate data, guarantee that data entry is accurate, and/or disallow the entry of data that doesn’t apply.
- Domain integrity. Domain integrity is the collection of processes that ensure the accuracy of each piece of data in a domain. So for example: Applying constraints and other measures that limit the format, type, and amount of data entered.
- User-defined integrity. User-defined integrity involves the rules and constraints created by the user to fit their particular needs. This could be incorporating specific business rules that are taken into account and incorporated into data integrity measures.
What data integrity is not
This is often confused with data security and data quality, however, each term has a distinct meaning.
Data integrity is not data security
Data security is a collection of measures, policies, solutions and procedures to ensure data is not corrupted or exploited for malicious use.
Data integrity is not data quality
Does the data in your database meet company-defined standards and the needs of your business? Data quality answers this and should be in line with your product proposition.
Data integrity and compliance
Data integrity is key to complying with data protection regulations (for example GDPR). With regulations increasing across the markets, having an internal map of all the compliance requirements, risk levels and approaches to auditing and ensuring ongoing compliance is essential. The integrity of your data is key to that approach.
Conclusion
Each one of these areas deserves an entire article on its own. If you don’t know where to start with Data Analytics or Machine Learning I hope this was a useful starter for ten. As a Consultant CTO, where I most see technical leadership fail consistently is not considering the context, as most engineering leaders were mostly former software developers and a CTO role is really a business-based role as soon as you scale. Lack of basic project/programme delivery skills and architecture knowledge compounds this issue for many engineering leaders.
However, the basis of all good engineering is nailing the basics first, and making sure that is robust and consistent.
If this is your first attempt at scaling I can honestly say it's an experience-based art form. One you will improve at over time. Good luck.