Scaling CDC ingestion pipelines using Iceberg Data Lake house

Introduction

Do you want to build or work for a data product company; specially if you are a data engineer trying to build spark muscles to do the heavy lifting of giant magnitudes. For that you need to understand how these companies optimize for the large cost of extracting information from data.

Without big data analytics, companies are blind and deaf, wandering out onto the Web like deer on a freeway.

What go wrong with a house by the lake right??? Imagine if you live there alone and the fishes keep multiplying, you don’t own any fishing equipment that scales, that leads to the lake getting polluted and instead of getting returns from your fishes you end up paying a higher cost to the services company that will fix the lake for you.

If we don’t design for a scalable ingestion pipeline for our data lake house platform we will lead the business into broken data lineage and high data management cost.
The data team will end up doing more fire fighting then innovation
To secure trust with business the data platform requires to maintain quality, schema, lineage, freshness and volume.

https://images.app.goo.gl/baa5iPg1gbM4B85B7

CDC and hidden risks

CDC is the idea that when relational or transactional tables are modified, you emit an update stream. This enables you to keep copies in sync by capturing changes to tables as they happen.

Although CDC has many advantages, there are also some problems that make it difficult:

Lower latency means more work

Write amplification - the work necessary to balance the trade-offs between efficiency at write time and efficiency at read time

Batch writes with double update and possible inconsistency

Read requirements with the different types of deletes in a table

The Problems with ingesting tabular data in a file format

Hidden cost in data lakes:

With the shift that came in 2015, when AWS release S3, data lakes become an affordable data warehousing alternative. The economics that goes behind budgeting decisions, that affect data architecture, depends on platform expectations and impact on revenue of the organization. However, business analytics will never understand any excuse for stale data. Urgency in data retrieval requires fast computations that have implied high cost.

When we store data in a file format like CSV or parquet in cloud storage, it needs a catalogs to persist the address of all files that represent a table and a query engine to make the data accessible. Please read cost modeling guide on data lakes by AWS here.

Source
Table maintenance and recovery time
1. Organizations have huge fact tables that almost never are immutable. What that means in simple business terms is that in the real world facts change. A user might have done a mistake and can ask the business to update the how the business remembers that information. A business might lazy process certain orders once a month and update all stacked cases at once. A compliance might need you to delete facts for one entity over years of data. No matter how loud you scream bad practices, as a data engineer you’ll have to learn to solve for these use case.
2. Large updates come in a skewed fashion like majority of them could belong to few partitions
3. Dimension tables have frequent schema changes and requires complete re-write of files. This causes risk of downstream queries to break or loss of backward compatibility.
4. In case of any routine ingestion job failures recovery might require a complete reload if old version of the table is not maintained. These might be caused due to bad partitioning or too many small files.

Getting Started

It helps to have a local first approach to learn a concept. So lets try to run everything locally and replicate issues.

source

Every data pipeline can we expressed with a source, pipe and sink.

We will run the pipeline on a spark notebook and we will write the data on Minio storage hosted inside the docker on our laptop.

<aside> 💡 How many x rows, y columns and z chars make a GB of data?

</aside>