Scalable Multi-Region Data Ingestion Pipeline

Aug 20 (edited) in Technical

Hello everyone,

I'm currently developing a data ingestion pipeline for three regions: APAC, UK, and US, each with multiple SQL servers. Each server manages approximately 400 tables, as shown in the attached screenshot.

Notes:

1. The table schema is consistent across all servers.

2. There is no need to store data in a specific geography for insights.

3. Unfortunately, I can't use Database Mirroring and SQL Server CDC features due to IT security issues.

The goal is to extract delta data from each server, merge it into a Delta table, and perform transformations following a medallion architecture.

I'm considering a Metadata-Driven Pipeline in a hybrid approach:

1. Store the metadata of each table in a Lakehouse or Data Warehouse.

2. The data pipeline will reference this metadata table to iterate through each source table, extracting data from each server and storing it in OneLake.

3. Use PySpark notebooks to merge the delta data from all servers into the Delta table.

I'm seeking an efficient and scalable way to implement this in the Fabric. I would appreciate your thoughts and comments.

1 comment