File-based Incremental Loading — A Practical Approach with MS Fabric, dltHub, MotherDuck & Python
Context
Instead of rewriting everything from page one every time you want to add something new, you simply continue writing from where you left off.
That’s how incremental ingestion works! Instead of copying all the data again, you just add the new or updated information, saving time and effort.
dlt from dlthub.com helps solve the incremental ingestion problem by automating how new or changed data is added to databases.
How does dlt from dlthub.com solve this?
- Detects new data: It identifies what’s new or updated instead of copying everything again.
- Processes data efficiently: It cleans, transforms, and loads only the required data, making the process faster.
- Automates the workflow: It handles the entire pipeline, so engineers don’t have to manually track changes.
- Supports multiple sources: It works with databases, cloud storage, APIs, and more to ingest data seamlessly.
Basically, DLT Hub makes incremental ingestion easy by handling the complex work behind the scenes, ensuring data is always fresh without wasting time or resources!
How filesystem source works (texts from dltHub website)
The Filesystem source doesn’t just give you an easy way to load data from both remote and local files — it also comes with a powerful set of tools that let you customize the loading process to fit your specific needs.
Filesystem source loads data in two steps:
- It accesses the files in your remote or local file storage without actually reading the content yet. At this point, you can filter files by metadata or name. You can also set up incremental loading to load only new files.
- The transformer reads the files’ content and yields the records. At this step, you can filter out the actual data, enrich records with metadata from files, or perform incremental loading based on the file content.
Common data landing pattern
In modern data engineering pipelines, ingestion mechanisms play a critical role, especially when working with raw data coming from systems like SAP. One such approach is a pull-based mechanism, where data lands in a dedicated Data Landing Zone (completely separate from the core data engineering area).
Once data arrives in the landing zone, the goal is to incrementally ingest only the latest files (new or modified) into the raw layer of the Medallion Architecture — ensuring data is append-only and aligned with data lake best practices.
The key question becomes:
How do you know which files have already been ingested and which are still pending?
Overall Solution Workflow
This is not about the internal workings of MS Fabric, but rather a hands-on practical demonstration of how you can track, detect, and ingest new/modified files into your pipeline in a robust and repeatable way.
Ask is to track new/modified files arriving to Fabric Lakehouse (Data Landing Zone) and have them appended to MotherDuck.
In this approach, you’ll be using a pure Python-based method to handle incremental file loading and ingestion within the MS Fabric environment.
📓 Example notebook (along with sample files) is available at: https://github.com/sketchmyview/fabric-dlthub-demo/tree/main/incremental-ingestion
You have three repossession files containing data for the UK market. The goal is to upload these files one at a time, while a Python script tracks and ingests only the latest file into MotherDuck, ensuring no duplicate ingestion.
Step 1: Setup the Environment
- Import the example notebook from the GitHub repo.
- Securely retrieve your MotherDuck secret from Azure Key Vault (because security comes first!).
- Provide the AKV name and configure the notebook for execution.
- Finally, set your lakehouse
bucket_url
, which acts as the source location where files will be uploaded.
Step 2: First File Upload & Ingestion
- Upload
Repossession-2025-09.csv
to the data landing zone. - Run the Python script using “Run All”.
- If everything works correctly, dlt will track the file, and you’ll see 10 rows appended to MotherDuck.
Step 3: Second File Upload & Ingestion
- Upload
Repossession-2024-12.csv
to the landing zone. - Run the script again with “Run All”.
- This time, dlt detects that the previous file has already been ingested.
- Only the new file gets processed, adding 2 new rows to MotherDuck.
Step 4: Third File Upload & Ingestion
- Upload
Repossession-2024-11.csv
to the landing zone. - Run the script once more using “Run All”.
- dlt verifies that the first two files are already processed and ingests only the new file, appending 1111 rows to MotherDuck.
If you run the Python code again, you will see no updates hence no new records appended.
What’s Happening Under the Hood?
- The
bucket_url
points to a specific directory (in this case,repossession_data
). - dlt continuously monitors this directory and tracks changes using metadata.
- File-based ingestion works by leveraging
modified_date
(or another file system-supported attribute). - By configuring
apply_hints
and specifyingmodified_date
, dlt can accurately identify new files requiring ingestion.
filesystem_pipe = filesystem(
bucket_url=”/lakehouse/default/Files/autoloader/repossession_data”,
file_glob=”*.csv”
)
filesystem_pipe.apply_hints(incremental=dlt.sources.incremental(“modification_date”))
Beyond Basics
This example demonstrates a foundational pattern, but dlt can be further configured to apply custom filters based on additional requirements — giving you flexibility to handle more complex scenarios if needed. Checkout dlt documentation at
https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem/basic
Hope you have enjoyed reading through.