Setting up ingestion with MS Fabric, dltHub, and Lakehouse
This is the first part of a six-part series on building end-to-end analytics in Microsoft Fabric — without Spark.
In this post, you’ll focus on the Ingestion phase of the ELT process, where raw data is extracted from various sources and loaded into a centralized repository such as a Lakehouse. In this part of the series, you will:
✅ Ingest the Northwind dataset into Microsoft Fabric Lakehouse using dltHub
The Focus Area
In this part of the series, you will:
✅ Ingest the Northwind dataset into Microsoft Fabric using dltHub
✅ Store the data as JSONL files in a Lakehouse
✅ Leverage Fabric’s Python environment for orchestration and automation
You build Northwind analytics using Medallion Architecture.
Reference: https://learn.microsoft.com/en-us/training/modules/describe-medallion-architecture/
Github repo
All the necessary code can be found in the GitHub repository: [https://github.com/sketchmyview/fabric-dlthub-series] 🚀
Why Not Spark?
Not all data analytics require big data processing. In such cases, bringing in Spark adds unnecessary complexity. Microsoft Fabric offers a pure Python environment, which intrigued me when I first explored it.
With Fabric’s built-in orchestration, monitoring, security, lakehouse, and warehouse capabilities, you can execute small to medium-scale analytics workflows without leaving the ecosystem.
Why dltHub for Ingestion?
dltHub simplifies data ingestion by:
✔ Automating schema creation and file compression
✔ Supporting pure Python-based ingestion tasks
✔ Seamlessly integrating with Fabric Pipelines for enhanced automation
What’s Next?
Once the JSONL files are stored in the bronze layer, you’ll convert them into Delta tables. Although dltHub offers this functionality, the goal is to explore what Fabric’s Python environment provides out of the box.
Interestingly, Fabric’s Python environment comes pre-installed with popular Python packages and even includes Spark, allowing for further experimentation. If needed, you can even install the latest Delta Lake version, making Fabric a powerful and flexible data engineering platform.
You’re focusing on the highlighted area in the diagram, which likely represents the ingestion phase of the ELT process within Microsoft Fabric.
This means the primary goal is to ingest the Northwind dataset into Fabric using dltHub, store it as JSONL files in a Lakehouse, and then explore Fabric’s Python environment
This series is designed to be practical and hands-on, enabling anyone with some data engineering knowledge to follow along.
While I won’t be diving into best software engineering principles like Git, DataOps, or full end-to-end solution architecture, I will be asking you to step up and explore those practices throughout this series.
The goal is to keep things approachable while leaving room for you to apply best practices as needed.
How do you start?
Begin by signing up for Microsoft Fabric at https://app.fabric.microsoft.com/.
Keep in mind that a business email ID is required. The free 60-day trial should give you ample time to explore and familiarize yourself with all its features.
After signing up, your screen will look like this. I recommend creating your own workspace to get started.
Creating new items
For this practical exercise, you will set up:
1️⃣ A Pipeline — To orchestrate Python-based ingestion jobs.
2️⃣ A Lakehouse — To store data ingested from the source (Northwind REST API).
3️⃣ Python Notebooks — Where the ingestion logic is implemented using dltHub.
Creating pipelines with parameters and parameter concept
In this example, you’ll create a single pipeline with two activities:
1️⃣ The first activity calls a Python notebook to execute the ingestion jobs.
2️⃣ The second activity iterates through all necessary folders in the Northwind bronze lakehouse, runs a Python job again, and creates Delta tables.
In real-world scenarios, you often need to pass dynamic values to Python jobs. In Microsoft Fabric, these are known as parameters, a concept borrowed from Azure Data Factory (ADF).
For this exercise, you’ll pass three dynamic parameters:
✅ Northwind REST API endpoint
✅ Table names (as an array of values)
✅ Fabric workspace name
This is a standout feature of the Fabric Python environment, inherited from Synapse Analytics Spark and Microsoft Fabric Spark, enhancing the overall development experience.
More of parameters https://learn.microsoft.com/en-us/fabric/data-factory/parameters
More on Fabric Python experience: https://learn.microsoft.com/en-us/fabric/data-engineering/using-python-experience-on-notebook
Parameter references
The parameters created in the previous steps need to be passed from internal activities. This is done by selecting “Add dynamic content” inside the parameter textbox.
Additionally, Pipelines offer an advanced expression builder, allowing for more complex logic — a very cool and powerful feature!
Export and import schema
One of the standout features of dltHub is its ability to export and import schemas, making it a powerful tool for metadata-driven ingestion.
While dltHub automatically generates schemas for you, you also have the flexibility to modify or customize them to fit your specific needs. You have a task later on this topic.
Reference: https://dlthub.com/docs/walkthroughs/adjust-a-schema
Tip: This schema is also useful for creating and maintaining an up-to-date data catalog. Microsoft Purview is a great option, and Microsoft Fabric integrates seamlessly with it.
Microsoft Purview’s data catalog is built on Apache Atlas https://atlas.apache.org/#/ and integrates well with the Apache Atlas API. This means that if you plan to run dltHub locally, you can deploy Apache Atlas in a Docker container to experiment with data cataloging and governance features. Since Apache Atlas is a production-grade solution, it can be used beyond just a local test environment.
Ingestion completion
After the first activity in the pipeline has executed, you should see the bronze layer successfully created. Pat on you. You have used Lakehouse concepts successfully.
Two cents on Fabric Lakehouse
1️⃣ Unified Storage for Structured & Unstructured Data
- Stores a variety of file formats, including CSV, Parquet, JSON, JSONL, Avro, and more.
- Supports both raw and processed data layers (e.g., bronze, silver, and gold layers).
2️⃣ Direct Integration with OneLake
- OneLake is the underlying storage foundation of Microsoft Fabric.
- Eliminates the need for separate data silos by allowing centralized data access across Fabric workloads.
3️⃣ Delta Lake Support for ACID Transactions
- Enables schema enforcement, versioning, and time travel through Delta tables.
- Ensures data consistency when multiple users or processes interact with the same dataset.
4️⃣ No-Copy Data Access Across Fabric Workloads
- Works seamlessly with Notebooks, SQL Analytics, Data Engineering, and Power BI without needing to move or duplicate data.
- Reduces storage costs and increases query efficiency.
5️⃣ Managed & Self-Service Data Exploration
- Users can browse, preview, and analyze files directly from the Fabric UI.
- Supports metadata-driven governance and integration with Microsoft Purview for data lineage and cataloging.
6️⃣ Optimized for Performance & Scalability
- Automatic indexing, caching, and optimization ensure fast queries on large datasets.
- Compatible with Spark, SQL, and Python-based workloads within Fabric.
OneLake explorer
OneLake Explorer is a file management tool within Microsoft Fabric that allows users to navigate, organize, and interact with data stored in OneLake — the unified storage layer in Fabric.
It provides a familiar file explorer-like experience while offering deep integration with Lakehouses, Warehouses, Notebooks, Pipelines, and Power BI.
Download it from https://www.microsoft.com/en-us/download/details.aspx?id=105222
Create delta tables from lakehouse files
The Northwind bronze layer, created via dltHub pipelines, is stored in a compressed JSONL format. The second activity in the pipeline iterates through the specified Lakehouse folders and generates Delta tables under the “Tables” section.
One of the parameters, raw_tables, is an array of folder names within the bronze layer. This list can be sourced from a CSV file or dynamically generated during dltHub pipeline execution — a task worth exploring. The Fabric pipeline can then reference this list using a lookup function for further processing.
Example array:
[“raw__northwind__categories”,”raw__northwind__category_details”,”raw__northwind__customers”,”raw__northwind__employees”,”raw__northwind__employee_territories”,”raw__northwind__order_details”,”raw__northwind__orders”,”raw__northwind__products”,”raw__northwind__regions”,”raw__northwind__shippers”,”raw__northwind__suppliers”,”raw__northwind__territories”]
View run histories and monitoring pipeline runs
Run history and pipeline monitoring are seamlessly integrated into the Microsoft ecosystem, providing visibility into pipeline execution details.
You can click on a specific run to open a dialog box with detailed insights and links for further exploration.
Examine raw data in PowerBI
One of the standout features in Microsoft Fabric is the Lakehouse SQL Endpoint, which enables seamless connectivity between your Fabric Lakehouse and Power BI (or any client that supports SQL endpoints).
By clicking on the three-dot menu (ellipsis), you can retrieve the SQL endpoint to establish a connection from Power BI or other tools.
Unleash Your Creativity
Congratulations! You’ve successfully completed “Ingestion with Microsoft Fabric, dltHub, and Lakehouse.”
If you’re eager to dive deeper, here are some self-assignments to further enhance your learning.
Feel free to share your findings with others along the way! 🚀
1️⃣ Simplify the process — Use dltHub to create Delta tables while eliminating unnecessary steps. (ref — https://dlthub.com/docs/dlt-ecosystem/table-formats/delta)
2️⃣ Modify metadata for compliance — Adjust the schema to remove PII data from the employee table. (ref — https://dlthub.com/docs/general-usage/schema)
3️⃣ Trace and visualize data flow — Capture data lineage and create an insightful visualization in Power BI. (ref — https://dlthub.com/docs/general-usage/destination-tables)
4️⃣ Automate with Scheduling & Incremental Loading — Configure the Fabric pipeline to run the ingestion process daily at 6:00 AM UTC and modify dltHub Python scripts to support incremental loading. (Reference — https://dlthub.com/docs/general-usage/incremental-loading)
How to use the github repo
All the necessary code can be found in the GitHub repository by weeks: [https://github.com/sketchmyview/fabric-dlthub-series] 🚀
Import the pipeline and notebooks from their respective Pipeline and Notebook folders into the Fabric environment.