Ditch the Overhead: Metadata-Driven Data Transformation with DuckDB & Polars in Microsoft Fabric
Context
When building a data platform — whether you’re working alone as a data engineer or as part of a team — you may have encountered certain common challenges. These issues are especially prevalent among those coming from a pure SQL background. At least, that’s been my observation so far.
This discussion follows Part 1 of the series: Setting Up Ingestion in MS Fabric DLT Hub Lakehouse.
Now, you’ve ingested data into a lakehouse and are about to build your clean layer. This step involves data quality checks, transformations, and several other processing tasks. However, as you progress, you may run into the following challenges:
- One Notebook per Transformation — Each transformation is handled in a separate notebook, leading to fragmentation.
- Standardized Development Process — While a structured workflow is in place, the fragmentation from point #1 introduces inefficiencies.
- Code Duplication Across Workspaces — Redundant code makes maintenance and scalability difficult.
Sound familiar?
Let’s embrace Metadata-Driven Data Transformation
What is metadata-driven data transformation?
This approach defines transformation logic using metadata — essentially “data about data” — rather than hardcoding it into applications. This allows for more flexible and dynamic transformations, making it easier to manage and adapt as data sources and requirements evolve.
In simple terms: Instead of writing separate Python scripts for each transformation, you use a single script or notebook in Microsoft Fabric. The transformation logic is dictated by a metadata file, which contains details such as:
- SQL queries to execute
- Source file locations
- Fabric workspace name
- Target Delta table name
- Other relevant transformation parameters
Here’s what a metadata file looks like:
Why DuckDB & Polars Instead of Spark?
One of the things I appreciate about Microsoft Fabric is it offers pure Python environment, which comes with built-in libraries like DuckDB, Polars, Matplotlib, and many others apart from Spark . For moderate-sized datasets, this provides an ideal environment for developing mid-scale data analytics products without the overhead of a full-scale distributed system.
That said, if I were working with big data analytics, I would absolutely choose Spark. However, Fabric offers a compelling alternative, enabling enterprise-grade orchestration through Fabric Pipelines while ensuring robust observability within the Fabric ecosystem.
What Are We Building?
Here’s a high-level view of the workflow (solid colour box):
Orchestration Part
Once data ingestion is completed using dlt from dltHub, the process begins by reading a CSV file, iterating through metadata, and passing that information to a transformation job, which runs inside a Python notebook.
Data Transformation Part
Once the transformation job receives the metadata, it extracts the relevant details and forwards them to a Python function.
This function takes in key parameters such as:
- SQL queries to execute
- Source file locations
- Fabric workspace name
- Target Delta table name
- Other transformation-related configurations metadata coming from a fabric pipeline
The function then processes the SQL query stored in the metadata, injects the necessary parameters, and executes it using DuckDB. Finally, the transformed data is saved as a Delta table using Polars.
This approach eliminates unnecessary overhead, making it lightweight, efficient, and well-suited for moderate-scale data processing needs.
Minimal Maintenance Overhead
On top of that, this approach significantly reduces maintenance overhead. By leveraging a metadata-driven framework, transformations become more modular, reusable, and scalable, eliminating the need for managing multiple notebooks or redundant code across workspaces.
View run histories and monitoring pipeline runs
Run history and pipeline monitoring are seamlessly integrated into the Microsoft ecosystem, providing visibility into pipeline execution details.
You can click on a specific run to open a dialog box with detailed insights and links for further exploration.
Examine data in PowerBI
One of the standout features in Microsoft Fabric is the Lakehouse SQL Endpoint, which enables seamless connectivity between your Fabric Lakehouse and Power BI (or any client that supports SQL endpoints).
By clicking on the three-dot menu (ellipsis), you can retrieve the SQL endpoint to establish a connection from Power BI or other tools.
Unleash Your Creativity
Congratulations! You’ve successfully completed “Ditch the Overhead: Metadata-Driven Data Transformation with DuckDB & Polars in Microsoft Fabric”
If you’re eager to dive deeper, here are some self-assignments to further enhance your learning.
1) Simplify the process — Use dlt from dltHub to create Delta tables while eliminating unnecessary steps. (ref — https://dlthub.com/docs/dlt-ecosystem/table-formats/delta)
2) Enhance lookup activity- modify lookup activity to supply dynamic values to connection and file path
Feel free to share your findings with others along the way!
How to use the github repo
All the necessary code can be found in the GitHub repository by weeks: This is a week2 of the series.
[https://github.com/sketchmyview/fabric-dlthub-series]
Import the pipeline and notebooks from their respective Pipeline and Notebook folders into the Fabric environment.