Addressing Data Transformation Challenges: A Strategic Initiative in MS Fabric

3 min readFeb 12, 2025

Through extensive experience working with diverse consulting clients and teams, I have observed a recurring pattern: the use of multiple notebooks for individual data transformation tasks. This approach presents several challenges:

One Notebook per Transformation: Each data transformation is handled in a separate notebook, leading to fragmentation.
Standardized Development Process: Teams working within Fabric Workspace follow a uniform workflow.
Code Duplication Across Workspaces: Redundant code is scattered across different Azure Fabric Workspaces, complicating maintenance.
Prolonged Data Transformation Cycle: The time required for development, testing, and deployment is extended, impacting efficiency.

Recognizing these challenges, we launched a strategic initiative within a specific project to streamline the data transformation process.

The Initiative

In this project, we leveraged well-structured, reusable PySpark code for data transformation. Upon closer examination, we identified a common pattern across notebooks:

Defining Sources and Reference Schemas: Notebooks established data sources and referenced Spark structs.
Applying Spark SQL Transformations: Transformations were executed using externally stored Spark SQL scripts, with the SQL script path and other dependencies specified in the notebook.
Writing to a Delta Location: The transformed data was saved in a Delta location.

While the existing code was structured efficiently, the increasing number of notebooks highlighted the need for a more scalable approach. To address these challenges, we implemented a metadata-driven data transformation framework, optimizing management, reusability, and efficiency.

Consolidating Reusable PySpark Code

To streamline development and improve efficiency, all reusable PySpark code was consolidated into a structured PySpark package, following Python package principles. This allowed developers to:

Develop and test code locally.
Package the code into a wheel (.whl) file through an Azure DevOps pipeline for seamless deployment across environments.

Metadata-Driven Transformation

A key aspect of our approach was leveraging metadata to drive data transformations. Below is an example of a YAML file containing transformation metadata:

reader:
  nytaxi_ds:
    type: "csv_batch"
    path: '/bronze/nytaxi.csv'
    header: true
    infer_schema: true
    view_name: "vw_raw_trips"
    schema: 'datacatalog/trip.json'
transformation:
  add_current_timestamp_col:
    type: "sql_step"
    query: "select *,current_timestamp() as ingested_date from vw_raw_trips"
    querypath: '/sqlcatalog/vw_add_current_timestamp.sql'
    view_name: "vw_add_current_timestamp"

  add_is_active_col:
    type: "sql_step"
    query: "select *,'1' as is_active from vw_add_current_timestamp"
    view_name: "vw_add_is_active"
  add_partitioned_col:
    type: "sql_step"
    query: "select *,year(ingested_date) as year,month(ingested_date) as month from vw_add_current_timestamp"
    view_name: "vw_add_partitioned"
writer:
  writer_1:
    write_view_name: "vw_add_partitioned"
    type: "delta_batch"
    path: '/silver/cleaned_trips'
    mode: "append"
    partition_by: [ "year","month" ]

To streamline execution, a single main notebook was developed to handle all data transformation tasks dynamically. This notebook:

Accepts job details from the YAML metadata file.

This eliminated the need for multiple transformation-specific notebooks, simplifying maintenance and execution.

5. Orchestration Integration

To align with the new approach, existing orchestrations in Fabric Pipelines were updated to:

Reference the main notebook instead of multiple individual notebooks.
Pass the appropriate job details from the YAML metadata file.

This integration improved automation and scalability while reducing operational complexity.

6. Metrics Collection for Azure Monitor

To enhance observability, metrics collection was implemented for Azure Monitor, allowing:

Flexible monitoring based on project-specific requirements.
Improved insights into job performance, errors, and execution trends.

By adopting this unified approach, we significantly improved efficiency, maintainability, and monitoring across our data transformation pipeline.

Benefits

The implemented changes led to several key advantages:

Elimination of Separate Notebooks: No longer requires a dedicated notebook for each data transformation, simplifying management.
Code Consolidation Across Environments: The use of a Python wheel (.whl) package eliminates code duplication across multiple Fabric Workspaces, enabling seamless deployment and installation on Spark clusters.
Shortened Data Transformation Cycle: The time required for development, testing, and deployment has been significantly reduced, improving overall efficiency.

By adopting this approach, we enhanced scalability, maintainability, and cost efficiency across our data transformation processes.

Multi-Notebook Data Transformation to Metadata-Driven Single-Notebook Data Transformation in MS Fabric

Hope this approach helps your data engineering team get started with Microsoft Fabric!

With minor adjustments to imports, this methodology can also be applied to Azure Synapse and Databricks, making it a versatile solution for modern data transformation workflows.

If you’re looking for assistance in implementing metadata-driven data transformation in Microsoft Fabric or similar platforms, feel free to reach out — I’d be happy to help!