Course Catalog Help
DATAENG 05c (Repositories): Multiple Outputs with Data Transforms in Code Repositories

DATAENG 05c (Repositories): Multiple Outputs with Data Transforms in Code Repositories

Learn how to use multi-output and generated transforms to produce more than one dataset output from a single transform file.

rate limit

Code not recognized.

About this course

After a Datasource Project has generated a set of clean outputs, the next stage in a pipeline — the Transform Project — prepares data to feed into the Ontology layer. These projects import the cleaned datasets from one or more Datasource Projects, join them with lookup datasets to expand values, normalize or de-normalize relationships to create object-centric or time-centric datasets, or aggregate data to create standard, shared metrics.

Up to this point in the Data Engineering Learning Path, you’ve authored code-based data transformations that output a single dataset. Foundry transform APIs provide at least two ways to generate multiple outputs in a single transform file. This is helpful in cases where you want to programmatically brake inputs into distinctive parts. In this tutorial, you’ll explore one of the available methods for outputting multiple datasets from a single transform as you take your pipeline into the Transform Project phase.

⚠️ Course Prerequisites

  • Publishing and Using Shared Libraries in Code Respositories: If you have not completed the previous course, please do so now.

Outcomes

The exercises in this tutorial will take the clean outputs from your Datasource project: Flight Alerts and Datasource Project: Passengers and further process them using the concept of a multi-output Python transform. You’ll first generate an intermediate transform that joins the flight alerts data with the passenger data. Then you’ll create a multi-output transform that creates individual datasets of alerts based on passenger country.

🥅 Learning Objectives

  1. Gain familiarity with the Transform Project stage of a production pipeline.
  2. Understand the difference between a multi-output and a generated transform, both of which are capable of producing more than one dataset output from a single transform file.

💪 Foundry Skills

  • Create, schedule, and document the Transform Project portion of a production data pipeline.
  • Write a generated and multi-output Python transform.

Curriculum

  • About this Course
  • Create a Transform Project and Multiple Output Transforms
  • Create Your Folder Structure and Repository
  • Add Code for Your “Transformed” Datasets
  • Multiple Outputs with “Generated” Transforms
  • Multi-output Transforms
  • Exercise Summary
  • Document and Schedule Your Pipeline
  • Add a README File
  • Add a Data Lineage Graph for Documentation
  • Configure a Connecting Build Schedule
  • Take Stock of Your Pipeline
  • Exercise Summary
  • Conclusion
  • Key Takeaways
  • Next Steps

About this course

After a Datasource Project has generated a set of clean outputs, the next stage in a pipeline — the Transform Project — prepares data to feed into the Ontology layer. These projects import the cleaned datasets from one or more Datasource Projects, join them with lookup datasets to expand values, normalize or de-normalize relationships to create object-centric or time-centric datasets, or aggregate data to create standard, shared metrics.

Up to this point in the Data Engineering Learning Path, you’ve authored code-based data transformations that output a single dataset. Foundry transform APIs provide at least two ways to generate multiple outputs in a single transform file. This is helpful in cases where you want to programmatically brake inputs into distinctive parts. In this tutorial, you’ll explore one of the available methods for outputting multiple datasets from a single transform as you take your pipeline into the Transform Project phase.

⚠️ Course Prerequisites

  • Publishing and Using Shared Libraries in Code Respositories: If you have not completed the previous course, please do so now.

Outcomes

The exercises in this tutorial will take the clean outputs from your Datasource project: Flight Alerts and Datasource Project: Passengers and further process them using the concept of a multi-output Python transform. You’ll first generate an intermediate transform that joins the flight alerts data with the passenger data. Then you’ll create a multi-output transform that creates individual datasets of alerts based on passenger country.

🥅 Learning Objectives

  1. Gain familiarity with the Transform Project stage of a production pipeline.
  2. Understand the difference between a multi-output and a generated transform, both of which are capable of producing more than one dataset output from a single transform file.

💪 Foundry Skills

  • Create, schedule, and document the Transform Project portion of a production data pipeline.
  • Write a generated and multi-output Python transform.

Curriculum

  • About this Course
  • Create a Transform Project and Multiple Output Transforms
  • Create Your Folder Structure and Repository
  • Add Code for Your “Transformed” Datasets
  • Multiple Outputs with “Generated” Transforms
  • Multi-output Transforms
  • Exercise Summary
  • Document and Schedule Your Pipeline
  • Add a README File
  • Add a Data Lineage Graph for Documentation
  • Configure a Connecting Build Schedule
  • Take Stock of Your Pipeline
  • Exercise Summary
  • Conclusion
  • Key Takeaways
  • Next Steps