Course Catalog Help
DATAENG 05b (Repositories): Publishing and Using Shared Libraries in Code Repositories

DATAENG 05b (Repositories): Publishing and Using Shared Libraries in Code Repositories

Write, publish and use a python library in Foundry.

rate limit

Code not recognized.

About this course

Raw datasets are typically highly restricted, because they often contain malformed or sensitive data unfit for downstream consumption. As you’ve learned in this training track, the chief output of a datasource project is a clean dataset that can be used in multiple cases, including as the next step in a production data pipeline. In the previous tutorial, you transformed raw JSON and CSV files into preprocessed “passenger” datasets contained in Datasource Project: Passengers. The next step is to generate a clean dataset output.

Your organization may have common data formats that would benefit from a standardized set of cleaning utilities that can be applied across transform use cases. Rather than inefficiently repeating the same cleaning utility code for each use, you can develop and publish Python code libraries to share across the enterprise.

⚠️ Course Prerequisites

  • DATAENG 05a: Working with Raw Files in Code Repositories: If you have not completed the previous course, please do so now.

Outcomes

Publishing and consuming shared Python code libraries across an organization is an important part of a Foundry data engineer’s toolkit. In the process of creating clean passenger data outputs from your datasource project (i.e., passengers_clean and passengers_flight_alerts_clean), you’ll also create a cleaning utility, publish it, and make use of it in another transform. Specifically, you'll be transitioning the cleaning functions from Introduction to Data Transformation with Code Repositories into a shared library and and referencing them in both of your datasource repositories. After cleaning the passenger data, create an output passenger dataset that unions the JSON and CSV pipelines together.

🥅 Learning Objectives

  1. Understand how Foundry generally makes packages available.
  2. Know how to write, publish, and use a Python library.
  3. Additional practice generating clean dataset outputs form a datasource project.

💪 Foundry Skills

  • Write a cleaning utility function.
  • Publish your cleaning utility as a shared Python library.
  • Implement a shared library in another code repository.

Curriculum

  • About this course
  • How Packages are Made Available
  • How does Foundry make packages available?
  • Creating a Shared Repository
  • Exercise Summary
  • Writing and Publishing a Shared Python Function
  • Add Your Package and Modules
  • Publishing Shared Code
  • Exercise Summary
  • Using a Shared Python Library
  • Add a Reference to Your Library
  • Replace Code References
  • Are your datasets up-to-date?
  • Exercise Summary
  • Updating and Extending Your Passengers Pipeline
  • Create clean “passengers” output datasets, part 1
  • Create clean “passengers” output datasets, part 2
  • Document Your Pipeline with a Data Lineage Graph
  • Document Your Pipeline with a README File
  • Add a Schedule to Your Pipeline
  • Exercise Summary
  • Conclusion
  • Key Takeaways
  • Next Steps

About this course

Raw datasets are typically highly restricted, because they often contain malformed or sensitive data unfit for downstream consumption. As you’ve learned in this training track, the chief output of a datasource project is a clean dataset that can be used in multiple cases, including as the next step in a production data pipeline. In the previous tutorial, you transformed raw JSON and CSV files into preprocessed “passenger” datasets contained in Datasource Project: Passengers. The next step is to generate a clean dataset output.

Your organization may have common data formats that would benefit from a standardized set of cleaning utilities that can be applied across transform use cases. Rather than inefficiently repeating the same cleaning utility code for each use, you can develop and publish Python code libraries to share across the enterprise.

⚠️ Course Prerequisites

  • DATAENG 05a: Working with Raw Files in Code Repositories: If you have not completed the previous course, please do so now.

Outcomes

Publishing and consuming shared Python code libraries across an organization is an important part of a Foundry data engineer’s toolkit. In the process of creating clean passenger data outputs from your datasource project (i.e., passengers_clean and passengers_flight_alerts_clean), you’ll also create a cleaning utility, publish it, and make use of it in another transform. Specifically, you'll be transitioning the cleaning functions from Introduction to Data Transformation with Code Repositories into a shared library and and referencing them in both of your datasource repositories. After cleaning the passenger data, create an output passenger dataset that unions the JSON and CSV pipelines together.

🥅 Learning Objectives

  1. Understand how Foundry generally makes packages available.
  2. Know how to write, publish, and use a Python library.
  3. Additional practice generating clean dataset outputs form a datasource project.

💪 Foundry Skills

  • Write a cleaning utility function.
  • Publish your cleaning utility as a shared Python library.
  • Implement a shared library in another code repository.

Curriculum

  • About this course
  • How Packages are Made Available
  • How does Foundry make packages available?
  • Creating a Shared Repository
  • Exercise Summary
  • Writing and Publishing a Shared Python Function
  • Add Your Package and Modules
  • Publishing Shared Code
  • Exercise Summary
  • Using a Shared Python Library
  • Add a Reference to Your Library
  • Replace Code References
  • Are your datasets up-to-date?
  • Exercise Summary
  • Updating and Extending Your Passengers Pipeline
  • Create clean “passengers” output datasets, part 1
  • Create clean “passengers” output datasets, part 2
  • Document Your Pipeline with a Data Lineage Graph
  • Document Your Pipeline with a README File
  • Add a Schedule to Your Pipeline
  • Exercise Summary
  • Conclusion
  • Key Takeaways
  • Next Steps