DATAENG 05b (Repositories): Publishing and Using Shared Libraries in Code Repositories

About this course

Raw datasets are typically highly restricted, because they often contain malformed or sensitive data unfit for downstream consumption. As you’ve learned in this training track, the chief output of a datasource project is a clean dataset that can be used in multiple cases, including as the next step in a production data pipeline. In the previous tutorial, you transformed raw JSON and CSV files into preprocessed “passenger” datasets contained in Datasource Project: Passengers. The next step is to generate a clean dataset output.

Your organization may have common data formats that would benefit from a standardized set of cleaning utilities that can be applied across transform use cases. Rather than inefficiently repeating the same cleaning utility code for each use, you can develop and publish Python code libraries to share across the enterprise.

⚠️ Course Prerequisites

DATAENG 05a: Working with Raw Files in Code Repositories: If you have not completed the previous course, please do so now.

Outcomes

Publishing and consuming shared Python code libraries across an organization is an important part of a Foundry data engineer’s toolkit. In the process of creating clean passenger data outputs from your datasource project (i.e., passengers_clean and passengers_flight_alerts_clean), you’ll also create a cleaning utility, publish it, and make use of it in another transform. Specifically, you'll be transitioning the cleaning functions from Introduction to Data Transformation with Code Repositories into a shared library and and referencing them in both of your datasource repositories. After cleaning the passenger data, create an output passenger dataset that unions the JSON and CSV pipelines together.

🥅 Learning Objectives

Understand how Foundry generally makes packages available.
Know how to write, publish, and use a Python library.
Additional practice generating clean dataset outputs form a datasource project.

💪 Foundry Skills

Write a cleaning utility function.
Publish your cleaning utility as a shared Python library.
Implement a shared library in another code repository.

Curriculum

About this course
How Packages are Made Available
How does Foundry make packages available?
Creating a Shared Repository
Exercise Summary
Writing and Publishing a Shared Python Function
Add Your Package and Modules
Publishing Shared Code
Exercise Summary
Using a Shared Python Library
Add a Reference to Your Library
Replace Code References
Are your datasets up-to-date?
Exercise Summary
Updating and Extending Your Passengers Pipeline
Create clean “passengers” output datasets, part 1
Create clean “passengers” output datasets, part 2
Document Your Pipeline with a Data Lineage Graph
Document Your Pipeline with a README File
Add a Schedule to Your Pipeline
Exercise Summary
Conclusion
Key Takeaways
Next Steps

About this course

⚠️ Course Prerequisites

DATAENG 05a: Working with Raw Files in Code Repositories: If you have not completed the previous course, please do so now.

Outcomes

🥅 Learning Objectives

Understand how Foundry generally makes packages available.
Know how to write, publish, and use a Python library.
Additional practice generating clean dataset outputs form a datasource project.

💪 Foundry Skills

Write a cleaning utility function.
Publish your cleaning utility as a shared Python library.
Implement a shared library in another code repository.

Curriculum

About this course
How Packages are Made Available
How does Foundry make packages available?
Creating a Shared Repository
Exercise Summary
Writing and Publishing a Shared Python Function
Add Your Package and Modules
Publishing Shared Code
Exercise Summary
Using a Shared Python Library
Add a Reference to Your Library
Replace Code References
Are your datasets up-to-date?
Exercise Summary
Updating and Extending Your Passengers Pipeline
Create clean “passengers” output datasets, part 1
Create clean “passengers” output datasets, part 2
Document Your Pipeline with a Data Lineage Graph
Document Your Pipeline with a README File
Add a Schedule to Your Pipeline
Exercise Summary
Conclusion
Key Takeaways
Next Steps

DATAENG 05b (Repositories): Publishing and Using Shared Libraries in Code Repositories

Write, publish and use a python library in Foundry.

Also available as part of:

About this course

⚠️ Course Prerequisites

Outcomes

🥅 Learning Objectives

💪 Foundry Skills

Curriculum

⚠️ Course Prerequisites

Outcomes

🥅 Learning Objectives

💪 Foundry Skills