DATAENG 03 (Repositories): Creating a Project Output in Code Repositories

About this course

In this tutorial, you’ll engineer a “clean” output for your project to be consumed by downstream pipelines and use cases. The code you’ll be implementing makes use of common PySpark features for transforming data inputs, and a significant portion of the tutorial will require you to explore selected documentation entries that expound on PySpark best practices. As a reminder however, teaching PySpark syntax patterns is outside the scope of this course.

⚠️ Course prerequisites

DATAENG 02: Introduction to Data Transformation: If you have not completed the previous course in this track, please do so now.
Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though an basic understanding of the use of code (e.g., SQL, Java, Python, R) to transform data will provide a conceptual advantage.
General familiarity with source code management workflows in Git (branching and merging) is useful but not required.

Learning Objectives

Understand the distinction between preprocessing and cleaning.
Document the datasource stage of your pipeline.

Foundry Skills

Create a multi-input transform file.
Use Contour to validate a proposed data transform.
Generate a Data Lineage graph as documentation for the Datasource project segment of your production pipeline.

Curriculum

Introduction
About this Course
Add a Cleaned Dataset
Updating your repository folder structure
Creating, Previewing, and Building your Code
Brief Code Review
Exercise Summary
Analyze Data on a Branch
Using Contour for Data Validation
Exercise Summary
Update the Master Branch & Document your Pipeline
Merge into Master
Approving your Pull Request
Documenting your Pipeline with a Data Lineage Graph
Exercise Summary
Add Written Pipeline Documentation
Add a README File to your Source Repository
Rendering Markdown in a Repository
Exercise Summary
Conclusion
Key Takeaways
Next Steps

About this course

⚠️ Course prerequisites

DATAENG 02: Introduction to Data Transformation: If you have not completed the previous course in this track, please do so now.
Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though an basic understanding of the use of code (e.g., SQL, Java, Python, R) to transform data will provide a conceptual advantage.
General familiarity with source code management workflows in Git (branching and merging) is useful but not required.

Learning Objectives

Understand the distinction between preprocessing and cleaning.
Document the datasource stage of your pipeline.

Foundry Skills

Create a multi-input transform file.
Use Contour to validate a proposed data transform.
Generate a Data Lineage graph as documentation for the Datasource project segment of your production pipeline.

Curriculum

Introduction
About this Course
Add a Cleaned Dataset
Updating your repository folder structure
Creating, Previewing, and Building your Code
Brief Code Review
Exercise Summary
Analyze Data on a Branch
Using Contour for Data Validation
Exercise Summary
Update the Master Branch & Document your Pipeline
Merge into Master
Approving your Pull Request
Documenting your Pipeline with a Data Lineage Graph
Exercise Summary
Add Written Pipeline Documentation
Add a README File to your Source Repository
Rendering Markdown in a Repository
Exercise Summary
Conclusion
Key Takeaways
Next Steps

DATAENG 03 (Repositories): Creating a Project Output in Code Repositories

Engineer a clean output for your project to be consumed by downstream pipelines and use cases.

Also available as part of:

About this course

⚠️ Course prerequisites

Learning Objectives

Foundry Skills

Curriculum

⚠️ Course prerequisites

Learning Objectives

Foundry Skills