Course Catalog Help
DATAENG 03 (Repositories): Creating a Project Output in Code Repositories

DATAENG 03 (Repositories): Creating a Project Output in Code Repositories

Engineer a clean output for your project to be consumed by downstream pipelines and use cases.

rate limit

Code not recognized.

About this course

In this tutorial, you’ll engineer a “clean” output for your project to be consumed by downstream pipelines and use cases. The code you’ll be implementing makes use of common PySpark features for transforming data inputs, and a significant portion of the tutorial will require you to explore selected documentation entries that expound on PySpark best practices. As a reminder however, teaching PySpark syntax patterns is outside the scope of this course.

⚠️ Course prerequisites

  • DATAENG 02: Introduction to Data Transformation: If you have not completed the previous course in this track, please do so now.
  • Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
  • General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though an basic understanding of the use of code (e.g., SQL, Java, Python, R) to transform data will provide a conceptual advantage.
  • General familiarity with source code management workflows in Git (branching and merging) is useful but not required.

Learning Objectives

  1. Understand the distinction between preprocessing and cleaning.
  2. Document the datasource stage of your pipeline.

Foundry Skills

  • Create a multi-input transform file.
  • Use Contour to validate a proposed data transform.
  • Generate a Data Lineage graph as documentation for the Datasource project segment of your production pipeline.

Curriculum

  • Introduction
  • About this Course
  • Add a Cleaned Dataset
  • Updating your repository folder structure
  • Creating, Previewing, and Building your Code
  • Brief Code Review
  • Exercise Summary
  • Analyze Data on a Branch
  • Using Contour for Data Validation
  • Exercise Summary
  • Update the Master Branch & Document your Pipeline
  • Merge into Master
  • Approving your Pull Request
  • Documenting your Pipeline with a Data Lineage Graph
  • Exercise Summary
  • Add Written Pipeline Documentation
  • Add a README File to your Source Repository
  • Rendering Markdown in a Repository
  • Exercise Summary
  • Conclusion
  • Key Takeaways
  • Next Steps

About this course

In this tutorial, you’ll engineer a “clean” output for your project to be consumed by downstream pipelines and use cases. The code you’ll be implementing makes use of common PySpark features for transforming data inputs, and a significant portion of the tutorial will require you to explore selected documentation entries that expound on PySpark best practices. As a reminder however, teaching PySpark syntax patterns is outside the scope of this course.

⚠️ Course prerequisites

  • DATAENG 02: Introduction to Data Transformation: If you have not completed the previous course in this track, please do so now.
  • Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
  • General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though an basic understanding of the use of code (e.g., SQL, Java, Python, R) to transform data will provide a conceptual advantage.
  • General familiarity with source code management workflows in Git (branching and merging) is useful but not required.

Learning Objectives

  1. Understand the distinction between preprocessing and cleaning.
  2. Document the datasource stage of your pipeline.

Foundry Skills

  • Create a multi-input transform file.
  • Use Contour to validate a proposed data transform.
  • Generate a Data Lineage graph as documentation for the Datasource project segment of your production pipeline.

Curriculum

  • Introduction
  • About this Course
  • Add a Cleaned Dataset
  • Updating your repository folder structure
  • Creating, Previewing, and Building your Code
  • Brief Code Review
  • Exercise Summary
  • Analyze Data on a Branch
  • Using Contour for Data Validation
  • Exercise Summary
  • Update the Master Branch & Document your Pipeline
  • Merge into Master
  • Approving your Pull Request
  • Documenting your Pipeline with a Data Lineage Graph
  • Exercise Summary
  • Add Written Pipeline Documentation
  • Add a README File to your Source Repository
  • Rendering Markdown in a Repository
  • Exercise Summary
  • Conclusion
  • Key Takeaways
  • Next Steps