Course Catalog Help
DATAENG 02 (Repositories): Introduction to Data Transformation with Code Repositories

DATAENG 02 (Repositories): Introduction to Data Transformation with Code Repositories

Use Code Repositories to normalize and format data using some basic transforms.

rate limit

Code not recognized.

About this course

Once your team has agreed on the datasets and transformation steps needed to achieve your outcome, it’s time to start developing your data assets in a Foundry code repository. The Code Repository application contains a fully integrated suite of tools that let you write, publish, and build data transformations as part of a production pipeline. There are several Foundry applications capable of transforming and outputting datasets (e.g., Contour, Code Workbook, Preparation, Fusion). In this tutorial, we will assume you want to learn how to use Code Repositories

⚠️ Course prerequisites

  • Data Pipeline Foundations: If you have not completed the previous course, please do so now.
  • Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
  • General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though an basic understanding of the use of code (e.g., SQL, Java, Python, R) to transform data will provide a conceptual advantage.
  • General familiarity with source code management workflows in Git (branching and merging) is useful but not required.

Outcomes

In the previous course, you created a series of folder that implements a recommended pipeline project structure. You’ll now use the Code Repositories application to generate the initial datasets in your pipeline.

In this course, you’ll use PySpark to normalize and format the data using some basic cleaning utilities. You’ll stop short of doing any mapping between the raw files—your first goal is simply to pre-process them for further cleaning and eventual joining downstream (in a subsequent tutorial).

In short, the inputs to this course are the simulated raw data sets from an upstream source and the outputs will be “pre-processed” datasets formatted for further cleaning in the next course.

🥅 Learning Objectives

  1. Navigate the Code Repositories environment.
  2. Learn the basic anatomy of a data transform.
  3. Understand how code management works in a Foundry code repository.
  4. Practice writing PySpark data transformations.
  5. Understand the importance of a pre-processing and cleaning in a data pipeline development.
  6. Understand basic patterns for creating and configuring a Code Repository for transforming data.

💪 Foundry Skills

  • Bootstrap a Foundry Code Repository.
  • Create and implementing reusable code utilities.
  • Implement branching and pipeline documentation best practices.

Curriculum

  • Introduction
  • About this Course
  • Getting Started
  • Preview the Project in Data Lineage
  • Creating a Code Repository
  • Branching your Code
  • Your First (“Identity”) Transform
  • Testing and committing your code
  • Building your dataset
  • Repeat the process for your other raw datasets
  • Exercise Summary
  • Merging your Code
  • Creating a Pull Request
  • Approving a PR and Merging into Master
  • Building on Master
  • Exercise Summary
  • Preprocessing your Data
  • Creating Cleaning Utilities
  • Creating Type Utilities
  • Applying Utility Files to Preprocess your Data
  • Exercise Summary
  • Conclusion
  • Key Takeaways
  • Next Steps

About this course

Once your team has agreed on the datasets and transformation steps needed to achieve your outcome, it’s time to start developing your data assets in a Foundry code repository. The Code Repository application contains a fully integrated suite of tools that let you write, publish, and build data transformations as part of a production pipeline. There are several Foundry applications capable of transforming and outputting datasets (e.g., Contour, Code Workbook, Preparation, Fusion). In this tutorial, we will assume you want to learn how to use Code Repositories

⚠️ Course prerequisites

  • Data Pipeline Foundations: If you have not completed the previous course, please do so now.
  • Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
  • General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though an basic understanding of the use of code (e.g., SQL, Java, Python, R) to transform data will provide a conceptual advantage.
  • General familiarity with source code management workflows in Git (branching and merging) is useful but not required.

Outcomes

In the previous course, you created a series of folder that implements a recommended pipeline project structure. You’ll now use the Code Repositories application to generate the initial datasets in your pipeline.

In this course, you’ll use PySpark to normalize and format the data using some basic cleaning utilities. You’ll stop short of doing any mapping between the raw files—your first goal is simply to pre-process them for further cleaning and eventual joining downstream (in a subsequent tutorial).

In short, the inputs to this course are the simulated raw data sets from an upstream source and the outputs will be “pre-processed” datasets formatted for further cleaning in the next course.

🥅 Learning Objectives

  1. Navigate the Code Repositories environment.
  2. Learn the basic anatomy of a data transform.
  3. Understand how code management works in a Foundry code repository.
  4. Practice writing PySpark data transformations.
  5. Understand the importance of a pre-processing and cleaning in a data pipeline development.
  6. Understand basic patterns for creating and configuring a Code Repository for transforming data.

💪 Foundry Skills

  • Bootstrap a Foundry Code Repository.
  • Create and implementing reusable code utilities.
  • Implement branching and pipeline documentation best practices.

Curriculum

  • Introduction
  • About this Course
  • Getting Started
  • Preview the Project in Data Lineage
  • Creating a Code Repository
  • Branching your Code
  • Your First (“Identity”) Transform
  • Testing and committing your code
  • Building your dataset
  • Repeat the process for your other raw datasets
  • Exercise Summary
  • Merging your Code
  • Creating a Pull Request
  • Approving a PR and Merging into Master
  • Building on Master
  • Exercise Summary
  • Preprocessing your Data
  • Creating Cleaning Utilities
  • Creating Type Utilities
  • Applying Utility Files to Preprocess your Data
  • Exercise Summary
  • Conclusion
  • Key Takeaways
  • Next Steps