DATAENG 02 (Repositories): Introduction to Data Transformation with Code Repositories

About this course

Once your team has agreed on the datasets and transformation steps needed to achieve your outcome, it’s time to start developing your data assets in a Foundry code repository. The Code Repository application contains a fully integrated suite of tools that let you write, publish, and build data transformations as part of a production pipeline. There are several Foundry applications capable of transforming and outputting datasets (e.g., Contour, Code Workbook, Preparation, Fusion). In this tutorial, we will assume you want to learn how to use Code Repositories

⚠️ Course prerequisites

Data Pipeline Foundations: If you have not completed the previous course, please do so now.
Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though an basic understanding of the use of code (e.g., SQL, Java, Python, R) to transform data will provide a conceptual advantage.
General familiarity with source code management workflows in Git (branching and merging) is useful but not required.

Outcomes

In the previous course, you created a series of folder that implements a recommended pipeline project structure. You’ll now use the Code Repositories application to generate the initial datasets in your pipeline.

In this course, you’ll use PySpark to normalize and format the data using some basic cleaning utilities. You’ll stop short of doing any mapping between the raw files—your first goal is simply to pre-process them for further cleaning and eventual joining downstream (in a subsequent tutorial).

In short, the inputs to this course are the simulated raw data sets from an upstream source and the outputs will be “pre-processed” datasets formatted for further cleaning in the next course.

🥅 Learning Objectives

Navigate the Code Repositories environment.
Learn the basic anatomy of a data transform.
Understand how code management works in a Foundry code repository.
Practice writing PySpark data transformations.
Understand the importance of a pre-processing and cleaning in a data pipeline development.
Understand basic patterns for creating and configuring a Code Repository for transforming data.

💪 Foundry Skills

Bootstrap a Foundry Code Repository.
Create and implementing reusable code utilities.
Implement branching and pipeline documentation best practices.

Curriculum

Introduction
About this Course
Getting Started
Preview the Project in Data Lineage
Creating a Code Repository
Branching your Code
Your First (“Identity”) Transform
Testing and committing your code
Building your dataset
Repeat the process for your other raw datasets
Exercise Summary
Merging your Code
Creating a Pull Request
Approving a PR and Merging into Master
Building on Master
Exercise Summary
Preprocessing your Data
Creating Cleaning Utilities
Creating Type Utilities
Applying Utility Files to Preprocess your Data
Exercise Summary
Conclusion
Key Takeaways
Next Steps

About this course

⚠️ Course prerequisites

Data Pipeline Foundations: If you have not completed the previous course, please do so now.
Necessary permissions to create Code Repositories. Please reach out to your program administrator or Palantir point of contact if you need authorization.
General familiarity with code-based data transformation: This course will provide PySpark code snippets, so PySpark specific knowledge is not necessary, though an basic understanding of the use of code (e.g., SQL, Java, Python, R) to transform data will provide a conceptual advantage.
General familiarity with source code management workflows in Git (branching and merging) is useful but not required.

Outcomes

In short, the inputs to this course are the simulated raw data sets from an upstream source and the outputs will be “pre-processed” datasets formatted for further cleaning in the next course.

🥅 Learning Objectives

Navigate the Code Repositories environment.
Learn the basic anatomy of a data transform.
Understand how code management works in a Foundry code repository.
Practice writing PySpark data transformations.
Understand the importance of a pre-processing and cleaning in a data pipeline development.
Understand basic patterns for creating and configuring a Code Repository for transforming data.

💪 Foundry Skills

Bootstrap a Foundry Code Repository.
Create and implementing reusable code utilities.
Implement branching and pipeline documentation best practices.

Curriculum

Introduction
About this Course
Getting Started
Preview the Project in Data Lineage
Creating a Code Repository
Branching your Code
Your First (“Identity”) Transform
Testing and committing your code
Building your dataset
Repeat the process for your other raw datasets
Exercise Summary
Merging your Code
Creating a Pull Request
Approving a PR and Merging into Master
Building on Master
Exercise Summary
Preprocessing your Data
Creating Cleaning Utilities
Creating Type Utilities
Applying Utility Files to Preprocess your Data
Exercise Summary
Conclusion
Key Takeaways
Next Steps

DATAENG 02 (Repositories): Introduction to Data Transformation with Code Repositories

Use Code Repositories to normalize and format data using some basic transforms.

Also available as part of:

About this course

⚠️ Course prerequisites

Outcomes

🥅 Learning Objectives

💪 Foundry Skills

Curriculum

⚠️ Course prerequisites

Outcomes

🥅 Learning Objectives

💪 Foundry Skills