DATAENG 05a (Repositories): Working with Raw Files in Code Repositories

About this course

The computation engine behind data transformations in Foundry is Spark: an open-source, distributed cluster-computing framework for quick, large-scale data processing and analytics. Spark works most efficiently on a data file type called Parquet, and by default, Foundry transforms output datasets as a series of distributed Parquet files.

All things being equal, datasets composed of Parquet files will always be more efficiently computed by Spark than other data formats. You may, however, want to process files in non-linear formats (such as XML or JSON). This tutorial reviews the essentials needed to read and write files in Foundry datasets using the @transform() decorator (vs. @transform_df used in the previous tutorial).

The files necessary for the next step in the development of your pipeline are in non-Parquet formats and must be directly accessed by your code for transformation.

⚠️ Course Prerequisites

DATAENG 04: Scheduling Data Pipelines: If you have not completed this previous course, please do so now.
A basic understanding of Spark and distributed computing will provide an advantage as we begin talking about dataset anatomy, but it is not required.

Outcomes

Your data pipeline consists of clean flight alert data enhanced with some mapping files, but there’s another datasource you’d like to incorporate into the overarching project: passengers associated with these flight alerts. Your team may have decided, for example, that a workflow they’d like to enable downstream is the ability to assign travel vouchers based on flight delay/alert severity and customer status, and integrating passenger data into your pipeline is a necessary step toward creating the Ontology framework to support that interaction pattern.

The goal of this tutorial is to expose another data transformation pattern that involves directly accessing and parsing a CSV and JSON file in Foundry. Whether your non-linearly formatted data was uploaded in an ad hoc manner or originates in an external source, the methods in this course will be an important part of a data engineer’s arsenal of transformation techniques.

🥅 Learning Objectives

Understand raw file access from a transform in the Code Repositories application.
Use Foundry APIs and packages to parse non-linear files into Parquet.

💪 Foundry Skills

Use the @transform() decorator to access raw files in Foundry.
Use additional Python libraries to parse non-Parquet data.
Use the Foundry Explorer helper.

Curriculum

About this course
Create Your Project Resources and Raw Data
Create Your Project Structure
Create Your Repository
Identify Your Raw Data
Copy the Raw Data Source into Your Project
Exercise Summary
Preprocess Your Non-standard Files
Preprocess Your Data, Part 1
Preprocess Your Data, Part 2
Viewing Module Contents
Exercise Summary
Python Unit Tests
Python Unit Tests
Conclusion
Key Takeaways
Next Steps

About this course

The files necessary for the next step in the development of your pipeline are in non-Parquet formats and must be directly accessed by your code for transformation.

⚠️ Course Prerequisites

DATAENG 04: Scheduling Data Pipelines: If you have not completed this previous course, please do so now.
A basic understanding of Spark and distributed computing will provide an advantage as we begin talking about dataset anatomy, but it is not required.

Outcomes

🥅 Learning Objectives

Understand raw file access from a transform in the Code Repositories application.
Use Foundry APIs and packages to parse non-linear files into Parquet.

💪 Foundry Skills

Use the @transform() decorator to access raw files in Foundry.
Use additional Python libraries to parse non-Parquet data.
Use the Foundry Explorer helper.

Curriculum

About this course
Create Your Project Resources and Raw Data
Create Your Project Structure
Create Your Repository
Identify Your Raw Data
Copy the Raw Data Source into Your Project
Exercise Summary
Preprocess Your Non-standard Files
Preprocess Your Data, Part 1
Preprocess Your Data, Part 2
Viewing Module Contents
Exercise Summary
Python Unit Tests
Python Unit Tests
Conclusion
Key Takeaways
Next Steps

DATAENG 05a (Repositories): Working with Raw Files in Code Repositories

Use Foundry APIs and packages to parse non-linear files into Parquet in Code Repositories.

Also available as part of:

About this course

⚠️ Course Prerequisites

Outcomes

🥅 Learning Objectives

💪 Foundry Skills

Curriculum

⚠️ Course Prerequisites

Outcomes

🥅 Learning Objectives

💪 Foundry Skills