Course Catalog Help
DATAENG 05a (Repositories): Working with Raw Files in Code Repositories

DATAENG 05a (Repositories): Working with Raw Files in Code Repositories

Use Foundry APIs and packages to parse non-linear files into Parquet in Code Repositories.

rate limit

Code not recognized.

About this course

The computation engine behind data transformations in Foundry is Spark: an open-source, distributed cluster-computing framework for quick, large-scale data processing and analytics. Spark works most efficiently on a data file type called Parquet, and by default, Foundry transforms output datasets as a series of distributed Parquet files.

All things being equal, datasets composed of Parquet files will always be more efficiently computed by Spark than other data formats. You may, however, want to process files in non-linear formats (such as XML or JSON). This tutorial reviews the essentials needed to read and write files in Foundry datasets using the @transform() decorator (vs. @transform_df used in the previous tutorial).

The files necessary for the next step in the development of your pipeline are in non-Parquet formats and must be directly accessed by your code for transformation.

⚠️ Course Prerequisites

  • DATAENG 04: Scheduling Data Pipelines: If you have not completed this previous course, please do so now.
  • A basic understanding of Spark and distributed computing will provide an advantage as we begin talking about dataset anatomy, but it is not required.

Outcomes

Your data pipeline consists of clean flight alert data enhanced with some mapping files, but there’s another datasource you’d like to incorporate into the overarching project: passengers associated with these flight alerts. Your team may have decided, for example, that a workflow they’d like to enable downstream is the ability to assign travel vouchers based on flight delay/alert severity and customer status, and integrating passenger data into your pipeline is a necessary step toward creating the Ontology framework to support that interaction pattern.

The goal of this tutorial is to expose another data transformation pattern that involves directly accessing and parsing a CSV and JSON file in Foundry. Whether your non-linearly formatted data was uploaded in an ad hoc manner or originates in an external source, the methods in this course will be an important part of a data engineer’s arsenal of transformation techniques.

🥅 Learning Objectives

  1. Understand raw file access from a transform in the Code Repositories application.
  2. Use Foundry APIs and packages to parse non-linear files into Parquet.

💪 Foundry Skills

  • Use the @transform() decorator to access raw files in Foundry.
  • Use additional Python libraries to parse non-Parquet data.
  • Use the Foundry Explorer helper.

Curriculum

  • About this course
  • Create Your Project Resources and Raw Data
  • Create Your Project Structure
  • Create Your Repository
  • Identify Your Raw Data
  • Copy the Raw Data Source into Your Project
  • Exercise Summary
  • Preprocess Your Non-standard Files
  • Preprocess Your Data, Part 1
  • Preprocess Your Data, Part 2
  • Viewing Module Contents
  • Exercise Summary
  • Python Unit Tests
  • Python Unit Tests
  • Conclusion
  • Key Takeaways
  • Next Steps

About this course

The computation engine behind data transformations in Foundry is Spark: an open-source, distributed cluster-computing framework for quick, large-scale data processing and analytics. Spark works most efficiently on a data file type called Parquet, and by default, Foundry transforms output datasets as a series of distributed Parquet files.

All things being equal, datasets composed of Parquet files will always be more efficiently computed by Spark than other data formats. You may, however, want to process files in non-linear formats (such as XML or JSON). This tutorial reviews the essentials needed to read and write files in Foundry datasets using the @transform() decorator (vs. @transform_df used in the previous tutorial).

The files necessary for the next step in the development of your pipeline are in non-Parquet formats and must be directly accessed by your code for transformation.

⚠️ Course Prerequisites

  • DATAENG 04: Scheduling Data Pipelines: If you have not completed this previous course, please do so now.
  • A basic understanding of Spark and distributed computing will provide an advantage as we begin talking about dataset anatomy, but it is not required.

Outcomes

Your data pipeline consists of clean flight alert data enhanced with some mapping files, but there’s another datasource you’d like to incorporate into the overarching project: passengers associated with these flight alerts. Your team may have decided, for example, that a workflow they’d like to enable downstream is the ability to assign travel vouchers based on flight delay/alert severity and customer status, and integrating passenger data into your pipeline is a necessary step toward creating the Ontology framework to support that interaction pattern.

The goal of this tutorial is to expose another data transformation pattern that involves directly accessing and parsing a CSV and JSON file in Foundry. Whether your non-linearly formatted data was uploaded in an ad hoc manner or originates in an external source, the methods in this course will be an important part of a data engineer’s arsenal of transformation techniques.

🥅 Learning Objectives

  1. Understand raw file access from a transform in the Code Repositories application.
  2. Use Foundry APIs and packages to parse non-linear files into Parquet.

💪 Foundry Skills

  • Use the @transform() decorator to access raw files in Foundry.
  • Use additional Python libraries to parse non-Parquet data.
  • Use the Foundry Explorer helper.

Curriculum

  • About this course
  • Create Your Project Resources and Raw Data
  • Create Your Project Structure
  • Create Your Repository
  • Identify Your Raw Data
  • Copy the Raw Data Source into Your Project
  • Exercise Summary
  • Preprocess Your Non-standard Files
  • Preprocess Your Data, Part 1
  • Preprocess Your Data, Part 2
  • Viewing Module Contents
  • Exercise Summary
  • Python Unit Tests
  • Python Unit Tests
  • Conclusion
  • Key Takeaways
  • Next Steps