Joining Datasets and Performing Aggregations with Spark

This notebook shows how to join datasets in Spark and perform a variety of aggregations using Spark SQL by completing a series of common tasks.

The Data

There are two data files that are used in this notebook (flights.parquet and airport-codes.csv), which can be found in the data folder of the GitHub repository.

File Schema

flights.parquet (3,606,803 total records)
- origin_airport_code: string (nullable = true)
- destination_airport_code: string (nullable = true)
- origin_city: string (nullable = true)
- destination_city: string (nullable = true)
- passengers: long (nullable = true)
- seats: long (nullable = true)
- flights: long (nullable = true)
- distance: double (nullable = true)
- origin_population: long (nullable = true) * destination_population: long (nullable = true)
- flight_year: long (nullable = true)
- flight_month: long (nullable = true)
- __index_level_0__: long (nullable = true)
airport-codes.csv (54,591 total records)
- ident: string (nullable = true)
- type: string (nullable = true)
- name: string (nullable = true)
- elevation_ft: double (nullable = true)
- continent: string (nullable = true)
- iso_country: string (nullable = true)
- iso_region: string (nullable = true)
- municipality: string (nullable = true)
- gps_code: string (nullable = true)
- iata_code: string (nullable = true)
- local_code: string (nullable = true)
- coordinates: string (nullable = true)

Tasks

Load the data and print the schema
Join the flight data to airport codes data by matching the IATA code of the originating flight to the IATA code in the airport codes file, then print the schema
Remove the following columns from the joined DataFrame:
- __index_level_0__
- ident
- local_code
- continent
- iso_country
- iata_code
Rename the following columns:
- type: origin_airport_type
- name: origin_airport_name
- elevation_ft: origin_airport_elevation_ft
- iso_region: origin_airport_region
- municipality: origin_airport_municipality
- gps_code: origin_airport_gps_code
- coordinates: origin_airport_coordinates
Repeat steps 2-4 for destinations:
- Join the data on destination airport instead of the origin airport
- Drop the same columns
- Rename the same columns using the prefix destination_airport_ instead of origin_airport_
- Print the schema
Create a dataframe using only data from 2008. This dataframe will be a report of the top ten airports by the number of inbound passengers. This dataframe should contain the following fields:
- Rank (1-10)
- Name
- IATA code
- Total Inbound Passengers
- Total Inbound Flights
- Average Daily Passengers
- Average Inbound Flights
Create a user-defined function in Python that will convert the string coordinates into numeric coordinates
Add new columns for:
- destination_airport_longitude
- destination_airport_latitude
- origin_airport_longitude
- origin_airport_latitude
  - Then, print the schema

Rather than providing all of the schema printouts here, they can all be found in the notebook located in the GitHub repository for this project.

Author

Xander Hieken

Share on

Twitter Facebook LinkedIn

Spark Aggregations

Xander Hieken

Joining Datasets and Performing Aggregations with Spark

The Data

File Schema

Tasks

Rather than providing all of the schema printouts here, they can all be found in the notebook located in the GitHub repository for this project.

Author

Share on

You may also enjoy

PUBG Behavioral Analysis

Twitter Getter

Spark Data Warehouse

Predicting NHL Contracts