IMDb Data Analysis with Spark

big datadistributed pipeline

## Overview

A university project extended into a full benchmarking study, processing the IMDb dataset to extract insights about genre growth trends, top-rated directors, and the most prolific crew members across decades.

The original implementation used the Scala Spark RDD API with optimisations including HashPartitioner, early filtering, and reduceByKey aggregations — achieving a 95% grade. This was later re-implemented in PySpark SQL/DataFrame API on the full 2.28 GB, 41M+ record dataset, with additional optimisations (Adaptive Query Execution, column pruning, broadcast joins) and packaged as a reproducible Docker container with a CLI benchmarking tool.

## Results

Both implementations were benchmarked across 10 trials on identical Spark configurations and business logic.

TaskScala RDD (mean)PySpark SQL (mean)
Task 13.51 s0.20 s
Task 211.08 s4.17 s
Task 317.97 s1.86 s
Task 426.70 s8.18 s
Total59.26 s14.41 s

PySpark SQL achieved a 75.7% average runtime reduction (59.26s → 14.41s), primarily driven by the Catalyst optimiser's query planning and Tungsten's cache-aware memory management — advantages the low-level RDD API cannot leverage.

## Links

Source code and setup instructions on GitHub