A university project extended into a full benchmarking study, processing the IMDb dataset to extract insights about genre growth trends, top-rated directors, and the most prolific crew members across decades.
The original implementation used the Scala Spark RDD API with optimisations including HashPartitioner, early filtering, and reduceByKey aggregations — achieving a 95% grade. This was later re-implemented in PySpark SQL/DataFrame API on the full 2.28 GB, 41M+ record dataset, with additional optimisations (Adaptive Query Execution, column pruning, broadcast joins) and packaged as a reproducible Docker container with a CLI benchmarking tool.
Both implementations were benchmarked across 10 trials on identical Spark configurations and business logic.
| Task | Scala RDD (mean) | PySpark SQL (mean) |
|---|---|---|
| Task 1 | 3.51 s | 0.20 s |
| Task 2 | 11.08 s | 4.17 s |
| Task 3 | 17.97 s | 1.86 s |
| Task 4 | 26.70 s | 8.18 s |
| Total | 59.26 s | 14.41 s |
PySpark SQL achieved a 75.7% average runtime reduction (59.26s → 14.41s), primarily driven by the Catalyst optimiser's query planning and Tungsten's cache-aware memory management — advantages the low-level RDD API cannot leverage.
Source code and setup instructions on GitHub