Asynchronous Join Order Optimization and Job Scheduling in Apache Spark

Ulm University

MA Abschlussvortrag, Milos Babic, Ort: O27/545, Datum: 29.11.2016, Zeit: 14:00 Uhr

With exponential growth of data generation and consumption, cluster-based computing systems like Apache Spark are build to deal with this large amount of data. Spark defines the distributed data structure Resilient Distributed Dataset (RDD) which represents the distributed data and provides operations on them, including join operation. Executing a query, the join order has a great impact on the query execution performance. Whereas a sequential join order of a query leads to executing one join after another, a bushy join ordered query allows to execute multiple join in parallel on a cluster.

The goal of this thesis is to find possibilities which enable Spark to execute a bushy join ordered query in an asynchronous fashion in order to utilize the resources of a cluster as much as possible. As a result, Spark provides such functionality by rearranging the RDDs into a bushy join ordered query manually and under certain conditions.

In addition, the top level module Spark SQL which was build to process relational data more efficiently on Spark, can also build bushy join ordered query. However, the join order must also be build manually by the user, because its relational optimizer does not optimize the join order yet. Thus, this thesis provides discoveries which can be used for further researches to automatize the join ordering in Spark and Spark SQL.