Enhancing Apache Spark for Enhanced Flexibility and Performance

This proposal aims to introduce the physical plan conversion, validation, and fallback mechanisms from the Gluten project into Apache Spark.

Jun 8th, 2024 10:00am by Mark Andreev

Featued image for: Enhancing Apache Spark for Enhanced Flexibility and Performance

Image by Ulrike Mai from Pixabay.

This proposal aims to introduce the physical plan conversion, validation, and fallback mechanisms from the Gluten project into Apache Spark. This will allow Spark to have greater flexibility and robustness when executing physical plans, while also taking advantage of the performance optimizations provided by Gluten.

Motivation

Apache Spark currently lacks an official mechanism to support cross-platform execution of physical plans. The Gluten project offers a mechanism that utilizes the Substrait standard to convert and optimize Spark’s physical plans. By introducing Gluten’s plan conversion, validation, and fallback mechanisms into Spark, we can significantly enhance the portability and interoperability of Spark’s physical plans, enabling them to operate across a broader spectrum of execution environments without requiring users to migrate, while also improving Spark’s execution efficiency through the utilization of Gluten’s advanced optimization techniques.

Design Proposal

To integrate Gluten’s plan conversion mechanism within Spark, we first defined an interface called TransformSupport, which inherits from SparkPlan. The TransformSupport interface includes two methods: def validate(): Boolean for validating if this operator/expression is supported in native code, and def doTransform(): SubstraitPlan for performing the plan conversion. Next, we defined three interfaces — LeafTransformSupport, UnaryTransformSupport, and BinaryTransformSupport — to adapt to different types of operators. For example, ProjectExecTransformer will inherit from UnaryTransformSupport, HashJoinTransformer will inherit from BinaryTransformSupport, and DatasourceScanTransformer will inherit from LeafTransformSupport. This way, we can extend SparkPlan to support plan conversion.

2. During the physical plan validation process, we will pass the converted Substrait plan in Protobuf format to different native backends for validation. This process includes checking supported function signatures and data types. If the validation is successful, it returns true; otherwise, it returns false.

3. If the validation fails, we will continue to use Spark’s own operators to execute the plan, but at this point, support for column-to-row (C2R) and row-to-column (R2C) data transformations is required. Taking the Velox backend as an example, we need to convert Velox’s columnar data format to Spark’s UnsafeRow format in the C2R transformation; conversely, in the R2C transformation, we convert Spark’s UnsafeRow format to Velox’s columnar data format.

To implement the execution of Spark physical plans on a native engine, we need to complete three key steps: plan transformation, validation, and fallback. Given that these steps involve substantial modifications to Spark, we have decided to first focus on the plan transformation phase. By utilizing Substrait, we will convert Spark plans into a common format, thereby achieving compatibility with various backends. After the first step is completed, we will further consider the remaining validation and fallback steps.

Conclusions

We have adeptly incorporated this conversion methodology into Apache Gluten, enabling seamless support for both ClickHouse and Velox backends. Our integration efforts with either backend have culminated in a substantial performance breakthrough. For an in-depth look at the enhancements, please refer to the information provided at this link. Moreover, it is with great pride that we acknowledge the successful deployment of Gluten by numerous customers within their production settings.

Mark Andreev is a Senior Software Engineer living in London, UK, specializing in data-intensive applications. His focus is Java related projects that help to process a large amount of data. During his career Mark was a data scientist which influenced...