Skip to content

[SPARK-38841][SQL] Enable Bloom filter Joins by default#36102

Closed
andylam-db wants to merge 7 commits into
apache:masterfrom
andylam-db:ENABLE_BLOOM_FILTER
Closed

[SPARK-38841][SQL] Enable Bloom filter Joins by default#36102
andylam-db wants to merge 7 commits into
apache:masterfrom
andylam-db:ENABLE_BLOOM_FILTER

Conversation

@andylam-db

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Enable Bloom Filter Joins by default.

Why are the changes needed?

Improve the performance of some joins by pre-filtering one side of a join using a Bloom filter. See: https://issues.apache.org/jira/browse/SPARK-32268 for performance benchmarks.

How was this patch tested?

N/A

@github-actions github-actions Bot added the SQL label Apr 7, 2022
@andylam-db andylam-db force-pushed the ENABLE_BLOOM_FILTER branch from 6b42b47 to 722f881 Compare April 7, 2022 17:32
@AmplabJenkins

Copy link
Copy Markdown

Can one of the admins verify this patch?

@andylam-db

Copy link
Copy Markdown
Contributor Author

@sigmod

.version("3.3.0")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this change for master or 3.3? Seems this is a new feature added in 3.3, isn't it safer to keep it as false by default? Usually we choose to be conservative instead of aggressive on enabling new feature.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for master only, and will not be merged to 3.3.
This way, new PRs to the master will less likely to break this feature.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. makes sense.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao Maybe you can enable vectorized Parquet reader by default in master branch only too. So it won't be silently broken.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a good idea. Let me switch it.

@sigmod

sigmod commented Apr 8, 2022

Copy link
Copy Markdown
Contributor

cc @wangyum

@wangyum

wangyum commented Apr 8, 2022

Copy link
Copy Markdown
Member

Could we use a new ticket ID? Otherwise users might get confused: same ticket ID, but different behavior in 3.3 and 3.4.

@andylam-db andylam-db changed the title [SPARK-32268][SQL][FOLLOWUP] Enable Bloom filter Joins by default [SPARK-38841][SQL][FOLLOWUP] Enable Bloom filter Joins by default Apr 9, 2022
@sigmod

sigmod commented Apr 9, 2022

Copy link
Copy Markdown
Contributor

Could we use a new ticket ID? Otherwise users might get confused: same ticket ID, but different behavior in 3.3 and 3.4.

@andylam-db has updated this PR with a new ticket ID.

@wangyum wangyum changed the title [SPARK-38841][SQL][FOLLOWUP] Enable Bloom filter Joins by default [SPARK-38841][SQL] Enable Bloom filter Joins by default Apr 9, 2022
@wangyum wangyum closed this in 1cb3bcd Apr 9, 2022
@wangyum

wangyum commented Apr 9, 2022

Copy link
Copy Markdown
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants