-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46597
base: master
Are you sure you want to change the base?
Conversation
01c6706
to
365e639
Compare
63d22f2
to
3758e43
Compare
9329234
to
ec22116
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala
Show resolved
Hide resolved
@uros-db This is all cleaned up. Let's get some of the other reviewers to look at it? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since Mode expression works with any child expression, and you special-cased handling Strings, how do we handle Array(String) and Struct(String), etc.?
In my local tests, I found that Mode performs a byte-by-byte comparison for structs, which does not consider collation. So that is still outstanding. Good catch! @uros-db There are several strategies we might adopt to handle structs with collation fields. I am looking into implementations. It is potentially straightforward though have some gotchas. Do you feel I should solve for that in a separate PR or in this one? I assume you prefer that this get solve in this PR and not a follow-up PR, right? |
I have added implementation for mode to support structs with fields with the various collations. Performance is not great, so far.
I will add the benchmark results from GHA once I get your feedback. I haven;t yet added support for collation for mode on array types, as in the "Collation Support in Spark" design doc, it says support for that is TBD. So I wanted to check in as to whether you think I should add support for that now or as a followup. |
What I would really like to try is to move from this implementation to an approach that will have the collation-support logic moved to the PartialAggregation stage, by moving logic to But as it has already been a couple weeks of development on this, I believe we should, for this PR, confine all the collation logic in the stage that can't be serialized and deserialized -- the |
I wouldn't say there's a preference on whether to include both support for string type and complex types within the same PR - if you think that the changes might end up being too large, then it's fine to split it into separate PRs. However I would say that we need to make sure there's no unexpected behaviour - for example, MODE shouldn't have correct support for collated StringType, but incorrect behaviour for ArrayType(StringType), StructType(...StringType...), etc. With that in mind, it seems that we should adopt one of two approaches:
|
also note that covering StringTypes which are fields of StructType is not by itself enough - suppose there's a field of StructType that is another StructType that has a field of collated StringType, etc. same goes for arrays, handling ArrayType(StringType) is not enough by itself - we also need ArrayType(ArrayType(StringType)) in short, I would say that we need a recursive approach to properly handle all possible collated string instances |
As for changing how but then of course there's the problem of preserving one of the actual values - you correctly noticed that we can't just return collationKey, as that value might not be present in the original array I suppose a separate map might do the trick here (mapping collationKey to original string value), and since we don't have preference towards which value gets returned, simply returning the first one that appeared is considered correct behaviour |
@uros-db if you are fine with me splitting it into two PRs, that's what I will do! I will modify this PR to fail for complex types that have collated strings. And I will get the PR to implement full (recursive) support for said complex types ready to be reviewed right after this one is merged. I appreciate your flexibility! |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
d460964
to
51f397c
Compare
@uros-db I have made changes for all but your latest suggestion (re whitelists -- will add that soon) |
a80a394
to
1fae9d9
Compare
latest review added checkinputdatatype to not support complex types containing nonbinary collations added checkinputdatatype to not support complex types containing nonbinary collations added struct test stuff Tests pass test structs fix scalastyle Collation Support for Mode
1fae9d9
to
0bab248
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
8e365a1
to
b071d17
Compare
@uros-db Should I also add collation support to The only difference will be
|
…essions/aggregate/Mode.scala Co-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
b071d17
to
f054589
Compare
@uros-db ? |
We can leave now that you've explored various options and finished the |
@uros-db when should I add back support for complex types? Should i wait until we have buy-in for the current approach from @dbatomic @nikolamand-db @stefankandic @stevomitric or should I do it now ? |
(I no longer think the code for support for complex types needs to be a seperate PR. ) |
What changes were proposed in this pull request?
SPARK-47353
Pull requests
Scala TreeMap (RB Tree)
GroupMapReduce <- Most performant
GroupMapReduce (Cleaned up) (This PR) <- Most performant
Comparing Experimental Approaches
Central Change to Mode
eval
Algorithm:eval
Method: Theeval
method now checks if the column being looked at is string with non-default collation and if so, uses a groupingMinor Change to Mode:
collationId
: A new lazy valuecollationId
is computed from thedataType
of thechild
expression, used to fetch the appropriate collation comparator whencollationEnabled
is true.This PR will fail for complex types containing collated strings
Follow up PR will implement that
Unit Test Enhancements: Significant additions to
CollationStringExpressionsSuite
to test new functionality including:Mode
function when handling strings with different collation settings.Benchmark Updates:
CollationBenchmark
classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.Why are the changes needed?
Does this PR introduce any user-facing change?
Yes, this PR introduces the following user-facing changes:
collationEnabled
property to theMode
expression.Mode
expression to customize its behavior.How was this patch tested?
This patch was tested through a combination of new and existing unit and end-to-end SQL tests.
Mode
function correctly handles strings with different collation settings.Out of scope: Special Unicode Cases higher planes
Tests do not need to include Null Handling.
Benchmark Tests:
Manual Testing:
Was this patch authored or co-authored using generative AI tooling?
Nope!