Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inject chukonu #46607

Open
wants to merge 891 commits into
base: master
Choose a base branch
from
Open

Conversation

fanzi2009
Copy link

inject chukonu

Hisoka-X and others added 30 commits June 29, 2023 16:38
…se array as struct using PERMISSIVE mode with corrupt record

### What changes were proposed in this pull request?
cherry pick apache#41662 , fix  parse array as struct bug on branch 3.4
### Why are the changes needed?
Fix the bug when parse array as struct using PERMISSIVE mode with corrupt record

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

Closes apache#41784 from Hisoka-X/SPARK-44079_3.4_cherry_pick.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ationTimeout to zero or negative will cause incessant executor cons/destructions

### What changes were proposed in this pull request?

This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means
- connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment
- connectionTimeout=0, `IdleStateHandler` for triggering `IdleStateEvent` is disabled.

### Why are the changes needed?

1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately.
2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new unit tests

Closes apache#41785 from yaooqinn/SPARK-44241.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 38645fa)
Signed-off-by: Kent Yao <yao@apache.org>
… merge dir is less than conf

### What changes were proposed in this pull request?
Fixed a minor issue with diskBlockManager after push-based shuffle is enabled

### Why are the changes needed?
this bug will affect the efficiency of push based shuffle

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test

Closes apache#40412 from Stove-hust/feature-42784.

Authored-by: meifencheng <meifencheng@meituan.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 35d5157)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
…a RuntimeException

### What changes were proposed in this pull request?
The executor expects `numChunks` to be > 0. If it is zero, then we see that the executor fails with
```
23/06/20 19:07:37 ERROR task 2031.0 in stage 47.0 (TID 25018) Executor: Exception in task 2031.0 in stage 47.0 (TID 25018)
java.lang.ArithmeticException: / by zero
	at org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:128)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:1047)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:90)
	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
```
Because this is an `ArithmeticException`, the executor doesn't fallback. It's not a `FetchFailure` either, so the stage is not retried and the application fails.

### Why are the changes needed?
The executor should fallback to fetch original blocks and not fail because this suggests that there is an issue with push-merged block.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Modified the existing UTs to validate that RuntimeException is thrown when numChunks are 0.

Closes apache#41762 from otterc/SPARK-44215.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 3e72806)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
…ere is a char/varchar column in the schema

### What changes were proposed in this pull request?

apache#38823 added support for defining generated columns in create table statements. This included generation expression validation. This validation currently erroneously fails when there are char or varchar columns anywhere in the table schema since the checkAnalysis fails here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala#L123.

This PR replaces any char/varchar columns in the schema with a string before analysis.

### Why are the changes needed?

This should not fail.
```
CREATE TABLE default.example (
    name VARCHAR(64),
    tstamp TIMESTAMP,
    tstamp_date DATE GENERATED ALWAYS AS (CAST(tstamp as DATE))
)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adds a unit test.

Closes apache#41868 from allisonport-db/validateGeneratedColumns-charvarchar.

Authored-by: Allison Portis <allison.portis@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit f0e1828)
Signed-off-by: Kent Yao <yao@apache.org>
…ll outer USING join

### What changes were proposed in this pull request?

For full outer joins employing USING, set the nullability of the coalesced join columns to true.

### Why are the changes needed?

The following query produces incorrect results:
```
create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values (2, 3) as (c1, c2);

select explode(array(c1)) as x
from v1
full outer join v2
using (c1);

-1   <== should be null
1
2
```
The following query fails with a `NullPointerException`:
```
create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values ('2', 3) as (c1, c2);

select explode(array(c1)) as x
from v1
full outer join v2
using (c1);

23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
...
```
The above full outer joins implicitly add an aliased coalesce to the parent projection of the join: `coalesce(v1.c1, v2.c1) as c1`. In the case where only one side's key is nullable, the coalesce's nullability is false. As a result, the generator's output has nullable set as false. But this is incorrect: If one side has a row with explicit null key values, the other side's row will also have null key values (because the other side's row will be "made up"), and both the `coalesce` and the `explode` will return a null value.

While `UpdateNullability` actually repairs the nullability of the `coalesce` before execution, it doesn't recreate the generator output, so the nullability remains incorrect in `Generate#output`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

Closes apache#41809 from bersprockets/using_oddity2.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(cherry picked from commit 7a27bc6)
Signed-off-by: Yuming Wang <yumwang@ebay.com>
…imeZone

### What changes were proposed in this pull request?

Apply `ResolveTimeZone` for the plan generated by `DistributionAndOrderingUtils#prepareQuery`.

### Why are the changes needed?

In SPARK-39607, we only applied `typeCoercionRules` for the plan generated by `DistributionAndOrderingUtils#prepareQuery`, this is not enough, the following exception will be thrown if `TimeZoneAwareExpression` participates in the implicit cast.

```
23/06/25 07:30:58 WARN UnsafeProjection: Expr codegen error and falling back to interpreter mode
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:63)
	at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:63)
	at org.apache.spark.sql.catalyst.expressions.Cast.zoneId$lzycompute(Cast.scala:491)
	at org.apache.spark.sql.catalyst.expressions.Cast.zoneId(Cast.scala:491)
	at org.apache.spark.sql.catalyst.expressions.Cast.castToDateCode(Cast.scala:1655)
	at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeCastFunction(Cast.scala:1335)
	at org.apache.spark.sql.catalyst.expressions.Cast.doGenCode(Cast.scala:1316)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:200)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:195)
	at org.apache.spark.sql.catalyst.expressions.Cast.genCode(Cast.scala:1310)
	at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.$anonfun$prepareArguments$3(objects.scala:124)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.prepareArguments(objects.scala:123)
	at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.prepareArguments$(objects.scala:91)
	at org.apache.spark.sql.catalyst.expressions.objects.Invoke.prepareArguments(objects.scala:363)
	at org.apache.spark.sql.catalyst.expressions.objects.Invoke.doGenCode(objects.scala:414)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:200)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:195)
	at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.$anonfun$prepareArguments$3(objects.scala:124)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.prepareArguments(objects.scala:123)
	at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.prepareArguments$(objects.scala:91)
	at org.apache.spark.sql.catalyst.expressions.objects.Invoke.prepareArguments(objects.scala:363)
	at org.apache.spark.sql.catalyst.expressions.objects.Invoke.doGenCode(objects.scala:414)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:200)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:195)
	at org.apache.spark.sql.catalyst.expressions.HashExpression.$anonfun$doGenCode$5(hash.scala:304)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.catalyst.expressions.HashExpression.doGenCode(hash.scala:303)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:200)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:195)
	at org.apache.spark.sql.catalyst.expressions.Pmod.doGenCode(arithmetic.scala:1068)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:200)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:195)
	at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
	at scala.collection.immutable.List.map(List.scala:293)
	at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
	at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:290)
	at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:338)
	at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.generate(GenerateUnsafeProjection.scala:327)
	at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.createCodeGeneratedObject(Projection.scala:124)
	at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.createCodeGeneratedObject(Projection.scala:120)
	at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
	at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:151)
	at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:161)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.getPartitionKeyExtractor$1(ShuffleExchangeExec.scala:316)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$13(ShuffleExchangeExec.scala:384)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$13$adapted(ShuffleExchangeExec.scala:383)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:875)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:875)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
```

### Does this PR introduce _any_ user-facing change?

Yes, it's a bug fix.

### How was this patch tested?

New tests are added.

Closes apache#41725 from pan3793/SPARK-44180.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d1d9760)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…Like`

### What changes were proposed in this pull request?
In the PR, I propose to check the number of argument types in the `InvokeLike` expressions. If the input types are provided, the number of types should be exactly the same as the number of argument expressions.

This is a backport of apache#41954.

### Why are the changes needed?
1. This PR checks the contract described in the comment explicitly:
https://github.com/apache/spark/blob/d9248e83bbb3af49333608bebe7149b1aaeca738/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L247

that can prevent the errors of expression implementations, and improve code maintainability.

2. Also it fixes the issue in the `UrlEncode` and `UrlDecode`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the related tests:
```
$ build/sbt "test:testOnly *UrlFunctionsSuite"
$ build/sbt "test:testOnly *DataSourceV2FunctionSuite"
```

Authored-by: Max Gekk <max.gekkgmail.com>
(cherry picked from commit 3e82ac6)

Closes apache#41985 from MaxGekk/fix-url_decode-3.4.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?

This PR upgrades snappy-java to 1.1.10.2. snappy-java 1.1.10.2 includes the following changes:

https://github.com/xerial/snappy-java/releases/tag/v1.1.10.2

### Why are the changes needed?

It seem the 1.1.10.1's libsnappy.so fails to uncompress on s390x.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#41994 from wangyum/SPARK-44415.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(cherry picked from commit 0a0c367)
Signed-off-by: Yuming Wang <yumwang@ebay.com>
…ws that have Null as first column value

Ports back apache#42046 to 3.4.

### What changes were proposed in this pull request?
Change the serialization format for group-by-with-state outputs: include an explicit hidden column indicating how many data and state records there are.

### Why are the changes needed?
The current implementation of ApplyInPandasWithStatePythonRunner cannot deal with outputs where the first column of the row is null, as it cannot distinguish the case where the column is null, or the field is filled as the number of data records are smaller than state records. It causes incorrect results for the former case.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add unit tests that cover null cases and different other scenarios.

Closes apache#42074 from siying/pandas34.

Authored-by: Siying Dong <siying.dong@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…-tests`

### What changes were proposed in this pull request?
This pr change to use `minikube` v1.30.1 for `k8s-integration-tests` on GitHub Action, this is a temporary solution.
This PR also leaves a TODO:

- SPARK-44495: Resume to use the latest minikube for `k8s-integration-tests` on GitHub Action

### Why are the changes needed?
Restore `k8s-integration-tests` GA testing

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- `k8s-integration-tests` test pass on GitHub Action

Closes apache#42162 from LuciferYang/SPARK-44494-34.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
### What changes were proposed in this pull request?
cherry-pick apache#42146 to 3.4

### Why are the changes needed?
can not cherry-pick clearly, so make this PR

### Does this PR introduce _any_ user-facing change?
no, infra-only

### How was this patch tested?
updated CI

Closes apache#42172 from zhengruifeng/cp_fix.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… testing

### What changes were proposed in this pull request?
The pr aims to ignoring `connect-check-protos` logic in GA testing for branch-3.4.

### Why are the changes needed?
Make GA happy.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes apache#42166 from panbingkun/branch-3.4_SPARK-44553.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…struct type

### What changes were proposed in this pull request?

This is a partial backport of apache#42161.

Fixes protobuf conversion from an empty struct type.

### Why are the changes needed?

The empty struct type was not properly converted to the protobuf message.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#42179 from ueshin/issues/SPARK-44479/3.4/empty_schema.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…pip packaging test in GitHub Actions

### What changes were proposed in this pull request?

This PR proposes to remove untracked/ignored files before running pip packaging test in GitHub Actions.

### Why are the changes needed?

In order to fix the flakiness in the test such as:

```
...
creating dist
Creating tar archive
error: [Errno 28] No space left on device
Cleaning up temporary directory - /tmp/tmp.CvSzgB7Kyy
```

See also https://github.com/apache/spark/actions/runs/5665869112/job/15351515539.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

GitHub Actions build in this PR.

Closes apache#42159 from HyukjinKwon/debug-ci-failure.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 77aab6e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- The pr is for branch-3.4.
- The pr aims to upgrade snappy-java from  1.1.10.2 to 1.1.10.3.

### Why are the changes needed?
1.The newest version include a bug fixed:
- Fix the GLIBC_2.32 not found issue of libsnappyjava.so in certain Linux distributions on s390x by kun-lu20 in xerial/snappy-java#481

2.Release notes:
https://github.com/xerial/snappy-java/releases/tag/v1.1.10.3

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes apache#42127 from panbingkun/branch-3.4_snappy_1_1_10_3.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
run `run_python_packaging_tests` when there are any changes in PySpark

### Why are the changes needed?
apache#42146 make CI run `run_python_packaging_tests` only within `pyspark-errors` (see https://github.com/apache/spark/actions/runs/5666118302/job/15359190468 and https://github.com/apache/spark/actions/runs/5668071930/job/15358091003)

![image](https://github.com/apache/spark/assets/7322292/aef5cd4c-87ee-4b52-add3-e19ca131cdf1)

but I ignored that `pyspark-errors` maybe skipped (because no related source changes), so the `run_python_packaging_tests` maybe also skipped  unexpectedly (see https://github.com/apache/spark/actions/runs/5666523657/job/15353485731)

![image](https://github.com/apache/spark/assets/7322292/c2517d39-efcf-4a95-8562-1507dad35794)

this PR is to run `run_python_packaging_tests` even if `pyspark-errors` is skipped

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
updated CI

Closes apache#42173 from zhengruifeng/infra_followup.

Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org>
Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
(cherry picked from commit f794734)
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…cgAk

### What changes were proposed in this pull request?

This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data"

The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal.

The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal.

This warning was added in the following PR: apache#36843

### Why are the changes needed?

With the current logic, RankingMetrics will:
- raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input)
- not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala

Closes apache#42207 from guilhem-depop/patch-1.

Authored-by: Guilhem Vuillier <101632595+guilhem-depop@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 72af2c0)
Signed-off-by: Sean Owen <srowen@gmail.com>
…ted behavior

Add documentation covering unexpected behavior of concat and concat_ws with respect to null values.

### What changes were proposed in this pull request?
Adds additional documentation to `concat` and `concat_ws`.

### Why are the changes needed?
The behavior of `concat` and `concat_ws` were unexpected w.r.t. null values and the documentation did not help make their behavior clear.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Closes apache#42153 from watfordkcf/patch-1.

Lead-authored-by: Christopher Watford <132389385+watfordkcf@users.noreply.github.com>
Co-authored-by: Christopher Watford <christopher.watford@kcftech.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit ff022e5)
Signed-off-by: Kent Yao <yao@apache.org>
…ffle blocks

### What changes were proposed in this pull request?

Fix double encryption issue for migrated shuffle blocks

Shuffle blocks upon migration are sent without decryption when io.encryption is enabled. The code on the receiving side ends up using serializer.wrapStream on the OutputStream to the file which results in the already encrypted bytes being encrypted again when the bytes are written out.

This patch removes the usage of serializerManager.wrapStream on the receiving side and also adds tests that validate that this works as expected. I have also validated that the added tests will fail if the fix is not in place.

Jira ticket with more details: https://issues.apache.org/jira/browse/SPARK-44588

### Why are the changes needed?

Migrated shuffle blocks will be double encrypted when `spark.io.encryption = true` without this fix.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests were added to test shuffle block migration with spark.io.encryption enabled and also fixes a test helper method to properly construct the SerializerManager with the encryption key.

Closes apache#42279 from henrymai/branch-3.4_backport_double_encryption.

Authored-by: Henry Mai <henrymai@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ormance of MapOutputTracker.updateMapOutput"

### What changes were proposed in this pull request?

This reverts commit f68ece9 from `branch-3.4` only.

### Why are the changes needed?

SPARK-43043 (apache#40690) was an improvement PR but it introduced a regression when it landed at Apache Spark 3.4.1.
```scala
  test("getMapOutputLocation should not throw NPE") {
    val tracker = newTrackerMaster()
    try {
      tracker.registerShuffle(0, 1, 1)
      tracker.registerMapOutput(0, 0, MapStatus(BlockManagerId("exec-1", "hostA", 1000),
        Array(2L), 0))
      tracker.removeOutputsOnHost("hostA")
      assert(tracker.getMapOutputLocation(0, 0) == None)
    } finally {
      tracker.stop()
    }
  }
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes apache#42285 from dongjoon-hyun/SPARK-44630.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

[SPARK-42730][CONNECT][DOCS]
Add start-connect-server.sh/stop-connect-server.sh to this list and cover Spark Connect sessions - other changes needed here.

### Why are the changes needed?

[SPARK-42730][CONNECT][DOCS]
Add start-connect-server.sh/stop-connect-server.sh to this list and cover Spark Connect sessions - other changes needed here..

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Spark Document related patch tested.

Closes apache#42307 from pegasas/doc.

Authored-by: pegasas <616672335@qq.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3b3e301)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR aims to add a test coverage for Apache Spark 4.0/3.5/3.4.
This PR depends on SPARK-44658 (apache#42323) but is created separately because this aims to land `branch-3.4` too.

To prevent a future regression.

No.

Pass the CIs.

Closes apache#42326 from dongjoon-hyun/SPARK-44661.

Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 9fbf0b4)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
We have a long-standing tricky optimization in `Dataset.union`, which invokes the optimizer rule `CombineUnions` to pre-optimize the analyzed plan. This is to avoid too large analyzed plan for a specific dataframe query pattern `df1.union(df2).union(df3).union...`.

This tricky optimization is designed to break dataframe caching, but we thought it was fine as people usually won't cache the intermediate dataframe in a union chain. However, `CombineUnions` gets improved from time to time (e.g. apache#35214) and now it can optimize a wide range of Union patterns. Now it's possible that people union two dataframe, do something with `select`, and cache it. Then the dataframe is unioned again with other dataframes and people expect the df cache to work. However the cache won't work due to the tricky optimization in `Dataset.union`.

This PR updates `Dataset.union` to only combine adjacent Unions to match the original purpose.

Fix perf regression due to breaking df caching

no

new test

Closes apache#42315 from cloud-fan/union.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ce1fe57)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…DataFrameConversionTest.get_excel_dfs' test to work with Python 3.7

### What changes were proposed in this pull request?

The fix is to use openpyxl by default instead of xlrd.

### Why are the changes needed?

test_to_excel test case was failing in python3.7, asking for xlrd which is unable to read xlsx files. Improves test coverage.

### Does this PR introduce _any_ user-facing change?

No, it's test only

### How was this patch tested?

Manually

Closes apache#42339 from Madhukar98/branch-3.4.

Authored-by: madlnu <madlnu@visa.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…eans with type arguments

### What changes were proposed in this pull request?
This is a port of [42327](apache#42327)

This PR fixes a regression introduced in Spark 3.4.x  where  Encoders.bean is no longer able to process nested beans having type arguments. For example:

```
class A<T> {
   T value;
   // value getter and setter
}

class B {
   A<String> stringHolder;
   // stringHolder getter and setter
}

Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: [ENCODER_NOT_FOUND]..."
```

### Why are the changes needed?
JavaTypeInference.encoderFor main match does not manage ParameterizedType and TypeVariable cases. I think this is a regression introduced after getting rid of usage of guava TypeToken: [SPARK-42093 SQL Move JavaTypeInference to AgnosticEncoders](apache@1867200#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b) hvanhovell cloud-fan

In this PR I'm leveraging commons lang3 TypeUtils functionalities to solve ParameterizedType type arguments for classes

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests have been extended to check correct encoding of a nested bean having type arguments.

Closes apache#42379 from gbloisi-openaire/spark-44634-branch-3.4.

Authored-by: Giambattista Bloisi <giambattista.bloisi@openaire.eu>
Signed-off-by: Herman van Hovell <herman@databricks.com>
…not triggered

This PR makes sure we use unique partition values when calculating the final partitions in `BatchScanExec`, to make sure no duplicated partitions are generated.

When `spark.sql.sources.v2.bucketing.pushPartValues.enabled` and `spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled` are enabled, and SPJ is not triggered, currently Spark will generate incorrect/duplicated results.

This is because with both configs enabled, Spark will delay the partition grouping until the time it calculates the final partitions used by the input RDD. To calculate the partitions, it uses partition values from the `KeyGroupedPartitioning` to find out the right ordering for the partitions. However, since grouping is not done when the partition values is computed, there could be duplicated partition values. This means the result could contain duplicated partitions too.

No, this is a bug fix.

Added a new test case for this scenario.

Closes apache#42324 from sunchao/SPARK-44641.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Chao Sun <sunchao@apple.com>
(cherry picked from commit aa1261d)
Signed-off-by: Chao Sun <sunchao@apple.com>
…SchemaIterator and config parsing of CONNECT_GRPC_ARROW_MAX_BATCH_SIZE

Fixes the limit checking of `maxEstimatedBatchSize` and `maxRecordsPerBatch` to respect the more restrictive limit and fixes the config parsing of `CONNECT_GRPC_ARROW_MAX_BATCH_SIZE` by converting the value to bytes.

Bugfix.
In the arrow writer [code](https://github.com/apache/spark/blob/6161bf44f40f8146ea4c115c788fd4eaeb128769/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L154-L163) , the conditions don’t seem to hold what the documentation says regd "maxBatchSize and maxRecordsPerBatch, respect whatever smaller" since it seems to actually respect the conf which is "larger" (i.e less restrictive) due to || operator.

Further, when the `CONNECT_GRPC_ARROW_MAX_BATCH_SIZE` conf is read, the value is not converted to bytes from MiB ([example](https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala#L103)).

No.

Existing tests.

Closes apache#42321 from vicennial/SPARK-44657.

Authored-by: vicennial <venkata.gudesa@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit f9d417f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

This PR aims to document `spark.network.timeoutInterval` configuration.

### Why are the changes needed?

Like `spark.network.timeout`, `spark.network.timeoutInterval` exists since Apache Spark 1.3.x.

https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31ed3ad998995b8/core/src/main/scala/org/apache/spark/internal/config/Network.scala#L48-L52

Since this is a user-facing configuration like the following, we had better document it.
https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31ed3ad998995b8/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala#L91-L93

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual because this is a doc-only change.

Closes apache#42402 from dongjoon-hyun/SPARK-44725.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 3af2e77)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…GI from SecurityManager of ApplicationMaster

### What changes were proposed in this pull request?

I make the SecurityManager instance a lazy value

### Why are the changes needed?

fix the bug in issue [SPARK-44581](https://issues.apache.org/jira/browse/SPARK-44581)

**Bug:**
In spark3.2 it throws the org.apache.hadoop.security.AccessControlException, but in spark2.4 this hook does not throw exception.

I rebuild the hadoop-client-api.jar, and add some debug log before the hadoop shutdown hook is created, and rebuild the spark-yarn.jar to add some debug log when creating the spark shutdown hook manager, here is the screenshot of the log:
![image](https://github.com/apache/spark/assets/62563545/ea338db3-646c-432c-bf16-1f445adc2ad9)

We can see from the screenshot, the ShutdownHookManager is initialized before the ApplicationManager create a new ugi.

**Reason**

The main cause is that ShutdownHook thread is created before we create the ugi in ApplicationMaster.

When we set the config key "hadoop.security.credential.provider.path", the ApplicationMaster will try to get a filesystem when generating SSLOptions, and when initialize the filesystem during which it will generate a new thread whose ugi is inherited from the current process (yarn).
After this, it will generate a new ugi (SPARK_USER) in ApplicationMaster and execute the doAs() function.

Here is the chain of the call:
ApplicationMaster.(ApplicationMaster.scala:83) -> org.apache.spark.SecurityManager.(SecurityManager.scala:98) -> org.apache.spark.SSLOptions$.parse(SSLOptions.scala:188) -> org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:2353) -> org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2434) -> org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:82)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

I didn't add new UnitTest for this, but I rebuild the package, and runs a program in my cluster, and turns out that the user when I delete the staging file turns to be the same with the SPARK_USER.

Closes apache#42405 from liangyu-1/SPARK-44581.

Authored-by: 余良 <yul165@chinaunicom.cn>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit e584ed4)
Signed-off-by: Kent Yao <yao@apache.org>
panbingkun and others added 4 commits May 10, 2024 19:54
…dencies.sh` script

### What changes were proposed in this pull request?
The pr aims to delete the dir `dev/pr-deps` after executing `test-dependencies.sh`.

### Why are the changes needed?
We'd better clean the `temporary files` generated at the end.
Before:
```
sh dev/test-dependencies.sh
```
<img width="569" alt="image" src="https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456">

After:
```
sh dev/test-dependencies.sh
```
<img width="534" alt="image" src="https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46531 from panbingkun/minor_test-dependencies.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit f699f55)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…rArrayTypeFromFirstElement

This PR fixes a bug that does not respect `spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled` in nested arrays, introduced by apache#36545.

To have a way to restore the original behaviour.

Yes, it fixes the regression when `spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled` is set to `True`.

Unittest added.

No.

Closes apache#46548 from HyukjinKwon/SPARK-48248.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit b2140d0)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Special case escaping for MySQL and fix issues with redundant escaping for ' character.

When pushing down startsWith, endsWith and contains they are converted to LIKE. This requires addition of escape characters for these expressions. Unfortunately, MySQL uses ESCAPE '\\' syntax instead of ESCAPE '\' which would cause errors when trying to push down.

Yes

Tests for each existing dialect.

No.

Closes apache#46437 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 47006a4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment