Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier #46580

Closed
wants to merge 4 commits into from

Conversation

nikolamand-db
Copy link
Contributor

What changes were proposed in this pull request?

PlanWithUnresolvedIdentifier is rewritten later in analysis which causes rules like
SubstituteUnresolvedOrdinals to miss the new plan. This causes following queries to fail:

create temporary view identifier('v1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1);
--
cache table identifier('t1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1); 
--
create table identifier('t2') as (select my_col from (values (1), (2), (1) 
as (my_col)) group by 1);
insert into identifier('t2') select my_col from (values (3) as (my_col)) group by 1; 

Fix this by explicitly applying rules after plan rewrite.

Why are the changes needed?

To fix the described bug.

Does this PR introduce any user-facing change?

Yes, it fixes the mentioned problematic queries.

How was this patch tested?

Updated existing identifier-clause.sql golden file.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 14, 2024
@nikolamand-db nikolamand-db changed the title [SPARK-48273] Fix late rewrite of PlanWithUnresolvedIdentifier [SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier May 14, 2024
@@ -254,7 +254,7 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
TypeCoercion.typeCoercionRules
}

override def batches: Seq[Batch] = Seq(
private val earlyBatches: Seq[Batch] = Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep it as def. It may not be a big deal but I'd like to keep the previous behavior as much as we can: we create a new instance of the rule for each analysis execution.

import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_IDENTIFIER
import org.apache.spark.sql.types.StringType

/**
* Resolves the identifier expressions and builds the original plans/expressions.
*/
object ResolveIdentifierClause extends Rule[LogicalPlan] with AliasHelper with EvalHelper {
class ResolveIdentifierClause(executor: RuleExecutor[LogicalPlan]) extends Rule[LogicalPlan]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, let's pass in the rules and create RuleExecutor within this rule, as the passed rules may be new instances for each analysis execution

@cloud-fan
Copy link
Contributor

I think this fixes https://issues.apache.org/jira/browse/SPARK-46625 as well. Can we add a test to verify?

@nikolamand-db
Copy link
Contributor Author

I think this fixes https://issues.apache.org/jira/browse/SPARK-46625 as well. Can we add a test to verify?

Checked locally, seems like these changes don't resolve the issue:

spark-sql (default)> explain extended WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');

== Parsed Logical Plan ==
CTE [S, T]
:  :- 'SubqueryAlias S
:  :  +- 'UnresolvedSubqueryColumnAliases [c1, c2]
:  :     +- 'UnresolvedInlineTable [col1, col2], [[1, 2], [2, 3]]
:  +- 'SubqueryAlias T
:     +- 'UnresolvedSubqueryColumnAliases [c1, c2]
:        +- 'UnresolvedInlineTable [col1, col2], [[a, b], [c, d]]
+- 'Project [unresolvedalias(expressionwithunresolvedidentifier(max, expressionwithunresolvedidentifier(c1, <function2>), org.apache.spark.sql.catalyst.parser.AstBuilder$$Lambda$1453/0x0000000401c95138@312ddc51))]
   +- 'PlanWithUnresolvedIdentifier T, org.apache.spark.sql.catalyst.parser.AstBuilder$$Lambda$1420/0x0000000401c5e688@330f3046

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `c1` cannot be resolved. Did you mean one of the following? [`Z`, `s`]. SQLSTATE: 42703; line 1 pos 129;
'WithCTE
:- CTERelationDef 0, false
:  +- SubqueryAlias S
:     +- Project [col1#5 AS c1#9, col2#6 AS c2#10]
:        +- LocalRelation [col1#5, col2#6]
:- CTERelationDef 1, false
:  +- SubqueryAlias T
:     +- Project [col1#7 AS c1#11, col2#8 AS c2#12]
:        +- LocalRelation [col1#7, col2#8]
+- 'Project [unresolvedalias('max('c1))]
   +- SubqueryAlias spark_catalog.default.t
      +- Relation spark_catalog.default.t[Z#13,s#14] parquet

PlanWithUnresolvedIdentifier gets resolved but it tries to use table t from catalog instead of CTE definition.


override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
_.containsAnyPattern(UNRESOLVED_IDENTIFIER)) {
case p: PlanWithUnresolvedIdentifier if p.identifierExpr.resolved =>
p.planBuilder.apply(evalIdentifierExpr(p.identifierExpr))
val executor = new RuleExecutor[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to create one RuleExecutor instance once in this rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, please check again.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in 731a2cf May 28, 2024
cloud-fan pushed a commit that referenced this pull request May 28, 2024
### What changes were proposed in this pull request?

`PlanWithUnresolvedIdentifier` is rewritten later in analysis which causes rules like
`SubstituteUnresolvedOrdinals` to miss the new plan. This causes following queries to fail:
```
create temporary view identifier('v1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1);
--
cache table identifier('t1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1);
--
create table identifier('t2') as (select my_col from (values (1), (2), (1)
as (my_col)) group by 1);
insert into identifier('t2') select my_col from (values (3) as (my_col)) group by 1;
```
Fix this by explicitly applying rules after plan rewrite.

### Why are the changes needed?

To fix the described bug.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the mentioned problematic queries.

### How was this patch tested?

Updated existing `identifier-clause.sql` golden file.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46580 from nikolamand-db/SPARK-48273.

Authored-by: Nikola Mandic <nikola.mandic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 731a2cf)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@LuciferYang
Copy link
Contributor

there are new failing cases in SQLQueryTestSuite and ThriftServerQueryTestSuite after this one is merged into branch-3.5:

[info] - identifier-clause.sql *** FAILED *** (2 seconds, 408 milliseconds)
[info]   "" did not contain "Exception" Exception did not match for query #96
[info]   create table identifier('t2') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1), expected: , but got: java.sql.SQLException
[info]   org.apache.hive.service.cli.HiveSQLException: Error running query: [NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT] org.apache.spark.sql.AnalysisException: [NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT] CREATE Hive TABLE (AS SELECT) is not supported, if you want to enable it, please set "spark.sql.catalogImplementation" to "hive".;
[info]   'CreateTable `spark_catalog`.`default`.`t2`, ErrorIfExists
[info]   +- Aggregate [my_col#20215], [my_col#20215]
[info]      +- SubqueryAlias __auto_generated_subquery_name
[info]         +- SubqueryAlias as
[info]            +- LocalRelation [my_col#20215]
[info]   
[info]   	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166)
[info]   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161)
[info]   	at java.security.AccessController.doPrivileged(Native Method)
[info]   	at javax.security.auth.Subject.doAs(Subject.java:422)
[info]   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175)
[info]   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[info]   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   	at java.lang.Thread.run(Thread.java:750)
[info]   Caused by: org.apache.spark.sql.AnalysisException: [NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT] CREATE Hive TABLE (AS SELECT) is not supported, if you want to enable it, please set "spark.sql.catalogImplementation" to "hive".;
[info]   'CreateTable `spark_catalog`.`default`.`t2`, ErrorIfExists
[info]   +- Aggregate [my_col#20215], [my_col#20215]
[info]      +- SubqueryAlias __auto_generated_subquery_name
[info]         +- SubqueryAlias as
[info]            +- LocalRelation [my_col#20215]
[info] - identifier-clause.sql *** FAILED *** (1 second, 208 milliseconds)
[info]   identifier-clause.sql
[info]   Expected "[]", but got "[org.apache.spark.sql.catalyst.ExtendedAnalysisException
[info]   {
[info]     "errorClass" : "NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT",
[info]     "messageParameters" : {
[info]       "cmd" : "CREATE Hive TABLE (AS SELECT)"
[info]     }
[info]   }]" Result did not match for query #96
[info]   create table identifier('t2') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1) (SQLQueryTestSuite.scala:876)

cc @nikolamand-db @cloud-fan

@cloud-fan
Copy link
Contributor

fixing at #46794

cloud-fan added a commit that referenced this pull request May 29, 2024
…tifier-clause.sql

### What changes were proposed in this pull request?

A followup of #46580 . It's better to create non-Hive tables in the tests, so that it's backport safe, as old branches creates hive table by default.

### Why are the changes needed?

fix branch-3.5 CI

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46794 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request May 29, 2024
…tifier-clause.sql

### What changes were proposed in this pull request?

A followup of #46580 . It's better to create non-Hive tables in the tests, so that it's backport safe, as old branches creates hive table by default.

### Why are the changes needed?

fix branch-3.5 CI

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46794 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit cf47293)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
HyukjinKwon pushed a commit that referenced this pull request May 30, 2024
### What changes were proposed in this pull request?

This is a followup of #46580 to update golden files

### Why are the changes needed?

fix CI

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46800 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants