[SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier #46580

nikolamand-db · 2024-05-14T13:34:41Z

What changes were proposed in this pull request?

PlanWithUnresolvedIdentifier is rewritten later in analysis which causes rules like
SubstituteUnresolvedOrdinals to miss the new plan. This causes following queries to fail:

create temporary view identifier('v1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1);
--
cache table identifier('t1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1); 
--
create table identifier('t2') as (select my_col from (values (1), (2), (1) 
as (my_col)) group by 1);
insert into identifier('t2') select my_col from (values (3) as (my_col)) group by 1;

Fix this by explicitly applying rules after plan rewrite.

Why are the changes needed?

To fix the described bug.

Does this PR introduce any user-facing change?

Yes, it fixes the mentioned problematic queries.

How was this patch tested?

Updated existing identifier-clause.sql golden file.

Was this patch authored or co-authored using generative AI tooling?

No.

cloud-fan · 2024-05-22T11:30:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -254,7 +254,7 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
    TypeCoercion.typeCoercionRules
  }

-  override def batches: Seq[Batch] = Seq(
+  private val earlyBatches: Seq[Batch] = Seq(


let's keep it as def. It may not be a big deal but I'd like to keep the previous behavior as much as we can: we create a new instance of the rule for each analysis execution.

cloud-fan · 2024-05-22T11:34:03Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala

 import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_IDENTIFIER
 import org.apache.spark.sql.types.StringType

 /**
 * Resolves the identifier expressions and builds the original plans/expressions.
 */
-object ResolveIdentifierClause extends Rule[LogicalPlan] with AliasHelper with EvalHelper {
+class ResolveIdentifierClause(executor: RuleExecutor[LogicalPlan]) extends Rule[LogicalPlan]


ditto, let's pass in the rules and create RuleExecutor within this rule, as the passed rules may be new instances for each analysis execution

cloud-fan · 2024-05-22T11:35:12Z

I think this fixes https://issues.apache.org/jira/browse/SPARK-46625 as well. Can we add a test to verify?

nikolamand-db · 2024-05-24T14:02:35Z

I think this fixes https://issues.apache.org/jira/browse/SPARK-46625 as well. Can we add a test to verify?

Checked locally, seems like these changes don't resolve the issue:

spark-sql (default)> explain extended WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');

== Parsed Logical Plan ==
CTE [S, T]
:  :- 'SubqueryAlias S
:  :  +- 'UnresolvedSubqueryColumnAliases [c1, c2]
:  :     +- 'UnresolvedInlineTable [col1, col2], [[1, 2], [2, 3]]
:  +- 'SubqueryAlias T
:     +- 'UnresolvedSubqueryColumnAliases [c1, c2]
:        +- 'UnresolvedInlineTable [col1, col2], [[a, b], [c, d]]
+- 'Project [unresolvedalias(expressionwithunresolvedidentifier(max, expressionwithunresolvedidentifier(c1, <function2>), org.apache.spark.sql.catalyst.parser.AstBuilder$$Lambda$1453/0x0000000401c95138@312ddc51))]
   +- 'PlanWithUnresolvedIdentifier T, org.apache.spark.sql.catalyst.parser.AstBuilder$$Lambda$1420/0x0000000401c5e688@330f3046

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `c1` cannot be resolved. Did you mean one of the following? [`Z`, `s`]. SQLSTATE: 42703; line 1 pos 129;
'WithCTE
:- CTERelationDef 0, false
:  +- SubqueryAlias S
:     +- Project [col1#5 AS c1#9, col2#6 AS c2#10]
:        +- LocalRelation [col1#5, col2#6]
:- CTERelationDef 1, false
:  +- SubqueryAlias T
:     +- Project [col1#7 AS c1#11, col2#8 AS c2#12]
:        +- LocalRelation [col1#7, col2#8]
+- 'Project [unresolvedalias('max('c1))]
   +- SubqueryAlias spark_catalog.default.t
      +- Relation spark_catalog.default.t[Z#13,s#14] parquet

PlanWithUnresolvedIdentifier gets resolved but it tries to use table t from catalog instead of CTE definition.

cloud-fan · 2024-05-24T19:37:48Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala


  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
    _.containsAnyPattern(UNRESOLVED_IDENTIFIER)) {
    case p: PlanWithUnresolvedIdentifier if p.identifierExpr.resolved =>
-      p.planBuilder.apply(evalIdentifierExpr(p.identifierExpr))
+      val executor = new RuleExecutor[LogicalPlan] {


We only need to create one RuleExecutor instance once in this rule.

Added, please check again.

cloud-fan · 2024-05-28T16:58:55Z

thanks, merging to master/3.5!

### What changes were proposed in this pull request? `PlanWithUnresolvedIdentifier` is rewritten later in analysis which causes rules like `SubstituteUnresolvedOrdinals` to miss the new plan. This causes following queries to fail: ``` create temporary view identifier('v1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1); -- cache table identifier('t1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1); -- create table identifier('t2') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1); insert into identifier('t2') select my_col from (values (3) as (my_col)) group by 1; ``` Fix this by explicitly applying rules after plan rewrite. ### Why are the changes needed? To fix the described bug. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the mentioned problematic queries. ### How was this patch tested? Updated existing `identifier-clause.sql` golden file. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46580 from nikolamand-db/SPARK-48273. Authored-by: Nikola Mandic <nikola.mandic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 731a2cf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

LuciferYang · 2024-05-29T13:03:00Z

there are new failing cases in SQLQueryTestSuite and ThriftServerQueryTestSuite after this one is merged into branch-3.5:

ThriftServerQueryTestSuite : https://github.com/apache/spark/actions/runs/9273419348/job/25513506718

[info] - identifier-clause.sql *** FAILED *** (2 seconds, 408 milliseconds)
[info]   "" did not contain "Exception" Exception did not match for query #96
[info]   create table identifier('t2') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1), expected: , but got: java.sql.SQLException
[info]   org.apache.hive.service.cli.HiveSQLException: Error running query: [NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT] org.apache.spark.sql.AnalysisException: [NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT] CREATE Hive TABLE (AS SELECT) is not supported, if you want to enable it, please set "spark.sql.catalogImplementation" to "hive".;
[info]   'CreateTable `spark_catalog`.`default`.`t2`, ErrorIfExists
[info]   +- Aggregate [my_col#20215], [my_col#20215]
[info]      +- SubqueryAlias __auto_generated_subquery_name
[info]         +- SubqueryAlias as
[info]            +- LocalRelation [my_col#20215]
[info]   
[info]   	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166)
[info]   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161)
[info]   	at java.security.AccessController.doPrivileged(Native Method)
[info]   	at javax.security.auth.Subject.doAs(Subject.java:422)
[info]   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175)
[info]   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[info]   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   	at java.lang.Thread.run(Thread.java:750)
[info]   Caused by: org.apache.spark.sql.AnalysisException: [NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT] CREATE Hive TABLE (AS SELECT) is not supported, if you want to enable it, please set "spark.sql.catalogImplementation" to "hive".;
[info]   'CreateTable `spark_catalog`.`default`.`t2`, ErrorIfExists
[info]   +- Aggregate [my_col#20215], [my_col#20215]
[info]      +- SubqueryAlias __auto_generated_subquery_name
[info]         +- SubqueryAlias as
[info]            +- LocalRelation [my_col#20215]

SQLQueryTestSuite : https://github.com/apache/spark/actions/runs/9273419348/job/25513508812

[info] - identifier-clause.sql *** FAILED *** (1 second, 208 milliseconds)
[info]   identifier-clause.sql
[info]   Expected "[]", but got "[org.apache.spark.sql.catalyst.ExtendedAnalysisException
[info]   {
[info]     "errorClass" : "NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT",
[info]     "messageParameters" : {
[info]       "cmd" : "CREATE Hive TABLE (AS SELECT)"
[info]     }
[info]   }]" Result did not match for query #96
[info]   create table identifier('t2') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1) (SQLQueryTestSuite.scala:876)

cc @nikolamand-db @cloud-fan

cloud-fan · 2024-05-29T18:46:39Z

fixing at #46794

…tifier-clause.sql ### What changes were proposed in this pull request? A followup of #46580 . It's better to create non-Hive tables in the tests, so that it's backport safe, as old branches creates hive table by default. ### Why are the changes needed? fix branch-3.5 CI ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? no Closes #46794 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…tifier-clause.sql ### What changes were proposed in this pull request? A followup of #46580 . It's better to create non-Hive tables in the tests, so that it's backport safe, as old branches creates hive table by default. ### Why are the changes needed? fix branch-3.5 CI ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? no Closes #46794 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit cf47293) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of #46580 to update golden files ### Why are the changes needed? fix CI ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? no Closes #46800 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Fix late rewrite of PlanWithUnresolvedIdentifier

185671b

github-actions bot added the SQL label May 14, 2024

nikolamand-db changed the title ~~[SPARK-48273] Fix late rewrite of PlanWithUnresolvedIdentifier~~ [SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier May 14, 2024

Rework

9bab0ee

cloud-fan reviewed May 22, 2024

View reviewed changes

Rework early batches

af79653

cloud-fan reviewed May 24, 2024

View reviewed changes

Instantiate RuleExecutor

a984783

cloud-fan closed this in 731a2cf May 28, 2024

cloud-fan mentioned this pull request May 29, 2024

[SPARK-48273][SQL][FOLLOWUP] Explicitly create non-Hive table in identifier-clause.sql #46794

Closed

cloud-fan mentioned this pull request May 30, 2024

[SPARK-48273][SQL][FOLLOWUP] Update golden file #46800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier #46580

[SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier #46580

nikolamand-db commented May 14, 2024

cloud-fan May 22, 2024

cloud-fan May 22, 2024

cloud-fan commented May 22, 2024

nikolamand-db commented May 24, 2024

cloud-fan May 24, 2024

nikolamand-db May 27, 2024

cloud-fan commented May 28, 2024

LuciferYang commented May 29, 2024

cloud-fan commented May 29, 2024

[SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier #46580

[SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier #46580

Conversation

nikolamand-db commented May 14, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan May 22, 2024

Choose a reason for hiding this comment

cloud-fan May 22, 2024

Choose a reason for hiding this comment

cloud-fan commented May 22, 2024

nikolamand-db commented May 24, 2024

cloud-fan May 24, 2024

Choose a reason for hiding this comment

nikolamand-db May 27, 2024

Choose a reason for hiding this comment

cloud-fan commented May 28, 2024

LuciferYang commented May 29, 2024

cloud-fan commented May 29, 2024