Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyzer support correlated subqueries #64050

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kitaisreal
Copy link
Collaborator

@kitaisreal kitaisreal commented May 17, 2024

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Analyzer supports correlated subqueries.

@kitaisreal kitaisreal added the can be tested Allows running workflows for external contributors label May 17, 2024
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-feature Pull request with new product feature label May 17, 2024
@robot-clickhouse-ci-1
Copy link
Contributor

robot-clickhouse-ci-1 commented May 17, 2024

This is an automated comment for commit ce6cc50 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process⏳ pending
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report❌ failure
Successful checks
Check nameDescriptionStatus
Docs checkBuilds and tests the documentation✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success

@alexey-milovidov alexey-milovidov changed the title Analyzer support correlated subqueries Analyzer support correlated subqueries (trivial case of unwrapping SELECT column) May 20, 2024
@kitaisreal kitaisreal changed the title Analyzer support correlated subqueries (trivial case of unwrapping SELECT column) Analyzer support correlated subqueries May 20, 2024
@hanfei1991
Copy link
Member

May I ask what kinds of correlated queires are supported here? In my mind, de-correlated is like transfering

select * from t1 where 0 < (select count(*) from t2 where t1.id = t2.id)

to

select * from t1 semi join t2 on t1.id = t2.id

but in the test I only saw column in SELECT list refering to other tables, I'm confused.

@hanfei1991 hanfei1991 self-assigned this May 21, 2024
@kitaisreal
Copy link
Collaborator Author

May I ask what kinds of correlated queires are supported here? In my mind, de-correlated is like transfering

select * from t1 where 0 < (select count(*) from t2 where t1.id = t2.id)

to

select * from t1 semi join t2 on t1.id = t2.id

but in the test I only saw column in SELECT list refering to other tables, I'm confused.

Automatic decorellation is not possible for all types correlated subqueries, and can be implemented later as optimization on top of this pull request.
In scope of pull request I plan to support correlated scalar subqueries, correlated subquery as second argument of IN. I do not plan to support LATERAL JOIN.

@novikd
Copy link
Member

novikd commented May 21, 2024

Automatic decorellation is not possible for all types correlated subqueries, and can be implemented later as optimization on top of this pull request.

This would be really hard to do. Implementing correlated subqueries as a separate logical plan step and decorrelation on top of the query plan is a standard today. The algorithm provided in Neumann, Thomas, and Alfons Kemper. paper "Unnesting arbitrary queries." is usually used to support it in DMBS.

I think implementing it as a separate function is a bad design decision. In the future it'll require to rewrite it completely as a separate query plan step.

Also, I expect it to be very slow because this implementation for each input row:

  • analyzes the subquery
  • does full table scan (for tables used in subquery)

I don't think it can be used in a real world production. This feature requires adding more functional and performance tests.

@novikd novikd added the analyzer Issues and pull-requests related to new analyzer label May 21, 2024
@novikd novikd self-assigned this May 21, 2024
@kitaisreal
Copy link
Collaborator Author

Automatic decorellation is not possible for all types correlated subqueries, and can be implemented later as optimization on top of this pull request.

This would be really hard to do. Implementing correlated subqueries as a separate logical plan step and decorrelation on top of the query plan is a standard today. The algorithm provided in Neumann, Thomas, and Alfons Kemper. paper "Unnesting arbitrary queries." is usually used to support it in DMBS.

I think implementing it as a separate function is a bad design decision. In the future it'll require to rewrite it completely as a separate query plan step.

Also, I expect it to be very slow because this implementation for each input row:

  • analyzes the subquery
  • does full table scan (for tables used in subquery)

I don't think it can be used in a real world production. This feature requires adding more functional and performance tests.

It is not possible to decorrelate correlated subqueries in all scenarious, so we will need to have fallback to slow execution anyway (on top of logical query plan or just with separate function). Also not sure how decorrelation will work with IN function, where we expect that each subquery can produce set.

We can push correlated subqueries to logical query plan around expression/filter steps that depend on result of that correlated subquery. Will try to check if this is easy.

This feature is required for standard SQL support, it is expected that initial implementation can be incomplete.

@novikd
Copy link
Member

novikd commented May 22, 2024

It is not possible to decorrelate correlated subqueries in all scenarious, so we will need to have fallback to slow execution anyway

Actually, the paper I mentioned states that it's possible to decorrelate arbitrary query but it's not always possible to express it in SQL. But this is still a questionable point of discussion, there are 2 options for what to do when it's impossible to automatically decorrelate subquery:

  1. Throw an exception and forbid such queries
  2. Run subquery in a nested loop.

I'm not sure if it's necessary to be able to run arbitrary correlated subquery or we can run only those we could decorrelate (as many DBMS do).

Also not sure how decorrelation will work with IN function, where we expect that each subquery can produce set.

It can be done using a JOIN. Anyways, it can not be supported with this approach too, because set is not a scalar.

We can push correlated subqueries to logical query plan around expression/filter steps that depend on result of that correlated subquery. Will try to check if this is easy.

This feature is required for standard SQL support, it is expected that initial implementation can be incomplete.

I find the ad hoc implementation in this PR problematic: it's not extensible, and it will be hard to improve correlated subqueries support on top of it:

  • Correlated subquery becomes a part of DAG, so to decorrelate it it'll require to find correlated parts in DAG. This looks quite hard and tricky.
  • Correlated table subqueries can not be supported as a function, because it can produce any number of rows.

This feature is required for standard SQL support, it is expected that initial implementation can be incomplete.

In conclusion: I don't know if we want to have nested loop implementation for correlated subqueries, but if we do, it should be implemented on a query plan level, so we won't rewrite the implementation completely when we try to support the general case.

@hanfei1991 hanfei1991 removed their assignment May 22, 2024
@UnamedRus
Copy link
Contributor

UnamedRus commented May 30, 2024

Run subquery in a nested loop.

Unrelated, but having nested loop implementation (w cache) nice in general (for joins it also can be beneficial)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer Issues and pull-requests related to new analyzer can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants