Analyzer support correlated subqueries #64050

kitaisreal · 2024-05-17T14:10:02Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Analyzer supports correlated subqueries.

robot-clickhouse-ci-1 · 2024-05-17T14:11:01Z

This is an automated comment for commit ce6cc50 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	⏳ pending
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	⏳ pending
Mergeable Check	Checks if all other necessary checks are successful	❌ failure
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	❌ failure

Successful checks

Check name	Description	Status
Docs check	Builds and tests the documentation	✅ success
PR Check	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success

hanfei1991 · 2024-05-20T17:59:50Z

May I ask what kinds of correlated queires are supported here? In my mind, de-correlated is like transfering

select * from t1 where 0 < (select count(*) from t2 where t1.id = t2.id)

to

select * from t1 semi join t2 on t1.id = t2.id

but in the test I only saw column in SELECT list refering to other tables, I'm confused.

kitaisreal · 2024-05-21T08:24:00Z

May I ask what kinds of correlated queires are supported here? In my mind, de-correlated is like transfering
select * from t1 where 0 < (select count(*) from t2 where t1.id = t2.id)
to
select * from t1 semi join t2 on t1.id = t2.id
but in the test I only saw column in SELECT list refering to other tables, I'm confused.

Automatic decorellation is not possible for all types correlated subqueries, and can be implemented later as optimization on top of this pull request.
In scope of pull request I plan to support correlated scalar subqueries, correlated subquery as second argument of IN. I do not plan to support LATERAL JOIN.

novikd · 2024-05-21T16:39:42Z

Automatic decorellation is not possible for all types correlated subqueries, and can be implemented later as optimization on top of this pull request.

This would be really hard to do. Implementing correlated subqueries as a separate logical plan step and decorrelation on top of the query plan is a standard today. The algorithm provided in Neumann, Thomas, and Alfons Kemper. paper "Unnesting arbitrary queries." is usually used to support it in DMBS.

I think implementing it as a separate function is a bad design decision. In the future it'll require to rewrite it completely as a separate query plan step.

Also, I expect it to be very slow because this implementation for each input row:

analyzes the subquery
does full table scan (for tables used in subquery)

I don't think it can be used in a real world production. This feature requires adding more functional and performance tests.

kitaisreal · 2024-05-21T19:30:31Z

Automatic decorellation is not possible for all types correlated subqueries, and can be implemented later as optimization on top of this pull request.

This would be really hard to do. Implementing correlated subqueries as a separate logical plan step and decorrelation on top of the query plan is a standard today. The algorithm provided in Neumann, Thomas, and Alfons Kemper. paper "Unnesting arbitrary queries." is usually used to support it in DMBS.

I think implementing it as a separate function is a bad design decision. In the future it'll require to rewrite it completely as a separate query plan step.

Also, I expect it to be very slow because this implementation for each input row:

analyzes the subquery

does full table scan (for tables used in subquery)

I don't think it can be used in a real world production. This feature requires adding more functional and performance tests.

It is not possible to decorrelate correlated subqueries in all scenarious, so we will need to have fallback to slow execution anyway (on top of logical query plan or just with separate function). Also not sure how decorrelation will work with IN function, where we expect that each subquery can produce set.

We can push correlated subqueries to logical query plan around expression/filter steps that depend on result of that correlated subquery. Will try to check if this is easy.

This feature is required for standard SQL support, it is expected that initial implementation can be incomplete.

novikd · 2024-05-22T12:16:26Z

It is not possible to decorrelate correlated subqueries in all scenarious, so we will need to have fallback to slow execution anyway

Actually, the paper I mentioned states that it's possible to decorrelate arbitrary query but it's not always possible to express it in SQL. But this is still a questionable point of discussion, there are 2 options for what to do when it's impossible to automatically decorrelate subquery:

Throw an exception and forbid such queries
Run subquery in a nested loop.

I'm not sure if it's necessary to be able to run arbitrary correlated subquery or we can run only those we could decorrelate (as many DBMS do).

Also not sure how decorrelation will work with IN function, where we expect that each subquery can produce set.

It can be done using a JOIN. Anyways, it can not be supported with this approach too, because set is not a scalar.

We can push correlated subqueries to logical query plan around expression/filter steps that depend on result of that correlated subquery. Will try to check if this is easy.

This feature is required for standard SQL support, it is expected that initial implementation can be incomplete.

I find the ad hoc implementation in this PR problematic: it's not extensible, and it will be hard to improve correlated subqueries support on top of it:

Correlated subquery becomes a part of DAG, so to decorrelate it it'll require to find correlated parts in DAG. This looks quite hard and tricky.
Correlated table subqueries can not be supported as a function, because it can produce any number of rows.

This feature is required for standard SQL support, it is expected that initial implementation can be incomplete.

In conclusion: I don't know if we want to have nested loop implementation for correlated subqueries, but if we do, it should be implemented on a query plan level, so we won't rewrite the implementation completely when we try to support the general case.

UnamedRus · 2024-05-30T05:24:47Z

Run subquery in a nested loop.

Unrelated, but having nested loop implementation (w cache) nice in general (for joins it also can be beneficial)

Analyzer support correlated subqueries

ce6cc50

kitaisreal added the can be tested Allows running workflows for external contributors label May 17, 2024

robot-clickhouse-ci-1 added the pr-feature Pull request with new product feature label May 17, 2024

alexey-milovidov changed the title ~~Analyzer support correlated subqueries~~ Analyzer support correlated subqueries (trivial case of unwrapping SELECT column) May 20, 2024

kitaisreal changed the title ~~Analyzer support correlated subqueries (trivial case of unwrapping SELECT column)~~ Analyzer support correlated subqueries May 20, 2024

hanfei1991 self-assigned this May 21, 2024

novikd added the analyzer Issues and pull-requests related to new analyzer label May 21, 2024

novikd self-assigned this May 21, 2024

hanfei1991 removed their assignment May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyzer support correlated subqueries #64050

Analyzer support correlated subqueries #64050

kitaisreal commented May 17, 2024 •

edited

robot-clickhouse-ci-1 commented May 17, 2024 •

edited

hanfei1991 commented May 20, 2024

kitaisreal commented May 21, 2024

novikd commented May 21, 2024

kitaisreal commented May 21, 2024

novikd commented May 22, 2024

UnamedRus commented May 30, 2024 •

edited

Analyzer support correlated subqueries #64050

Are you sure you want to change the base?

Analyzer support correlated subqueries #64050

Conversation

kitaisreal commented May 17, 2024 • edited

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

robot-clickhouse-ci-1 commented May 17, 2024 • edited

hanfei1991 commented May 20, 2024

kitaisreal commented May 21, 2024

novikd commented May 21, 2024

kitaisreal commented May 21, 2024

novikd commented May 22, 2024

UnamedRus commented May 30, 2024 • edited

kitaisreal commented May 17, 2024 •

edited

robot-clickhouse-ci-1 commented May 17, 2024 •

edited

UnamedRus commented May 30, 2024 •

edited