Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid counting contributions multiple times when using * wildcard filter #30

Open
Hagellach37 opened this issue Jul 28, 2023 · 1 comment
Labels
blocked can not be worked on right now enhancement New feature or request low-priority

Comments

@Hagellach37
Copy link
Contributor

Hagellach37 commented Jul 28, 2023

contributions could be counted more than once with the current implementation of the /stats endpoint when the wildcard * filter is used.

When using the wildcard filter we could first group values by contribution_id before deriving the sum(buildings) or sum(road_length_delta).

private fun getStatsFromTimeSpan(hashtagHandler: HashtagHandler) = """
SELECT
count(distinct changeset_id) as changesets,
count(distinct user_id) as users,
ifNull(sum(road_length_delta)/1000, 0) as roads,
ifNull(sum(building_edit), 0) as buildings,
count(map_feature_edit) as edits,
max(changeset_timestamp) as latest
FROM "stats"
WHERE
${if (hashtagHandler.isWildCard) "startsWith" else "equals"}(hashtag, ?)
and changeset_timestamp > parseDateTime64BestEffort(?)
and changeset_timestamp < parseDateTime64BestEffort(?);
""".trimIndent()

@Hagellach37 Hagellach37 added the enhancement New feature or request label Aug 4, 2023
@ElJocho
Copy link
Contributor

ElJocho commented Aug 8, 2023

Thought collection on this topic:

  • the correct query would be something like this:
SELECT
    count(distinct changeset_id) as changesets,
    count(distinct user_id) as users,
    ifNull(sum(road_length_delta)/1000, 0) as roads,
    ifNull(sum(building_edit), 0) as buildings,
    count(map_feature_edit) as edits,
    max(changeset_timestamp) as latest
FROM (
	SELECT 
		min(changeset_id) as changeset_id,
		user_id,
		min(road_length_delta) as road_length_delta,
		min(building_edit) as building_edit,
		min(map_feature_edit) as map_feature_edit,
		changeset_timestamp,
		groupArray(hashtag) as hashtags,
		min(country_iso_a3)
	FROM "stats"
	WHERE
	    startsWith(hashtag, 'm') 
            AND changeset_timestamp > parseDateTimeBestEffort('2020-01-01T00:00:00Z') 
	    AND changeset_timestamp < parseDateTimeBestEffort('2022-01-01T00:00:00Z')
	GROUP BY 
		changeset_timestamp, 
		user_id, 
		osm_id
) as grouped_stats
WHERE 
	arrayExists(hashtag -> startsWith(hashtag, 'm'), hashtags)

-> gives ~2/3rd the buildings of our current approach
-> Issue: takes 6 seconds instead of 0.4s

I investigated further and tried to create a aggregated projection like so:

 ALTER TABLE stats ADD PROJECTION combined_hashtag (
	SELECT 
		min(changeset_id) as changeset_id,
		user_id,
		min(road_length_delta) as road_length_delta,
		min(building_edit) as building_edit,
		min(map_feature_edit) as map_feature_edit,
		changeset_timestamp,
		groupArray(hashtag) as hashtags,
		min(country_iso_a3) as country_iso_a3
	GROUP BY 
		changeset_timestamp, 
		user_id,
		osm_id
)
;

Sadly there seems to be no way to access the newly defined column "hashtags" so I cannot query for arrayExists(hashtag -> startsWith(hashtag, 'm'), hashtags), rendering it useless.

Conclusion

  • Currently we have no feasible (sub-1-seconds-execition time) way to re-aggregate our hashtag column and thus cannot guarantee correct values for all wildcard queries.

@ElJocho ElJocho added blocked can not be worked on right now low-priority labels Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked can not be worked on right now enhancement New feature or request low-priority
Projects
None yet
Development

No branches or pull requests

2 participants