Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ingest data without duplication allowed? #1877

Open
parselife opened this issue Mar 31, 2022 · 4 comments
Open

How to ingest data without duplication allowed? #1877

parselife opened this issue Mar 31, 2022 · 4 comments

Comments

@parselife
Copy link

parselife commented Mar 31, 2022

With documentation, there is :

Data ID: An identifier for the data represented by this row. We do not impose a requirement that Data IDs are globally unique but they should be unique for the adapter. Therefore, the pairing of Internal Adapter ID and Data ID define a unique identifier for a data element. An example of a data ID for vector data would be the feature ID.

according to that, Adapter ID and Data ID define a unique identifier, so how to ingest data without duplication allowed?

now, my index looks like

adapter_id                 data_id   
4	......              places.12
4	......              places.12

Why this happened?

The values of adapter_id and data_id in these two records are the same

i want to get a single record without a duplicated one, how can i do?

@parselife
Copy link
Author

I find the cassandra's table definition :

**primary key (partition, adapter_id, sort, data_id, vis, nano_time, field_mask, value, num_duplicates)**

Any way to custom this ?

@rfecher
Copy link
Contributor

rfecher commented Apr 1, 2022

Not sure why you'd why exactly you'd want to customize that primary key, you can give it the data ID to be unique, and other things like sort and partition key come from the index (which again you could customize but probably don't want to).

The issue is most likely that you are inserting rows into the index with the same adapter ID and data ID but different sort keys. This would happen, for example, if you were using a spatial index and the rows had different geometries (or similarly a temporal index with different date/times). In these rare cases you would want to delete the row prior to ingesting. The num_duplicates identifier that we tack onto the primary key is a hint that we intentionally are storing duplicates, and this can happen in rare circumstances such as if you are storing a time range (consider a track that has a start time and and end time) and that time range crosses a periodicity boundary on a temporal index (because time is unbounded, we place it on the space filling curve by applying a periodicity such as a year which is our default but can be configured, so in the case of a year periodicity if the track started on Dec. 31 and ended on Jan 1 for example, we have to insert 2 rows on each side of the boundary and we maintain that with the hint num_duplicates). Hopefully that adds some clarity to your situation - as mentioned most likely you are inserting a data ID multiple times with different sort keys, such as different geometries within a spatial index, which will require deleting the previous one prior to insertion in that case.

@parselife
Copy link
Author

Thx for your reply, Where can i find the sort keys ? My situation is that: The data written twice is just the same

@rfecher
Copy link
Contributor

rfecher commented Apr 25, 2022

Do you have a "ROUND_ROBIN" partition strategy on your index (such as described in this add index help output, https://locationtech.github.io/geowave/latest/userguide.html#help-command)? This partition strategy would by design add random partition keys even to identical rows and explain this behavior you're seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants