-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fingerprint ingest processor #13724
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
❌ Gradle check result for 50cdb84: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13724 +/- ##
============================================
+ Coverage 71.42% 71.73% +0.30%
- Complexity 59978 61440 +1462
============================================
Files 4985 5072 +87
Lines 282275 288504 +6229
Branches 40946 41784 +838
============================================
+ Hits 201603 206945 +5342
- Misses 63999 64466 +467
- Partials 16673 17093 +420 ☔ View full report in Codecov by Sentry. |
❌ Gradle check result for 4a37513: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
❌ Gradle check result for 7cad2e0: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❕ Gradle check result for 7cad2e0: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Let's say I have the following two documents:
Assuming I set "ignore_missing=true", then will these hash to the same fingerprint? |
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
Actually yes, in this case, it will generate same fingerprint for both documents, but the parameter |
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
❌ Gradle check result for cff2bcf: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
❌ Gradle check result for 4447c9c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❕ Gradle check result for a505368: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
I think we can fix the problem. The issue is that the delimiter can exist within the text being delimited, so there is ambiguity. The general technique I've seen to avoid this problem is to prepend the length in front of each string being hashed. Netstring is one method for doing this. There are likely other techniques as well. Edit: here is a random article from the internet discussing this problem: https://crypto.stackexchange.com/questions/55162/best-way-to-hash-two-values-into-one |
Just kind of a general concern here...getting a consistent hash output over time and across versions is critically important (please correct me if I'm wrong). In other words, given a fixed input document and processor configuration, future versions must continue to generate the exact same fingerprint hash. The 'hash_method' is only one part of it. How we flatten/normalize fields is part of it, and how we concatenate the resulting fields and values into the hash are another part of it. If we find a problem with the normalization or concatenation in the future, then I don't think we can just fix it in place, but we'll have to implement a new version for users to opt in to. I'm wondering if we should bake a version into "hash_method", like Also, do we truly need to give users the flexibility to choose the hash algorithm? Can we start even simpler by having a single I don't mean to over-engineer things here, but I think this is important to get right given that we're persisting data here and must continue to support it going forward once we release it. |
This is really good concern, I sadly have no much experience with fingerprinting in general and its flaws. Keeping the options open sounds like a safe bet, may be more straightforward way would be to couple it with OpenSearch version? For example:
I think we have a good suggestions to deal with that now, I would say having a choice is forward thinking. |
I don't like this. I would think that most use cases of this feature would break if the hashing output changed. I think changing it should be opt-in only and never happen automatically.
I like this. It is more meaningful than an arbitrary version number or date that I suggested. |
I am sorry @andrross, I just realized that the example was incomplete, it supposed to be |
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Thank you, I've changed the code by following the |
Thanks @reta @andrross, I have few questions, if we couple the version with OpenSearch version, does that mean the generated hash value will be different in different OpenSearch versions? Secondly, why should we append the version to the |
Hi @reta , we may not have enough time to complete this work before the code freeze date of the release 2.15.0, it can be postponed to next release, could you help to remove the |
❌ Gradle check result for fb74e64: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Thanks @gaobinlong
May be not necessarily different (I think hashes are pretty stable) but fe the fingerprint generation (fe canonical form,
Yes, the alg + version (fe
💯 |
@gaobinlong Only if we need to introduce new versions and only if the user opts into it. If we get it right in the initial implementation, then The basic point is that we need to put some kind of version identifier with the hash method because we have to guarantee that it is stable over time. Using the OpenSearch version that the method was first introduced in is just one way to do that. |
Description
Add a new ingest processor named
fingerprint
which generate a hash value for some specified fields or fields not in the specified excluded list and write the hash value to thetarget_field
, the hash value can be used to deduplicate documents within a index and collapse search results.The usage of this processor is:
or
The main parameters in this processor are:
fields
: fields in the document used to generate hash value, field name and value are concatenated and separated by|
, like|field1|value1|field2|value2|
, for nested fields, the field name is flattened, like|root_field.sub_field1|value1|root_field.sub_field2|value2|
2.include_all_fields
: whether all fields are included to generate the hash value, eitherfields
orinclude_all_fields
can be set.exclude_fields
: fields not in this list are used to generate the hash value, eitherfields
andexclude_fields
can be non-emptyhash_method
: MD5, SHA-1 or SHA-256, SHA-1 is the default hash method.target_field
: the field to store the hash valueignore_missing
: if one of the specified fields is missing, the processor will exit quietly and do nothing.In addition, if
fields
andexclude_fields
are both empty or null, it meansinclude all fields
, all fields are used to generate the hash value.Related Issues
#13612
Check List
New functionality includes testing.All tests passBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.