Add fingerprint ingest processor #13724

gaobinlong · 2024-05-17T06:21:59Z

Description

Add a new ingest processor named fingerprint which generate a hash value for some specified fields or fields not in the specified excluded list and write the hash value to the target_field, the hash value can be used to deduplicate documents within a index and collapse search results.

The usage of this processor is:

"processors": [
      {
        "fingerprint": {
          "fields": ["foo", "bar"],
          "target_field": "fingerprint",
          "hash_method": "SHA-256",
          "ignore_missing": true
        }
      }
    ]

or

"processors": [
      {
        "fingerprint": {
          "exclude_fields": ["zoo"],
          "target_field": "fingerprint"
        }
      }
    ]

The main parameters in this processor are:

fields: fields in the document used to generate hash value, field name and value are concatenated and separated by |, like |field1|value1|field2|value2|, for nested fields, the field name is flattened, like |root_field.sub_field1|value1|root_field.sub_field2|value2|
~~2. include_all_fields: whether all fields are included to generate the hash value, either fields or include_all_fields can be set.~~
exclude_fields: fields not in this list are used to generate the hash value, either fields and exclude_fields can be non-empty
hash_method: MD5, SHA-1 or SHA-256, SHA-1 is the default hash method.
target_field: the field to store the hash value
ignore_missing: if one of the specified fields is missing, the processor will exit quietly and do nothing.

In addition, if fields and exclude_fields are both empty or null, it means include all fields, all fields are used to generate the hash value.

Related Issues

#13612

Check List

~~New functionality includes testing.~~
- ~~All tests pass~~
New functionality has been documented.
- New functionality has javadoc added
API changes companion pull request created.
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

github-actions · 2024-05-17T06:28:59Z

❌ Gradle check result for 50cdb84: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-05-17T07:19:10Z

✅ Gradle check result for e6b8851: SUCCESS

codecov · 2024-05-17T07:21:03Z

Codecov Report

Attention: Patch coverage is 85.96491% with 16 lines in your changes missing coverage. Please review.

Project coverage is 71.73%. Comparing base (b15cb0c) to head (a505368).
Report is 368 commits behind head on main.

❗ Current head a505368 differs from pull request most recent head fb74e64

Please upload reports for the commit fb74e64 to get more accurate results.

Files	Patch %	Lines
...opensearch/ingest/common/FingerprintProcessor.java	86.48%	5 Missing and 10 partials ⚠️
...search/ingest/common/IngestCommonModulePlugin.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #13724      +/-   ##
============================================
+ Coverage     71.42%   71.73%   +0.30%     
- Complexity    59978    61440    +1462     
============================================
  Files          4985     5072      +87     
  Lines        282275   288504    +6229     
  Branches      40946    41784     +838     
============================================
+ Hits         201603   206945    +5342     
- Misses        63999    64466     +467     
- Partials      16673    17093     +420

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-05-17T07:24:11Z

❌ Gradle check result for 4a37513: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java

github-actions · 2024-06-05T15:28:58Z

❌ Gradle check result for 7cad2e0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-06-05T16:02:38Z

❕ Gradle check result for 7cad2e0: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

andrross · 2024-06-05T19:49:27Z

field name and value are concatenated and separated by |, like |field1|value1|field2|value2|

Let's say I have the following two documents:

{
  "field1": "value1",
  "field2": "value2"
}

{
  "field1": "value1|field2|value2"
}

Assuming I set "ignore_missing=true", then will these hash to the same fingerprint?

modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java

gaobinlong · 2024-06-06T02:20:22Z

field name and value are concatenated and separated by |, like |field1|value1|field2|value2|

Let's say I have the following two documents:
{
  "field1": "value1",
  "field2": "value2"
}
{
  "field1": "value1|field2|value2"
}
Assuming I set "ignore_missing=true", then will these hash to the same fingerprint?

If ignore_missing is true and one of the specified fields doesn't exist, the processor will exit quietly without modifying the document, so no hash value will be generated. Ignore_missing is a common parameter, the meaning are same in all processors.

Actually yes, in this case, it will generate same fingerprint for both documents, but the parameter ignore_missing is controlled by users, maybe we can just document this behavior and tell users the underlying issue if they set ignore_missing to true?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

github-actions · 2024-06-06T03:24:28Z

❌ Gradle check result for cff2bcf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

github-actions · 2024-06-06T04:24:25Z

❌ Gradle check result for 4447c9c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-06-06T06:24:13Z

❕ Gradle check result for a505368: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

andrross · 2024-06-06T15:36:38Z

maybe we can just document this behavior and tell users the underlying issue if they set ignore_missing to true?

I think we can fix the problem. The issue is that the delimiter can exist within the text being delimited, so there is ambiguity. The general technique I've seen to avoid this problem is to prepend the length in front of each string being hashed. Netstring is one method for doing this. There are likely other techniques as well.

Edit: here is a random article from the internet discussing this problem: https://crypto.stackexchange.com/questions/55162/best-way-to-hash-two-values-into-one

andrross · 2024-06-06T16:41:58Z

Just kind of a general concern here...getting a consistent hash output over time and across versions is critically important (please correct me if I'm wrong). In other words, given a fixed input document and processor configuration, future versions must continue to generate the exact same fingerprint hash. The 'hash_method' is only one part of it. How we flatten/normalize fields is part of it, and how we concatenate the resulting fields and values into the hash are another part of it. If we find a problem with the normalization or concatenation in the future, then I don't think we can just fix it in place, but we'll have to implement a new version for users to opt in to. I'm wondering if we should bake a version into "hash_method", like V1_SHA-256 or 2024-06-01_SHA-256 to give us the flexibility to release future versions of the hashing technique using the same underlying hashing algorithm?

Also, do we truly need to give users the flexibility to choose the hash algorithm? Can we start even simpler by having a single hash_method called V1 where the algorithm used is an implementation detail? I know there are tradeoffs between the algorithms, but is there a good one-size-fits-all algorithm that might work? We can always add configurability in the future if it is needed.

I don't mean to over-engineer things here, but I think this is important to get right given that we're persisting data here and must continue to support it going forward once we release it.

reta · 2024-06-06T16:54:44Z

How we flatten/normalize fields is part of it, and how we concatenate the resulting fields and values into the hash are another part of it. If we find a problem with the normalization or concatenation in the future, then I don't think we can just fix it in place, but we'll have to implement a new version for users to opt in to. I'm wondering if we should bake a version into "hash_method", like V1_SHA-256 or 2024-06-01_SHA-256 to give us the flexibility to release future versions of the hashing technique using the same underlying hashing algorithm?

This is really good concern, I sadly have no much experience with fingerprinting in general and its flaws. Keeping the options open sounds like a safe bet, may be more straightforward way would be to couple it with OpenSearch version? For example:

SHA-256 - whatever the current is, would be stored as "SHA-256@"
SHA-256@2.14.0 - use the SHA-256 impl as of OpenSearch 2.11.0

Also, do we truly need to give users the flexibility to choose the hash algorithm?

I think we have a good suggestions to deal with that now, I would say having a choice is forward thinking.

andrross · 2024-06-06T17:31:46Z

SHA-256 - whatever the current is, would be stored as "SHA-256@"

I don't like this. I would think that most use cases of this feature would break if the hashing output changed. I think changing it should be opt-in only and never happen automatically.

SHA-256@2.14.0

I like this. It is more meaningful than an arbitrary version number or date that I suggested.

reta · 2024-06-06T20:26:14Z

I don't like this. I would think that most use cases of this feature would break if the hashing output changed. I think changing it should be opt-in only and never happen automatically.

I am sorry @andrross, I just realized that the example was incomplete, it supposed to be SHA-256@<current version>, thank you for noticing that.

andrross · 2024-06-06T22:22:49Z

I am sorry @andrross, I just realized that the example was incomplete, it supposed to be SHA-256@, thank you for noticing that.

@reta Ah, ok, that does make more sense, but I would still prefer to always be fully explicit with the version.

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

gaobinlong · 2024-06-07T07:46:06Z

maybe we can just document this behavior and tell users the underlying issue if they set ignore_missing to true?

I think we can fix the problem. The issue is that the delimiter can exist within the text being delimited, so there is ambiguity. The general technique I've seen to avoid this problem is to prepend the length in front of each string being hashed. Netstring is one method for doing this. There are likely other techniques as well.

Edit: here is a random article from the internet discussing this problem: https://crypto.stackexchange.com/questions/55162/best-way-to-hash-two-values-into-one

Thank you, I've changed the code by following the Netstring, please help to take a look!

gaobinlong · 2024-06-07T08:23:19Z

Keeping the options open sounds like a safe bet, may be more straightforward way would be to couple it with OpenSearch version?

Thanks @reta @andrross, I have few questions, if we couple the version with OpenSearch version, does that mean the generated hash value will be different in different OpenSearch versions? Secondly, why should we append the version to the hash_method parameter? Does that mean users can choose which version they want to use so we should handle different versions in a future release? I think the version covers not only the hash algorithm, but also how we flatten the object fields and concatenate the field name and value. So how about make the version internally used but still expose it to users, just like the community_id processor, which prepends the version number to the hash string, so the generated hash value just like 1:pkvHqCL88/tg1k4cPigmZXUtL00=, if the version number doesn't change, the generated hash value is always consistent in different OpenSearch versions.

gaobinlong · 2024-06-07T08:33:00Z

Hi @reta , we may not have enough time to complete this work before the code freeze date of the release 2.15.0, it can be postponed to next release, could you help to remove the v2.15.0 label, thank you!

github-actions · 2024-06-07T08:34:46Z

❌ Gradle check result for fb74e64: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

reta · 2024-06-07T13:02:09Z

Thanks @gaobinlong

Thanks @reta @andrross, I have few questions, if we couple the version with OpenSearch version, does that mean the generated hash value will be different in different OpenSearch versions?

May be not necessarily different (I think hashes are pretty stable) but fe the fingerprint generation (fe canonical form, | concatenation / flattening, ...) may change because of possible bugs and edge cases we haven't though of. We could also move off to a different (more efficient?) hash function implementation fe.

Secondly, why should we append the version to the hash_method parameter?

Yes, the alg + version (fe SHA-256@2.14.0) should be there

I think the version covers not only the hash algorithm, but also how we flatten the object fields and concatenate the field name and value.

💯

andrross · 2024-06-07T16:28:00Z

if we couple the version with OpenSearch version, does that mean the generated hash value will be different in different OpenSearch versions?

@gaobinlong Only if we need to introduce new versions and only if the user opts into it. If we get it right in the initial implementation, then SHA-256@2.15.0 may be the only version of the SHA-256 fingerprint hashing we ever implement and support. This parameter is essentially saying "this is the fingerprint hashing method that uses SHA-256 and was introduced in OpenSearch 2.15.0". If we never need to change that method to improve performance or fix bugs, then future users on OpenSearch 4.0 may still be creating fingerprint processors with the hash method SHA-256@2.15.0.

The basic point is that we need to put some kind of version identifier with the hash method because we have to guarantee that it is stable over time. Using the OpenSearch version that the method was first introduced in is just one way to do that.

gaobinlong requested review from anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, dbwiddis, dreamer-89, gbbafna, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami, tlfeng and VachaShah as code owners May 17, 2024 06:22

Add fingerprint ingest processor

4a37513

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

gaobinlong force-pushed the fingerPrint branch from 50cdb84 to 4a37513 Compare May 17, 2024 06:26

gaobinlong mentioned this pull request May 17, 2024

[DOC] Add fingerprint ingest processor opensearch-project/documentation-website#7191

Open

4 tasks

reta reviewed May 17, 2024

View reviewed changes

modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java Outdated Show resolved Hide resolved

reta reviewed May 17, 2024

View reviewed changes

modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java Outdated Show resolved Hide resolved

reta added the ingest-pipeline label Jun 5, 2024

reta approved these changes Jun 5, 2024

View reviewed changes

andrross reviewed Jun 5, 2024

View reviewed changes

modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java Outdated Show resolved Hide resolved

modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java Outdated Show resolved Hide resolved

Optimize error message and check if field name is empty string

cff2bcf

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

Fix yaml test failure

4447c9c

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

Merge remote-tracking branch 'upstream/main' into fingerPrint

a505368

Prepend string length to the field value

fb74e64

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

reta added v2.16.0 Issues and PRs related to version 2.16.0 and removed v2.15.0 Issues and PRs related to version 2.15.0 labels Jun 7, 2024

This was referenced Jun 11, 2024

[AUTOCUT] Gradle Check Flaky Test Report for IndicesRequestCacheIT prudhvigodithi/OpenSearch#25

Open

[AUTOCUT] Gradle Check Flaky Test Report for IndicesRequestCacheIT prudhvigodithi/OpenSearch#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fingerprint ingest processor #13724

Add fingerprint ingest processor #13724

gaobinlong commented May 17, 2024 •

edited by reta

github-actions bot commented May 17, 2024

github-actions bot commented May 17, 2024

codecov bot commented May 17, 2024 •

edited

github-actions bot commented May 17, 2024

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 5, 2024

andrross commented Jun 5, 2024

gaobinlong commented Jun 6, 2024 •

edited

github-actions bot commented Jun 6, 2024

github-actions bot commented Jun 6, 2024

github-actions bot commented Jun 6, 2024

andrross commented Jun 6, 2024 •

edited

andrross commented Jun 6, 2024

reta commented Jun 6, 2024

andrross commented Jun 6, 2024

reta commented Jun 6, 2024

andrross commented Jun 6, 2024

gaobinlong commented Jun 7, 2024

gaobinlong commented Jun 7, 2024

gaobinlong commented Jun 7, 2024

github-actions bot commented Jun 7, 2024

reta commented Jun 7, 2024

andrross commented Jun 7, 2024

Add fingerprint ingest processor #13724

Are you sure you want to change the base?

Add fingerprint ingest processor #13724

Conversation

gaobinlong commented May 17, 2024 • edited by reta

Description

Related Issues

Check List

github-actions bot commented May 17, 2024

github-actions bot commented May 17, 2024

codecov bot commented May 17, 2024 • edited

Codecov Report

github-actions bot commented May 17, 2024

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 5, 2024

andrross commented Jun 5, 2024

gaobinlong commented Jun 6, 2024 • edited

github-actions bot commented Jun 6, 2024

github-actions bot commented Jun 6, 2024

github-actions bot commented Jun 6, 2024

andrross commented Jun 6, 2024 • edited

andrross commented Jun 6, 2024

reta commented Jun 6, 2024

andrross commented Jun 6, 2024

reta commented Jun 6, 2024

andrross commented Jun 6, 2024

gaobinlong commented Jun 7, 2024

gaobinlong commented Jun 7, 2024

gaobinlong commented Jun 7, 2024

github-actions bot commented Jun 7, 2024

reta commented Jun 7, 2024

andrross commented Jun 7, 2024

gaobinlong commented May 17, 2024 •

edited by reta

codecov bot commented May 17, 2024 •

edited

gaobinlong commented Jun 6, 2024 •

edited

andrross commented Jun 6, 2024 •

edited