Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fingerprint ingest processor #13724

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

gaobinlong
Copy link
Contributor

@gaobinlong gaobinlong commented May 17, 2024

Description

Add a new ingest processor named fingerprint which generate a hash value for some specified fields or fields not in the specified excluded list and write the hash value to the target_field, the hash value can be used to deduplicate documents within a index and collapse search results.

The usage of this processor is:

"processors": [
      {
        "fingerprint": {
          "fields": ["foo", "bar"],
          "target_field": "fingerprint",
          "hash_method": "SHA-256",
          "ignore_missing": true
        }
      }
    ]

or

"processors": [
      {
        "fingerprint": {
          "exclude_fields": ["zoo"],
          "target_field": "fingerprint"
        }
      }
    ]

The main parameters in this processor are:

  1. fields: fields in the document used to generate hash value, field name and value are concatenated and separated by |, like |field1|value1|field2|value2|, for nested fields, the field name is flattened, like |root_field.sub_field1|value1|root_field.sub_field2|value2|
    2. include_all_fields: whether all fields are included to generate the hash value, either fields or include_all_fields can be set.
  2. exclude_fields: fields not in this list are used to generate the hash value, either fields and exclude_fields can be non-empty
  3. hash_method: MD5, SHA-1 or SHA-256, SHA-1 is the default hash method.
  4. target_field: the field to store the hash value
  5. ignore_missing: if one of the specified fields is missing, the processor will exit quietly and do nothing.

In addition, if fields and exclude_fields are both empty or null, it means include all fields, all fields are used to generate the hash value.

Related Issues

#13612

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • API changes companion pull request created.
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

❌ Gradle check result for 50cdb84: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for e6b8851: SUCCESS

Copy link

codecov bot commented May 17, 2024

Codecov Report

Attention: Patch coverage is 85.96491% with 16 lines in your changes missing coverage. Please review.

Project coverage is 71.73%. Comparing base (b15cb0c) to head (a505368).
Report is 368 commits behind head on main.

Current head a505368 differs from pull request most recent head fb74e64

Please upload reports for the commit fb74e64 to get more accurate results.

Files Patch % Lines
...opensearch/ingest/common/FingerprintProcessor.java 86.48% 5 Missing and 10 partials ⚠️
...search/ingest/common/IngestCommonModulePlugin.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13724      +/-   ##
============================================
+ Coverage     71.42%   71.73%   +0.30%     
- Complexity    59978    61440    +1462     
============================================
  Files          4985     5072      +87     
  Lines        282275   288504    +6229     
  Branches      40946    41784     +838     
============================================
+ Hits         201603   206945    +5342     
- Misses        63999    64466     +467     
- Partials      16673    17093     +420     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❌ Gradle check result for 4a37513: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jun 5, 2024

❌ Gradle check result for 7cad2e0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jun 5, 2024

❕ Gradle check result for 7cad2e0: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@andrross
Copy link
Member

andrross commented Jun 5, 2024

field name and value are concatenated and separated by |, like |field1|value1|field2|value2|

Let's say I have the following two documents:

{
  "field1": "value1",
  "field2": "value2"
}
{
  "field1": "value1|field2|value2"
}

Assuming I set "ignore_missing=true", then will these hash to the same fingerprint?

@gaobinlong
Copy link
Contributor Author

gaobinlong commented Jun 6, 2024

field name and value are concatenated and separated by |, like |field1|value1|field2|value2|

Let's say I have the following two documents:

{
  "field1": "value1",
  "field2": "value2"
}
{
  "field1": "value1|field2|value2"
}

Assuming I set "ignore_missing=true", then will these hash to the same fingerprint?

If ignore_missing is true and one of the specified fields doesn't exist, the processor will exit quietly without modifying the document, so no hash value will be generated. Ignore_missing is a common parameter, the meaning are same in all processors.

Actually yes, in this case, it will generate same fingerprint for both documents, but the parameter ignore_missing is controlled by users, maybe we can just document this behavior and tell users the underlying issue if they set ignore_missing to true?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

github-actions bot commented Jun 6, 2024

❌ Gradle check result for cff2bcf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

github-actions bot commented Jun 6, 2024

❌ Gradle check result for 4447c9c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jun 6, 2024

❕ Gradle check result for a505368: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@andrross
Copy link
Member

andrross commented Jun 6, 2024

maybe we can just document this behavior and tell users the underlying issue if they set ignore_missing to true?

I think we can fix the problem. The issue is that the delimiter can exist within the text being delimited, so there is ambiguity. The general technique I've seen to avoid this problem is to prepend the length in front of each string being hashed. Netstring is one method for doing this. There are likely other techniques as well.

Edit: here is a random article from the internet discussing this problem: https://crypto.stackexchange.com/questions/55162/best-way-to-hash-two-values-into-one

@andrross
Copy link
Member

andrross commented Jun 6, 2024

Just kind of a general concern here...getting a consistent hash output over time and across versions is critically important (please correct me if I'm wrong). In other words, given a fixed input document and processor configuration, future versions must continue to generate the exact same fingerprint hash. The 'hash_method' is only one part of it. How we flatten/normalize fields is part of it, and how we concatenate the resulting fields and values into the hash are another part of it. If we find a problem with the normalization or concatenation in the future, then I don't think we can just fix it in place, but we'll have to implement a new version for users to opt in to. I'm wondering if we should bake a version into "hash_method", like V1_SHA-256 or 2024-06-01_SHA-256 to give us the flexibility to release future versions of the hashing technique using the same underlying hashing algorithm?

Also, do we truly need to give users the flexibility to choose the hash algorithm? Can we start even simpler by having a single hash_method called V1 where the algorithm used is an implementation detail? I know there are tradeoffs between the algorithms, but is there a good one-size-fits-all algorithm that might work? We can always add configurability in the future if it is needed.

I don't mean to over-engineer things here, but I think this is important to get right given that we're persisting data here and must continue to support it going forward once we release it.

@reta
Copy link
Collaborator

reta commented Jun 6, 2024

How we flatten/normalize fields is part of it, and how we concatenate the resulting fields and values into the hash are another part of it. If we find a problem with the normalization or concatenation in the future, then I don't think we can just fix it in place, but we'll have to implement a new version for users to opt in to. I'm wondering if we should bake a version into "hash_method", like V1_SHA-256 or 2024-06-01_SHA-256 to give us the flexibility to release future versions of the hashing technique using the same underlying hashing algorithm?

This is really good concern, I sadly have no much experience with fingerprinting in general and its flaws. Keeping the options open sounds like a safe bet, may be more straightforward way would be to couple it with OpenSearch version? For example:

  • SHA-256 - whatever the current is, would be stored as "SHA-256@"
  • SHA-256@2.14.0 - use the SHA-256 impl as of OpenSearch 2.11.0

Also, do we truly need to give users the flexibility to choose the hash algorithm?

I think we have a good suggestions to deal with that now, I would say having a choice is forward thinking.

@andrross
Copy link
Member

andrross commented Jun 6, 2024

SHA-256 - whatever the current is, would be stored as "SHA-256@"

I don't like this. I would think that most use cases of this feature would break if the hashing output changed. I think changing it should be opt-in only and never happen automatically.

SHA-256@2.14.0

I like this. It is more meaningful than an arbitrary version number or date that I suggested.

@reta
Copy link
Collaborator

reta commented Jun 6, 2024

I don't like this. I would think that most use cases of this feature would break if the hashing output changed. I think changing it should be opt-in only and never happen automatically.

I am sorry @andrross, I just realized that the example was incomplete, it supposed to be SHA-256@<current version>, thank you for noticing that.

@andrross
Copy link
Member

andrross commented Jun 6, 2024

I am sorry @andrross, I just realized that the example was incomplete, it supposed to be SHA-256@, thank you for noticing that.

@reta Ah, ok, that does make more sense, but I would still prefer to always be fully explicit with the version.

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
@gaobinlong
Copy link
Contributor Author

maybe we can just document this behavior and tell users the underlying issue if they set ignore_missing to true?

I think we can fix the problem. The issue is that the delimiter can exist within the text being delimited, so there is ambiguity. The general technique I've seen to avoid this problem is to prepend the length in front of each string being hashed. Netstring is one method for doing this. There are likely other techniques as well.

Edit: here is a random article from the internet discussing this problem: https://crypto.stackexchange.com/questions/55162/best-way-to-hash-two-values-into-one

Thank you, I've changed the code by following the Netstring, please help to take a look!

@gaobinlong
Copy link
Contributor Author

Keeping the options open sounds like a safe bet, may be more straightforward way would be to couple it with OpenSearch version?

Thanks @reta @andrross, I have few questions, if we couple the version with OpenSearch version, does that mean the generated hash value will be different in different OpenSearch versions? Secondly, why should we append the version to the hash_method parameter? Does that mean users can choose which version they want to use so we should handle different versions in a future release? I think the version covers not only the hash algorithm, but also how we flatten the object fields and concatenate the field name and value. So how about make the version internally used but still expose it to users, just like the community_id processor, which prepends the version number to the hash string, so the generated hash value just like 1:pkvHqCL88/tg1k4cPigmZXUtL00=, if the version number doesn't change, the generated hash value is always consistent in different OpenSearch versions.

@gaobinlong
Copy link
Contributor Author

Hi @reta , we may not have enough time to complete this work before the code freeze date of the release 2.15.0, it can be postponed to next release, could you help to remove the v2.15.0 label, thank you!

Copy link
Contributor

github-actions bot commented Jun 7, 2024

❌ Gradle check result for fb74e64: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@reta reta added v2.16.0 Issues and PRs related to version 2.16.0 and removed v2.15.0 Issues and PRs related to version 2.15.0 labels Jun 7, 2024
@reta
Copy link
Collaborator

reta commented Jun 7, 2024

Thanks @gaobinlong

Thanks @reta @andrross, I have few questions, if we couple the version with OpenSearch version, does that mean the generated hash value will be different in different OpenSearch versions?

May be not necessarily different (I think hashes are pretty stable) but fe the fingerprint generation (fe canonical form, | concatenation / flattening, ...) may change because of possible bugs and edge cases we haven't though of. We could also move off to a different (more efficient?) hash function implementation fe.

Secondly, why should we append the version to the hash_method parameter?

Yes, the alg + version (fe SHA-256@2.14.0) should be there

I think the version covers not only the hash algorithm, but also how we flatten the object fields and concatenate the field name and value.

💯

@andrross
Copy link
Member

andrross commented Jun 7, 2024

if we couple the version with OpenSearch version, does that mean the generated hash value will be different in different OpenSearch versions?

@gaobinlong Only if we need to introduce new versions and only if the user opts into it. If we get it right in the initial implementation, then SHA-256@2.15.0 may be the only version of the SHA-256 fingerprint hashing we ever implement and support. This parameter is essentially saying "this is the fingerprint hashing method that uses SHA-256 and was introduced in OpenSearch 2.15.0". If we never need to change that method to improve performance or fix bugs, then future users on OpenSearch 4.0 may still be creating fingerprint processors with the hash method SHA-256@2.15.0.

The basic point is that we need to put some kind of version identifier with the hash method because we have to guarantee that it is stable over time. Using the OpenSearch version that the method was first introduced in is just one way to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch ingest-pipeline v2.16.0 Issues and PRs related to version 2.16.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants