-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to use custom hash functions in hashing.hash
and Memory
#1232
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1232 +/- ##
==========================================
+ Coverage 88.68% 88.82% +0.14%
==========================================
Files 47 47
Lines 7052 7106 +54
==========================================
+ Hits 6254 6312 +58
+ Misses 798 794 -4
Continue to review full report at Codecov.
|
Ideally, it would be nice to default to |
52c9858
to
4065c14
Compare
a4efea6
to
d2f49b8
Compare
@tomMoral I saw you reviewed the other PR which attempted this. As can be seen from the linked issue and my basic benchmarks the use here is for large Dataframe and array arguments. |
57d7e1e
to
dc728bb
Compare
dc728bb
to
f82cebb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the possibility to easily choose the hash library seems a nice addition and I would like to help merging this.
My main question is about the API design: should we require to register the hash function? I feel that the registration would not really be useful as if one instanciates Memory
in 2 modules that can be imported separately, one would need to register the hash func in each of them (or import joblib in a parent module). Also, if 2 lib register the same hash, this would raise an error on import.
I think we could simplify the API by replacing the hash_name
by a hash_func
that can take a string or a callable. If the hash_func
is a string, we simply call hashlib.new(hash_func)
else call hash_func()
. This way, we don't have to register the hash_func
. This would simplify using the xxhash
for instance with the following code:
import xxhash
mem = Memory(hash_func=xxhash.xxh3_64)
Also, as we add the possibility of changing the hash func, we could have collision between hash from different arguments. This comment by @ogrisel #343 (comment) propose a solution by adding a tag to the hash and raising a warning it does not match. Note that to avoid braing the backward compat, one would need to default to md5:
if no tag is present.
if hash_name not in hashing._HASHES: | ||
raise ValueError("Valid options for 'hash_name' are {}. " | ||
"Got hash_name={!r} instead." | ||
.format(hash_name, hash_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.format(hash_name, hash_name)) | |
.format(hashing._HASHES, hash_name)) |
@@ -495,3 +495,23 @@ def test_wrong_hash_name(): | |||
with raises(ValueError, match=msg): | |||
data = {'foo': 'bar'} | |||
hash(data, hash_name='invalid') | |||
|
|||
|
|||
def test_right_regist_hash(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_right_regist_hash(): | |
def test_right_register_hash(): |
try: | ||
import xxhash | ||
except ImportError: | ||
xxhash = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably remove this part. The rational is that as this is not the default, users would mainly discover this through the doc, and we could explain how to register this when they create a Memory
object?
Else, this results in an extra lib import on all processes, which might be costly ( not sure how heavy this lib is?)
Hi @judahrand, are you still working on this PR? |
This Pull Request is a suggestion to improve #343.
It adds the ability to register and use custom hash functions. Additionally, it exposes the ability to choose which hash function is used to hash arguments in
Memory
.