Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't a Person email be unique ? #196

Open
wizmer opened this issue Oct 2, 2018 · 5 comments
Open

Shouldn't a Person email be unique ? #196

wizmer opened this issue Oct 2, 2018 · 5 comments

Comments

@wizmer
Copy link
Contributor

wizmer commented Oct 2, 2018

Hi,

Due to the async nature of Nexus, I have mistakenly created multiple instance of a Person representing the actual same person. It is due to a bad design on my side but on the other hand I was wondering if it would be doable/beneficial to constraint the unicity of the email address directly in the schema ?

Thanks

@MFSY
Copy link
Contributor

MFSY commented Oct 2, 2018

Hi @wizmer,
It is certainly beneficial to be able to constraint the unicity of a given property (for example email) but really hard to do with SHACL and at scale. Unfortunately, I don't think we have that capacity right now. But if you are using pyxus to interact with the Nexus API, then there is an option to help make sure that an entity is not submitted twice.

The workaround is to add a schema:identifier property to the person instance and put the person email as value. Then the find_by_identifier function can be used to check whether there is an already registered person with a given email value through search. You can generalize it by using find_by_field if you want to use another field as identifier.

I suggest also to use ORCID ids as person identifier whenever possible (at least for people working in research context).

May be @olinux can weigh in and help.

@olinux
Copy link
Collaborator

olinux commented Oct 2, 2018

Hi everyone,
Here's my two cents: Since we have/had the same provblem, we've introduced the schema:identifier in all our instances (as a hash) which represents a value defining an entity to clearly distinguish different entities (regardless of their generated UUID). The availability and strength of these keys depends on your data. So if for your data an email is distinctive enough (e.g. probability is high enough that if these e-mail addresses are re-used after a person has left the organization and appears in another meaning in your data) you can go with it as a key which is checked for existence before uploading. Since indices in elasticsearch as well as blazegraph are updated asynchronously, it could still happen that two uploads in short time will happen (the check for existence bases on either elasticsearch or blazegraph and could report an instance to be non-existent while the first message of its creation is still in the queue).
If you don't want to introduce complex and (time)expensive synchronization mechanisms (such as client-side locks) you will not be able to evict that threat as far as we can tell and you need to introduce cleanup mechanisms. If you're interested in more detail what we're doing to prevent these things to happen, feel free to ask - so we can give you a short tour.

@wizmer
Copy link
Contributor Author

wizmer commented Oct 2, 2018

Thanks to both of you for the answers,

@MFSY Yes, that's what I was afraid of. After all, SHACL was developed for the Web. How could you guaranty unicity at the Web scale ?

@olinux Well, I am not using Pyxus but @genric entity-management library. I am exactly in the case that you described, where indices are not updated yet so instances were created multiple times. I am currently implementing a client side lock that blocks until the indices are updated. I think I have no other choice. You already implemented such thing in Pyxus ? If yes, I'd be interested in having a look.

@jdcourcol
Copy link

jdcourcol commented Oct 15, 2018

@MFSY what is the strategy we should adopt for the BBP Nexus database ?
I thought so far the email was the unique identifier. Is that assumption still valid ?

@MFSY
Copy link
Contributor

MFSY commented Oct 16, 2018

Hi @jdcourcol,

I propose we take this discussion offline. I'll organize a call for that to revisit how identifiers are handled with the latest Nexus development.
But I can say that while emails are used as identifiers on some data, we know this is not a final solution. People can change emails, have multiple emails from different institutions and there is a risk of repurposing already used emails.

As you can see from the question and answers above, it is not trivial to enforce entity unicity based on email value or any other property value with Nexus v0.
With Nexus V1, we are a in a better position to implement best practicies in term of entity, person identification. For example we started to look at ORCID as person identifier (a person can still have one or many emails). We think it make sense in research context to use Orcid.

Of course the devil is in the details and we need to
coordinate on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants