Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC invocation_id conflict #103

Open
songweijia opened this issue Feb 21, 2019 · 11 comments
Open

RPC invocation_id conflict #103

songweijia opened this issue Feb 21, 2019 · 11 comments

Comments

@songweijia
Copy link
Contributor

When we construct an RPC message, we use a random generated from Matt's library(mutils::long_rand()) as invocation_id, which index into the results_map holding the promises of the RPC call. In objectstore performance benchmark, I found duplicate invocation_ids within ~50K RPC calls, throwing an std::future already retrieved exception and then crashing. We need to fix this with non-conflicting invocation_id. And we need reclaim the used invocation_id - Don't leave used one-time garbage there forever!

Code location: send_return send(...)@derecho/remote_invocable.h:101.

@songweijia
Copy link
Contributor Author

I think a simple solution is to have a set holding all the pending invocation_ids. When we create a new invocation_id, we test duplication and insert it into the set. Once RPC is done, we remove it from the set. But this is not lock-free.

@songweijia
Copy link
Contributor Author

Another option is to use a sequencer, which is easier to manage than a set.

@etremel etremel changed the title RPC invocation_id confliction RPC invocation_id conflict Feb 22, 2019
@etremel
Copy link
Contributor

etremel commented Feb 22, 2019

I agree that the results_map in a RemoteInvoker should be garbage-collected once each invocation has completed; this seems like an oversight. Also, I don't see why invocation IDs need to be random; they could just as easily be sequential, since they don't have to be globally unique (just unique within a RemoteInvoker instance). You might want to check with Matthew about why he designed it that way, though.

@sagarjha
Copy link
Contributor

After discussions with Weijia, we decided to garbage collect the relevant data structures when all replies are available and the user has destroyed the reply objects. This is not very urgent right now as Weijia's fix to use a sequencer for invocation_ids allows him to dodge the bug.

@mpmilano
Copy link
Member

All we need is unique, not random. So as long as the sequencing is globally unique you should be fine.

Also we should tie the lifetime of the entry in that table to the lifetime of the results object we give the programmer. No need for invasive GC when we can track the actual usage

@songweijia
Copy link
Contributor Author

Why do we need a globally unique invocation_id? The target of the RPC call should know the caller's node id and the RPC signature. I had thought it should be fine as long as the invocation_id is unique in such a domain.

Sagar and I had a sketch solution for GC like this: in the destructor of class QueryResuts, we can test if the promises are all fulfilled. If they are, clear it; otherwise, we set a tombstone in the corresponding PendingResults entry in the map.

@mpmilano
Copy link
Member

That makes sense for the sketch. I'm actually super sick right now so I genuinely can't remember why I think there needs to be uniqueness

@KenBirman
Copy link
Contributor

KenBirman commented Feb 23, 2019 via email

@KenBirman
Copy link
Contributor

KenBirman commented Feb 23, 2019 via email

@mpmilano
Copy link
Member

Oh that is pretty good actually --- it would certainly be unique enough, and it would entirely bypass the existing RDMC-style buffer management. Since we're overhauling that anyway, it makes sense to me to take advantage of these things!

Re: Weijia's question: as of the last time I had a hand in this, the actual runtime process that ships around results to invocations is totally type-erased; the result datagram is only addressed based on the invocation ID, not based on the receiver types at all. This is why they needed to be globally unique; it would be enough for them to be unique within a single RDMC/top-level group, but so far we've only ever had one of those per process anyway.

@songweijia
Copy link
Contributor Author

It turns out my fix had an issue. On receiving a reply message, the p2p message loop is going to set the returned value in RemoteInvoker::results_map[invocation_id]. At the same time, application thread may be calling p2p_send() and inserting a new entry to RemoteInvoker::results_map. Since the p2p message loop had given up the lock on the results_map, those two things will happen concurrently. However, std::map, the type of results_map, is not thread safe: if one thread is inserting/deleting some key K1, retrieving the value of another key K2 may fail with std::out_of_range exception. That's why the p2p message loop encounters nondeterministic std::out_of_range exception.

In commit 6b7c1ac, I gave up on using std::map with a pre-filled array. I also added a reset() method to PendingResults class. p2p_send() resets the corresponding PendingResults slot in the array without any lock or map insertion/deletion operation. Therefore, the reply processing path in p2p message loop will not be affected anymore. However, we still use several hundred KB for the results vector. We may need a better design later.

@songweijia songweijia removed the bug label May 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants