Skip to content

Commit

Permalink
Another critical bugfix for API change from libfabric 1.7.o to libfab…
Browse files Browse the repository at this point in the history
…ric 1.12.1

In libfabric v1.12.1, verbs provider's fi_cq_open() API does not pick a valid size if given size is zero. In such a case, fi_msg() will always return -FI_EAGAIN, causing an infinite loop in RDMC initialization. The TCP provider is not affected. Instead of let fi_cq_open() to pick one for us, we set it to a fix number 2097152.
  • Loading branch information
songweijia committed Jun 6, 2021
1 parent 626b842 commit 2fdb5f0
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions src/rdmc/lf_helper.cpp
Expand Up @@ -665,6 +665,11 @@ bool lf_initialize(const std::map<node_id_t, std::pair<ip_addr_t, uint16_t>>& ip
fail_if_nonzero_retry_on_eagain(
"fi_domain() failed", CRASH_ON_FAILURE,
fi_domain, g_ctxt.fabric, g_ctxt.fi, &(g_ctxt.domain), nullptr);
/**
* libfabric 1.12 does not pick an adequate default value for completion queue size.
* We simply set it to a large enough one.
*/
g_ctxt.cq_attr.size = 2097152;
fail_if_nonzero_retry_on_eagain(
"failed to initialize tx completion queue", CRASH_ON_FAILURE,
fi_cq_open, g_ctxt.domain, &(g_ctxt.cq_attr), &(g_ctxt.cq), nullptr);
Expand Down

0 comments on commit 2fdb5f0

Please sign in to comment.