New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10 of 10 LNX series - Introduce the first version of the LINKx provider #10034
Open
amirshehataornl
wants to merge
54
commits into
ofiwg:main
Choose a base branch
from
amirshehataornl:09_lnx_linkx_provider
base: main
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
10 of 10 LNX series - Introduce the first version of the LINKx provider #10034
amirshehataornl
wants to merge
54
commits into
ofiwg:main
from
amirshehataornl:09_lnx_linkx_provider
+6,387
−314
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
amirshehataornl
force-pushed
the
09_lnx_linkx_provider
branch
5 times, most recently
from
May 11, 2024 06:10
b6242ba
to
e6333d5
Compare
When checking fabric attributes with ofi_check_fabric_attr() make sure to consider provider exclusion. When checking to see if a provider name is given, only consider ones which are not excluded using the '^' character. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
It is not efficient to do a reverse lookup on the AV table when a message is received. Some providers do not store the fi_addr_t associated with the peer in the header passed on the wire. And it is not practical to require providers to add that to wire header, as it would break backwards compatibility. In order to handle this case, an address matching callback is added to the peer_srx.peer_ops structure. This allows the provider receiving the message to register an address matching callback. This callback is called by the owner provider to match an fi_addr_t with provider specific address in the message received. The callback allows the receiving provider to do an O(1) index into the AV table to lookup the address of the peer, and then compare that with the source address in the received message. As part of this change provider specific address information needs to be passed to the owner provider, which the owner will need to give back to the receiving provider, when it attempts to do address matching. Update the SHM and LINKx providers to conform with the API changes Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add a new structure fi_peer_match to collect the parameters which need to be passed to the get_msg and get_tag functions. Update the util_get_tag() and util_get_msg() function callbacks. Compilation gives a warning but not failing. This causes memory corruption when the callbacks are called. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
amirshehataornl
force-pushed
the
09_lnx_linkx_provider
branch
from
May 16, 2024 13:22
e6333d5
to
f775ecb
Compare
Add a memory registration callback to the fi_ops_srx_peer. This allows core providers to expose a memory registration callback which the parent or peer provider can use to register memory on the receive path. For example the CXI provider registers memory with the NIC on the receive path. When using the peer infrastructure this can not happen because we do not know which provider will perform the receive operation. But if the source NID is specified then we can know and therefore we can perform the receive buffer registration at the top of the receive path. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add FI_PEER capability bit Signed-off-by: Amir Shehata <shehataa@ornl.gov>
The parent provider should be able to get access to the peer provider callbacks. Added the srx block in the fid.context so we can retrieve it later on. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add the FI_PEER capability bit to the SHM fi_infos Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add the FI_PEER capability bit to the CXI provider fi_info Signed-off-by: Amir Shehata <shehataa@ornl.gov>
On cq_open, check the FI_PEER_IMPORT, if set, set all internal cq operation to be enosys, with the exception to the read callback. The read callback is overloaded to operate as a progress callback function. Invoking the read callback will progress the enpoints linked to this CQ. Keep track of the fid_peer_cq structure passed in. If the FI_PEER_IMPORT flag is set, then set the callbacks in cxip_cq structure which handle writing to the peer_cq, otherwise set them to the ones which write to the util_cq. A provider needs to call a different set of functions to insert completion events into an imported CQ vs an internal CQ. These set of callback definition standardize a way to assign a different function to a CQ object, which can then be called to insert into the CQ. For example: struct prov_cq { struct util_cq *util_cq; struct fid_peer_cq *peer_cq; ofi_peer_cq_cb cq_cb; }; When a provider opens a CQ it can: if (attr->flags & FI_PEER_IMPORT) { prov_cq->cq_cb.cq_comp = prov_peer_cq_comp; } else { prov_cq->cq_cb.cq_comp = prov_cq_comp; } Collect the peer CQ callbacks in one structure for use in CXI. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Restructure the code to allow for posting on the owner provider's shared receive queues. Do not do a reverse lookup on the AV table to get the fi_addr_t, instead register an address matching callback with the owner. The owner can then call the address matching callback to match an fi_addr_t to the source address in the message received. This is more efficient as the peer lookup can be an O(1) operation; AV[fi_addr_t]. The peer's CXI address can be compared with the CXI address in the message received. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Upstream has a different method of registering SRX. There is a limitation where the SRX is only returned back in the upcall in get_tag/get_msg. But that prevents the parent provider of doing anything else with the peer callbacks. This presents a problem because we added a callback to register memory on the receive path. This patch updates the CXI provider Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add memory registration callback to allow for parent provider, if one exists, to register receive buffers and not to wait until the data arrives before we can register the receive buffers. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Use the following format to described provider linking via LINKx: provider name: "<prov1>+<prov2>:linkx". Add infrastructure to support this format. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add initial implementation of the fi_link() API. In this patch it handles only one provider. It's behavior will be exactly like fi_fabric() API. If there are multiple providers in the passed in list, then call ofi_create_link() which ends up linking all providers in the list. Currently this operation is not supported. Add the initial LINKx implementation. Although LINKx is a provider, it will not be returned as part of the matching list in the fi_getinfo() call. It will behave like it's part of the libfabric infrastructure. Therefore, it will not implement the <>_getinfo() function. If this function is called. It'll return EOPNOTSUPP. Similarly, the fi_fabric() call will not be callable. It will return EOPNOTSUPP. Instead the ofi_create_link() will initialize a virtual fabric and return that to the caller as part of the fi_link() API. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add the ability to iterate over all provider endpoints passed into the fi_link() API and initialize the fabric for each. Store the fid_fabric in a local linked list. The core provider endpoint fid_fabric will be used whenever redirecting function calls to the core provider. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
when fi_domain() is called on the LINKx provider, then go through all the core providers which are being linked and initialize their associated domains. Similarly, when closing the domain, go over all the core providers' domains and close them. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Initialize the endpoints for all the core providers and store them in the ep_fid in the local endpoint table. When closing an endpoint, close all the core providers endpoints. Lay the ground work for the endpoint operations. Most of these operations will index into the LINKx peer table. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Initialize the LINKx peer table. The peer table uses the OFI address vector infrastructure. The peer table will contain an array of peer entries. The size of the array is determined by the fi_av_attr passed into the fi_av_open() API: fi_av_attr.count. Each peer entry will indicate whether the peer is local or remote. It will also have a list of provider endpoint information: struct local_prov_entry *eps. This structure contains all the FIDs which are required for communication. For on-node peers, it'll have only 1 entry, which would be the SHM provider endpoint. For remote peers it'll have one or more entries to the providers which can be used for remote communication. When the peer is indexed to during traffic operations, then the appropriate FID for the operation can be dereferenced from the local_prov_entry field and used in the call to the core provider. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Define the LINKx address format. A LINKx address is an encapsulation of multiple core provider endpoints. The address format allows specification of multiple providers, and under each provider multiple addresses. This will allow LINKx to return a bond of addresses which could in theory span multiple providers. When the application calls av_insert() create a peer entry (or update an existing one). The peer table is referenced by the fid_av. If the peer is local and we're managing the shm provider, then mark this peer as only reachable via shared memory, as shared memory will always be more efficient than all other methods of communication. If the peer is remote, then we can have multiple addresses provided for us. The addresses are divided by provider. Go through the address provider list and insert these addresses into the local endpoint for the same provider. For example if a node has 2 TCP interfaces, we'll be provided two TCP addresses, one for each interface. If the local node also has 2 TCP interfaces, they'll be presented as two local provider endpoints, each with its own Address Vector. In order to be able to reach the peer's interfaces from either of our local interfaces, we need to insert the remote addresses into both of the local Address Vector tables for both provider endpoints. When the application calls av_remove() remove the peer from the peer table and all associated core provider addresses from the core providers. When closing the AV, remove the peer table and clean up all the memory. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
The LINKx getname() method collects all the addresses of the different providers it manages. LINKx can manage multiple providers and each provider can manage one or more endpoints. It builds an lnx_addresses structure big enough to fit all the addresses. If the buffer provided is smaller than that, it sets the addresslen parameter to the needed size and return FI_ETOOSMALL. This address defines all the different ways this process can be reached. For example if a node has 4 NICs, the address can contain the address for all 4 NICs the process is listening on. The application, should then allocate a large enough buffer and callback into the LINKx provider using fi_getname() to get the addresses. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
When an application requests a Address Vector to be bound to an Endpoint, LINKx looks up the peer_table referenced by the AV fid and adds a pointer to the peer_table in the LINKx endpoint structure. Since each core provider maintains its own address vector, call fi_ep_bind() on each of the core provider endpoints. Later on when calling a traffic operation function, example fi_tsend(), then we can index into the peer table by the fi_addr_t passed into the function call. From there we can make a decision on which local endpoint to use to send to that peer. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
For send/recv operation we currently make an assumption that there are only two addresses available, a shared memory address if the peer is local, or another address if the peer is remote. This will need to be updated to handle Multi-Rail where there could be multiple addresses we can reach the peer on. For recv operations, it currently doesn't support handling FI_ADDR_UNSPEC, as that will require shared receive queues. Shared receive queues are going to be needed for handling multi-rail, as you can receive messages on any of the rails. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
When FI_ENABLE is requested, ensure that the completion queues and the peer tables are created and ready. Enable all the core endpoints. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
On cq_open() LINKx will create its own CQ. This is the CQ which the application will query. It will then export this CQ for use by all the core provider endpoints. It uses the fid_peer_cq structure to share its own CQ. LINKx will provide its write/writeerr callbacks() for use by the core provider endpoints. These functions are thin wrappers around their ofi_util.h counter parts. The ofi_util.h cq functions already implement locking and writing onto the CQ provided. Handle binding the LINKx endpoint to the LINKx cq, and perform the CQ bind on all the core endpoints being managed by LINKx. Progress invocation will kick all the core providers to progress their work through the peer_cq progress callback assigned by the core provider endpoints. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add support for fi_mr_regattrs() API in LINKx provider. Upon handling fi_mr_regattrs() call, LINKx will check the domain's mr_mode. If there are no flags set or if FI_MR_ENDPOINT is set, then don't register the memory. Otherwise call the core's memory registration. This is to adhere with the behavior outlined here: https://ofiwg.github.io/libfabric/v1.15.0/man/fi_mr.3.html LINKx needs to determine which domain to register the memory against. Since it handles multiple domains, the only way it can find out the domain is by being provided a destination or source address. It can then use the address to predetermine the local domain it'll use for the data operation and register the memory against it. Therefore, fi_addr_t field has been added to fi_mr_attr structure to allow LINKx to perform the above logic. If the address is not specified, then LINKx will need to register the memory with all the domains it has, because it doesn't know which domain will endup handling the data operation. TODO: I'm unsure if registering the same piece of memory with many domains is going to cause a problem. The fid_mr obtained from the fi_mr_regattr() call to the core provider as well as the local endpoint which will be used for the pending data operation and the core provider peer address will be stored in the LINKx memory descriptor, which will be returned in the fid_mr structure in the fi_mr_regattr() result. The application will then send this information to the LINKx provider in the data operation. The LINKx provider can then use this information to determine the local endpoint, peer address and fid_mr to forward to the core provider. In the absence of the descriptor, then LINKx will go through the standard procedure to determine the local endpoint and the peer address. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
When opening the AV for each core provider, fetch the correct attribute structure to pass to the fi_av_open() API. The one passed to the LINKx provider may contain the count of the entire set of processes which are expected for the lifetime of libfabric. In case of OpenMPI this can be the size of the entire world, which is not necessarily what we would like to pass to the underlying provider. Some providers (ex SHM) have a limit on the size of the AV table they may create, and we don't want to exceed that. Other providers (ex CXI) use the count as the initial size of the AV table and then increase the AV table size by that amount when there are no more entries available in the AV table. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Modify the access to the fi_info cache to match against domain name or fabric name. When we are setting the fi_fabric() we can grab the fi_info which matches the fabric name. All domains which are part of the same fabric have the same fabric information. When we are setting the fi_domain(), we need to grab the fi_info which matches the domain name. Each domain represents a different interface, and we need to ensure that the domain requested by the application is honored. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
If the application didn't do any memory registration, the SHM provider will not be able to select the protocol to use accurately. SHM provider relies on the memory registration to determine the type of memory, device vs system. In contrast other providers use the ofi_get_hmem_iface() API to determine the iface based on the buffer pointer. In order to work around that, if the application didn't register the memory for the SHM provider, then LINKx will find out the buffer type, device vs system, and do it's own memory registration per send/recv API call. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Instead of each core provider having their own receive queues, which makes it impossible to order messages and prevent buffer overwrite, have LINKx provider export its receive queue for use by the core providers. When LINKx sets up the core provider's domains it provides the fi_peer_srx block to the core provider. This block contains a set of callbacks which the core provider can call to interface with LINKx queues. The core provider also initializes the peer portion of this block to point to a set of callbacks which LINKx can use to trigger the core provider to receive messages, or discard them. LINKx is responsible for maintaining the order of the queues and protecting them. Signed-off-by: Amir Shehata <shehataa@ornl.org>
Some providers expose hardware capabilities, such as hw tag matching. If LINKx uses shared receive queues, this capability will need to be turned off. If HW tag matching is on, there will be no chance for the provider to query the shared receive queue before writing into the buffer. This would cause problems with buffer overwrite. Shared receive queues are needed in two cases: 1. handle receive from any source 2. Multi-Rail (has not been implemented yet) However, it is recognized that some application might prefer to use hardware tag matching due to various reasons, including reducing CPU usage. To allow for that, this patch adds an environment variable: OFI_LINKX_SRQ_SUPPORT to allow the user to turn off srq support. By default srq support is on unless the user explicitly turns it off. When it is turned off, receiving from FI_ADDR_UNSPEC will not be supported and will return -FI_ENOSYS. Similarly any Multi-Rail capability which will be added in future patches will also be turned off. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
The memory registration flow produces worse performance in smaller message sizes. Avoid it if the buffer type is FI_HMEM_SYSTEM, as shm provider will assume the buffer is host memory if no memory registration occurs. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Ensure the tag is being stored in the rx_entry to be used for later comparison. Add some extra trace which would come in handy when debugging. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Linked providers should have the same ep type. In this case, always use the SHM ep type as the basis for other providers you're linking against. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Instead of using global variables to describe which providers are linked. Put the link at the fabric level. This way multiple fabrics can exist with multiple links. Otherwise if multiple fabrics are created they will trample over each other. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add support for memory registration, RMA and atomics. However, while working on this it has become clear that RMA support requires LINKx redesign to properly support memory registration. The primary reason for this is when memory is registered the key is returned to the application. The application can exchange these keys with other processes which can then be used for RMA. However if you're registering with multiple core providers which key do you give to the application. Some peer like mechanism needs to be designed for this scenario. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Reworked the memory registration structures. There are still a few problems. We need to understand when exactly to do memory registration. Currently OMPI MTL avoids doing any registration with CXI, but the BTL does. There isn't a good way to figure out the difference between the two paths. Some more thought needs to be put in how we do memory registration at the LINKx level. For one-sided: The way memory registration works is P1 would register a piece of memory and get back a key. P2 would do the same. P1 and P2 would exchange these keys. When P1 wants to do an RMA from P2 memory, it passes P2 key to P2 in the RMA operation. P2 would then use the key to find the memory and complete the RMA operation. This represents a problem with LINKx, because LINKx essentially abstracts multiple core providers, and provides its own key back to the application. However, when the LINKx Key is passed to the core provider, it makes no sense to it. What we need is a way for the core provider to ask LINKx to extract its key given the LINKx key. To do that we will need to implement a separate peer interface to enable this type of key lookup. For now, hack LINKx to always expect exactly two providers. For one-sided this would be sufficient for running on frontier, since we'll ever only link SHM and CXI. However, the above solution will need to be implemented for the generic case. Don't calloc() on the memory registration path. Instead create a buffer pool and grab out of that when doing memory registration. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
for loops add a lot of latency, need to find a better design. MR registration can happen on the fast path, and if I do a lot of looping I'll add delays. Need to redesign this pattern. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Instead of using the freestack datastructure for the receive requests, use a buffer pool, and allocate off that buffer pool. The maximum size of the buffer pool is max(UINT16_t) Make sure to cleanup recv buffer pools Signed-off-by: Amir Shehata <shehataa@ornl.gov>
In match common there were two nested loops: for (i = 0; i < LNX_MAX_LOCAL_EPS; i++) { lpp = peer->lp_provs[i]; if (!lpp) continue; if (lpp->lpp_prov == lp) { for (j = 0; j < LNX_MAX_LOCAL_EPS; j++) { //Match addr } } } LNX_MAX_LOCAL_EPS = 16 This would result in 16 loops + 2*16 loops each time we traverse a single entry in the recv or unexpected queue, resulting in increase in latency. After removing the loops, the latency is now comparable, however, this might not be the only issue. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Use a linked list for the provider endpoint list. This way we don't have a mandatory maximum we need to use while iterating over the endpoints in the fast path. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Make lpp_map a linked list instead of an array so we avoid having to iterate over all the entries. Most of the time there will be 1 map in the list. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Make lp_provs a linked list. An array means you have to iterate the entire array to find what you're looking for. A linked list is better for the type of search and access needed to the providers. Also keep a separate field for the shm memory provider for ease of access. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Iterate over all the linked provider CQs and kick each one to progress. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
lnx_match_common() is called to match a peer entry to a message or a receive request. This function needs to be as efficient as possible. However, there are competing requirements. To support Multi-Rail we need to be able to handle matching against multiple different potential addresses for a single peer. That adds looping which introduces latencies. This is an attempt to optimize for the most used scenario, where we link one provider with SHM. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
For local peers we only need to check on the shared memory providers. By pass any looping that we might need to do in the message matching path. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Introduce FI_LINKX_DISABLE_SHM to disable SHM. It defaults to 0, IE use SHM. If set to 1 all peers are treated as remote. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
On the receive path, if we know who we are receiving from and the provider which will be used provides a registration callback, make sure to call it. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
A change in libfabric core prints needs an update in LINKx. Made some other superficial cleanups. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
FI_LINKX_PROV_LINKS environment variable is used to specify which providers LINKx will link together. Format: <prov 1>+<prov 2>+...+<prov N>. EX: shm+cxi LINKx then generates a permutation of all the links that can be instantiated and returns that to the application. Application can then select the link it wants to use Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Upstream has a different method of registering SRX. There is a limitation where the SRX is only returned back in the upcall in get_tag/get_msg. But that prevents the parent provider of doing anything else with the peer callbacks. This presents a problem because we added a callback to register memory on the receive path. This patch updates the LINKx code Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add the LINKx provider man page. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
amirshehataornl
force-pushed
the
09_lnx_linkx_provider
branch
from
May 16, 2024 13:40
f775ecb
to
d4985bb
Compare
amirshehataornl
changed the title
09 lnx linkx provider
09 of 09 LNX series - Introduce the first version of the LINKx provider
May 16, 2024
amirshehataornl
changed the title
09 of 09 LNX series - Introduce the first version of the LINKx provider
10 of 10 LNX series - Introduce the first version of the LINKx provider
May 16, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
the new LINKx provider.
A few notes:
Signed-off-by: Amir Shehata shehataa@ornl.gov