Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10 of 10 LNX series - Introduce the first version of the LINKx provider #10034

Open
wants to merge 54 commits into
base: main
Choose a base branch
from

Conversation

amirshehataornl
Copy link
Contributor

@amirshehataornl amirshehataornl commented May 9, 2024

the new LINKx provider.

A few notes:

  1. I pusehd a series of PRs which are all in sequence, titled 01-09
  2. the LINKx provider doesn't use all the utility functions which were added at the same time while development was on going on the provider. The primary reason at the moment is that LINKx doesn't break up the receive queues on per peer basis. I investigated that and didn't find any performance advantage.
  3. The goal here is to try and land this series without extensive alteration to the current code, aside from bug fixes. The idea is to keep the code as close as possible as to what's currently in production on Frontier. We can advance the provider with more features and udpates as we move forward.
  4. This version of the LINKx provider only supports RMA and tagged operations. The other operations are planned after the first version of the provider lands

Signed-off-by: Amir Shehata shehataa@ornl.gov

@amirshehataornl amirshehataornl force-pushed the 09_lnx_linkx_provider branch 5 times, most recently from b6242ba to e6333d5 Compare May 11, 2024 06:10
When checking fabric attributes with ofi_check_fabric_attr() make sure to
consider provider exclusion.

When checking to see if a provider name is given, only consider ones which
are not excluded using the '^' character.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
It is not efficient to do a reverse lookup on the AV table when a message
is received. Some providers do not store the fi_addr_t associated with the
peer in the header passed on the wire. And it is not practical to require
providers to add that to wire header, as it would break backwards
compatibility.

In order to handle this case, an address matching callback is added to the
peer_srx.peer_ops structure. This allows the provider receiving the
message to register an address matching callback. This callback is called
by the owner provider to match an fi_addr_t with provider specific address
in the message received.

The callback allows the receiving provider to do an O(1) index into the AV
table to lookup the address of the peer, and then compare that with the
source address in the received message.

As part of this change provider specific address information needs to be
passed to the owner provider, which the owner will need to give back to the
receiving provider, when it attempts to do address matching.

Update the SHM and LINKx providers to conform with the API changes

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add a new structure fi_peer_match to collect the parameters which need
to be passed to the get_msg and get_tag functions.

Update the util_get_tag() and util_get_msg() function callbacks.
Compilation gives a warning but not failing. This causes memory
corruption when the callbacks are called.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
amirshehataornl and others added 21 commits May 16, 2024 09:37
Add a memory registration callback to the fi_ops_srx_peer. This allows
core providers to expose a memory registration callback which the parent
or peer provider can use to register memory on the receive path.

For example the CXI provider registers memory with the NIC on the receive
path. When using the peer infrastructure this can not happen because we
do not know which provider will perform the receive operation. But if
the source NID is specified then we can know and therefore we can
perform the receive buffer registration at the top of the receive path.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add FI_PEER capability bit

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
The parent provider should be able to get access to the peer provider
callbacks. Added the srx block in the fid.context so we can retrieve it
later on.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add the FI_PEER capability bit to the SHM fi_infos

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add the FI_PEER capability bit to the CXI provider fi_info

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
On cq_open, check the FI_PEER_IMPORT, if set, set all internal cq operation
to be enosys, with the exception to the read callback.

The read callback is overloaded to operate as a progress callback
function. Invoking the read callback will progress the enpoints linked to
this CQ.

Keep track of the fid_peer_cq structure passed in.

If the FI_PEER_IMPORT flag is set, then set the callbacks in cxip_cq structure
which handle writing to the peer_cq, otherwise set them to the ones which
write to the util_cq.

A provider needs to call a different set of functions to insert
completion events into an imported CQ vs an internal CQ.

These set of callback definition standardize a way to assign a different
function to a CQ object, which can then be called to insert into the CQ.

For example:

	struct prov_cq {
		struct util_cq *util_cq;
		struct fid_peer_cq *peer_cq;
		ofi_peer_cq_cb cq_cb;
	};

When a provider opens a CQ it can:

	if (attr->flags & FI_PEER_IMPORT) {
		prov_cq->cq_cb.cq_comp = prov_peer_cq_comp;
	} else {
		prov_cq->cq_cb.cq_comp = prov_cq_comp;
	}

Collect the peer CQ callbacks in one structure for use in CXI.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Restructure the code to allow for posting on the owner provider's shared
receive queues.

Do not do a reverse lookup on the AV table to get the fi_addr_t, instead
register an address matching callback with the owner. The owner can then
call the address matching callback to match an fi_addr_t to the source
address in the message received.

This is more efficient as the peer lookup can be an O(1) operation;
AV[fi_addr_t]. The peer's CXI address can be compared with the CXI address
in the message received.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Upstream has a different method of registering SRX. There is a limitation
where the SRX is only returned back in the upcall in get_tag/get_msg. But
that prevents the parent provider of doing anything else with the
peer callbacks. This presents a problem because we added a callback to
register memory on the receive path.

This patch updates the CXI provider

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add memory registration callback to allow for parent provider, if one
exists, to register receive buffers and not to wait until the data
arrives before we can register the receive buffers.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Use the following format to described provider linking via LINKx:
provider name: "<prov1>+<prov2>:linkx". Add infrastructure to support
this format.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add initial implementation of the fi_link() API.
In this patch it handles only one provider. It's behavior
will be exactly like fi_fabric() API.

If there are multiple providers in the passed in list, then
call ofi_create_link() which ends up linking all providers
in the list. Currently this operation is not supported.

Add the initial LINKx implementation.

Although LINKx is a provider, it will not be returned as
part of the matching list in the fi_getinfo() call. It will
behave like it's part of the libfabric infrastructure.
Therefore, it will not implement the <>_getinfo() function. If this
function is called. It'll return EOPNOTSUPP.

Similarly, the fi_fabric() call will not be callable. It will return
EOPNOTSUPP.

Instead the ofi_create_link() will initialize a virtual fabric
and return that to the caller as part of the fi_link() API.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add the ability to iterate over all provider endpoints passed into
the fi_link() API and initialize the fabric for each. Store the
fid_fabric in a local linked list.

The core provider endpoint fid_fabric will be used whenever redirecting
function calls to the core provider.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
when fi_domain() is called on the LINKx provider, then go through
all the core providers which are being linked and initialize their
associated domains.

Similarly, when closing the domain, go over all the core providers'
domains and close them.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Initialize the endpoints for all the core providers and store
them in the ep_fid in the local endpoint table.

When closing an endpoint, close all the core providers endpoints.

Lay the ground work for the endpoint operations. Most of these
operations will index into the LINKx peer table.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Initialize the LINKx peer table. The peer table uses the OFI
address vector infrastructure.

The peer table will contain an array of peer entries. The size
of the array is determined by the fi_av_attr passed into the fi_av_open()
API: fi_av_attr.count.

Each peer entry will indicate whether the peer is local or remote. It will
also have a list of provider endpoint information:
struct local_prov_entry *eps. This structure contains all the FIDs which are
required for communication.

For on-node peers, it'll have only 1 entry, which would be the SHM provider
endpoint. For remote peers it'll have one or more entries to the providers
which can be used for remote communication.

When the peer is indexed to during traffic operations, then the
appropriate FID for the operation can be dereferenced from the
local_prov_entry field and used in the call to the core provider.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Define the LINKx address format.

A LINKx address is an encapsulation of multiple core provider endpoints.
The address format allows specification of multiple providers, and under
each provider multiple addresses.

This will allow LINKx to return a bond of addresses which could in theory
span multiple providers.

When the application calls av_insert() create a peer entry (or update an
existing one). The peer table is referenced by the fid_av.

If the peer is local and we're managing the shm provider, then mark this
peer as only reachable via shared memory, as shared memory will always be
more efficient than all other methods of communication.

If the peer is remote, then we can have multiple addresses provided for
us. The addresses are divided by provider. Go through the address provider
list and insert these addresses into the local endpoint for the same
provider. For example if a node has 2 TCP interfaces, we'll be provided
two TCP addresses, one for each interface. If the local node also has 2
TCP interfaces, they'll be presented as two local provider endpoints, each with
its own Address Vector. In order to be able to reach the peer's interfaces
from either of our local interfaces, we need to insert the remote
addresses into both of the local Address Vector tables for both provider
endpoints.

When the application calls av_remove() remove the peer from the peer table
and all associated core provider addresses from the core providers.

When closing the AV, remove the peer table and clean up all the memory.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
The LINKx getname() method collects all the addresses of the different
providers it manages. LINKx can manage multiple providers and each
provider can manage one or more endpoints.

It builds an lnx_addresses structure big enough to fit all the addresses.
If the buffer provided is smaller than that, it sets the addresslen
parameter to the needed size and return FI_ETOOSMALL.

This address defines all the different ways this process can be reached.
For example if a node has 4 NICs, the address can contain the address for
all 4 NICs the process is listening on.

The application, should then allocate a large enough buffer and callback
into the LINKx provider using fi_getname() to get the addresses.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
When an application requests a Address Vector to be bound to an
Endpoint, LINKx looks up the peer_table referenced by the AV fid and adds
a pointer to the peer_table in the LINKx endpoint structure.

Since each core provider maintains its own address vector, call
fi_ep_bind() on each of the core provider endpoints.

Later on when calling a traffic operation function, example fi_tsend(),
then we can index into the peer table by the fi_addr_t passed into the
function call. From there we can make a decision on which local endpoint
to use to send to that peer.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
For send/recv operation we currently make an assumption that there are only two
addresses available, a shared memory address if the peer is local, or
another address if the peer is remote.

This will need to be updated to handle Multi-Rail where there could be
multiple addresses we can reach the peer on.

For recv operations, it currently doesn't support handling FI_ADDR_UNSPEC,
as that will require shared receive queues.

Shared receive queues are going to be needed for handling multi-rail, as
you can receive messages on any of the rails.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
When FI_ENABLE is requested, ensure that the completion queues and the
peer tables are created and ready.

Enable all the core endpoints.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
On cq_open() LINKx will create its own CQ. This is the CQ which the
application will query. It will then export this CQ for use by all the
core provider endpoints. It uses the fid_peer_cq structure to share its
own CQ.

LINKx will provide its write/writeerr callbacks() for use by the core
provider endpoints. These functions are thin wrappers around their
ofi_util.h counter parts. The ofi_util.h cq functions already implement
locking and writing onto the CQ provided.

Handle binding the LINKx endpoint to the LINKx cq, and perform the CQ bind
on all the core endpoints being managed by LINKx.

Progress invocation will kick all the core providers to progress their
work through the peer_cq progress callback assigned by the core
provider endpoints.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add support for fi_mr_regattrs() API in LINKx provider.

Upon handling fi_mr_regattrs() call, LINKx will check the domain's
mr_mode. If there are no flags set or if FI_MR_ENDPOINT is set, then don't
register the memory. Otherwise call the core's memory registration. This
is to adhere with the behavior outlined here:

https://ofiwg.github.io/libfabric/v1.15.0/man/fi_mr.3.html

LINKx needs to determine which domain to register the memory against.
Since it handles multiple domains, the only way it can find out the
domain is by being provided a destination or source address. It can then
use the address to predetermine the local domain it'll use for the data
operation and register the memory against it.

Therefore, fi_addr_t field has been added to fi_mr_attr structure to
allow LINKx to perform the above logic.

If the address is not specified, then LINKx will need to register the
memory with all the domains it has, because it doesn't know which domain
will endup handling the data operation.

TODO: I'm unsure if registering the same piece of memory with many domains
is going to cause a problem.

The fid_mr obtained from the fi_mr_regattr() call to the core provider
as well as the local endpoint which will be used for the pending data
operation and the core provider peer address will be stored in the LINKx
memory descriptor, which will be returned in the fid_mr structure in the
fi_mr_regattr() result.

The application will then send this information to the LINKx provider in
the data operation. The LINKx provider can then use this information to
determine the local endpoint, peer address and fid_mr to forward to the
core provider.

In the absence of the descriptor, then LINKx will go through the standard
procedure to determine the local endpoint and the peer address.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
When opening the AV for each core provider, fetch the correct attribute
structure to pass to the fi_av_open() API. The one passed to the LINKx
provider may contain the count of the entire set of processes which are
expected for the lifetime of libfabric. In case of OpenMPI this can be the
size of the entire world, which is not necessarily what we would like to
pass to the underlying provider. Some providers (ex SHM) have a limit on
the size of the AV table they may create, and we don't want to exceed
that. Other providers (ex CXI) use the count as the initial size of the AV
table and then increase the AV table size by that amount when there are no
more entries available in the AV table.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Modify the access to the fi_info cache to match against domain name or
fabric name.

When we are setting the fi_fabric() we can grab the fi_info which matches
the fabric name. All domains which are part of the same fabric have the
same fabric information.

When we are setting the fi_domain(), we need to grab the fi_info which
matches the domain name. Each domain represents a different interface, and
we need to ensure that the domain requested by the application is honored.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
If the application didn't do any memory registration, the SHM provider
will not be able to select the protocol to use accurately. SHM provider
relies on the memory registration to determine the type of memory, device
vs system. In contrast other providers use the ofi_get_hmem_iface() API to
determine the iface based on the buffer pointer.

In order to work around that, if the application didn't register the
memory for the SHM provider, then LINKx will find out the buffer type,
device vs system, and do it's own memory registration per send/recv API
call.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Instead of each core provider having their own receive queues, which makes
it impossible to order messages and prevent buffer overwrite, have LINKx
provider export its receive queue for use by the core providers.

When LINKx sets up the core provider's domains it provides the fi_peer_srx
block to the core provider. This block contains a set of callbacks which
the core provider can call to interface with LINKx queues. The core
provider also initializes the peer portion of this block to point to a set
of callbacks which LINKx can use to trigger the core provider to receive
messages, or discard them.

LINKx is responsible for maintaining the order of the queues and
protecting them.

Signed-off-by: Amir Shehata <shehataa@ornl.org>
Some providers expose hardware capabilities, such as hw tag matching. If
LINKx uses shared receive queues, this capability will need to be turned
off. If HW tag matching is on, there will be no chance for the provider to
query the shared receive queue before writing into the buffer. This would
cause problems with buffer overwrite.

Shared receive queues are needed in two cases:
1. handle receive from any source
2. Multi-Rail (has not been implemented yet)

However, it is recognized that some application might prefer to use
hardware tag matching due to various reasons, including reducing CPU
usage.

To allow for that, this patch adds an environment variable:
OFI_LINKX_SRQ_SUPPORT to allow the user to turn off srq support. By
default srq support is on unless the user explicitly turns it off.

When it is turned off, receiving from FI_ADDR_UNSPEC will not be supported
and will return -FI_ENOSYS. Similarly any Multi-Rail capability which will
be added in future patches will also be turned off.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
The memory registration flow produces worse performance in smaller message
sizes. Avoid it if the buffer type is FI_HMEM_SYSTEM, as shm provider will
assume the buffer is host memory if no memory registration occurs.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Ensure the tag is being stored in the rx_entry to be used for later
comparison.

Add some extra trace which would come in handy when debugging.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Linked providers should have the same ep type. In this case, always use
the SHM ep type as the basis for other providers you're linking against.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Instead of using global variables to describe which providers are linked.
Put the link at the fabric level. This way multiple fabrics can exist with
multiple links. Otherwise if multiple fabrics are created they will
trample over each other.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add support for memory registration, RMA and atomics.
However, while working on this it has become clear that RMA support
requires LINKx redesign to properly support memory registration. The
primary reason for this is when memory is registered the key is returned
to the application. The application can exchange these keys with other
processes which can then be used for RMA. However if you're registering
with multiple core providers which key do you give to the application.

Some peer like mechanism needs to be designed for this scenario.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Reworked the memory registration structures. There are still a few
problems. We need to understand when exactly to do memory registration.
Currently OMPI MTL avoids doing any registration with CXI, but the BTL
does. There isn't a good way to figure out the difference between the two
paths. Some more thought needs to be put in how we do memory registration
at the LINKx level.

For one-sided:

The way memory registration works is P1 would register a piece of memory
and get back a key. P2 would do the same. P1 and P2 would exchange these
keys. When P1 wants to do an RMA from P2 memory, it passes P2 key to P2 in
the RMA operation. P2 would then use the key to find the memory and
complete the RMA operation.

This represents a problem with LINKx, because LINKx essentially abstracts
multiple core providers, and provides its own key back to the application.
However, when the LINKx Key is passed to the core provider, it makes no
sense to it.

What we need is a way for the core provider to ask LINKx to extract its
key given the LINKx key. To do that we will need to implement a separate
peer interface to enable this type of key lookup.

For now, hack LINKx to always expect exactly two providers. For one-sided
this would be sufficient for running on frontier, since we'll ever only link
SHM and CXI.

However, the above solution will need to be implemented for the generic
case.

Don't calloc() on the memory registration path. Instead create a buffer
pool and grab out of that when doing memory registration.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
for loops add a lot of latency, need to find a better design. MR
registration can happen on the fast path, and if I do a lot of looping
I'll add delays. Need to redesign this pattern.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Instead of using the freestack datastructure for the receive requests, use
a buffer pool, and allocate off that buffer pool. The maximum size of the
buffer pool is max(UINT16_t)
Make sure to cleanup recv buffer pools

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
In match common there were two nested loops:

for (i = 0; i < LNX_MAX_LOCAL_EPS; i++) {
	lpp = peer->lp_provs[i];
	if (!lpp)
		continue;
	if (lpp->lpp_prov == lp) {
		for (j = 0; j < LNX_MAX_LOCAL_EPS; j++) {
			//Match addr
		}
	}
}

LNX_MAX_LOCAL_EPS = 16

This would result in 16 loops + 2*16 loops each time we traverse a single
entry in the recv or unexpected queue, resulting in increase in latency.

After removing the loops, the latency is now comparable, however, this
might not be the only issue.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Use a linked list for the provider endpoint list. This way we don't have a
mandatory maximum we need to use while iterating over the endpoints in the
fast path.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Make lpp_map a linked list instead of an array so we avoid having to
iterate over all the entries. Most of the time there will be 1 map in the
list.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Make lp_provs a linked list. An array means you have to iterate the entire
array to find what you're looking for. A linked list is better for the type
of search and access needed to the providers.

Also keep a separate field for the shm memory provider for ease of access.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Iterate over all the linked provider CQs and kick each one to progress.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
lnx_match_common() is called to match a peer entry to a message or a
receive request. This function needs to be as efficient as possible.
However, there are competing requirements. To support Multi-Rail we need
to be able to handle matching against multiple different potential
addresses for a single peer. That adds looping which introduces latencies.
This is an attempt to optimize for the most used scenario, where we link
one provider with SHM.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
For local peers we only need to check on the shared memory providers. By
pass any looping that we might need to do in the message matching path.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Introduce FI_LINKX_DISABLE_SHM to disable SHM. It defaults to 0, IE use
SHM. If set to 1 all peers are treated as remote.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
On the receive path, if we know who we are receiving from and the
provider which will be used provides a registration callback, make sure
to call it.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
A change in libfabric core prints needs an update in LINKx.
Made some other superficial cleanups.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
FI_LINKX_PROV_LINKS environment variable is used to
specify which providers LINKx will link together. Format: <prov 1>+<prov
2>+...+<prov N>. EX: shm+cxi
LINKx then generates a permutation of all the links that can be
instantiated and returns that to the application. Application can then
select the link it wants to use

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Upstream has a different method of registering SRX. There is a limitation
where the SRX is only returned back in the upcall in get_tag/get_msg. But
that prevents the parent provider of doing anything else with the
peer callbacks. This presents a problem because we added a callback to
register memory on the receive path.

This patch updates the LINKx code

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Add the LINKx provider man page.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
@amirshehataornl amirshehataornl changed the title 09 lnx linkx provider 09 of 09 LNX series - Introduce the first version of the LINKx provider May 16, 2024
@amirshehataornl amirshehataornl changed the title 09 of 09 LNX series - Introduce the first version of the LINKx provider 10 of 10 LNX series - Introduce the first version of the LINKx provider May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant