Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libfabric version2: Initial set of proposed changes #9384

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

shefty
Copy link
Member

@shefty shefty commented Sep 29, 2023

For discussion and review only. CANNOT BE MERGED!!!

@shefty
Copy link
Member Author

shefty commented Oct 4, 2023

Update: adjusting patches to start working through CI issues (many of which are related to building on other platforms and missing updates to the CI scripts)

@shefty
Copy link
Member Author

shefty commented Oct 5, 2023

@zachdworkin - earlier Intel CI failed on socket provider testing, which makes sense, since I removed that provider. Is there some way I can disable sockets from being run in the contrib/intel scripts, or would this require updates to the CI system? (I'm NOT asking to disable anything in the CI system now -- that wouldn't be until we're closer to accepting these sort of changes).

@zachdworkin
Copy link
Contributor

@zachdworkin - earlier Intel CI failed on socket provider testing, which makes sense, since I removed that provider. Is there some way I can disable sockets from being run in the contrib/intel scripts, or would this require updates to the CI system? (I'm NOT asking to disable anything in the CI system now -- that wouldn't be until we're closer to accepting these sort of changes).

The only way I know that you can do this is with the replay function and either comment out that stage or just delete it from your replay. Otherwise I will need to remove it from the pipeline with a different PR

@shefty shefty force-pushed the version2 branch 2 times, most recently from 8a2af9d to 9c4caa2 Compare October 5, 2023 20:53
@shefty
Copy link
Member Author

shefty commented Oct 5, 2023

mpich references FI_NAME_MAX, so removing that from the public header breaks that build. We either need to add it back in and document what it actually means, leave it purely for compatibility, or update mpich to use an internal definition.

@shefty shefty force-pushed the version2 branch 2 times, most recently from 5dabf05 to 412a65a Compare October 5, 2023 22:48
Provider only supported by 1.x series

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Provider only supported by v1.x series

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
GNI is only supported by the v1.x series

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
NetworkDirect support is supported by verbs provider.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
@shefty shefty force-pushed the version2 branch 6 times, most recently from ffd3a52 to dc7faf6 Compare October 11, 2023 21:58
Flag is only used between rxm and verbs.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
The value is constrained to 32-bit int flags, not u64 flags.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Remove support for asynchronous insertions and AV_MAP.  The
format of the fi_addr_t value will either be indexed based in
the standard case or provider defined in more advanced use cases,
based on the AV configuration (such as using auth_keys).

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Remove FI_BUFFERED_RECV as an exported API option.  Since it's
currently used internally between mrail and rxm, make it an
internal only option.  It has a limited use case for multirail
over rxm over connected endpoints where shared receive queues are
not available.  With shared receive queues, the feature wouldn't
be needed, as mrail could own the buffers outright.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Recommend that applications and providers use FI_THREAD_COMPLETION
as the preferred threading model for lockless operation when using
scalable endpoints.  This helps align application design with the
provider implementation.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Remove overly complicated threading models and focus on specific
models to allow better alignment between application designs and
provider implementation.

Use FI_THREAD_DOMAIN as the preferred lockless threading model
for standard endpoints.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Combine data and control progress into one progress option.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Completions are always unordered.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Field was deprecated and only serves as a placeholder for
compatility.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Support for wait sets adds significant complexity to the provider
implementation and is basically an abstraction around the OS
constructs for poll/epoll (on Linux).  Remove the feature from the
API, but keep the internal implementation for now.  This allows
providers to move away from wait set support.

Note that blocking support and support for native wait objects
(e.g. epoll fd's) are still supported by the API.  Only the wait
set abstraction is removed, which allows providers control over
creating wait objects.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Poll sets are a simple iterator around progressing multiple objects.
Remove from the API to reduce provider complexity.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
There's never been an implementation.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Remove FI_MR_BASIC/SCALABLE/UNSPEC.  These were deprecated in
version 1.5.  Remove FI_LOCAL_MR, which was an earlier version of
FI_MR_LOCAL and also deprecated.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Feature is not implemented natively by providers.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Remove FI_ORDER_NONE (flag that's 0) and FI_ORDER_STRICT (which isn't a
flag, and covers only a portion of the valid flags).

Remove FI_ORDER_DATA, which will not be supported by version 2 in
order to allow for greater provider optimization.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Remove the option of directing transmit and receive completions
to separate CQs for the same endpoint.  The option adds complexity
at the provider and application levels.  This is largely the
result of needing SW based protocols for certain operations, such
as tag matching.  This either makes it necessary for the app to
drive progress across multiple CQs, or the provider emulates the
application's CQs in SW.

This change updates the man page only.  Provider developers are
left to update their code bases separately.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Disallow users hand-crafting or hand-copying their own fi_info structs.
This allows the library to allocate hidden fields for internal use.
Plus, there's no need to apps to do this, given that the API call is
way easier to use.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Add a new call that takes fi_info as input, which provides
consistency with the domain and endpoint open calls.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Subsystem filtering isn't useful (or likely ever used).  Remove
the enum fi_log_subsys.  Instead convert the API to accept int
flags.  This maintains API compatibility.  Update all current
FI_LOG_xxx subsys values to 0.  This avoids needing to update
the providers to the new API, forcing them to pass in 0 for
the flags.

No actual flag values are defined.  Those become a placeholder
for future options.

The logging checks are simplified by this change.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Introduce the concept of peer groups.  A peer group is a set of
peers that are communicating together for some specific set of tasks.
Peer groups provide a lower-level mapping of HPC and AI communicators.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Allow specifying precise tag formatting options.  The mem_tag_format
takes as input a set of bit fields.  In practice, this ends up being
unusable to implement, resulting in the entire tag simply being
masked with ignore bits.

When the mem_tag_format value only has the lower bits set (< 256),
interpret the format as specific options.  Two new options are
defined, one aligned with MPI and the other with CCLs.  This
information can be used by providers to optimize for the separate
use cases.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
@@ -222,7 +222,7 @@ fi_sendmsg.

*FI_CLAIM*
: Applies to posted receive operations for endpoints configured
for FI_BUFFERED_RECV or FI_VARIABLE_MSG. This flag is used to
for FI_BUFFERED_RECV. This flag is used to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the FI_VARIABLE_MSG flag, the FI_BUFFERED_RECV flag was also removed in this pull request. Therefore, shouldn't the FI_CLAIM and FI_DISCARD flags also be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FI_CLAIM and FI_DISCARD can also be used with FI_PEEK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mblocksome and @j-xiong are both correct. However, I agree with Mike. This change is to the fi_msg man page, and FI_PEEK does not apply there. FI_CLAIM and FI_DISCARD can be removed as flags for the fi_msg calls. They remain for fi_tagged.

@@ -234,7 +234,7 @@ fi_sendmsg.

*FI_DISCARD*
: Applies to posted receive operations for endpoints configured
for FI_BUFFERED_RECV or FI_VARIABLE_MSG. This flag is used to
for FI_BUFFERED_RECV. This flag is used to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the FI_VARIABLE_MSG flag, the FI_BUFFERED_RECV flag was also removed in this pull request. Therefore, shouldn't the FI_CLAIM and FI_DISCARD flags also be removed?

@@ -319,11 +319,6 @@ The following flags may be used with fi_trecvmsg.
fi_context structure used for an FI_PEEK + FI_CLAIM operation must be used

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove text about fi_recv_context since buffered receives have been removed?

enum {
FI_TAG_BITS,
FI_TAG_HPC,
FI_TAG_AI,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above should revert back to FI_TAG_MPI and FI_TAG_NCCL, respectively, as the tag formats specifically target the tag definitions for that middleware.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FI_TAG_CCL to be more implementation neutral.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants