P2642, P2630, P2897: Express compile-time extent with nonzero offset #443

mhoemmen · 2024-02-06T00:35:42Z

Suppose that I want to vectorize a 1-D array copy operation using mdspan and aligned_accessor (P2897). I have a copy_8_floats function that expresses a hardware instruction for copying a contiguous array of 8 floats, where the start of the array is aligned to 8 * sizeof(float) (32) bytes.

template<class ElementType, size_t num_elements, size_t byte_alignment>
using aligned_array_view = mdspan<ElementType,
  extents<int, num_elements>, layout_right,
  aligned_accessor<ElementType, byte_alignment>>;

void
copy_8_floats(aligned_array_view<float const, 8, 32> source,
  aligned_array_view<float, 8, 32> destination)
{
  // hardware instruction for aligned copy goes here
}

The natural next step would be to use copy_8_floats to implement copying 1-D float arrays by the usual "strip-mining" approach.

template<class ElementType>
using array_view = mdspan<ElementType,
  extents<int, dynamic_extent>, layout_right,
  default_accessor<ElementType>>;

template<size_t Extent>
void copy(array_view<float const> source, array_view<float> destination)
{
  constexpr int vector_length = 8;

  // Handle possibly unaligned prefix of less than vector_length elements. 
  auto aligned_starting_index = [](auto* ptr) {
    constexpr auto v = static_cast<unsigned>(vector_length);
    auto ptr_value = std::bit_cast<uintptr_t>(ptr_value);
    auto remainder = ptr_value % v;
    return static_cast<int>(ptr_value + (v - remainder) % v);
  };
  int src_upper = aligned_starting_index(source.data());
  int dst_upper = aligned_starting_index(destination.data());
  int src_lower = 0;
  int dst_lower = 0;
  for (; src_lower < src_upper && dst_lower < dst_upper; ++src_lower, ++dst_lower) {
    destination[dst_lower] = source[src_lower];
  }

  // Strip-mine the aligned vector_length segments of the array.
  const int src_upper = (source.size() / vector_length) * vector_length;
  const int dst_upper = (destination.size() / vector_length) * vector_length;
  for (; src_lower < src_upper && dst_lower < dst_upper;
    src_lower += vector_length, dst_lower += vector_length)
  {
    constexpr auto one = std::integral_constant<int, 1>{};
    constexpr auto vec_len = std::integral_constant<int, vector_length>{};

    // Using strided_slice ensures that the extent is fully known at compile time.
    constexpr auto vector_slice = strided_slice{.offset=dst_lower, .extent=vector_length, .stride=one};
    // PROBLEM: Current wording from P2630 and P2642 makes this always layout_stride.
    aligned_array_view source_slice = submdspan(source, vector_slice);
    aligned_array_view destination_slice = submdspan(destination, vector_slice);

    copy_8_floats(source_slice, destination_slice);
  }

  // Handle suffix of less than vector_length elements.
  for (; src_lower < source.size() && dst_lower < destination.size(); ++src_lower, ++dst_lower) {
    destination[dst_lower] = source[src_lower];
  }
}

One must use strided_slice so that the resulting extent is fully known at compile time. Using index-pair-like (e.g., tuple) would result in a run-time extent, because both elements of the pair would have run-time components.

There are two problems with the above code.

Current wording from P2630 and P2642 makes the above submdspan results always have layout_stride.
Accessor .offset only takes a size_t offset. Thus, it discards compile-time information, like knowing that the offset is divisible by some overalignment factor.

I suggest the following strategy to fix these issues.

We can fix (1) in P2642 for layout_left_padded and layout_right_padded.
We can fix (1) for layout_left and layout_right with a separate paper, because C++26 is still a working draft.
Fixing (2) is harder and would complicate the accessor policy, so we might want to consider not doing that.

Regarding fixing (2), ([mdspan.submdspan.submdspan] paragraph 6](https://eel.is/c++draft/views.multidim#mdspan.submdspan.submdspan-6) says that submdspan(src, slices...) has effects equivalent to the following.

auto sub_map_offset = submdspan_mapping(src.mapping(), slices...);
return mdspan(src.accessor().offset(src.data(), sub_map_offset.offset),
              sub_map_offset.mapping,
              AccessorPolicy::offset_policy(src.accessor()));

The problem is AccessorPolicy::offset_policy(src.accessor()). The offset_policy is just a type, and offset returns offset_policy::data_handle_type. If we want an offset with suitable compile-time alignment to have a different accessor type, then we would need two changes. First, offset would need an overload taking a type that expresses the product of a compile-time integer (of suitable alignment) and a run-time integer. Second, we would need some way to get the accessor type corresponding to that offset overload -- say, a new templated type alias whose template parameter is the type of the corresponding offset overload's parameter. Given that offset is already part of the accessor, this would make more sense than introducing a customization point just for the offset accessor policy.

The work-around for a known accessor type is to replace submdspan and handle the accessor conversion by hand. The source accessor can still have an offset overload that takes the offset as a type other than size_t.

auto sub_map_offset = submdspan_mapping(src.mapping(), slices...);
return mdspan(
  src.accessor().offset(src.data(), sub_map_offset.offset),
  sub_map_offset.mapping,
  DesiredAccessor(src.accessor()));

Given that this work-around is easy to do, should only be needed for very special cases, and avoids a big design change to the accessor policy requirements, it should suffice.

The text was updated successfully, but these errors were encountered:

mhoemmen · 2024-02-26T22:59:54Z

Suggested fix: Write a separate follow-on paper that addresses layout_left, layout_right, layout_left_padded, and layout_right_padded all at once.

For the padded layouts, if the leftmost resp. rightmost extent gets strided_slice and its .extent is a compile-time value, then preserve the layout (taking care to preserve the stride, which might entail making it a run-time value -- check all cases here).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2642, P2630, P2897: Express compile-time extent with nonzero offset #443

P2642, P2630, P2897: Express compile-time extent with nonzero offset #443

mhoemmen commented Feb 6, 2024 •

edited

mhoemmen commented Feb 26, 2024

P2642, P2630, P2897: Express compile-time extent with nonzero offset #443

P2642, P2630, P2897: Express compile-time extent with nonzero offset #443

Comments

mhoemmen commented Feb 6, 2024 • edited

mhoemmen commented Feb 26, 2024

mhoemmen commented Feb 6, 2024 •

edited