Bad performance of views/slicing #2734

razorx89 · 2023-10-18T05:58:11Z

Hi,

I am still getting used to the library, but was able to isolate an unexpected performance hit. I want to update just a subregion of a pre-allocated 1D tensor. Maybe there is a better pattern to achieve the same result?

#include <chrono>
#include <xtensor/xrandom.hpp>
#include <xtensor/xtensor.hpp>

double mean_milliseconds_from_total(std::chrono::nanoseconds total,
                                    size_t num_repeats) {
  std::chrono::duration<double, std::milli> total_ms = total;
  return total_ms.count() / (double)num_repeats;
}

int main() {
  size_t num_repeats = 100;
  xt::xtensor<double, 1> a = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> b = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> c = xt::zeros<double>({10000000});

  // case 1: full tensor
  auto started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = a + b;
  auto finished = std::chrono::high_resolution_clock::now();
  std::cout << "elapsed time: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2: view of tensor with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::all()) = xt::view(a + b, xt::all());
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "elapsed time: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3: view of tensor with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::range(0, c.size())) =
        xt::view(a + b, xt::range(0, c.size()));
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "elapsed time: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;
  return 0;
}

Result:

elapsed time: 8.00238ms
elapsed time: 31.0913ms
elapsed time: 30.9484ms

I understand that introducing views should have a performance hit, but for doing essentially the same task (memory layout, contiguous memory, same range, equal step size of one), it is quite a big hit. Is this expected behavior or am I doing something wrong?

Thanks.

Versions:

xtl v0.7.5
xtensor v0.24.7
Apple clang version 14.0.3 (clang-1403.0.22.14.1)

The text was updated successfully, but these errors were encountered:

razorx89 · 2023-10-18T06:09:38Z

A little bit more detailed:

#include <chrono>
#include <xtensor/xrandom.hpp>
#include <xtensor/xtensor.hpp>

double mean_milliseconds_from_total(std::chrono::nanoseconds total,
                                    size_t num_repeats) {
  std::chrono::duration<double, std::milli> total_ms = total;
  return total_ms.count() / (double)num_repeats;
}

int main() {
  size_t num_repeats = 100;
  xt::xtensor<double, 1> a = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> b = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> c = xt::zeros<double>({10000000});

  // case 1: full tensor
  auto started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = a + b;
  auto finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 1:  "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2a: view of tensor and expression with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::all()) = xt::view(a + b, xt::all());
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 2a: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2b: view of only tensor with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::all()) = a + b;
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 2b: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2c: view of only expression with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = xt::view(a + b, xt::all());
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 2c: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3a: view of tensor and expression with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::range(0, c.size())) =
        xt::view(a + b, xt::range(0, c.size()));
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 3a: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3b: view of only tensor with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::range(0, c.size())) = a + b;
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 3b: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3c: view of only expression with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = xt::view(a + b, xt::range(0, c.size()));
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 3c: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;
  return 0;
}

Result:

case 1:  7.82969ms
case 2a: 31.7327ms
case 2b: 9.01278ms
case 2c: 29.9268ms
case 3a: 31.1293ms
case 3b: 8.89544ms
case 3c: 29.6933ms

Accessing the expression through a view seems to be the expensive part.

spectre-ns · 2023-11-05T01:43:32Z

This actually a compiler issue. I posted this in the gitter channel as well. If intel can make sense of the view objects, then the other implementations should be able to as well. If you're using MSVC, Clang, or GCC that doesn't help much but it's good to know the optimizations exist.

In MSVC with /O2 /Ob2 /arch:avx512
USING XSIMD
SIMD SIZE: 8

USING XSIMD
SIMD SIZE: 8

case 1: 38.3063ms
case 2a: 114.402ms
case 2b: 35.6107ms
case 2c: 92.7817ms
case 3a: 106.946ms
case 3b: 35.6352ms
case 3c: 102.884ms

With Intel 2023 DPC++/C++ Optimizing Compiler using equivalent flags:
USING XSIMD
SIMD SIZE: 8

case 1: 38.7353ms
case 2a: 45.1576ms
case 2b: 40.1831ms
case 2c: 30.8375ms
case 3a: 35.3932ms
case 3b: 37.2431ms
case 3c: 27.0555ms

spectre-ns mentioned this issue Nov 4, 2023

[Performance] Improve xfunction evaluation in xviews #2738

Closed

tdegeus added Performance Compiler bug labels Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance of views/slicing #2734

Bad performance of views/slicing #2734

razorx89 commented Oct 18, 2023 •

edited

razorx89 commented Oct 18, 2023 •

edited

spectre-ns commented Nov 5, 2023

Bad performance of views/slicing #2734

Bad performance of views/slicing #2734

Comments

razorx89 commented Oct 18, 2023 • edited

razorx89 commented Oct 18, 2023 • edited

spectre-ns commented Nov 5, 2023

razorx89 commented Oct 18, 2023 •

edited

razorx89 commented Oct 18, 2023 •

edited