Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance of views/slicing #2734

Open
razorx89 opened this issue Oct 18, 2023 · 2 comments
Open

Bad performance of views/slicing #2734

razorx89 opened this issue Oct 18, 2023 · 2 comments

Comments

@razorx89
Copy link

razorx89 commented Oct 18, 2023

Hi,

I am still getting used to the library, but was able to isolate an unexpected performance hit. I want to update just a subregion of a pre-allocated 1D tensor. Maybe there is a better pattern to achieve the same result?

#include <chrono>
#include <xtensor/xrandom.hpp>
#include <xtensor/xtensor.hpp>

double mean_milliseconds_from_total(std::chrono::nanoseconds total,
                                    size_t num_repeats) {
  std::chrono::duration<double, std::milli> total_ms = total;
  return total_ms.count() / (double)num_repeats;
}

int main() {
  size_t num_repeats = 100;
  xt::xtensor<double, 1> a = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> b = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> c = xt::zeros<double>({10000000});

  // case 1: full tensor
  auto started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = a + b;
  auto finished = std::chrono::high_resolution_clock::now();
  std::cout << "elapsed time: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2: view of tensor with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::all()) = xt::view(a + b, xt::all());
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "elapsed time: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3: view of tensor with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::range(0, c.size())) =
        xt::view(a + b, xt::range(0, c.size()));
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "elapsed time: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;
  return 0;
}

Result:

elapsed time: 8.00238ms
elapsed time: 31.0913ms
elapsed time: 30.9484ms

I understand that introducing views should have a performance hit, but for doing essentially the same task (memory layout, contiguous memory, same range, equal step size of one), it is quite a big hit. Is this expected behavior or am I doing something wrong?

Thanks.

Versions:

  • xtl v0.7.5
  • xtensor v0.24.7
  • Apple clang version 14.0.3 (clang-1403.0.22.14.1)
@razorx89
Copy link
Author

razorx89 commented Oct 18, 2023

A little bit more detailed:

#include <chrono>
#include <xtensor/xrandom.hpp>
#include <xtensor/xtensor.hpp>

double mean_milliseconds_from_total(std::chrono::nanoseconds total,
                                    size_t num_repeats) {
  std::chrono::duration<double, std::milli> total_ms = total;
  return total_ms.count() / (double)num_repeats;
}

int main() {
  size_t num_repeats = 100;
  xt::xtensor<double, 1> a = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> b = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> c = xt::zeros<double>({10000000});

  // case 1: full tensor
  auto started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = a + b;
  auto finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 1:  "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2a: view of tensor and expression with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::all()) = xt::view(a + b, xt::all());
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 2a: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2b: view of only tensor with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::all()) = a + b;
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 2b: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2c: view of only expression with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = xt::view(a + b, xt::all());
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 2c: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3a: view of tensor and expression with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::range(0, c.size())) =
        xt::view(a + b, xt::range(0, c.size()));
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 3a: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3b: view of only tensor with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::range(0, c.size())) = a + b;
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 3b: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3c: view of only expression with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = xt::view(a + b, xt::range(0, c.size()));
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 3c: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;
  return 0;
}

Result:

case 1:  7.82969ms
case 2a: 31.7327ms
case 2b: 9.01278ms
case 2c: 29.9268ms
case 3a: 31.1293ms
case 3b: 8.89544ms
case 3c: 29.6933ms

Accessing the expression through a view seems to be the expensive part.

@spectre-ns
Copy link
Contributor

This actually a compiler issue. I posted this in the gitter channel as well. If intel can make sense of the view objects, then the other implementations should be able to as well. If you're using MSVC, Clang, or GCC that doesn't help much but it's good to know the optimizations exist.

In MSVC with /O2 /Ob2 /arch:avx512
USING XSIMD
SIMD SIZE: 8

USING XSIMD
SIMD SIZE: 8

case 1: 38.3063ms
case 2a: 114.402ms
case 2b: 35.6107ms
case 2c: 92.7817ms
case 3a: 106.946ms
case 3b: 35.6352ms
case 3c: 102.884ms

With Intel 2023 DPC++/C++ Optimizing Compiler using equivalent flags:
USING XSIMD
SIMD SIZE: 8

case 1: 38.7353ms
case 2a: 45.1576ms
case 2b: 40.1831ms
case 2c: 30.8375ms
case 3a: 35.3932ms
case 3b: 37.2431ms
case 3c: 27.0555ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants