Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor View Operations Slower Than Manual Looping #2776

Open
SteveMacenski opened this issue Mar 15, 2024 · 0 comments
Open

Tensor View Operations Slower Than Manual Looping #2776

SteveMacenski opened this issue Mar 15, 2024 · 0 comments

Comments

@SteveMacenski
Copy link

SteveMacenski commented Mar 15, 2024

Hi,

First off, thanks for all the great work on xtensor, its made it possible to build state of the art Model Predictive algorithms in ROS' Nav2 project using only CPU matching or in some cases beating GPU-enabled versions.

I've been on a kick since March 1st of really narrowing in on our uses of xtensor and every operation to try to squeeze the last few bits of performance from the system we can. I've found a couple of interesting remarks that I wanted to ask maintainers about since they seem very counter intuitive.

Copying Tensor 8-10x Faster Than View Assignment

There's a point where we have a method that we assign a set of controls to absolute velocities as a pass-through on the system dynamics. If that doesn't grokk, I'm basically just copying one tensor into another with a 1 index offset, shown below.

    xt::noalias(xt::view(state.vx, xt::all(), xt::range(1, _))) =
      xt::view(state.cvx, xt::all(), xt::range(0, -1));

    xt::noalias(xt::view(state.wz, xt::all(), xt::range(1, _))) =
      xt::view(state.cwz, xt::all(), xt::range(0, -1));

    if (isHolonomic()) {
      xt::noalias(xt::view(state.vy, xt::all(), xt::range(1, _))) =
        xt::view(state.cvy, xt::all(), xt::range(0, -1));
    }

This operation with a tensor size of {2000, 56} takes about 0.8-1.3ms per iteration. I thought to myself, 'that's weird' so I wrote a quick loop doing the same thing and it takes only 0.15 - 0.2ms.

    const bool is_holo = isHolonomic();
    for (unsigned int i = 0; i != state.vx.shape(0); i++) {
      for (unsigned int j = 1; j != state.vx.shape(1); j++) {
        state.vx(i, j) = state.cvx(i, j - 1);
        state.wz(i, j) = state.cwz(i, j - 1);
        if (is_holo) {
          state.vy(i, j) = state.cvy(i, j - 1);
        }
      }
    }

I'm not sure what to make of this except that I feel like I must be missing some subtle detail.

An aside

An aside, but I've also been running into some interesting results with xt::cumsum and xt::atan2, xt::cos, xt::sin where as the equivalent in looping (or vectorize()) is approximately the same as using the xtensor function. I don't know if you expect this to be significantly faster (I would have thought so with simd), but I was surprised to find it was as slow as it is. About ~1.5ms of our total ~6ms control loop is taken just computing xt::cumsum over 3 {2000, 56} tensors. I would have liked to have used it more in additional locations, but the overhead of it made that impractical. Example:

  xt::noalias(trajectories.x) = state.pose.pose.position.x +
    xt::cumsum(dx * settings_.model_dt, 1);

...

  auto yaws_between_points = xt::atan2(
    goal_y - data.trajectories.y,
    goal_x - data.trajectories.x);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant