Improve the performance of convolve (and correlate)

The exising implementation of `convolve` (which is also used for `correlate`) was pretty slow. The slowness was due to the fact that we were using regular tensor element access in the inner loop of the convolution, which was very expensive. The solution was to ensure that the input tensors were contiguous (by cloning them if necessary) and then using `unsafe_raw_buf` for all the tensor element accesses, which is safe because we know that the indexes are all within the tensor boundaries. On my system this change makes a large convolution go from ~1.5 seconds in release mode and ~0.25 seconds in danger mode down to ~65 usecs in both modes (i.e. a x23 reduction in release mode and a x3.8 reduction in danger mode)!
mratsim · May 1, 2024 · 8474811 · 8474811
1 parent 1448698
commit 8474811
Showing 1 changed file with 14 additions and 1 deletion.
diff --git a/src/arraymancer/tensor/math_functions.nim b/src/arraymancer/tensor/math_functions.nim
@@ -294,12 +294,25 @@ proc convolveImpl[T: SomeNumber | Complex32 | Complex64](
   # Initialize the result tensor
   result = zeros[T](len_result)
 
+  # Ensure that the input tensors are contiguous so that they can be accessed
+  # efficiently using `unsafe_raw_buf` in the inner loop
+  let f = if f.isContiguous(): f else: f.clone()
+  let g = if g.isContiguous(): g else: g.clone()
+
   # And perform the convolution
   omp_parallel_blocks(block_offset, block_size, len_result):
     for n in block_offset ..< block_offset + block_size:
       let shift = n + offset
       for m in max(0, shift - g.size + 1) .. min(f.size - 1, shift):
-        result[n] += f[m] * g[shift - m]
+        # We want to do the following operation:
+        # result[n] += f[m] * g[shift - m]
+        # In order to do it efficently, we want to avoid all the overhead of
+        # using regular `[]` access. Since we know that we are working with
+        # continuous, rank-1 tensors, that `n`, `m` and `shift-m` are within
+        # the boundaries of `result`, `f` and `g` (respectively) and since
+        # `T is KnownSupportsCopyMem`, it is safe (and way more efficient) to
+        # use `unsafe_raw_buf` to access the actual tensor elements here:
+        result.unsafe_raw_buf[n] += f.unsafe_raw_buf[m] * g.unsafe_raw_buf[shift - m]
 
 proc convolve*[T: SomeNumber | Complex32 | Complex64](
     t1, t2: Tensor[T],