cp.unique without sorting #8307

essoca · 2024-04-24T20:09:28Z

Description

There are use cases where one wants to find unique rows in a 2D array without having to sort the outputs. This is already recognized by the implementation of unique by pandas. They claim it to be significantly faster than numpy's for long enough sequences.

I wanted to try the differences by implementing such functionality, keeping numpy's API. So I tried:

@numba.njit('Tuple((int64[::1], int64[::1], int32[::1]))(int64[::1])')
def _unique1d_with_no_sort_cpu_kernel(arr):
    uniques = dict()
    counts = dict()
    INF64 = np.iinfo(np.int64).max
    ind = np.ones(arr.size, dtype=np.int64) * INF64
    inv = np.empty(arr.size, dtype=np.int64)
    count_unique = -1
    for i, row in enumerate(arr):
        if row not in uniques:
            count_unique += 1
            uniques[row] = count_unique
            counts[row] = 1
            ind[count_unique] = i
        else:
            counts[row] += 1
        inv[i] = uniques[row]
    counts = np.asarray(list(counts.values()), dtype=np.int32)
    return ind[ind != INF64], inv, counts

and then wrap it by

def unique_no_sort_cpu_axis0(arr):
    arr_hashed = np.asarray([hash(row.tobytes()) for row in arr])
    return _unique1d_with_no_sort_cpu_kernel(arr_hashed)

Write a test case:

arr = np.random.randint(0, 2, (1000000, 10))
a1, b1, c1 = unique_no_sort_cpu_axis0(arr)
_, a2, b2, c2 = np.unique(
  arr,
  return_index=True,
  return_inverse=True,
  return_counts=True,
  axis=0
)

Asserting equality:

>>> all([np.all(a1==a2), np.all(b1==b2), np.all(c1==c2)])
True

Compare performance:

>>> %timeit unique_no_sort_cpu_axis0(arr)
273 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit np.unique(arr, return_index=True, return_inverse=True, return_counts=True, axis=0)
2.27 s ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This implementation is then ~8x faster than numpy.

Since finding unique rows using cupy is also of interest (e.g. see the latest post requesting such a feature, which is btw already possible by sorting), I am very interested in the possibility of a faster implementation without sorting. Could you please consider it?

Many thanks in advance.

Additional Information

No response

The text was updated successfully, but these errors were encountered:

kmaehashi · 2024-04-25T06:07:28Z

Hi @essoca! CuPy's primary aim is to provide NumPy/SciPy compatible APIs, and unique without sorting is currently considered functionality that end users would implement on top of CuPy. We will continue to watch the needs of the community to see if it's better to add support for it as exceptions like these.

As for the algorithms proposed, I'm under the impression that it is difficult to naively port it to run efficiently on CUDA.

leofang · 2024-04-29T01:37:19Z

Just FYI, the array API standard does not regulate the sort order and it could be unsorted:
https://data-apis.org/array-api/latest/API_specification/set_functions.html
Though, in NumPy 2.0 it is still sorted
https://numpy.org/devdocs/reference/routines.set.html
since all the array-API-compliant routines are implemented as shortcuts to numpy.unique() with different arguments.

essoca · 2024-04-29T19:41:46Z

Thanks for the info @leofang. What I don't get is that, if array-API compliant routines are implemented as shortcuts to numpy.unique(), then why not choosing the most efficient unique() implementation by default? As far as I know, there are two well-known algorithms. 1) The O(N) hash-table one provided above 2) The O(N log N) sort-and-diff used by numpy/cupy.

As to your impression @kmaehashi, I believe that the O(N) nature of finding uniques makes it "natural" for the CUDA computing model. Actually, below is a (deceptively simple) algorithm implementing the above logic for a simple, yet general enough, case: $N\times M$ binary arrays with $M \le log_2(N)-1$.

@numba.cuda.jit('int64(int64[:])', device=True)
def _bin2dec(row):
    dec = 0
    for j in range(row.size):
        dec += row[j] * (2**j)
    return dec

@numba.guvectorize('int64[:,:], int64[:]', '(n,m)->(n)', target='cuda')
def _hash2d(arr, out):
    for i in range(arr.shape[0]):
        out[i] = _bin2dec(arr[i][::-1])

@numba.cuda.jit('int64[::1], int64[::1], int64[::1], int32[::1]')
def _unique1d_cuda_kernel(rows_hashed, ind, inv, count):
    i = numba.cuda.grid(1)
    if i < rows_hashed.size:
        row_idx = rows_hashed[i] % rows_hashed.size
        ind[row_idx] = i
        inv[i] = row_idx
        numba.cuda.atomic.add(count, row_idx, 1)

A test for this:

import cupy as cp

N, M = 10, 3
arr = cp.random.randint(0, 2, (N, M))
rows_hashed = cp.empty(arr.shape[0], dtype=cp.int64)
_hash2d(arr, rows_hashed)

ind = cp.ones(arr.shape[0], dtype=cp.int64) * N
inv = cp.empty(arr.shape[0], dtype=np.int64)
count = cp.zeros(arr.shape[0], dtype=cp.int32)
_unique1d_cuda_kernel[1, N](rows_hashed, ind, inv, count)
ind = ind[ind != N]
count = count[count != 0]

uniques2, ind2, inv2, count2 = cp.unique(
    arr,
    return_index=True,
    return_inverse=True,
    return_counts=True,
    axis=0
)

uniques = arr[ind]
uniques2 = uniques2

>>> cp.all(uniques==uniques2).item()
True

The complications of generalizing this are in dealing with hash collision when $M\ge\log_2(N)$, and of course keeping M below the boundary for integer overflow of hashed values.

In the most general case, the problem boils down to having dicts in the CUDA target.

leofang · 2024-05-20T15:50:19Z

Thanks for the info @leofang. What I don't get is that, if array-API compliant routines are implemented as shortcuts to numpy.unique(), then why not choosing the most efficient unique() implementation by default?

Sorry I dropped the ball. The rationale is that we (array API standardization committee) did not want to dictate anything that could impede alternative implementations. NumPy has been having the result sorted for years (and CuPy followed) and we (NumPy/CuPy) would not want to break backward compatibility, but on the other hand there are needs (as you showed) for other libraries to do it differently.

essoca added the cat:feature New features/APIs label Apr 24, 2024

kmaehashi added prio:medium issue-checked and removed cat:feature New features/APIs prio:medium labels Apr 25, 2024

essoca mentioned this issue May 16, 2024

cp.unique runs forever #8326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp.unique without sorting #8307

cp.unique without sorting #8307

essoca commented Apr 24, 2024 •

edited

kmaehashi commented Apr 25, 2024

leofang commented Apr 29, 2024

essoca commented Apr 29, 2024 •

edited

leofang commented May 20, 2024

cp.unique without sorting #8307

cp.unique without sorting #8307

Comments

essoca commented Apr 24, 2024 • edited

Description

Additional Information

kmaehashi commented Apr 25, 2024

leofang commented Apr 29, 2024

essoca commented Apr 29, 2024 • edited

leofang commented May 20, 2024

essoca commented Apr 24, 2024 •

edited

essoca commented Apr 29, 2024 •

edited