Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cp.unique without sorting #8307

Open
essoca opened this issue Apr 24, 2024 · 4 comments
Open

cp.unique without sorting #8307

essoca opened this issue Apr 24, 2024 · 4 comments

Comments

@essoca
Copy link

essoca commented Apr 24, 2024

Description

There are use cases where one wants to find unique rows in a 2D array without having to sort the outputs. This is already recognized by the implementation of unique by pandas. They claim it to be significantly faster than numpy's for long enough sequences.

I wanted to try the differences by implementing such functionality, keeping numpy's API. So I tried:

@numba.njit('Tuple((int64[::1], int64[::1], int32[::1]))(int64[::1])')
def _unique1d_with_no_sort_cpu_kernel(arr):
    uniques = dict()
    counts = dict()
    INF64 = np.iinfo(np.int64).max
    ind = np.ones(arr.size, dtype=np.int64) * INF64
    inv = np.empty(arr.size, dtype=np.int64)
    count_unique = -1
    for i, row in enumerate(arr):
        if row not in uniques:
            count_unique += 1
            uniques[row] = count_unique
            counts[row] = 1
            ind[count_unique] = i
        else:
            counts[row] += 1
        inv[i] = uniques[row]
    counts = np.asarray(list(counts.values()), dtype=np.int32)
    return ind[ind != INF64], inv, counts

and then wrap it by

def unique_no_sort_cpu_axis0(arr):
    arr_hashed = np.asarray([hash(row.tobytes()) for row in arr])
    return _unique1d_with_no_sort_cpu_kernel(arr_hashed)

Write a test case:

arr = np.random.randint(0, 2, (1000000, 10))
a1, b1, c1 = unique_no_sort_cpu_axis0(arr)
_, a2, b2, c2 = np.unique(
  arr,
  return_index=True,
  return_inverse=True,
  return_counts=True,
  axis=0
)

Asserting equality:

>>> all([np.all(a1==a2), np.all(b1==b2), np.all(c1==c2)])
True

Compare performance:

>>> %timeit unique_no_sort_cpu_axis0(arr)
273 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit np.unique(arr, return_index=True, return_inverse=True, return_counts=True, axis=0)
2.27 s ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This implementation is then ~8x faster than numpy.

Since finding unique rows using cupy is also of interest (e.g. see the latest post requesting such a feature, which is btw already possible by sorting), I am very interested in the possibility of a faster implementation without sorting. Could you please consider it?

Many thanks in advance.

Additional Information

No response

@essoca essoca added the cat:feature New features/APIs label Apr 24, 2024
@kmaehashi
Copy link
Member

Hi @essoca! CuPy's primary aim is to provide NumPy/SciPy compatible APIs, and unique without sorting is currently considered functionality that end users would implement on top of CuPy. We will continue to watch the needs of the community to see if it's better to add support for it as exceptions like these.

As for the algorithms proposed, I'm under the impression that it is difficult to naively port it to run efficiently on CUDA.

@leofang
Copy link
Member

leofang commented Apr 29, 2024

Just FYI, the array API standard does not regulate the sort order and it could be unsorted:
https://data-apis.org/array-api/latest/API_specification/set_functions.html
Though, in NumPy 2.0 it is still sorted
https://numpy.org/devdocs/reference/routines.set.html
since all the array-API-compliant routines are implemented as shortcuts to numpy.unique() with different arguments.

@essoca
Copy link
Author

essoca commented Apr 29, 2024

Thanks for the info @leofang. What I don't get is that, if array-API compliant routines are implemented as shortcuts to numpy.unique(), then why not choosing the most efficient unique() implementation by default? As far as I know, there are two well-known algorithms. 1) The O(N) hash-table one provided above 2) The O(N log N) sort-and-diff used by numpy/cupy.

As to your impression @kmaehashi, I believe that the O(N) nature of finding uniques makes it "natural" for the CUDA computing model. Actually, below is a (deceptively simple) algorithm implementing the above logic for a simple, yet general enough, case: $N\times M$ binary arrays with $M \le log_2(N)-1$.

@numba.cuda.jit('int64(int64[:])', device=True)
def _bin2dec(row):
    dec = 0
    for j in range(row.size):
        dec += row[j] * (2**j)
    return dec

@numba.guvectorize('int64[:,:], int64[:]', '(n,m)->(n)', target='cuda')
def _hash2d(arr, out):
    for i in range(arr.shape[0]):
        out[i] = _bin2dec(arr[i][::-1])

@numba.cuda.jit('int64[::1], int64[::1], int64[::1], int32[::1]')
def _unique1d_cuda_kernel(rows_hashed, ind, inv, count):
    i = numba.cuda.grid(1)
    if i < rows_hashed.size:
        row_idx = rows_hashed[i] % rows_hashed.size
        ind[row_idx] = i
        inv[i] = row_idx
        numba.cuda.atomic.add(count, row_idx, 1)

A test for this:

import cupy as cp

N, M = 10, 3
arr = cp.random.randint(0, 2, (N, M))
rows_hashed = cp.empty(arr.shape[0], dtype=cp.int64)
_hash2d(arr, rows_hashed)

ind = cp.ones(arr.shape[0], dtype=cp.int64) * N
inv = cp.empty(arr.shape[0], dtype=np.int64)
count = cp.zeros(arr.shape[0], dtype=cp.int32)
_unique1d_cuda_kernel[1, N](rows_hashed, ind, inv, count)
ind = ind[ind != N]
count = count[count != 0]

uniques2, ind2, inv2, count2 = cp.unique(
    arr,
    return_index=True,
    return_inverse=True,
    return_counts=True,
    axis=0
)

uniques = arr[ind]
uniques2 = uniques2

>>> cp.all(uniques==uniques2).item()
True

The complications of generalizing this are in dealing with hash collision when $M\ge\log_2(N)$, and of course keeping M below the boundary for integer overflow of hashed values.

In the most general case, the problem boils down to having dicts in the CUDA target.

@leofang
Copy link
Member

leofang commented May 20, 2024

Thanks for the info @leofang. What I don't get is that, if array-API compliant routines are implemented as shortcuts to numpy.unique(), then why not choosing the most efficient unique() implementation by default?

Sorry I dropped the ball. The rationale is that we (array API standardization committee) did not want to dictate anything that could impede alternative implementations. NumPy has been having the result sorted for years (and CuPy followed) and we (NumPy/CuPy) would not want to break backward compatibility, but on the other hand there are needs (as you showed) for other libraries to do it differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants