You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am wondering if dask or pandas has native or built-in support for aggregate function that run against vector data. Namley, text/image embeddings are stored in the column of csv/parquet file. And I'd like to run various aggregate functions such as mean, max and so on. All these operations are element-wise, namely, it returns the mean of all the values in same index and return an array with same lenght. What's more, I'd like to run K-Nearest-Neighbor search as well.
If not natively supported, how to achieve these operations with performance efficient?
example code:
import dask.dataframe as dd
import pandas as pd
import numpy as np
# Sample DataFrame with arrays in one of the columns
data = {
'category': ['A', 'A', 'B', 'B'],
'values': [np.array([1, 2, 3]), np.array([4, 5, 6]), np.array([7, 8, 9]), np.array([10, 11, 12])],
'scalar': [1, 2, 3, 4]
}
pdf = pd.DataFrame(data)
# Convert the Pandas DataFrame to a Dask DataFrame
ddf = dd.from_pandas(pdf, npartitions=2)
result = ddf.groupby('category')['values'].mean().compute()
print(result)
Expected output
category
A [2.5, 3.5, 4.5]
B [8.5, 9.5, 10.5]
Name: values, dtype: object
The text was updated successfully, but these errors were encountered:
I am wondering if
dask
orpandas
has native or built-in support for aggregate function that run against vector data. Namley, text/image embeddings are stored in the column ofcsv/parquet
file. And I'd like to run various aggregate functions such asmean
,max
and so on. All these operations are element-wise, namely, it returns the mean of all the values in same index and return an array with same lenght. What's more, I'd like to runK-Nearest-Neighbor
search as well.If not natively supported, how to achieve these operations with performance efficient?
example code:
Expected output
The text was updated successfully, but these errors were encountered: