parse_gctx takes too much memory when we inquire specific columns and rows #63

Dom303 · 2020-01-15T17:58:16Z

When loading a very large gctx file ~24Gb on my laptop with 16 Gb using the function cmapPy.pandasGEXpress.parse.parse, I run out of memory with the following error:

Unable to allocate array with shape (473647,) and data type

If I use cidx to select a very low number of columns, then there is no more error.
However, when I request certain columns and certain rows, using both cidx and ridx, the same allocation error occurs. This indicates that the row filtering is applied, followed by the column filtering. This is a bad behaviour when dealing with very large cmap files, where it would be preferable that both filtering be applied simultaneously to avoid running out of RAM.

The problem comes from pandasGEXpress.parse_metadata_df, at the line curr_dset.read_direct(temp_array).
The function read_direct simply reads all the rows/columns without any means of filtering.

The text was updated successfully, but these errors were encountered:

dhamelse · 2021-04-12T21:00:57Z

I am also experiencing this issue.

mark-liddell · 2022-01-12T16:54:42Z

This is also causing issues for me

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse_gctx takes too much memory when we inquire specific columns and rows #63

parse_gctx takes too much memory when we inquire specific columns and rows #63

Dom303 commented Jan 15, 2020 •

edited

dhamelse commented Apr 12, 2021

mark-liddell commented Jan 12, 2022

parse_gctx takes too much memory when we inquire specific columns and rows #63

parse_gctx takes too much memory when we inquire specific columns and rows #63

Comments

Dom303 commented Jan 15, 2020 • edited

dhamelse commented Apr 12, 2021

mark-liddell commented Jan 12, 2022

Dom303 commented Jan 15, 2020 •

edited