FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466

phofl · 2023-08-09T05:31:08Z

This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.

The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

If you would like to filter this warning without installing pyarrow at this time, please view this comment: #54466 (comment)

mynewestgitaccount · 2023-08-11T05:29:06Z

Something that hasn't received enough attention/discussion, at least in my mind, is this piece of the Drawbacks section of the PDEP (bolding added by me):

Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow using pip from wheels, numpy and pandas requires about 70MB, and including PyArrow requires an additional 120MB. An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments such as AWS Lambda.

I honestly don't understand how mandating a 170% increase in the effective size of a pandas installation (70MB to 190MB, from the numbers in the quoted text) can be considered okay.

For that kind of increase, I would expect/want the tradeoff to be major improvements across the board. Instead, this change comes with limited benefit but massive bloat for anyone who doesn't need the features PyArrow enables, e.g. for those who don't have issues with the current functionality of pandas.

rebecca-palmer · 2023-08-14T07:09:04Z

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

mroeschke · 2023-08-16T20:42:17Z

For that kind of increase, I would expect/want the tradeoff to be major improvements across the board.

Yeah unfortunately this is where the subjective tradeoff comes into effect. pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively. The hope with pyarrow is that the tradeoff improves the current functionality for common "object" types in pandas such as text, binary, decimal, and nested data.

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible.

AFAIK most pydata projects don't actually publish/manage Linux system packages for their respective libraries. Do you know how these are packaged today?

mynewestgitaccount · 2023-08-16T21:18:43Z

pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively.

The pytz and dateutil wheels are only ~500kb. Drawing a comparison between them and PyArrow seems like a stretch, to put it lightly.

rebecca-palmer · 2023-08-16T21:26:21Z

Do you know how these are packaged today?

By whoever offers to do it, currently me for pandas. Of the pydata projects, Debian currently has pydata-sphinx-theme, sparse, patsy, xarray and numexpr.

An old discussion thread (anyone can post there, but be warned that doing so will expose your non-spam-protected email address) suggests that there is existing work on a pyarrow Debian package, but I don't yet know whether it ever got far enough to work.

rebecca-palmer · 2023-08-18T07:53:15Z

I do intend to investigate this further at some point - I haven't done so yet because Debian updated numexpr to 2.8.5, breaking pandas (#54449 / #54546), and fixing that is currently more urgent.

jjerphan · 2023-08-18T20:22:30Z

Hi,

Thanks for welcoming feedback from the community.

While I respect you decision, I am afraid that making pyarrow a required dependency will come with costly consequences for users and downstream libraries' developers and maintainers for two reasons:

installing pyarrow after pandas in a fresh conda environment increases its size from approximately 100MiB to approximately 500 MiB.

Packages size

libgoogle-cloud-2.12.0-h840a212_1 :                 46106632 bytes,
python-3.11.4-hab00c5b_0_cpython :                  30679695 bytes,
libarrow-12.0.1-h10ac928_8_cpu :                    27696900 bytes,
ucx-1.14.1-h4a2ce2d_3 :                             15692979 bytes,
pandas-2.0.3-py311h320fe9a_1 :                      14711359 bytes,
numpy-1.25.2-py311h64a7726_0 :                      8139293 bytes,
libgrpc-1.56.2-h3905398_1 :                         6331805 bytes,
libopenblas-0.3.23-pthreads_h80387f5_0 :            5406072 bytes,
aws-sdk-cpp-1.10.57-h85b1a90_19 :                   4055495 bytes,
pyarrow-12.0.1-py311h39c9aba_8_cpu :                3989550 bytes,
libstdcxx-ng-13.1.0-hfd8a6a1_0 :                    3847887 bytes,
rdma-core-28.9-h59595ed_1 :                         3735644 bytes,
libthrift-0.18.1-h8fd135c_2 :                       3584078 bytes,
tk-8.6.12-h27826a3_0 :                              3456292 bytes,
openssl-3.1.2-hd590300_0 :                          2646546 bytes,
libprotobuf-4.23.3-hd1fb520_0 :                     2506133 bytes,
libgfortran5-13.1.0-h15d22d2_0 :                    1437388 bytes,
pip-23.2.1-pyhd8ed1ab_0 :                           1386212 bytes,
krb5-1.21.2-h659d440_0 :                            1371181 bytes,
libabseil-20230125.3-cxx17_h59595ed_0 :             1240376 bytes,
orc-1.9.0-h385abfd_1 :                              1020883 bytes,
ncurses-6.4-hcb278e6_0 :                            880967 bytes,
pygments-2.16.1-pyhd8ed1ab_0 :                      853439 bytes,
jedi-0.19.0-pyhd8ed1ab_0 :                          844518 bytes,
libsqlite-3.42.0-h2797004_0 :                       828910 bytes,
libgcc-ng-13.1.0-he5830b7_0 :                       776294 bytes,
ld_impl_linux-64-2.40-h41732ed_0 :                  704696 bytes,
libnghttp2-1.52.0-h61bc06f_0 :                      622366 bytes,
ipython-8.14.0-pyh41d4057_0 :                       583448 bytes,
bzip2-1.0.8-h7f98852_4 :                            495686 bytes,
setuptools-68.1.2-pyhd8ed1ab_0 :                    462324 bytes,
zstd-1.5.2-hfc55251_7 :                             431126 bytes,
libevent-2.1.12-hf998b51_1 :                        427426 bytes,
libgomp-13.1.0-he5830b7_0 :                         419184 bytes,
xz-5.2.6-h166bdaf_0 :                               418368 bytes,
libcurl-8.2.1-hca28451_0 :                          372511 bytes,
s2n-1.3.48-h06160fa_0 :                             369441 bytes,
aws-crt-cpp-0.21.0-hb942446_5 :                     320415 bytes,
readline-8.2-h8228510_1 :                           281456 bytes,
libssh2-1.11.0-h0841786_0 :                         271133 bytes,
prompt-toolkit-3.0.39-pyha770c72_0 :                269068 bytes,
libbrotlienc-1.0.9-h166bdaf_9 :                     265202 bytes,
python-dateutil-2.8.2-pyhd8ed1ab_0 :                245987 bytes,
re2-2023.03.02-h8c504da_0 :                         201211 bytes,
aws-c-common-0.9.0-hd590300_0 :                     197608 bytes,
aws-c-http-0.7.11-h00aa349_4 :                      194366 bytes,
pytz-2023.3-pyhd8ed1ab_0 :                          186506 bytes,
aws-c-mqtt-0.9.3-hb447be9_1 :                       162493 bytes,
aws-c-io-0.13.32-h4a1a131_0 :                       154523 bytes,
ca-certificates-2023.7.22-hbcca054_0 :              149515 bytes,
lz4-c-1.9.4-hcb278e6_0 :                            143402 bytes,
python-tzdata-2023.3-pyhd8ed1ab_0 :                 143131 bytes,
libedit-3.1.20191231-he28a2e2_2 :                   123878 bytes,
keyutils-1.6.1-h166bdaf_0 :                         117831 bytes,
tzdata-2023c-h71feb2d_0 :                           117580 bytes,
gflags-2.2.2-he1b5a44_1004 :                        116549 bytes,
glog-0.6.0-h6f12383_0 :                             114321 bytes,
c-ares-1.19.1-hd590300_0 :                          113362 bytes,
libev-4.33-h516909a_1 :                             106190 bytes,
aws-c-auth-0.7.3-h28f7589_1 :                       101677 bytes,
libutf8proc-2.8.0-h166bdaf_0 :                      101070 bytes,
traitlets-5.9.0-pyhd8ed1ab_0 :                      98443 bytes,
aws-c-s3-0.3.14-hf3aad02_1 :                        86553 bytes,
libexpat-2.5.0-hcb278e6_1 :                         77980 bytes,
libbrotlicommon-1.0.9-h166bdaf_9 :                  71065 bytes,
parso-0.8.3-pyhd8ed1ab_0 :                          71048 bytes,
libzlib-1.2.13-hd590300_5 :                         61588 bytes,
libffi-3.4.2-h7f98852_5 :                           58292 bytes,
wheel-0.41.1-pyhd8ed1ab_0 :                         57374 bytes,
aws-c-event-stream-0.3.1-h2e3709c_4 :               54050 bytes,
aws-c-sdkutils-0.1.12-h4d4d85c_1 :                  53123 bytes,
aws-c-cal-0.6.1-hc309b26_1 :                        50923 bytes,
aws-checksums-0.1.17-h4d4d85c_1 :                   50001 bytes,
pexpect-4.8.0-pyh1a96a4e_2 :                        48780 bytes,
libnuma-2.0.16-h0b41bf4_1 :                         41107 bytes,
snappy-1.1.10-h9fff704_0 :                          38865 bytes,
typing_extensions-4.7.1-pyha770c72_0 :              36321 bytes,
libuuid-2.38.1-h0b41bf4_0 :                         33601 bytes,
libbrotlidec-1.0.9-h166bdaf_9 :                     32567 bytes,
libnsl-2.0.0-h7f98852_0 :                           31236 bytes,
wcwidth-0.2.6-pyhd8ed1ab_0 :                        29133 bytes,
asttokens-2.2.1-pyhd8ed1ab_0 :                      27831 bytes,
stack_data-0.6.2-pyhd8ed1ab_0 :                     26205 bytes,
executing-1.2.0-pyhd8ed1ab_0 :                      25013 bytes,
_openmp_mutex-4.5-2_gnu :                           23621 bytes,
libgfortran-ng-13.1.0-h69a702a_0 :                  23182 bytes,
libcrc32c-1.1.2-h9c3ff4c_0 :                        20440 bytes,
aws-c-compression-0.2.17-h4d4d85c_2 :               19105 bytes,
ptyprocess-0.7.0-pyhd3deb0d_0 :                     16546 bytes,
pure_eval-0.2.2-pyhd8ed1ab_0 :                      14551 bytes,
libblas-3.9.0-17_linux64_openblas :                 14473 bytes,
liblapack-3.9.0-17_linux64_openblas :               14408 bytes,
libcblas-3.9.0-17_linux64_openblas :                14401 bytes,
six-1.16.0-pyh6c4a22f_0 :                           14259 bytes,
backcall-0.2.0-pyh9f0ad1d_0 :                       13705 bytes,
matplotlib-inline-0.1.6-pyhd8ed1ab_0 :              12273 bytes,
decorator-5.1.1-pyhd8ed1ab_0 :                      12072 bytes,
backports.functools_lru_cache-1.6.5-pyhd8ed1ab_0 :  11519 bytes,
pickleshare-0.7.5-py_1003 :                         9332 bytes,
prompt_toolkit-3.0.39-hd8ed1ab_0 :                  6731 bytes,
backports-1.0-pyhd8ed1ab_3 :                        5950 bytes,
python_abi-3.11-3_cp311 :                           5682 bytes,
_libgcc_mutex-0.1-conda_forge :                     2562 bytes,

pyarrow also depends on libarrow which itself depends on several notable C and C++ libraries. This constraints the installation of other packages whose dependencies might be incompatible with libarrow's, making pandas potentially unusable in some context.

Have you considered those two observations as drawbacks before taking the decision?

lithomas1 · 2023-08-18T20:32:05Z

Hi,

Thanks for welcoming feedback from the community.

While I respect you decision, I am afraid that making pyarrow a required dependency will come with costly consequences for users and downstream libraries' developers and maintainers for two reasons:

installing pyarrow after pandas in a fresh conda environment increases its size from approximately 100MiB to approximately 500 MiB.

Packages size

pyarrow also depends on libarrow which itself depends on several notable C and C++ libraries. This constraints the installation of other packages whose dependencies might be incompatible with libarrow's, making pandas potentially unusable in some context.

Have you considered those two observations as drawbacks before taking the decision?

This is discussed a bit in https://github.com/pandas-dev/pandas/pull/52711/files#diff-3fc3ce7b7d119c90be473d5d03d08d221571c67b4f3a9473c2363342328535b2R179-R193
(for pip only I guess).

While currently the build size for pyarrow is pretty large, it doesn't "have" to be that big. I think by pandas 3.0
(when pyarrow will actually become required), at least some components will be spun out/made optional/something like that (I heard that the arrow people were talking about this).

(cc @jorisvandenbossche for more info on this)

I'm not an Arrow dev myself, but if is something that just needs someone to look at, I'm happy to put some time in help give Arrow a nudge in the right direction.

Finally, for clarity purposes, is the reason for concern also AWS lambda/pyodide/Alpine, or something else?

(IMO, outside of stuff like lambda funcs, pyarrow isn't too egregious in terms of package size compared to pytorch/tensorflow but it's definetely something that can be improved)

jjerphan · 2023-08-18T20:49:13Z

If libarrow is slimmed down by having non-essential Arrow features be extracted into other libraries which could be optional dependencies, I think most people's concerns would be addressed.

Edit: See conda-forge/arrow-cpp-feedstock#1035

DerThorsten · 2023-08-22T07:16:22Z

Hi,

Thanks for welcoming feedback from the community.
For wasm builds of python / python-packages (ie pyodide / emscripten-forge) package size really matters since these packages have to be downloaded from within the browser. Once a package is too big, usability suffers drastically.

With pyarrow as a required dependency, pandas is less usable from python in the browser.

surfaceowl · 2023-08-30T15:36:08Z

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

There is another way - use virtual environments in user space instead of system python. The Python Software Foundation recommends users create virtual environments; and Debian/Ubuntu want users to leave the system python untouched to avoid breaking system python.

Perhaps Pandas could add some warnings or error messages on install to steer people to virtualenv. This approach might avoid or at least defer work of adding pyarrow to APT as well as the risks of users breaking system python. Also which I'm building projects I might want a much later version of pandas/pyarrow than would ever ship on Debian given the release strategy/timing delay.

On the other hand, arrow backend has significant advantages and with the rise of other important packages in the data space that also use pyarrow (polars, dask, modin), perhaps there is sufficient reason to add pyarrow to APT sources.

A good summary that might be worth checking out is Externally managed environments. The original PEP 668 is found here.

stonebig · 2023-08-30T18:29:28Z

I think it's the rigth path for performance in WASM.

mlkui · 2023-08-31T10:24:53Z

This is a good idea!
But I think there are also two important features should also be implemented except strings:

Zero-copy for multi-index dataframe. Currently, multi-index dataframe can not be convert from arrow table with zero copy(zero_copy_only=True), which is a BIGGER problem for big dataframe. You can reset_index() the dataframe, convert it to arrow table, and convert arrow table back to dataframe with zero copy, but after all, you must use call set_index() to the dataframe to get multi-index back, then copy happens.
Zero-copy for pandas.concat. Arrow table concat can be zero-copy, but when concat two zero-copy dataframe(convert from arrow table), copy happens even pandas COW is turned on. Also, currently, trying to concat two arrow table and then convert the table to dataframe with zero_copy_only=True is also not allowed as the chunknum>1.

phofl · 2023-08-31T21:57:42Z

@mlkui

Regarding concat: This should already be zero copy:

df = pd.DataFrame({"a": [1, 2, 3]}, dtype="int64[pyarrow]")
df2 = pd.DataFrame({"a": [1, 2, 3]}, dtype="int64[pyarrow]")

x = pd.concat([df, df2])

This creates a new dataframe that has 2 pyarrow chunks.

Can you open a separate issue if this is not what you are looking for?

mlkui · 2023-09-01T03:25:57Z

@phofl
Thanks for your reply. But your example may be too simple. Please view the following codes(pandas 2.0.3 and pyarrow 12.0/ pandas 2.1.0 and pyarrow 13.0):

        with pa.memory_map("d:\\1.arrow", 'r') as source1, pa.memory_map("d:\\2.arrow", 'r') as source2, pa.memory_map("d:\\3.arrow", 'r') as source3, pa.memory_map("d:\\4.arrow", 'r') as source4:

            c1 = pa.ipc.RecordBatchFileReader(source1).read_all().column("p")
            c2 = pa.ipc.RecordBatchFileReader(source2).read_all().column("v")
            c3 = pa.ipc.RecordBatchFileReader(source1).read_all().column("p")
            c4 = pa.ipc.RecordBatchFileReader(source2).read_all().column("v")
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            s1 = c1.to_pandas(zero_copy_only=True)
            s2 = c2.to_pandas(zero_copy_only=True)
            s3 = c3.to_pandas(zero_copy_only=True)
            s4 = c4.to_pandas(zero_copy_only=True)
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            dfs = {"p": s1, "v": s2}
            df1 = pd.concat(dfs, axis=1, copy=False)                            #zero-copy
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            dfs2 = {"p": s3, "v": s4}
            df2 = pd.concat(dfs2, axis=1, copy=False)                           #zero-copy
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            # NOT zero-copy
            result_df = pd.concat([df1, df2], axis=0, copy=False)

        with pa.memory_map("z1.arrow", 'r') as source1, pa.memory_map("z2.arrow", 'r') as source2:

            table1 = pa.ipc.RecordBatchFileReader(source1).read_all()
            table2 = pa.ipc.RecordBatchFileReader(source2).read_all()
            combined_table = pa.concat_tables([table1, table2])
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))        #Zero-copy

            df1 = table1.to_pandas(zero_copy_only=True)
            df2 = table2.to_pandas(zero_copy_only=True)
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))       #Zero-copy

            #Use pandas to concat two zero-copy dataframes
            #But copy happens
            result_df = pd.concat([df1, df2], axis=0, copy=False)

            #Try to convert the arrow table to pandas directly
            #This will raise exception for chunk number is 2
            df3 = combined_table.to_pandas(zero_copy_only=True)

            # Combining chunks to one will cause copy
            combined_table = combined_table.combine_chunks()

0x26res · 2023-09-03T19:06:28Z

Beside the build size, there is a portability issue with pyarrow.

pyarrow does not provide wheels for as many environment as numpy.

For environments where pyarrow does not provide wheels, pyarrow has to be installed from source which is not simple.

flying-sheep · 2023-10-10T07:09:39Z

If this happens, would dtype='string' and dtype='string[pyarrow]' be merged into one implementation?

We’re currently thinking about coercing strings in our library, but hesitating because of the unclear future here.

EwoutH · 2023-10-26T21:50:03Z

pyarrow does not provide wheels for as many environment as numpy.

The fact that they still don’t have Python 3.12 wheels up is worrisome.

h-vetinari · 2023-11-01T09:31:20Z

The fact that they still don’t have Python 3.12 wheels up is worrisome.

Arrow is a beast to build, and even harder to fit into a wheel properly (so you get less features, and things like using the slimmed-down libarrow will be harder to pull off).

Conda-forge builds for py312 have been available for a month already though, and are ready in principle to ship pyarrow with a minimal libarrow. That still needs some usability improvements, but it's getting there.

musicinmybrain · 2023-11-03T21:12:58Z

Without weighing in on whether this is a good idea or a bad one, Fedora Linux already has a libarrow package that provides python3-pyarrow, so I think this shouldn’t be a real problem for us from a packaging perspective.

I’m not saying that Pandas is easy to keep packaged, up to date, and coordinated with its dependencies and reverse dependencies! Just that a hard dependency on PyArrow wouldn’t necessarily make the situation worse for us.

ZupoLlask · 2023-11-30T10:01:46Z

@h-vetinari Almost there? :-)

raulcd · 2023-11-30T10:12:59Z

@h-vetinari Almost there? :-)

There is still a lot of work to be done on the wheels side but for conda after the work we did to divide the CPP library, I created this PR which is currently under discussion in order to provide both a pyarrow-base that only depends on libarrow and libparquet and pyarrow which would pull all the Arrow CPP dependencies. Both have been built with support for everything so depending on pyarrow-base and libarrow-dataset would allow the use of pyarrow.dataset, etc.

chris-vecchio · 2023-12-08T17:15:52Z

Thanks for requesting feedback. I'm not well versed on the technicalities, but I strongly prefer to not require pyarrow as a dependency. It's better imo to let users choose to use PyArrow if they desire. I prefer to use the default NumPy object type or pandas' StringDType without the added complexity of PyArrow.

Soft-Buddy · 2024-02-26T18:20:50Z

I don't consider this a good decision, a huge increment in the installation size will be there :(

miraculixx · 2024-02-28T07:54:25Z

@dwgillies #54466 (comment)

Does anyone have a procedure for installing pyarrow in cygwin? Note: straightforward installation does not work.

That's a great question - many companies rely on Python + Pandas running in cygwin, mingw (through git-bash) and Msys in their Windows work PCs. It is often the best way to have a useful Python dev env in a corporate environment.

Will Pandas+PyArrow be supported in these environments? If not there is a high risk of lots of outdated installations bc these environments are rather sticky once deployed, and there is no easy way to upgrade to Linux or WSL.

Soft-Buddy · 2024-02-28T13:27:51Z

@dwgillies #54466 (comment)

Does anyone have a procedure for installing pyarrow in cygwin? Note: straightforward installation does not work.

That's a great question - many companies rely on Python + Pandas running in cygwin, mingw (through git-bash) and Msys in their Windows work PCs. It is often the best way to have a useful Python dev env in a corporate environment.

Will Pandas+PyArrow be supported in these environments? If not there is a high risk of lots of outdated installations bc these environments are rather sticky once deployed, and there is no easy way to upgrade to Linux or WSL.

Our issues kinda match buddy. I use pandas in my android app which ships a cross compiled copy of python and of pandas compiled using crossenv. PyArrow's installation doesn't work there either... And triggers some weird errors

**warning:** Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 ```py import pandas as pd ``` Signed-off-by: Avelino <31996+avelino@users.noreply.github.com>

jkmackie · 2024-03-02T21:46:17Z

My general concern with the mandatory PyArrow dependency is chasing competing standards and dependency issues like bugs.

Kindly recall PDEP 10 lists three key benefits of pyarrow: (1) better pyarrow string memory/speed; (2) nested datatypes; and (3) interoperability.

PDEP Point 1 - Pandas 2.2.0 Performance
To make this less abstract, below are Pandas performance stats based on the 1brc challenge of aggregating 1 billion rows of city temperatures as fast as possible.

1brc INPUT - 1 billion rows
2 columns: city and temperature

OUTPUT - Temp mean/min/max by city

Metrics

Memory
These metrics use the default DataFrame format: city is 'object' and temperature is 'float64'.

Turns out the city column 'object' format hogs 🐷 90% of the 'deep' memory usage ⵜ. This is indeed an issue! The last 10% of memory is temperatures. Downcasting to 'float32' halves memory for the temperature column.

_{ⵜ Memory Footnote:}
_{There's a mismatch between dataframe 'deep' memory usage (69GB) and the PC RAM increase I saw in Task Manager (about 23-24 GB) during pd.read_parquet(). My system memory is 64GB. Hard to believe 2GB memory compression accounts for the discrepancy.}

Speed
Reading from parquet is 2.5 times as fast as reading from csv and takes one-fifth the space (snappy compression). Mean/min/max aggregation time was reasonable at under one minute.

PDEP Point 2 - Nesting
The PDEP 10 nested datatype example saves [{'a': 1, 'b': 2}, {'a': 2, 'b': 99}] to a Series rather than a DataFrame. The pyarrow benefit is saving an unknown nested structure as speed/memory efficient strings.

The existing alternative is use pd.json_normalize() or pd.DataFrame() to load the example into a DataFrame with a column for eack key. Foreknowledge of the format is required. Then downcast numeric columns with pd.to_numeric(df[mycol], downcast=<'integer', 'signed', 'unsigned', or 'float'>).

PDEP Point 3 - Interoperability
What about potential PyArrow C++ binding issues? Is this straightforward to debug and fix?

TAKEAWAYS
Pandas stock performance is good. 😎 With foreknowledge of the nested format, data can be flattened into a DataFrame (with a column for each key). Numbers are downcasted one column at a time.

The standout issue to me is the dtype: object. Why not build a solution in Pandas or NumPy?

hagenw · 2024-03-04T07:50:03Z

BTW, reading in a CSV file or parquet file is still faster by a factor of 5 for me when I do the reading with pyarrow and then convert to a pandas.DataFrame (but yes, using pyarrow as datatype for string is then faster then using object), compared to reading directly with pandas using pyarrow as engine.

jkmackie · 2024-03-04T16:26:53Z

BTW, reading in a CSV file or parquet file is still faster by a factor of 5 for me when I do the reading with pyarrow and then convert to a pandas.DataFrame (but yes, using pyarrow as datatype for string is then faster then using object), compared to reading directly with pandas using pyarrow as engine.

@hagenw Would you kindly explain the below result? Looks like parquet uses a lot more peak RAM.

Windows users: In general, what is the large discrepancy between DataFrame memory shown by df.info(memory_usage = 'deep') versus Windows Task Manager (below is a Task Manager memory metric pic)? What is the right 'real-world' memory metric?

hagenw · 2024-03-04T18:33:39Z

I measured peak memory consumption with memray, but I'm not completely sure if I did it correctly.
I have some updated results in the dev branch (https://github.com/audeering/audb/tree/dev/benchmarks), there we see the following

So it seems to be more equal. The code I used to measure memory consumption is available at https://github.com/audeering/audb/blob/44de33f0fea1f4d003882d674dc696a8f0cfe95d/benchmarks/benchmark-dependencies-save-and-load.py. That uses memray and writes the results to binary files that you need to inspect afterwards to extract the result.

`pyarrow` will be a future dependency for pandas: pandas-dev/pandas#54466

commit 218ce70 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 18:11:19 2024 -0500 feat: 🌱 created a seed command generates 3 new leaderboard documents commit 200565e Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 18:05:39 2024 -0500 feat: ✨ modified commands using subparsers commit e3a0ff5 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 15:38:26 2024 -0500 style: 🎨 Formatted Code commit 970b95a Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 15:19:50 2024 -0500 refactor: ♻️ updated magic constants in `shared` Constants: `MAX_NUM_OF_TEAMS`, `DECIMALS`, `ROOT_FOLDER_PATH` commit 957213d Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 15:03:20 2024 -0500 perf: ⚡️ reduced pandas imports commit 86453b6 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 14:51:07 2024 -0500 docs: 📝 updated instructions and commands in README added `--dev` to install develop packages commit 7dab886 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 14:40:50 2024 -0500 chore: ➕ installed `pyarrow` as pandas depenceny `pyarrow` will be a future dependency for pandas: pandas-dev/pandas#54466

ebuchlin · 2024-03-11T20:35:37Z

There have been / are some efforts to reduce the size of pandas (#30741), these efforts should not be wasted by a dependency which could perhaps remain optional (although I have no idea whether this is feasible). +120MB multiplied by the number of installs/environments/images/CI runs is not so small. It takes more time to download and install, more network usage, more storage... It's neither green, nor inclusive for situations/people/institutes/countries where resources are not as easily available as where these decisions are taken.

WillAyd · 2024-05-09T12:49:57Z

@susmitpy @dwgillies @admajaus pinging as the people that I think mentioned lambda in this thread.

AWS already has a tool called "AWS SDK for pandas" which itself requires pyarrow. There might be confusion on how AWS counts size limits (see aws/aws-sdk-pandas#2761) but looks like it is definitely possible to run pandas + pyarrow in lambda.

Does this cover the concern for that platform?

susmitpy · 2024-05-09T13:25:40Z

@WillAyd

More often than not we need more than one library in an aws lambda function. There is a hard set limit of 250 MB. With pandas increasing from 70 MB to 190 MB (according to one of the posts above) that leaves only 60 MB for other libraries.
Pandas being so helpful, powerful and convenient is always the go to choice for dealing with data, however it being the cause due to which "along with pandas you cannot use more 1-2 libraries" will be a big issue.

cc: @dwgillies @admajaus

WillAyd · 2024-05-09T13:36:43Z

Have you tried the layer in the link above? It is not going to be a 120 MB increase because AWS is not building a pyarrow wheel with all of the same options - looks like they remove Gandiva and Flight support

susmitpy · 2024-05-09T13:52:32Z

@WillAyd
Just tried it.

179 MB is the layer's size.

WillAyd · 2024-05-09T13:57:48Z

Very helpful thanks. And the size of your current pandas + numpy + botocore + fastparquet images are significantly smaller than that?

susmitpy · 2024-05-09T14:03:58Z

I don't think that's a proper comparison as AWS data Wrangler will also have support to read parquet files for which for now I resort to fastparquet for it's smaller size.

…

On Thu, 9 May 2024, 19:28 William Ayd, ***@***.***> wrote: Very helpful thanks. And the size of your current pandas + numpy + botocore images are signifcantly smaller than that? — Reply to this email directly, view it on GitHub <#54466 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIGHCM3KKESGOBDORQENI23ZBN6HNAVCNFSM6AAAAAA3JOMQ4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBSG4YTMMBWGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

susmitpy · 2024-05-09T14:08:21Z

Also to fetch files from S3 while avoiding downloading file and then loading, s3fs is required which I guess won't be required when using AWS sdk (not sure though).

WillAyd · 2024-05-09T14:38:12Z

Yea ultimately what I'm trying to guage is how big of a difference it is. I don't have access to any lambda environments, but locally if I install your stack of pandas + numpy + fastparquet + botocore I get the following installation sizes in my site-packages folder:

75M	pandas
39M	numpy
37M	numpy.libs
25M	botocore
16M	pip
7.9M	fastparquet

Adding up to almost 200 MB just from those packages alone.

If AWS is already distributing an image with pyarrow that is smaller than this then I'm unsure about the apprehension to this proposal on account of lambda environments. Is there a significant use case why users cannot use the already distributed AWS environment that includes pandas + pyarrow and if so why should that be something that holds pandas developers back from requiring pyarrow?

h-vetinari · 2024-05-09T20:40:45Z

As of a few hours ago, there's a pyarrow-core on conda-forge (only for the latest v16), which should substantially cut down on the foot print.

The split of the cloud provider bindings out of core hasn't happened yet, but will further reduce the footprint once it happens.

MarcoGorelli · 2024-05-13T16:53:06Z

I honestly don't understand how mandating a 170% increase in the effective size of a pandas installation (70MB to 190MB, from the numbers in the quoted text) can be considered okay.

I think the pdep text wasn't precise here - pandas and numpy each require about 70MB (in fact, a bit more now, I just checked). So the percentage of the increase is more like 82% - not 170%. Still quite a lot, I don't mean to minimise it, but at lot less than has been stated here.

It's good to see that on the conda-forge side, things have become smaller. For the PyPI package, however, my understanding is that this is unlikely to happen any time soon

Have you tried the layer in the link above

I just tried this, and indeed, it works - pandas 2.2.2 and pyarrow 14.0.1 are included. I don't think it's as flexible as being able to install whichever versions you want, but it does seem like there is a workable way to use pandas in Lambda

lithomas1 pinned this issue Aug 9, 2023

lithomas1 added Community Community topics (meetings, etc.) Arrow pyarrow functionality labels Aug 9, 2023

jjerphan mentioned this issue Aug 18, 2023

Make pyarrow a required dependency #52509

Closed

lukemanley unpinned this issue Sep 6, 2023

lukemanley pinned this issue Sep 13, 2023

ivirshup mentioned this issue Oct 9, 2023

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays scverse/anndata#1068

Open

Josue-B-Navarrete mentioned this issue Mar 5, 2024

Complete error rhysnewell/aviary#195

Closed

ShawnGeorge03 added a commit to theDS3/Datathon-Leaderboard that referenced this issue Mar 7, 2024

chore: ➕ installed pyarrow as pandas depenceny

7dab886

`pyarrow` will be a future dependency for pandas: pandas-dev/pandas#54466

agriyakhetarpal mentioned this issue Mar 18, 2024

ENH: out-of-tree Pyodide builds in CI for pandas #57891

Closed

3 tasks

kaiohp mentioned this issue Mar 18, 2024

Fix pyarrow warning lvgalvao/DataProjectStarterKit#6

Open

agriyakhetarpal mentioned this issue Mar 20, 2024

Investigate alternatives to xarray to handle ProcessedVariable computations pybamm-team/PyBaMM#3913

Open

mroeschke mentioned this issue Apr 12, 2024

RLS: 3.0 #57064

Open

jorisvandenbossche mentioned this issue May 7, 2024

ENH: Add export to GeoArrow geopandas/geopandas#3219

Draft

davetapley mentioned this issue May 7, 2024

pyarrow pyinstaller/pyinstaller-hooks-contrib#739

Closed

hagenw mentioned this issue May 10, 2024

Depend on a smaller pyarrow package audeering/audb#400

Open

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466

Comments

phofl commented Aug 9, 2023 • edited by mroeschke

mynewestgitaccount commented Aug 11, 2023

rebecca-palmer commented Aug 14, 2023

mroeschke commented Aug 16, 2023

mynewestgitaccount commented Aug 16, 2023

rebecca-palmer commented Aug 16, 2023

rebecca-palmer commented Aug 18, 2023

jjerphan commented Aug 18, 2023

lithomas1 commented Aug 18, 2023 • edited

jjerphan commented Aug 18, 2023 • edited

DerThorsten commented Aug 22, 2023 • edited

surfaceowl commented Aug 30, 2023

stonebig commented Aug 30, 2023

mlkui commented Aug 31, 2023 • edited

phofl commented Aug 31, 2023

mlkui commented Sep 1, 2023 • edited

0x26res commented Sep 3, 2023 • edited

flying-sheep commented Oct 10, 2023

EwoutH commented Oct 26, 2023

h-vetinari commented Nov 1, 2023

musicinmybrain commented Nov 3, 2023

ZupoLlask commented Nov 30, 2023

raulcd commented Nov 30, 2023

chris-vecchio commented Dec 8, 2023 • edited

Soft-Buddy commented Feb 26, 2024

miraculixx commented Feb 28, 2024 • edited

Soft-Buddy commented Feb 28, 2024 • edited

jkmackie commented Mar 2, 2024

hagenw commented Mar 4, 2024

jkmackie commented Mar 4, 2024 • edited

hagenw commented Mar 4, 2024 • edited

ebuchlin commented Mar 11, 2024

WillAyd commented May 9, 2024

susmitpy commented May 9, 2024

WillAyd commented May 9, 2024 • edited

susmitpy commented May 9, 2024

WillAyd commented May 9, 2024 • edited

susmitpy commented May 9, 2024 via email

susmitpy commented May 9, 2024

WillAyd commented May 9, 2024

h-vetinari commented May 9, 2024

MarcoGorelli commented May 13, 2024 • edited

phofl commented Aug 9, 2023 •

edited by mroeschke

lithomas1 commented Aug 18, 2023 •

edited

jjerphan commented Aug 18, 2023 •

edited

DerThorsten commented Aug 22, 2023 •

edited

mlkui commented Aug 31, 2023 •

edited

mlkui commented Sep 1, 2023 •

edited

0x26res commented Sep 3, 2023 •

edited

chris-vecchio commented Dec 8, 2023 •

edited

miraculixx commented Feb 28, 2024 •

edited

Soft-Buddy commented Feb 28, 2024 •

edited

jkmackie commented Mar 4, 2024 •

edited

hagenw commented Mar 4, 2024 •

edited

WillAyd commented May 9, 2024 •

edited

WillAyd commented May 9, 2024 •

edited

MarcoGorelli commented May 13, 2024 •

edited