New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466
Comments
Something that hasn't received enough attention/discussion, at least in my mind, is this piece of the Drawbacks section of the PDEP (bolding added by me):
I honestly don't understand how mandating a 170% increase in the effective size of a pandas installation (70MB to 190MB, from the numbers in the quoted text) can be considered okay. For that kind of increase, I would expect/want the tradeoff to be major improvements across the board. Instead, this change comes with limited benefit but massive bloat for anyone who doesn't need the features PyArrow enables, e.g. for those who don't have issues with the current functionality of pandas. |
Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.) I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy). |
Yeah unfortunately this is where the subjective tradeoff comes into effect. pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively. The hope with pyarrow is that the tradeoff improves the current functionality for common "object" types in pandas such as text, binary, decimal, and nested data.
AFAIK most pydata projects don't actually publish/manage Linux system packages for their respective libraries. Do you know how these are packaged today? |
The pytz and dateutil wheels are only ~500kb. Drawing a comparison between them and PyArrow seems like a stretch, to put it lightly. |
By whoever offers to do it, currently me for pandas. Of the pydata projects, Debian currently has pydata-sphinx-theme, sparse, patsy, xarray and numexpr. An old discussion thread (anyone can post there, but be warned that doing so will expose your non-spam-protected email address) suggests that there is existing work on a pyarrow Debian package, but I don't yet know whether it ever got far enough to work. |
Hi, Thanks for welcoming feedback from the community. While I respect you decision, I am afraid that making
Packages size
Have you considered those two observations as drawbacks before taking the decision? |
This is discussed a bit in https://github.com/pandas-dev/pandas/pull/52711/files#diff-3fc3ce7b7d119c90be473d5d03d08d221571c67b4f3a9473c2363342328535b2R179-R193 While currently the build size for pyarrow is pretty large, it doesn't "have" to be that big. I think by pandas 3.0 (cc @jorisvandenbossche for more info on this) I'm not an Arrow dev myself, but if is something that just needs someone to look at, I'm happy to put some time in help give Arrow a nudge in the right direction. Finally, for clarity purposes, is the reason for concern also AWS lambda/pyodide/Alpine, or something else? (IMO, outside of stuff like lambda funcs, pyarrow isn't too egregious in terms of package size compared to pytorch/tensorflow but it's definetely something that can be improved) |
If Edit: See conda-forge/arrow-cpp-feedstock#1035 |
Hi, Thanks for welcoming feedback from the community. With |
There is another way - use virtual environments in user space instead of system python. The Python Software Foundation recommends users create virtual environments; and Debian/Ubuntu want users to leave the system python untouched to avoid breaking system python. Perhaps Pandas could add some warnings or error messages on install to steer people to virtualenv. This approach might avoid or at least defer work of adding pyarrow to APT as well as the risks of users breaking system python. Also which I'm building projects I might want a much later version of pandas/pyarrow than would ever ship on Debian given the release strategy/timing delay. On the other hand, arrow backend has significant advantages and with the rise of other important packages in the data space that also use pyarrow (polars, dask, modin), perhaps there is sufficient reason to add pyarrow to APT sources. A good summary that might be worth checking out is Externally managed environments. The original PEP 668 is found here. |
I think it's the rigth path for performance in WASM. |
This is a good idea!
|
Regarding concat: This should already be zero copy:
This creates a new dataframe that has 2 pyarrow chunks. Can you open a separate issue if this is not what you are looking for? |
@phofl
|
If this happens, would We’re currently thinking about coercing strings in our library, but hesitating because of the unclear future here. |
Arrow is a beast to build, and even harder to fit into a wheel properly (so you get less features, and things like using the slimmed-down libarrow will be harder to pull off). Conda-forge builds for py312 have been available for a month already though, and are ready in principle to ship pyarrow with a minimal libarrow. That still needs some usability improvements, but it's getting there. |
Without weighing in on whether this is a good idea or a bad one, Fedora Linux already has a I’m not saying that Pandas is easy to keep packaged, up to date, and coordinated with its dependencies and reverse dependencies! Just that a hard dependency on PyArrow wouldn’t necessarily make the situation worse for us. |
@h-vetinari Almost there? :-) |
There is still a lot of work to be done on the wheels side but for conda after the work we did to divide the CPP library, I created this PR which is currently under discussion in order to provide both a |
Thanks for requesting feedback. I'm not well versed on the technicalities, but I strongly prefer to not require pyarrow as a dependency. It's better imo to let users choose to use PyArrow if they desire. I prefer to use the default NumPy object type or pandas' StringDType without the added complexity of PyArrow. |
I don't consider this a good decision, a huge increment in the installation size will be there :( |
That's a great question - many companies rely on Python + Pandas running in cygwin, mingw (through git-bash) and Msys in their Windows work PCs. It is often the best way to have a useful Python dev env in a corporate environment. Will Pandas+PyArrow be supported in these environments? If not there is a high risk of lots of outdated installations bc these environments are rather sticky once deployed, and there is no easy way to upgrade to Linux or WSL. |
Our issues kinda match buddy. I use pandas in my android app which ships a cross compiled copy of python and of pandas compiled using crossenv. PyArrow's installation doesn't work there either... And triggers some weird errors |
**warning:** Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 ```py import pandas as pd ``` Signed-off-by: Avelino <31996+avelino@users.noreply.github.com>
**warning:** Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 ```py import pandas as pd ``` Signed-off-by: Avelino <31996+avelino@users.noreply.github.com>
My general concern with the mandatory PyArrow dependency is chasing competing standards and dependency issues like bugs. Kindly recall PDEP 10 lists three key benefits of pyarrow: (1) better pyarrow string memory/speed; (2) nested datatypes; and (3) interoperability. PDEP Point 1 - Pandas 2.2.0 Performance 1brc INPUT - 1 billion rows OUTPUT - Temp mean/min/max by city Memory Turns out the city column 'object' format hogs 🐷 90% of the 'deep' memory usage ⵜ. This is indeed an issue! The last 10% of memory is temperatures. Downcasting to 'float32' halves memory for the temperature column. ⵜ Memory Footnote: Speed PDEP Point 2 - Nesting The existing alternative is use PDEP Point 3 - Interoperability TAKEAWAYS The standout issue to me is the |
BTW, reading in a CSV file or parquet file is still faster by a factor of 5 for me when I do the reading with |
@hagenw Would you kindly explain the below result? Looks like parquet uses a lot more peak RAM. Windows users: In general, what is the large discrepancy between DataFrame memory shown by |
I measured peak memory consumption with So it seems to be more equal. The code I used to measure memory consumption is available at https://github.com/audeering/audb/blob/44de33f0fea1f4d003882d674dc696a8f0cfe95d/benchmarks/benchmark-dependencies-save-and-load.py. That uses |
`pyarrow` will be a future dependency for pandas: pandas-dev/pandas#54466
commit 218ce70 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 18:11:19 2024 -0500 feat: 🌱 created a seed command generates 3 new leaderboard documents commit 200565e Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 18:05:39 2024 -0500 feat: ✨ modified commands using subparsers commit e3a0ff5 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 15:38:26 2024 -0500 style: 🎨 Formatted Code commit 970b95a Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 15:19:50 2024 -0500 refactor: ♻️ updated magic constants in `shared` Constants: `MAX_NUM_OF_TEAMS`, `DECIMALS`, `ROOT_FOLDER_PATH` commit 957213d Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 15:03:20 2024 -0500 perf: ⚡️ reduced pandas imports commit 86453b6 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 14:51:07 2024 -0500 docs: 📝 updated instructions and commands in README added `--dev` to install develop packages commit 7dab886 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 14:40:50 2024 -0500 chore: ➕ installed `pyarrow` as pandas depenceny `pyarrow` will be a future dependency for pandas: pandas-dev/pandas#54466
There have been / are some efforts to reduce the size of pandas (#30741), these efforts should not be wasted by a dependency which could perhaps remain optional (although I have no idea whether this is feasible). +120MB multiplied by the number of installs/environments/images/CI runs is not so small. It takes more time to download and install, more network usage, more storage... It's neither green, nor inclusive for situations/people/institutes/countries where resources are not as easily available as where these decisions are taken. |
@susmitpy @dwgillies @admajaus pinging as the people that I think mentioned lambda in this thread. AWS already has a tool called "AWS SDK for pandas" which itself requires pyarrow. There might be confusion on how AWS counts size limits (see aws/aws-sdk-pandas#2761) but looks like it is definitely possible to run pandas + pyarrow in lambda. Does this cover the concern for that platform? |
More often than not we need more than one library in an aws lambda function. There is a hard set limit of 250 MB. With pandas increasing from 70 MB to 190 MB (according to one of the posts above) that leaves only 60 MB for other libraries. cc: @dwgillies @admajaus |
Have you tried the layer in the link above? It is not going to be a 120 MB increase because AWS is not building a pyarrow wheel with all of the same options - looks like they remove Gandiva and Flight support |
@WillAyd 179 MB is the layer's size. |
Very helpful thanks. And the size of your current pandas + numpy + botocore + fastparquet images are significantly smaller than that? |
I don't think that's a proper comparison as AWS data Wrangler will also
have support to read parquet files for which for now I resort to
fastparquet for it's smaller size.
…On Thu, 9 May 2024, 19:28 William Ayd, ***@***.***> wrote:
Very helpful thanks. And the size of your current pandas + numpy +
botocore images are signifcantly smaller than that?
—
Reply to this email directly, view it on GitHub
<#54466 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIGHCM3KKESGOBDORQENI23ZBN6HNAVCNFSM6AAAAAA3JOMQ4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBSG4YTMMBWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Also to fetch files from S3 while avoiding downloading file and then loading, s3fs is required which I guess won't be required when using AWS sdk (not sure though). |
Yea ultimately what I'm trying to guage is how big of a difference it is. I don't have access to any lambda environments, but locally if I install your stack of pandas + numpy + fastparquet + botocore I get the following installation sizes in my site-packages folder: 75M pandas
39M numpy
37M numpy.libs
25M botocore
16M pip
7.9M fastparquet Adding up to almost 200 MB just from those packages alone. If AWS is already distributing an image with pyarrow that is smaller than this then I'm unsure about the apprehension to this proposal on account of lambda environments. Is there a significant use case why users cannot use the already distributed AWS environment that includes pandas + pyarrow and if so why should that be something that holds pandas developers back from requiring pyarrow? |
As of a few hours ago, there's a The split of the cloud provider bindings out of core hasn't happened yet, but will further reduce the footprint once it happens. |
I think the pdep text wasn't precise here - pandas and numpy each require about 70MB (in fact, a bit more now, I just checked). So the percentage of the increase is more like 82% - not 170%. Still quite a lot, I don't mean to minimise it, but at lot less than has been stated here. It's good to see that on the conda-forge side, things have become smaller. For the PyPI package, however, my understanding is that this is unlikely to happen any time soon
I just tried this, and indeed, it works - pandas 2.2.2 and pyarrow 14.0.1 are included. I don't think it's as flexible as being able to install whichever versions you want, but it does seem like there is a workable way to use pandas in Lambda |
This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.
The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html
If you would like to filter this warning without installing pyarrow at this time, please view this comment: #54466 (comment)
The text was updated successfully, but these errors were encountered: