Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error raised with pandas data frame #34

Open
Federico2111 opened this issue Feb 14, 2022 · 2 comments
Open

Error raised with pandas data frame #34

Federico2111 opened this issue Feb 14, 2022 · 2 comments

Comments

@Federico2111
Copy link

Hello,

When the input data is a pandas data frame, an error is raised:

File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scikits/bootstrap/bootstrap.py", line 179, in ci
lengths = [x.shape[0] for x in tdata]
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scikits/bootstrap/bootstrap.py", line 179, in
lengths = [x.shape[0] for x in tdata]
IndexError: tuple index out of range

In the code, it is explained why:

334 # Ensure that the data is actually an array. This isn't nice to pandas,
335 # but pandas seems much much slower and the indices become a problem.
336 if multi and isinstance(data, Iterable):
337 tdata: "Tuple[NDArrayAny, ...]" = tuple(np.array(x) for x in data)
338 lengths = [x.shape[0] for x in tdata]

Any suggestion?

@cgevans
Copy link
Owner

cgevans commented Mar 7, 2022

It doesn't appear that this is simply from a dataframe. Eg, the following works:

import pandas as pd
import numpy as np
import scikits.bootstrap as boot
boot.ci(pd.DataFrame(np.random.randn(100)))

@Federico2111
Copy link
Author

I am trying to bootstrap the eta squared effect size, calculated with these anova libraries:
https://pingouin-stats.org/generated/pingouin.anova.html
https://pingouin-stats.org/generated/pingouin.rm_anova.html
The input for these libraries has to be a pandas data frame, structured as you can see in the description of the libraries. You might want to look also at the pingouin data sets, mentioned in the examples, to see exactly how the data frames have to be structured to work with these libraries.

If I use your bootstrap with these pandas data frames as input and these anova libraries as function, that error I shared is raised.

I solved the problem this way. I created a function where the raw data sets are fed as input, not in a pandas data frame format. Within my function, the input data sets get structured as a pandas data frame, which is then inputted to the anova library to calculate eta squared, which is returned by my function.
I used your bootstrap, inputting the raw data sets and evoking my function. This way, I avoid inputting a pandas data frame to your bootstrap, which raises an error. This approach works correctly and I get the bootstrap confidence interval around eta squared.

The important point is that, using your bootstrap, "multi" needs to be set to "paired". In fact, with "multi=paired", the input data sets (arrays) are sampled together and the link/correspondence between/among the values in each array, at a particular index, is maintained. This is necessary to recreate a correct pandas data frame, within my function, to feed to the anova library, since the data sets have to be related index to index (participant number (subject) - measured value (dependent variable) - between/within factor). This link is not maintained with "multi=independent", where arrays are sampled separately and have unequal length, thus it is not possible to recreate a correct data frame, and an error is also raised due to the unequal size arrays fed to the anova library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants