Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIEN_trait_mean performance #16

Open
achmurzy opened this issue May 21, 2020 · 1 comment
Open

BIEN_trait_mean performance #16

achmurzy opened this issue May 21, 2020 · 1 comment

Comments

@achmurzy
Copy link
Contributor

achmurzy commented May 21, 2020

I'm trying to pull as many trait means as possible for the following list of species:
names.txt

Using vectorized versions of BIEN_trait_mean(vector_of_species_names, vector_of_traits) usually crashes my R console. I'm not sure if its on the backend, but returning the list of trait ids by default could be part of the issue. Maybe we could add a flag to optionally add the list of trait IDs? It greatly increases the size of the data frame that gets returned, and it would be nice if it were optional.

So what I'm doing now is querying means one-by-one:
for species in species_list:
for trait in trait_list:
BIEN_trait_mean(species, trait)
rbind(traits, new_trait)
This isn't the 'R' way of doing it but it works quickly - vectorizing a list of 20 species crashes my console.

@achmurzy achmurzy changed the title BIEN_trait_mean performance and accuracy BIEN_trait_mean performance May 21, 2020
@achmurzy
Copy link
Contributor Author

achmurzy commented May 23, 2020

Okay playing with this further I was able to determine that:
-The trait IDs aren't the problem, at least I don't think so
-rather, I didn't realize that BIEN_trait_mean is only intended to return one trait at a time. I had been inputting a vector of traits like so:
trait_list <- BIEN_trait_list()
BIEN_trait_mean(species, trait_list)
to pull everything. This returns the warning:
In if (!trait %in% traits_available$trait_name) { :
the condition has length > 1 and only the first element will be used
Then returned traits all have the same value.
1 Pentaclethra macrophylla 15.7878787878788 flower color
2 Pentaclethra macrophylla 15.7878787878788 flower pollination syndrome cm
3 Pentaclethra macrophylla 15.7878787878788 fruit type
4 Pentaclethra macrophylla 15.7878787878788 inflorescence length cm
level_used sample_size
1 Family 533
2 Family 533
3 Family 533
4 Family 533

I think it will be common for people to want to pull every trait and to call the function as I did above. Right now you have to write a for-loop to do it one at a time (which works great and is pretty fast). However, it might be better to prevent putting multiple traits into BIEN_trait_mean, or make sure it supports vectorized trait lists.

-Finally, Querying DBH also tends to be extremely slow as you suggested and I think you're right about this crashing the console. In particular, calculating mean DBH at the Family level could be drawing many thousands of records without being very informative. Additionally, the trait 'whole plant height' seems to behave the same way. The R process gets 'Killed' probably because the SQL query returns way too much stuff. Maybe DBH data should only be available through the stem.R module? These are traits that take > 15 minutes to query data then eventually just crash the console, so maybe higher density measurements need some special treatment. The other traits return values in less than 30 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant