Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data columns not returned as numeric #104

Open
dfolch opened this issue Apr 8, 2020 · 13 comments · May be fixed by #115
Open

data columns not returned as numeric #104

dfolch opened this issue Apr 8, 2020 · 13 comments · May be fixed by #115

Comments

@dfolch
Copy link
Contributor

dfolch commented Apr 8, 2020

Previous versions of cenpy retuned data columns as numeric values. Running that older code today returns objects. This is a feature request to go back to the previous approach.

In [6]: api_conn = cen.remote.APIConnection('ACSDT5Y2018')                      

In [7]: data = api_conn.query(['B01003_001E'], geo_unit='tract', geo_filter={'state':'04', 'county':'005'})                                                                                                                                                                                     

In [8]: data.B01003_001E.dtype                                                                                                                                                             
Out[8]: dtype('O')
@ljwolf
Copy link
Member

ljwolf commented Apr 8, 2020

Yeah.... Iirc this was because of a change in pandas. They removed pandas.convert_objects() and pandas.infer_objects() has different behavior. Happy to use things like the _coerce function over in the products API or revisit the infer_objects approach.

Should be a very simple change!

@rluedde
Copy link
Contributor

rluedde commented Jul 11, 2020

>>> api_conn = cenpy.remote.APIConnection('ACSDT5Y2018')
>>> data = api_conn.query(['B01003_001E'], geo_unit='tract', geo_filter={'state':'04', 'county':'005'})
>>> data.B01003_001E.infer_objects().dtype
dtype('O')
>>> data.B01003_001E.convert_dtypes().dtype
StringDtype
>>> data.B01003_001E.astype(int).dtype
dtype('int64')

Why do neither of these functions (infer_objects() nor convert_dtypes()) return a Series of a numeric type but astype() does?

@ljwolf, it's looking to me like _coerce is the way to go.

Also, @dfolch, how do you get colored syntax highlighting in your markdown? Is it because you copied from a notebook?

@rluedde
Copy link
Contributor

rluedde commented Jul 11, 2020

Is there ever a case where you wouldn't want data columns to be of integer type? Of course, you never want the geography columns to be of a numeric type.

@ljwolf
Copy link
Member

ljwolf commented Jul 11, 2020

Yes, fips codes for geographic identifiers ought to be kept as strings

@rluedde
Copy link
Contributor

rluedde commented Jul 11, 2020

@ljwolf, how does one get the _coerce() from products.py to be used on remote.py's APIConnectionclass?

In remote.py, I've tried:

  • from .products import _coerce leads to ImportError: cannot import name 'APIConnection' from partially initialized module 'cenpy.remote' (most likely due to a circular import)
  • from products import _coerce leads to ModuleNotFoundError: No module named 'products'
  • from . import products as prod leads toImportError: cannot import name 'APIConnection' from partially initialized module 'cenpy.remote' (most likely due to a circular import)

@dfolch
Copy link
Contributor Author

dfolch commented Jul 12, 2020

Could move _coerse out of products.py into tools.py or possibly create a new utils.py file? If this is the route, then maybe move everything after line 886 out of products.py.

@rluedde
Copy link
Contributor

rluedde commented Jul 13, 2020

@dfolch I'm pretty sure this takes care of the import issues. Should all of the functions that are now in utilities.py still be private?

In products.py, can the import be from utilities import *? As far as I understand it, the namespace wouldn't change from how it currently is. Or is it better here to be more explicit and say where certain utilities.py functions came from (in products.py and remote.py?

@rluedde
Copy link
Contributor

rluedde commented Jul 13, 2020

Is it the case that you want all or no data columns to be converted to integers or do you want to convert all of the ones that can be converted?

@ljwolf
Copy link
Member

ljwolf commented Jul 13, 2020

I made _coerce private because it lived in products.py, and if you from cenpy import products, I wanted that to be very clean.

If coerce gets moved to utilities, then it's ok to become coerce, but when it's imported in products.py make sure to use from .utilities import coerce as _coerce.

@rluedde
Copy link
Contributor

rluedde commented Jul 13, 2020

@ljwolf I'm sorry, I didn't see your comment until after I made the PR. I will work on this.

What about the rest of the functions in utilities? Should they be made private as well? I'm not sure if I should be making comments here or on the PR.

@JoeGermuska
Copy link
Contributor

some data columns should be float, not int -- most anything that has 'median', 'average' or 'rate' in the table name. I've used predicateType from the variables DataFrame to do conversions (although there's at least one case I've found where the Census API returns an incorrect value)

This gives some sense of the range of valid float values in the data and also flushes out the NaN where they creep in.
api_conn.variables[(api_conn.variables['predicateType'] != 'int') & (api_conn.variables['group'] != 'N/A')]

@JoeGermuska
Copy link
Contributor

I've also since realized that the real problem is with the Census API, which returns numbers as quoted strings. JSON numbers shouldn't be quoted. See (and upvote) uscensusbureau/api#5

@ronnie-llamado
Copy link
Member

ronnie-llamado commented May 4, 2021

@JoeGermuska Would you still recommend using the predicateType to cast variables? It's an adaptive solution that caters to the Census API instead of casting everything to one type. This is of course assuming the predicateType provided is the correct value.

Here's a quick solution doing just that (staged inside cenpy.remote.APIConnection):

df = {some recently pulled data inside class ApiConnection}

type_dict = { 
    k: eval(self.variables.predicateType.loc[k.upper()]) 
    for k in df.columns
}   
df = df.astype(type_dict, errors='ignore')

Note: This would also require some data cleansing of the predicateTypes. There are two things would need to be addressed in the variables property:

  1. Convert string to str
  2. Convert np.nan to str
df.predicateType = df.predicateType.replace(['string', np.nan], 'str')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants