Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use monetary adjustment to try a better accuracy in meal outlier classifier #514

Open
cuducos opened this issue Jan 14, 2020 · 2 comments
Open

Comments

@cuducos
Copy link
Collaborator

cuducos commented Jan 14, 2020

What is the problem?

Maybe we can get better accuracy in the meal outlier classifier (as far as I can remember, the only one in which the value of the reimbursement is relevant) by adjusting the prices overtime to the inflation.

How can this be addressed?

There's a package that can easily do that (using IPCA at this point), probably in the fit or transform stages of the classifier (sorry, scikit-learn, I never remember the differente between these two).

Probably something like that would do the adjustment df["ajusted_value"] = df.apply(lambda row: ipca.adjust(row['expense_date'], row['total_value']), axis=1) and then we compare the results to see if there is a better accuracy as my hypothesis suggests.

Who could help with this issue?

Anyone interested in doing some exploratory work with data and, maybe, contributing to https://github.com/okfn-brasil/notebooks ; )

@willianpaixao
Copy link
Contributor

Hi @cuducos,
honest question, how far back in the past are we analyzing data? Does inflation would make a noticeable difference in a spam of of less than five years?

And yes, an library like you mentioned would give the most accurate correction, but a simple table defining accumulated inflation for quarters (or even semesters) would already improve accuracy. Maybe it's a good first test.

@cuducos
Copy link
Collaborator Author

cuducos commented Apr 24, 2020

how far back in the past are we analyzing data?

Data goes back to 2009.

Does inflation would make a noticeable difference in a spam of of less than five years?

In five year the imapct was 26% (according to IPCA), 34% (according to IGPM) or 56% (according to SELIC). Not sure which one better serves this purpose, but given the nature of the expenses, I would guess IPCA (but it is still merely a guess).

In [1]: from calculadora_do_cidadao import Ipca, Igpm, Selic

In [2]: from datetime import datetime

In [3]: from datetime import timedelta

In [4]: for Adapter in (Ipca, Igpm, Selic):
   ...:     adapter = Adapter()
   ...:     diff = adapter.adjust((datetime.now() - timedelta(days=365 * 5)).date())
   ...:     print(diff)
   ...: 
1.259894139013801502406252724
1.340223541188715175178488001
1.558435971207303521700667897

And yes, an library like you mentioned would give the most accurate correction, but a simple table defining accumulated inflation for quarters (or even semesters) would already improve accuracy.

IMHO opinion, given this library already exists (see nested example above), it will mean more work to build this table than to automatize the thing… the library allows you to use the date already present in our dataset while working with a quarter/semester would need an extra conversion from date to quarter/semester.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants