Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New classifier for alcoholic beverages #511

Open
cuducos opened this issue Jan 3, 2020 · 0 comments
Open

New classifier for alcoholic beverages #511

cuducos opened this issue Jan 3, 2020 · 0 comments

Comments

@cuducos
Copy link
Collaborator

cuducos commented Jan 3, 2020

What is the problem?

The problem was that digitized receipts were not machine-readable and we could not afford to properly run OCR in all images we had (although we've tried). However, a couple of months ago the Chamber of Deputies started to offer eletronic receipts.

Since we know their URL (thanks @giovanisleite for #501) and they are structured HTML documens (that is to say, machine-readable), we can now try a classifier that idenfies alcoholic beverages in the reimbursements (what is not allowed).

We just need to take extra care to check whether the full amount of the eletronic receipt was actually reimbursed (even without remark, sometimes the Chamber of Deputies cuts off alcholic beverages from the reimbusements).

How can this be addressed?

I think the classifier should:

  1. get the contents of available eletronic receipt
  2. parse them
  3. test them agains a dictionary of possible names for alcoholic beverages (@brunopazzim's drafted one in the early days of Serenata)

Surely we might go first to a exploratory notebook at github.com/okfn-brasil/notebooks to test whether results are worth it!

Who could help with this issue?

Anyone 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant