Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rosie stops mid-classification due to MemoryError in a 32gb ram machine #560

Open
ogecece opened this issue Aug 23, 2021 · 1 comment
Open

Comments

@ogecece
Copy link
Member

ogecece commented Aug 23, 2021

Rosie stopped tweeting a while back and that was the reason.

In the last weeks @andreformento diagnosed this locally and we tested it in the production infrastructure.

Here's the full traceback for executing python3 rosie.py run chamber_of_deputies in a common 8 vcpus 32gb ram Digital Ocean's Droplet:

2021-08-23 22:32:46,878 - rosie.chamber_of_deputies.adapter - INFO - Updating companies
Downloading 2016-09-03-companies.xz: 100%|████████████████████████████████████████████████████████████████████████████| 4.84M/4.84M [00:00<00:00, 34.5Mb/s]
2021-08-23 22:32:47,051 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2009
2021-08-23 22:33:05,802 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2010
2021-08-23 22:33:27,758 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2011
2021-08-23 22:33:52,820 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2012
2021-08-23 22:34:14,875 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2013
2021-08-23 22:34:39,627 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2014
2021-08-23 22:35:00,156 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2015
2021-08-23 22:35:24,343 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2016
2021-08-23 22:35:47,603 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2017
2021-08-23 22:36:10,159 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2018
2021-08-23 22:36:29,338 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2019
2021-08-23 22:36:47,928 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2020
2021-08-23 22:36:58,705 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2021
2021-08-23 22:37:07,120 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2018.csv
2021-08-23 22:37:08,965 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2014.csv
2021-08-23 22:37:11,514 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2010.csv
2021-08-23 22:37:14,283 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2020.csv
2021-08-23 22:37:16,251 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2012.csv
2021-08-23 22:37:19,527 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2013.csv
2021-08-23 22:37:23,982 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2011.csv
2021-08-23 22:37:29,628 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2009.csv
2021-08-23 22:37:33,911 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2015.csv
2021-08-23 22:37:39,087 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2021.csv
2021-08-23 22:37:43,265 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2017.csv
2021-08-23 22:37:50,403 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2019.csv
2021-08-23 22:37:57,065 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2016.csv
2021-08-23 22:38:03,934 - rosie.chamber_of_deputies.adapter - INFO - Loading companies
2021-08-23 22:38:22,833 - rosie.chamber_of_deputies.adapter - INFO - Categorizing reimbursements
2021-08-23 22:38:24,119 - rosie.chamber_of_deputies.adapter - INFO - Coercing issue_date column to date data type
2021-08-23 22:38:25,018 - rosie.chamber_of_deputies.adapter - INFO - Coercing situation_date column to date data type
2021-08-23 22:38:39,961 - rosie.chamber_of_deputies.adapter - INFO - Renaming columns to Serenata de Amor standard
2021-08-23 22:38:39,962 - rosie.chamber_of_deputies.adapter - INFO - Dataset ready! Rosie starts her analysis now :)
2021-08-23 22:39:10,942 - rosie.core - INFO - Running classifier 1 of 6: meal_price_outlier
2021-08-23 22:40:08,740 - rosie.core - INFO - Running classifier 2 of 6: over_monthly_subquota_limit
2021-08-23 22:44:21,321 - rosie.core - INFO - Running classifier 3 of 6: suspicious_traveled_speed_day
Traceback (most recent call last):
  File "rosie.py", line 64, in <module>
    main()
  File "rosie.py", line 60, in main
    run(module, arguments['--output'])
  File "rosie.py", line 34, in run
    module.main(directory)
  File "/opt/serenata-de-amor/rosie/rosie/chamber_of_deputies/__init__.py", line 9, in main
    core()
  File "/opt/serenata-de-amor/rosie/rosie/core/__init__.py", line 45, in __call__
    self.predict(model, name)
  File "/opt/serenata-de-amor/rosie/rosie/core/__init__.py", line 73, in predict
    prediction = model.predict(self.dataset)
  File "/opt/serenata-de-amor/rosie/rosie/chamber_of_deputies/classifiers/traveled_speeds_classifier.py", line 70, in predict
    is_outlier = self.__applicable_rows(_X) & \
  File "/opt/serenata-de-amor/rosie/rosie/chamber_of_deputies/classifiers/traveled_speeds_classifier.py", line 100, in __applicable_rows
    X[['latitude', 'longitude']].notnull().all(axis=1)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 2918, in __getitem__
    data = self._take_with_is_copy(indexer, axis=1)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3363, in _take_with_is_copy
    result = self.take(indices=indices, axis=axis)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3348, in take
    self._consolidate_inplace()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5216, in _consolidate_inplace
    self._protect_consolidate(f)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5205, in _protect_consolidate
    result = f()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5214, in f
    self._mgr = self._mgr.consolidate()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 983, in consolidate
    bm._consolidate_inplace()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 988, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1909, in _consolidate
    list(group_blocks), dtype=dtype, can_consolidate=_can_consolidate
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1934, in _merge_blocks
    new_values = new_values[argsort]
MemoryError

Due to this, the pipeline doesn't propagate further and Jarbas isn't updated. Therefore, no new data was being registered to be tweeted.

A PR (#561) has been opened to solve this temporarily, but any help would be appreciated in how we could reduce the memory consumption.

Kudos @andreformento!

@andreformento
Copy link

I created this PR #562 to help to run using only last years 👀
I know that is not a optimization, but it create a possibility to run with less resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants