Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuilder: Too many locations already known for chunks with duplicate positions #2120

Open
vdombrovski opened this issue Jun 8, 2022 · 0 comments

Comments

@vdombrovski
Copy link
Contributor

ISSUE TYPE
  • Bug Report
COMPONENT NAME

oio-blob-rebuilder

SDS VERSION
5.12.0
CONFIGURATION
Default
OS / ENVIRONMENT
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
SUMMARY

When locating some objects, we can see that some of them have a duplicate position:

+------+--------------------------------------------------------------------------------------------+----------------+----------------------------------+
| Pos  | Id                                                                                         | Metachunk size | Metachunk hash                   |
+------+--------------------------------------------------------------------------------------------+----------------+----------------------------------+
| 0.1  | http://100.121.97.21:6217/93746C56170BEEFCF1997B4BDED97292421B06EBD505904F7E766DB3A75EF59A |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.3  | http://100.121.97.21:6214/C23EA94874496257FBB275D6C07748D924EB66B1CA0576719C61FAFB33FDB82D |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.4  | http://100.121.98.21:6231/8913A34B162935184E4703ADAE00975E02C0EC5449576C99352A6EE7EFC207B4 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.5  | http://100.121.98.21:6209/2A3D109702AAFE2B98CD77875560F46D643F5B7E18FEF9D8D91B58AD74F43DFE |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.6  | http://100.121.98.21:6238/6102C3FD92B1870B5906FC972133CFDE1A2538F1D7E8CAC0D81D2A0EED23F00D |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.7  | http://100.121.98.22:6255/E65DE5953157382C228658A44090D4341574C45E25E68534AA6E75054D9E6DC3 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.8  | http://100.121.99.22:6223/EB76BF861097C37EDF0F27A57A070145F55AFD2370F1E9BBDE1EAAD013EBB6C8 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.9  | http://100.121.99.22:6253/7DAFDF361F054D4C5568855BD2B655149A60629B0771F31BC04424AF9EF4CFD2 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.10 | http://100.121.99.22:6231/BF8AD402FB85CE54101AF47563180B43DF67A9CA29F13242420EEA95101A9834 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.11 | http://100.121.99.22:6248/341F067F78A37A5761A5C573E5FE5D0ED8E85ABA5E8B39CC06E91833F13BD903 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.2  | http://100.121.97.22:6219/CE53AD9988731885FA41D9BD7598F407D3908701E653F10F3FFEE21769513600 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.2  | http://100.121.97.22:6212/83E857FCDD8F91937C505D8B9B5E3F8AC8C08B4DF4DF98BCD01FCBE815167D11 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.0  | http://100.121.97.22:6204/D78798C6491011730588257394F096DB5782FAE9D33BEAB0F5C1819F898F9AC9 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
+------+--------------------------------------------------------------------------------------------+----------------+----------------------------------+

From #1909, it looks like this is a "normal" behavior, as the oioproxy doesn't check at all the integrity of a passed JSON when performing a content/create call.

However when rebuilding we get the following error:

2022-06-08 14:41:53.336 194635 7FD2E20B9370 log ERROR ERROR while rebuilding chunk OPENIO|A5B2970E8FBD895C486A853C2D3848AAF2835548AA1BEF2B251181FB07213EEE|DBB63ACFB7DA0500F2420539E536E998|6E561DA710A5027CA104712707BCEF7398A0C41939744791852B992D0A29AAF5: No spare chunk: found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy

Which is understandable, as the duplicate chunks count towards valid spare chunk candidates.

STEPS TO REPRODUCE

Somehow create a chunk with a duplicate position. You can forge a chunk list with a duplicate position, and send the JSON to a content/create call to the oioproxy when doing so. Just make sure that the object locate command returns an output as described above.

# Target the chunk to be rebuilt:
echo "A5B2970E8FBD895C486A853C2D3848AAF2835548AA1BEF2B251181FB07213EEE|DBB63ACFB7DA0500F2420539E536E998|6E561DA710A5027CA104712707BCEF7398A0C41939744791852B992D0A29AAF5" > /tmp/to_rebuild
oio-blob-rebuilder OPENIO --input-file /tmp/to_rebuild
EXPECTED RESULTS

Successful rebuilding of the chunk

ACTUAL RESULTS
022-06-08 14:41:02.204 194586 7F5FDB2BE4B0 log INFO Failed to find spare chunk (attempt 1/3): found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy (HTTP 400) (STATUS 400)
2022-06-08 14:41:02.206 194586 7F5FDB2BE4B0 log INFO Failed to find spare chunk (attempt 2/3): found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy (HTTP 400) (STATUS 400)
2022-06-08 14:41:02.208 194586 7F5FDB2BE4B0 log INFO Failed to find spare chunk (attempt 3/3): found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy (HTTP 400) (STATUS 400)
2022-06-08 14:41:02.208 194586 7F5FDE38C370 log ERROR ERROR while rebuilding chunk OPENIO|A5B2970E8FBD895C486A853C2D3848AAF2835548AA1BEF2B251181FB07213EEE|DBB63ACFB7DA0500F2420539E536E998|6E561DA710A5027CA104712707BCEF7398A0C41939744791852B992D0A29AAF5: No spare chunk: found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy

A partial fix would consist in removing duplicate chunk positions before feeding the list to the _get_spare_chunk function (needs to be done for EC and repli):

--- /usr/lib/python2.7/dist-packages/oio/content/ec_old.py      2022-06-08 14:57:10.918012798 +0000
+++ /usr/lib/python2.7/dist-packages/oio/content/ec.py  2022-06-08 14:56:40.222204985 +0000
@@ -58,9 +58,12 @@
         # Find a spare chunk address
         broken_list = list()

+        used = set()
+        candidates = [c for c in chunks.all() if c.pos not in used and c.pos != current_chunk.pos and (used.add(c.pos) or True)]
+
         if not allow_same_rawx and chunk_id is not None:
             broken_list.append(current_chunk)
-        spare_url, _quals = self._get_spare_chunk(chunks.all(), broken_list)
+        spare_url, _quals = self._get_spare_chunk(candidates, broken_list)
         new_chunk = Chunk({'pos': current_chunk.pos, 'url': spare_url[0]})

         # Regenerate the lost chunk's data, from existing chunks

Warning however, as this sometimes will generate errors as such:

2022-06-08 14:44:52.500 194742 7FE3C2B4F370 log ERROR ERROR while rebuilding chunk OPENIO|43111ECD2732E20A123114C34E1F8E740ADA5D8256CF27A3AB77213FFFFEB678|B875B33177DA05000CB194A6F73A6CA7|01E9E28CBD0BB4E0DA1A408F3AD246D44CECD1915F6D52D496370BF23874B029: pyeclib_c_reconstruct ERROR: Insufficient number of fragments. Please inspect syslog for liberasurecode error report.

Depending on what chunks have been selected for rebuild. We think that duplicate chunk positions also lead to the same fragment being uploaded in 2 different positions, which is in itself another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant