Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guest import job rarely loops at 100% #1677

Open
pboguslawski opened this issue Jun 7, 2022 · 1 comment
Open

Guest import job rarely loops at 100% #1677

pboguslawski opened this issue Jun 7, 2022 · 1 comment

Comments

@pboguslawski
Copy link
Contributor

pboguslawski commented Jun 7, 2022

Sometimes (rarely) during vm import the job loops at the 100% (same message with 100% is repeated every 1s and job doesn't stop execution), "hard" kill is required to stop such job:

root@myganeti:~# gnt-backup import -n myganeti2.mydomain:myganeti.mydomain --src-node=myganeti.mydomain --src-dir=/var/backups/ganeti/myvm.mydomain --no-ip-check --no-name-check --net 0:link=net1,mac=generate myvm.mydomain
Tue Jun  7 08:42:44 2022 * creating instance disks...
Tue Jun  7 08:43:02 2022 adding instance myvm.mydomain to cluster config
Tue Jun  7 08:43:02 2022 adding disks to cluster config
Tue Jun  7 08:43:02 2022 * wiping instance disks...
Tue Jun  7 08:43:04 2022  - INFO: * Wiping disk 0
Tue Jun  7 08:43:07 2022  - INFO:  - done: 10.0% ETA: 34s
Tue Jun  7 08:43:37 2022  - INFO: Waiting for instance myvm.mydomain to sync disks
Tue Jun  7 08:43:37 2022  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Tue Jun  7 08:43:38 2022  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Tue Jun  7 08:43:39 2022  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Tue Jun  7 08:43:39 2022  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Tue Jun  7 08:43:40 2022  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
^CAborted. Note that if the operation created any jobs, they might have been submitted ad will continue to run in the background.

root@myganeti:~# gnt-job list
   ID Status  Summary
[...]
48767 running INSTANCE_CREATE(myvm.mydomain)
[...]

root@myganeti:~# gnt-job info 48767 
- Job ID: 48767
  Status: running
  Received:         2022-06-07 08:42:41.987983
  Processing start: 2022-06-07 08:42:42.473976 (delta 0.485993s)
  Processing end:   unknown (None)
  Total processing time: N/A
  Opcodes: 
    - Opcode: OP_INSTANCE_CREATE
      Status: running
      Processing start: 2022-06-07 08:42:42.473976
      Execution start:  2022-06-07 08:42:42.615376
      Processing end:   N/A
      Input fields: 
        OP_ID: OP_INSTANCE_CREATE
[...]
      Execution log: 
        - Time: 2022-06-07 08:42:44.449925
          Content: [1, message, * creating instance disks...]
        - Time: 2022-06-07 08:43:02.707168
          Content: [2, message, adding instance myvm.mydomain to cluster config]
        - Time: 2022-06-07 08:43:02.761817
          Content: [3, message, adding disks to cluster config]
        - Time: 2022-06-07 08:43:02.882746
          Content: [4, message, * wiping instance disks...]
        - Time: 2022-06-07 08:43:04.002804
          Content: [5, message,  - INFO: * Wiping disk 0]
        - Time: 2022-06-07 08:43:07.761615
          Content: [6, message,  - INFO:  - done: 10.0% ETA: 34s]
        - Time: 2022-06-07 08:43:37.166534
          Content: [7, message,  - INFO: Waiting for instance myvm.mydomain to sync disks]
        - Time: 2022-06-07 08:43:37.845352
          Content: [8, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:38.516010
          Content: [9, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:39.158333
          Content: [10, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:39.827776
          Content: [11, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:40.539092
          Content: [12, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:41.224098
          Content: [13, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:41.972286
          Content: [14, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:42.771993
          Content: [15, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:43.603526
          Content: [16, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:44.492215
          Content: [17, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:45.247506
          Content: [18, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:46.059068
          Content: [19, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:46.827857
          Content: [20, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:47.630069
          Content: [21, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:48.439567
          Content: [22, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:49.207742
          Content: [23, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:50.124583
          Content: [24, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:50.947226
          Content: [25, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:51.878631
          Content: [26, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:52.695489
          Content: [27, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:53.550960
          Content: [28, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:54.475304
          Content: [29, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:55.336191
          Content: [30, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:56.172383

root@myganeti:~# gnt-job cancel 48767

root@myganeti:~# gnt-job list
   ID Status  Summary
[...]
48767 running INSTANCE_CREATE(myvm.mydomain)
[...]

root@myganeti:~# gnt-job cancel --kill --yes-do-it 48767
SIGKILL send to the process

root@myganeti:~# gnt-job list
   ID Status  Summary
[...]
48767 error   INSTANCE_CREATE(myvm.mydomain)
[...]

root@myganeti:~# gnt-job info 48767 
- Job ID: 48767
  Status: error
  Received:         2022-06-07 08:42:41.987983
  Processing start: 2022-06-07 08:42:42.473976 (delta 0.485993s)
  Processing end:   2022-06-07 08:49:06.708302 (delta 384.234326s)
  Total processing time: 384.720319 seconds
  Opcodes: 
    - Opcode: OP_INSTANCE_CREATE
      Status: error
      Processing start: 2022-06-07 08:42:42.473976
      Execution start:  2022-06-07 08:42:42.615376
      Processing end:   2022-06-07 08:49:06.708302
      Input fields: 
        OP_ID: OP_INSTANCE_CREATE
[...]
      Execution log: 
        - Time: 2022-06-07 08:42:44.449925
          Content: [1, message, * creating instance disks...]
        - Time: 2022-06-07 08:43:02.707168
          Content: [2, message, adding instance myvm.mydomain to cluster config]
        - Time: 2022-06-07 08:43:02.761817
          Content: [3, message, adding disks to cluster config]
        - Time: 2022-06-07 08:43:02.882746
          Content: [4, message, * wiping instance disks...]
        - Time: 2022-06-07 08:43:04.002804
          Content: [5, message,  - INFO: * Wiping disk 0]
        - Time: 2022-06-07 08:43:07.761615
          Content: [6, message,  - INFO:  - done: 10.0% ETA: 34s]
        - Time: 2022-06-07 08:43:37.166534
          Content: [7, message,  - INFO: Waiting for instance myvm.mydomain to sync disks]
        - Time: 2022-06-07 08:43:37.845352
          Content: [8, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:38.516010
          Content: [9, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:39.158333
          Content: [10, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
        - Time: 2022-06-07 08:43:39.827776
          Content: [11, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]
[...]
        - Time: 2022-06-07 08:49:06.396917
          Content: [457, message,  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)]

Ganeti from Debian 11 main:

# dpkg -s ganeti | grep Version
Version: 3.0.1-2

Same problem was observed in Ganeti from Debian 9 main.

@saschalucas
Copy link
Member

This looks like a known problem with DRBD, which Ganeti does not handle ATM.

I remember having this type of error, too. Looking at the instance primary/secondary node in /proc/drbd, there should be a DRBD resource (the one belonging to the instance), which is 100% in sync, but stalled. Could you provide relevant content next time the problem occurs?

Very very rarely this happened to running instances. As a workaround I disconnect/connect the drbd resource via DRBDSETUP(8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants