Spice errors on offline guest backups #1661

pboguslawski · 2022-03-04T10:43:59Z

We've noticed offline backup problems with one guest on ganeti host; problem does not occur on every backup; it occurs on one guest only (old server hardware so maybe some kind of race connected with processing speed).

Ganeti 2 from standard Debian 9 packages:

root@myhost:~# gnt-job info 294914
- Job ID: 294914
  Status: error
  Received:         2021-03-13 20:30:03.551883
  Processing start: 2021-03-13 20:30:04.134755 (delta 0.582872s)
  Processing end:   2021-03-13 20:31:19.675262 (delta 75.540507s)
  Total processing time: 76.123379 seconds
  Opcodes:
    - Opcode: OP_BACKUP_EXPORT
      Status: error
      Processing start: 2021-03-13 20:30:04.134755
      Execution start:  2021-03-13 20:30:04.164027
      Processing end:   2021-03-13 20:31:19.675238
      Input fields:
        OP_ID: OP_BACKUP_EXPORT
        comment: None
        compress: none
        debug_level: 0
        depends: None
        dry_run: False
        ignore_remove_failures: False
        instance_name: myguest.mydomain
        instance_uuid: 8aad35ac-1234-4239-9897-eb6c20340ac8
        long_sleep: False
        mode: local
        priority: 0
        reason:
          - - gnt:client:gnt-backup
            - export
            - 1615663803549786112
          - - gnt:opcode:backup_export
            - job=294914;index=0
            - 1615663803551883000
          - - gnt:daemon:masterd:pickup:op_backup_export
            - job=294914;index=0
            - 1615663804128281088
        remove_instance: False
        shutdown: True
        shutdown_timeout: 120
        target_node: myhost.mydomain
        target_node_uuid: a92aafa8-0938-439e-9312-1a2b1ffc7e88
        zero_free_space: False
      Result:
        - OpExecError
        - - Could not start instance: Hypervisor error: Failed to start instance myguest.mydomain: exited with exit code 1 ((/sbin/start-stop-daemon:17102): Spice-Warning **: reds.c:2524:reds_init_socket: reds_init_socket: binding socket to 127.0.0.1:11004 failed
qemu-system-x86_64: failed to initialize spice server
)
      Execution log:
        - Time: 2021-03-13 20:30:04.313848
          Content: [1, message, Shutting down instance myguest.mydomain]
        - Time: 2021-03-13 20:31:14.837206
          Content: [2, message, Creating a snapshot of disk/0 on node myhost.mydomain]
        - Time: 2021-03-13 20:31:16.070407
          Content: [3, message, Creating a snapshot of disk/1 on node myhost.mydomain]
        - Time: 2021-03-13 20:31:16.827548
          Content: [4, message, Starting instance myguest.mydomain]
        - Time: 2021-03-13 20:31:18.075947
          Content: [5, message, Failed to start instance: Hypervisor error: Failed to start instance myguest.mydomain: exited with exit code 1 ((/sbin/start-stop-daemon:17102): Spice-Warning **: reds.c:2524:reds_init_socket: reds_init_socket: binding socket to 127.0.0.1:11004 failed\nqemu-system-x86_64: failed to initialize spice server\n)]
        - Time: 2021-03-13 20:31:18.518172
          Content: [6, message, Removing snapshot of disk/0 on node myhost.mydomain]
        - Time: 2021-03-13 20:31:19.128266
          Content: [7, message, Removing snapshot of disk/1 on node myhost.mydomain]

Ganeti 3 from standard Debian 11 packages:

root@myhost:~# gnt-job info 2753
- Job ID: 2753
  Status: error
  Received:         2022-01-25 20:30:02.371484
  Processing start: 2022-01-25 20:30:02.842194 (delta 0.470710s)
  Processing end:   2022-01-25 20:30:39.537554 (delta 36.695360s)
  Total processing time: 37.166070 seconds
  Opcodes:
    - Opcode: OP_BACKUP_EXPORT
      Status: error
      Processing start: 2022-01-25 20:30:02.842194
      Execution start:  2022-01-25 20:30:02.913307
      Processing end:   2022-01-25 20:30:39.537524
      Input fields:
        OP_ID: OP_BACKUP_EXPORT
        comment: None
        compress: none
        debug_level: 0
        depends: None
        dry_run: False
        ignore_remove_failures: False
        instance_name: myguest.mydomain
        instance_uuid: 8aad35ac-1234-4239-9897-eb6c20340ac8
        long_sleep: False
        mode: local
        priority: 0
        reason:
          - - gnt:client:gnt-backup
            - export
            - 1643139002369189376
          - - gnt:opcode:backup_export
            - job=2753;index=0
            - 1643139002371484000
        remove_instance: False
        shutdown: True
        shutdown_timeout: 120
        target_node: myhost.mydomain
        target_node_uuid: a92aafa8-0938-439e-9312-1a2b1ffc7e88
        zero_free_space: False
      Result:
        - OpExecError
        - - Could not start instance: Hypervisor error: Failed to start instance myguest.mydomain: exited with exit code 1 ()
      Execution log:
        - Time: 2022-01-25 20:30:03.045304
          Content: [1, message, Shutting down instance myguest.mydomain]
        - Time: 2022-01-25 20:30:25.366033
          Content: [2, message, Creating a snapshot of disk/0 on node myhost.mydomain]
        - Time: 2022-01-25 20:30:28.663659
          Content: [3, message, Creating a snapshot of disk/1 on node myhost.mydomain]
        - Time: 2022-01-25 20:30:31.835476
          Content: [4, message, Starting instance myguest.mydomain]
        - Time: 2022-01-25 20:30:34.176965
          Content: [5, message, Failed to start instance: Hypervisor error: Failed to start instance myguest.mydomain: exited with exit code 1 ()]
        - Time: 2022-01-25 20:30:35.489601
          Content: [6, message, Removing snapshot of disk/0 on node myhost.mydomain]
        - Time: 2022-01-25 20:30:37.507101
          Content: [7, message, Removing snapshot of disk/1 on node myhost.mydomain]

Seems that qemu process cannot bind to spice port which is still used for some time after guest shutdown. Attached please find patch that resolves this issue for us in Debian 11 (10s sleep added after vm shutdown, before starting it again during offline backup).

backup.py-IB1107521.patch.gz

BTW: Seems that error message content is not present in job info in ganeti 3 (it was avaliable in ganeti 2 as above). Probably another issue?

The text was updated successfully, but these errors were encountered:

rbott · 2022-03-05T17:40:52Z

Hi @pboguslawski that sounds interesting. Can you verify that the port is still in listening mode and is there still a process listed along with it while the error occurs? e.g. with netstat -tlpen (or ss -tlpn).

pboguslawski · 2022-03-12T20:12:37Z

Can you verify that the port is still in listening mode and is there still a process listed along with it while the error occurs?

Below log of failed backup job (standard ganeti on Debian 11, without our patch) with added netstat -tlpen output regarding the spice port of vm (after //; netstat output was saved every second):

root@myhost:~# gnt-job info 15947
- Job ID: 15947
  Status: error
  Received:         2022-03-12 20:30:02.332224
  Processing start: 2022-03-12 20:30:02.837210 (delta 0.504986s)
  Processing end:   2022-03-12 20:30:46.003610 (delta 43.166400s)
  Total processing time: 43.671386 seconds
  Opcodes:
    - Opcode: OP_BACKUP_EXPORT
      Status: error
      Processing start: 2022-03-12 20:30:02.837210
      Execution start:  2022-03-12 20:30:02.852788
      Processing end:   2022-03-12 20:30:46.003581
      Input fields:
        OP_ID: OP_BACKUP_EXPORT
        comment: None
        compress: none
        debug_level: 0
        depends: None
        dry_run: False
        ignore_remove_failures: False
        instance_name: myguest.mydomain
        instance_uuid: 8aad35ac-1234-4239-9897-eb6c20340ac8
        long_sleep: False
        mode: local
        priority: 0
        reason:
          - - gnt:client:gnt-backup
            - export
            - 1647113402330256128
          - - gnt:opcode:backup_export
            - job=15947;index=0
            - 1647113402332224000
        remove_instance: False
        shutdown: True
        shutdown_timeout: 120
        target_node: myhost.mydomain
        target_node_uuid: a92aafa8-0938-439e-9312-1a2b1ffc7e88
        zero_free_space: False
      Result:
        - OpExecError
        - - Could not start instance: Hypervisor error: Failed to start instance myguest.mydomain: exited with exit code 1 ()
      Execution log:
// 20:30:30: tcp        0      0 127.0.0.1:11005         0.0.0.0:*               LISTEN      0          54320740   74101/kvm
        - Time: 2022-03-12 20:30:02.970722
          Content: [1, message, Shutting down instance myguest.mydomain]
// 20:30:31: tcp        0      0 127.0.0.1:11005         0.0.0.0:*               LISTEN      0          54320740   -
        - Time: 2022-03-12 20:30:33.305622
          Content: [2, message, Creating a snapshot of disk/0 on node myhost.mydomain]
        - Time: 2022-03-12 20:30:36.507728
          Content: [3, message, Creating a snapshot of disk/1 on node myhost.mydomain]
        - Time: 2022-03-12 20:30:39.583500
          Content: [4, message, Starting instance myguest.mydomain]
// 20:30:40: tcp        0      0 127.0.0.1:11005         0.0.0.0:*               LISTEN      0          54320740   -
        - Time: 2022-03-12 20:30:41.471169
          Content: [5, message, Failed to start instance: Hypervisor error: Failed to start instance myguest.mydomain: exited with exit code 1 ()]
        - Time: 2022-03-12 20:30:42.675776
          Content: [6, message, Removing snapshot of disk/0 on node myhost.mydomain]
        - Time: 2022-03-12 20:30:44.463784
          Content: [7, message, Removing snapshot of disk/1 on node myhost.mydomain]
// 20:30:52: tcp        0      0 127.0.0.1:11005         0.0.0.0:*               LISTEN      0          54320740   -
// 20:30:53: no row with spice port 11005 in netstat output

Same result in ss -tlpn output.

Seems that spice port is still shutting down when vm is started after snapshot and this causes the problem.

pboguslawski · 2022-05-09T06:34:27Z

10s sleep added after vm shutdown was not enough - sometimes error still occured; sleep was increased to 15s and no such errors from a few weeks.

Looks like something connected with socket shutdown like on:

https://unix.stackexchange.com/questions/17218/how-long-is-a-tcp-local-socket-address-that-has-been-bound-unavailable-after-clo
https://stackoverflow.com/questions/3229860/what-is-the-meaning-of-so-reuseaddr-setsockopt-option-linux/3233022#3233022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spice errors on offline guest backups #1661

Spice errors on offline guest backups #1661

pboguslawski commented Mar 4, 2022

rbott commented Mar 5, 2022

pboguslawski commented Mar 12, 2022 •

edited

pboguslawski commented May 9, 2022

Spice errors on offline guest backups #1661

Spice errors on offline guest backups #1661

Comments

pboguslawski commented Mar 4, 2022

rbott commented Mar 5, 2022

pboguslawski commented Mar 12, 2022 • edited

pboguslawski commented May 9, 2022

pboguslawski commented Mar 12, 2022 •

edited