move-instance difficult to use and ultimately fails #1696

anarcat · 2023-03-10T16:47:45Z

I'm trying to migrate between two Ganeti clusters. I have found with great anticipation the move-instance command, but I'm having a hard time making it work.

At first, it would just crash with a backtrace in Debian bullseye:

TypeError: '>' not supported between instances of 'NoneType' and 'int'

That's due to this code:

ganeti/tools/move-instance

Line 941 in 114e59f

if not options.iallocator and (options.opportunistic_tries > 0):

If I pass --opportunistic-tries=1 it tells me:

move-instance: error: Opportunistic instance creation can only be used with an iallocator

So, basically, right now, you must use:

move-instance --opportunistic-tries=1 --iallocator=hail

According to @apoikos (on IRC), the TypeError is a python2-to-3 leftover...

The next problem I had with move-instance was a ganeti.rapi.client.Error: Password not specified, but that was me failing at setting up the RAPI users. I also got ganeti.rapi.client.GanetiApiError: 401 Unauthorized: No permission -- see authorization schemes on the destination cluster. Maybe the docs could be improved to lead the operator the right way ("check your RAPI users again") in the documentation. Having a way to test the users out of band (say with curl) would also be useful here.

Then I had another error which was pretty opaque:

ganeti.errors.OpPrereqError: ("Invalid handshake: Hash didn't match, clusters don't share the same domain secret", 'wrong_input')

So that might seem obvious but I did copy the secret over and ran:

gnt-cluster renew-crypto --cluster-domain-secret=cluster-domain-secret

So it seems the bug there is that the --cluster-domain-secret= argument actually fails to replace the secret on the cluster. I had to manually copy the cluster-domain-secret file in /var/lib/ganeti and restart the server for that to work.

But what completely blocked me is this:

ganeti.errors.OpPrereqError: ('If network is given, no mode or link is allowed to be passed', 'wrong_input')

It looks like the source node is encoding NIC information in the backup and the target node is somewhat unhappy with it. I'm not sure how to debug this: I'm lost in the stack between the client and server method definitions and I don't actually understand what's going on so much.

Did anyone get that thing to work at all? What am I doing wrong?

Should I open separate issues for those things?

The text was updated successfully, but these errors were encountered:

anarcat · 2023-03-10T17:20:00Z

oh, and for what it's worth, I have a manual procedure for moving VMs around with export/import documented here:

https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ganeti/#migrating-a-vm-between-clusters

it's basically, on the source node:

gnt-backup export -n chi-node-01.torproject.org test-01.torproject.org

and on the target node:

rsync -ASHaxX --info=progress2 root@chi-node-01.torproject.org:/var/lib/ganeti/export/test-01.torproject.org/ /var/lib/ganeti/export/test-01.torproject.org/
gnt-backup import -n dal-node-01:dal-node-02 --src-node=dal-node-01 --src-dir=/var/lib/ganeti/export/test-01.torproject.org --no-ip-check --no-name-check --net 0:ip=pool,network=gnt-dal-01 -t drbd --no-wait-for-sync test-01.torproject.net

... so that works, but it's still a manual process, with multiple steps and everything is a little error-prone, with multiple long-running processes punctuated by manual "copy-paste" things which is not ideal for large clusters. (And yes, I could also have my own automation to speed this up myself, but then i'd be rewriting move-instance, wouldn't i? :)

rbott · 2023-03-12T15:00:26Z

Hi @anarcat,

we have used the tool a lot in all of our Ganeti 2.16 -> 3.0 Upgrades (basically we set up new small clusters with fresh hardware and Ganeti 3.0, moved some instances, re-purposed/re-installed older nodes where possible and added them to the new cluster(s), moved some more instances etc.). We also stumbled upon some bugs and/or missing features which are fixed in master here, here and here. You can use the script standalone without the rest of the tree directly from master to carry out the migrations.

We ended up with the following pre-setup:

put the cluster-domain-secret manually in /var/lib/ganeti/cluster-domain-secret and use gnt-cluster redist-conf to redistribute it on both source and destination clusters
if the source system is even Ganeti 2.15 (which also worked fine) on Debian Jessie make sure that socat is installed from the base distribution, not from the backports repository (Ganeti 2.16 contains a fix for that so we did not face that issue on newer machines)
make sure source Ganeti nodes are able to establish TCP connections on ports > 1024 to destination Ganeti nodes (inter-cluster-migration will always use the primary network of the cluster, not a secondary/alternate network which may be configured for e.g. DRBD stuff)

We then used an Ansible playbook to carry out the actual instance migrations, but that happenly mainly due to easier integration into other internal workflows which are not relevant here. It did the follwoing:

pre-flight check if all relevant systems, APIs, ports etc are reachable
the export/import scripts from the debootstrap OS provider are broken/not usable in our scenario ("partition style" full disk images), so we replace them on-the-fly with the ones from gnt-noop which simply use dd to do a bitwise transfer of all disks
call the migration script with the relevant parameters

Our instance scenario:

KVM hypervisor, DRBD storage
1-2 VirtIO disk(s), 1-2 VirtIO NIC(s) (bridged mode with one bridge per Vlan on source node, bridged mode with tagged vlans on destination node)
parameters are mostly set on cluster level (e.g. cpu_type, vnc/spice) and might change during migration to the new cluster default
because of the change in the instance network configuration (see above) we alter the network parameters (e.g. link=br604 turns into link=gnt-bridge,vlan=604) and also statically set the mac adress the interface had on the source cluster (otherwise a new one will be assigned during instance creation on the destination cluster) - this can be done by adding something like --net 0:link=gnt-bridge,vlan=604,mac=aa:bb:cc:dd:ee:ff to the command-line of move-instance)

I think I remember that while looking at the code of move-instance we found out that it actually does make some assumptions on the instance configuration which might lead to the NIC-configuration-related-error you stated above.

The above is far from perfect and still yields some issues that should actually be fixed upstream. But it worked well for us with several hundred instances so far. Hope that helps a bit :-)

rbott · 2023-03-12T15:09:33Z

We have also used the same approach to move instances between Ganeti 3.0 clusters. However, due to an issue with more recent socat versions, this needs a manual change of the export/import code on the source node :-(

More information can be found in this issue

rbott · 2023-03-12T15:19:27Z

Oh and I would definitely say (to add something more useful to this issue): I would suggest to a) extend the documentation with more guidance/example commands/pitfalls and b) of course fix the open/known issues, e.g. the setting of the shared secret which clearly is a bug.

I might find some time in the next days to extend the documentation.

Any move-instance call without --opportunistic-tries currently crashes with this: File "/usr/lib/ganeti/tools/move-instance", line 1132, in <module> main() File "/usr/share/ganeti/3.0/ganeti/rapi/client.py", line 274, in wrapper return fn(*args, **kwargs) File "/usr/lib/ganeti/tools/move-instance", line 1069, in main CheckOptions(parser, options, args) File "/usr/lib/ganeti/tools/move-instance", line 1007, in CheckOptions _CheckAllocatorOptions(parser, options) File "/usr/lib/ganeti/tools/move-instance", line 941, in _CheckAllocatorOptions if not options.iallocator and (options.opportunistic_tries > 0): TypeError: '>' not supported between instances of 'NoneType' and 'int' It is said this was a problem related to the 2to3 migration, but I haven't investigated fully. See: ganeti#1696

anarcat · 2023-03-13T20:48:37Z

wow, that's all extremely useful! that --keep-instance flag is invaluable, I didn't even realize the move-instance script was trashing instances on the source cluster, ouch! I guess it makes sense because of the "move" semantic, but still, dang...

the export/import scripts from the debootstrap OS provider are broken/not usable in our scenario ("partition style" full disk images), so we replace them on-the-fly with the ones from gnt-noop which simply use dd to do a bitwise transfer of all disks

amazing, I converged over the exact same thing, probably because of the exact same bug, see ganeti/instance-debootstrap#18

because of the change in the instance network configuration (see above) we alter the network parameters (e.g. link=br604 turns into link=gnt-bridge,vlan=604) and also statically set the mac adress the interface had on the source cluster (otherwise a new one will be assigned during instance creation on the destination cluster) - this can be done by adding something like --net 0:link=gnt-bridge,vlan=604,mac=aa:bb:cc:dd:ee:ff to the command-line of move-instance)

so basically I need to actually allocate a MAC address for each VM I move? ouch?

I was hoping i could just batch-move instances here to quickly evacuate a cluster, individually mapping MAC addresses doesn't sound like a fun time...

I think I remember that while looking at the code of move-instance we found out that it actually does make some assumptions on the instance configuration which might lead to the NIC-configuration-related-error you stated above.

okay, that definitely sounds familiar. what's strange with this problem is that the problem occurs whether I pass a --net argument or not. it seems like there's a builtin default somewhere that conflicts with another default... without a --net option, i end up with the following nic configuration in the remote create job:

        nics: 
          - ip: 38.229.82.23
            link: br0
            mac: 06:66:38:c4:0c:23
            mode: bridged
            network: 097c2565-dab9-4a29-9519-b987718ed812
            vlan:

what's interesting there is that the ip there is actually the one from the source cluster. in a sense, it's obviously incorrect as it does, indeed, have both a IP and a network field, as described, but it's not supplied by the operator. i wonder wth is going on here...

In ganeti#1696, I ended up in a situation where the instance I'm moving had both "network" and "link" parameters. That seems like a natural configuration in our cluster, but something move-instance really doesn't like, as it's passing this verbatim to create-instance which naturally freaks out on the contradictory arguments. Instead of just crashing, detect empty parameters (`not v` is a shortcut here) and null them out. This will result in those parameters being passed as empty to create-instance, or to be more accurate, not be passed at all. I've been able to fix the crash in ganeti#1696 by passing `--net 0:ip=pool,network=gnt-dal-01,mode=,link=` to move-instance with this patch. I'm not sure it's the right approach though. I'd much rather *not* have to pass `--net` at all, as I actually want to move multiple instances across clusters, and that seems incompatible with `--net` for some reason I cannot currently fathom (but which is possibly related to this problem).

anarcat · 2023-03-13T20:54:56Z

@rbott

I think I remember that while looking at the code of move-instance we found out that it actually does make some assumptions on the instance configuration which might lead to the NIC-configuration-related-error you stated above.

i'd really love to hear where you found that code, because what I found was pretty generic, copying data around. i've made #1698 which seems to work as as stopgap measure here.

i do wonder if the right place to do this might not better be somewhere in here:

ganeti/tools/move-instance

Lines 590 to 604 in 114e59f

    
           def _GetNics(instance, override_nics): 
        
             try: 
        
               nics = [{ 
        
                 constants.INIC_IP: ip, 
        
                 constants.INIC_MAC: mac, 
        
                 constants.INIC_MODE: mode, 
        
                 constants.INIC_LINK: link, 
        
                 constants.INIC_VLAN: vlan, 
        
                 constants.INIC_NETWORK: network, 
        
                 constants.INIC_NAME: nic_name 
        
                 } for nic_name, _, ip, mac, mode, link, vlan, network, _ 
        
                   in instance["nics"]] 
        
             except ValueError: 
        
               raise Error("Received NIC information does not match expected format; " 
        
                           "Do the versions of this tool and the source cluster match?")

i just can't figure out what to do with this stuff... it seems like it make sense to inherit it, but we're actually creating garbage here because it's where we create that dict which has both network and mode for example...

at least failing here would fail early and facilitate debugging? not sure what the best way forward is here either.

anarcat · 2023-03-13T20:58:39Z

so i have two more PRs here, #1698 and #1697 which fix the problems i've encountered so far. i'm at this error now:

2023-03-13 20:56:10,146: Move1 INFO [Mon Mar 13 20:56:10 2023]  - WARNING: export 'export-disk1-2023-03-13_20_55_58-_1zmyfcu' on chi-node-08.torproject.org failed: Exited with status 1
2023-03-13 20:56:10,146: Move1 INFO [Mon Mar 13 20:56:10 2023] Disk 1 failed to send data: Exited with status 1 (recent output: dd: 0 bytes copied, 0.998604 s, 0.0 kB/s\ndd: 0 bytes copied, 6.00403 s, 0.0 kB/s\nsocat: E SSL_connect(): Connection refused)

i think this could be related to:

make sure source Ganeti nodes are able to establish TCP connections on ports > 1024 to destination Ganeti nodes (inter-cluster-migration will always use the primary network of the cluster, not a secondary/alternate network which may be configured for e.g. DRBD stuff)

i have punched holes in the primary nodes, but not all nodes, so this might be what's crashing this...

and then I guess i'll catch up with your #1681... how did you actually work around that one?

rbott · 2023-03-14T11:36:34Z

because of the change in the instance network configuration (see above) we alter the network parameters (e.g. link=br604 turns into link=gnt-bridge,vlan=604) and also statically set the mac adress the interface had on the source cluster (otherwise a new one will be assigned during instance creation on the destination cluster) - this can be done by adding something like --net 0:link=gnt-bridge,vlan=604,mac=aa:bb:cc:dd:ee:ff to the command-line of move-instance)

so basically I need to actually allocate a MAC address for each VM I move? ouch?

Well yes and no. If you provide a --net parameter and leave out mac, it will default to the value of generate which will cause the destination Ganeti Cluster to role the dices and generate a new mac address. If that does not cause any problems for you, you can completely ignore this. But if it does cause Problems (DHCP reservations, older systems with autogenerated udev rules for ethX names etc.) you might want to retain the original mac address. In our case we simply ask RAPI on the source cluster for the current mac address(es) of the instance and pass it to the --net parameter of the move-instance command. In case of our ansible playbook it is a simple extra task. But YMMV, it might not even be required to retain the mac address(es) :-)

I think I remember that while looking at the code of move-instance we found out that it actually does make some assumptions on the instance configuration which might lead to the NIC-configuration-related-error you stated above.

i'd really love to hear where you found that code, because what I found was pretty generic, copying data around. i've made #1698 which seems to work as as stopgap measure here.

I probably should have looked at the code again before posting assumptions, sorry for that :-) But I think you have found the right spot and #1698 (along with @apoikos annotation/review) should do the trick and solve that issue.

and then I guess i'll catch up with your #1681... how did you actually work around that one?

Well, we took the short (and ugly) route and "hot-patched" this file on the sending node(s):

ganeti/lib/impexpd/__init__.py

Line 91 in 114e59f

SOCAT_OPENSSL_OPTS = ["verify=1", "cipher=%s" % constants.OPENSSL_CIPHERS]

...to state verify=0. As we mainly used move-instance to migrate from older clusters to 3.0 clusters, we rarely ran into this problem (mostly cases where we actually moved an instance to the wrong destination cluster and had to move it between two 3.0 clusters afterwards). But nevertheless it is actually broken for everyone right now using 3.0 and it needs a proper solution.

In ganeti#1696, I ended up in a situation where the instance I'm moving had both "network" and "link" parameters. That seems like a natural configuration in our cluster, but something move-instance really doesn't like, as it's passing this verbatim to create-instance which naturally freaks out on the contradictory arguments. Instead of just crashing, detect empty parameters (`not v` is a shortcut here) and null them out. This will result in those parameters being passed as empty to create-instance, or to be more accurate, not be passed at all. I've been able to fix the crash in ganeti#1696 by passing `--net 0:ip=pool,network=gnt-dal-01,mode=,link=` to move-instance with this patch. I'm not sure it's the right approach though. I'd much rather *not* have to pass `--net` at all, as I actually want to move multiple instances across clusters, and that seems incompatible with `--net` for some reason I cannot currently fathom (but which is possibly related to this problem).

anarcat · 2023-03-14T13:55:51Z

On 2023-03-14 04:36:44, Rudolph Bott wrote: > > because of the change in the instance network configuration (see above) we alter the network parameters (e.g. `link=br604` turns into `link=gnt-bridge,vlan=604`) and also statically set the mac adress the interface had on the source cluster (otherwise a new one will be assigned during instance creation on the destination cluster) - this can be done by adding something like `--net 0:link=gnt-bridge,vlan=604,mac=aa:bb:cc:dd:ee:ff` to the command-line of `move-instance`) > > so basically I need to actually allocate a MAC address for each VM I move? ouch? Well yes and no. If you provide a `--net` parameter and leave out `mac`, it will default to the value of `generate` which will cause the destination Ganeti Cluster to role the dices and generate a new mac address. If that does not cause any problems for you, you can completely ignore this. But if it _does_ cause Problems (DHCP reservations, older systems with autogenerated udev rules for ethX names etc.) you might want to retain the original mac address. In our case we simply ask RAPI on the source cluster for the current mac address(es) of the instance and pass it to the `--net` parameter of the `move-instance` command. In case of our ansible playbook it is a simple extra task. But YMMV, it might not even be required to retain the mac address(es) :-)

What I meant is that to override the error, I need to pass a `mac=` setting somehow. But yeah, I think it's okay if our MACs get renumbered. We do have a per-cluster MAC prefix anyway, so it would be odd for those VMs to be different. [...]

> and _then_ I guess i'll catch up with your #1681... how did you actually work around that one? Well, we took the short (and ugly) route and "hot-patched" this file on the sending node(s): https://github.com/ganeti/ganeti/blob/114e59fcc9d4a7c82618569f5d6b7389a0f80123/lib/impexpd/__init__.py#L91 ...to state `verify=0`. As we mainly used `move-instance` to migrate from older clusters *to* 3.0 clusters, we rarely ran into this problem (mostly cases where we actually moved an instance to the wrong destination cluster and had to move it between two 3.0 clusters afterwards). But nevertheless it is actually broken for everyone right now using 3.0 and it needs a proper solution.

While we're talking about monkeypatching stuff here, I wonder if there's a cleaner way to bypass this than just disabling verification. In our case, this is flying over an untrusted network, so I actually really don't want to disable verification. I think. Maybe there's a way to hardcode the CA or something?

anarcat · 2023-03-15T19:27:55Z

okay, so i think this ticket can remain for documentation, i filed #1697 for the python 3 stuff, #1698 for the NIC stuff (which could also be improved) and #1699 for the commonname stuff.

what would remain here is documenting the heck out of all this.

Any move-instance call without --opportunistic-tries currently crashes with this: File "/usr/lib/ganeti/tools/move-instance", line 1132, in <module> main() File "/usr/share/ganeti/3.0/ganeti/rapi/client.py", line 274, in wrapper return fn(*args, **kwargs) File "/usr/lib/ganeti/tools/move-instance", line 1069, in main CheckOptions(parser, options, args) File "/usr/lib/ganeti/tools/move-instance", line 1007, in CheckOptions _CheckAllocatorOptions(parser, options) File "/usr/lib/ganeti/tools/move-instance", line 941, in _CheckAllocatorOptions if not options.iallocator and (options.opportunistic_tries > 0): TypeError: '>' not supported between instances of 'NoneType' and 'int' It is said this was a problem related to the 2to3 migration, but I haven't investigated fully. See: #1696

In #1696, I ended up in a situation where the instance I'm moving had both "network" and "link" parameters. That seems like a natural configuration in our cluster, but something move-instance really doesn't like, as it's passing this verbatim to create-instance which naturally freaks out on the contradictory arguments. Instead of just crashing, detect empty parameters (`not v` is a shortcut here) and null them out. This will result in those parameters being passed as empty to create-instance, or to be more accurate, not be passed at all. I've been able to fix the crash in #1696 by passing `--net 0:ip=pool,network=gnt-dal-01,mode=,link=` to move-instance with this patch. I'm not sure it's the right approach though. I'd much rather *not* have to pass `--net` at all, as I actually want to move multiple instances across clusters, and that seems incompatible with `--net` for some reason I cannot currently fathom (but which is possibly related to this problem).

Any move-instance call without --opportunistic-tries currently crashes with this: File "/usr/lib/ganeti/tools/move-instance", line 1132, in <module> main() File "/usr/share/ganeti/3.0/ganeti/rapi/client.py", line 274, in wrapper return fn(*args, **kwargs) File "/usr/lib/ganeti/tools/move-instance", line 1069, in main CheckOptions(parser, options, args) File "/usr/lib/ganeti/tools/move-instance", line 1007, in CheckOptions _CheckAllocatorOptions(parser, options) File "/usr/lib/ganeti/tools/move-instance", line 941, in _CheckAllocatorOptions if not options.iallocator and (options.opportunistic_tries > 0): TypeError: '>' not supported between instances of 'NoneType' and 'int' It is said this was a problem related to the 2to3 migration, but I haven't investigated fully. See: ganeti#1696

In ganeti#1696, I ended up in a situation where the instance I'm moving had both "network" and "link" parameters. That seems like a natural configuration in our cluster, but something move-instance really doesn't like, as it's passing this verbatim to create-instance which naturally freaks out on the contradictory arguments. Instead of just crashing, detect empty parameters (`not v` is a shortcut here) and null them out. This will result in those parameters being passed as empty to create-instance, or to be more accurate, not be passed at all. I've been able to fix the crash in ganeti#1696 by passing `--net 0:ip=pool,network=gnt-dal-01,mode=,link=` to move-instance with this patch. I'm not sure it's the right approach though. I'd much rather *not* have to pass `--net` at all, as I actually want to move multiple instances across clusters, and that seems incompatible with `--net` for some reason I cannot currently fathom (but which is possibly related to this problem).

anarcat mentioned this issue Mar 13, 2023

move-instance: fix crash in Python 3 #1697

Merged

anarcat mentioned this issue Mar 13, 2023

move-instance: allow users to empty out NIC parameters #1698

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move-instance difficult to use and ultimately fails #1696

move-instance difficult to use and ultimately fails #1696

anarcat commented Mar 10, 2023

anarcat commented Mar 10, 2023 •

edited

rbott commented Mar 12, 2023 •

edited

rbott commented Mar 12, 2023

rbott commented Mar 12, 2023 •

edited

anarcat commented Mar 13, 2023

anarcat commented Mar 13, 2023

anarcat commented Mar 13, 2023

rbott commented Mar 14, 2023

anarcat commented Mar 14, 2023 via email •

edited

anarcat commented Mar 15, 2023

move-instance difficult to use and ultimately fails #1696

move-instance difficult to use and ultimately fails #1696

Comments

anarcat commented Mar 10, 2023

anarcat commented Mar 10, 2023 • edited

rbott commented Mar 12, 2023 • edited

rbott commented Mar 12, 2023

rbott commented Mar 12, 2023 • edited

anarcat commented Mar 13, 2023

anarcat commented Mar 13, 2023

anarcat commented Mar 13, 2023

rbott commented Mar 14, 2023

anarcat commented Mar 14, 2023 via email • edited

anarcat commented Mar 15, 2023

anarcat commented Mar 10, 2023 •

edited

rbott commented Mar 12, 2023 •

edited

rbott commented Mar 12, 2023 •

edited

anarcat commented Mar 14, 2023 via email •

edited