-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move-instance difficult to use and ultimately fails #1696
Comments
oh, and for what it's worth, I have a manual procedure for moving VMs around with export/import documented here: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ganeti/#migrating-a-vm-between-clusters it's basically, on the source node:
and on the target node:
... so that works, but it's still a manual process, with multiple steps and everything is a little error-prone, with multiple long-running processes punctuated by manual "copy-paste" things which is not ideal for large clusters. (And yes, I could also have my own automation to speed this up myself, but then i'd be rewriting move-instance, wouldn't i? :) |
Hi @anarcat, we have used the tool a lot in all of our Ganeti 2.16 -> 3.0 Upgrades (basically we set up new small clusters with fresh hardware and Ganeti 3.0, moved some instances, re-purposed/re-installed older nodes where possible and added them to the new cluster(s), moved some more instances etc.). We also stumbled upon some bugs and/or missing features which are fixed in master here, here and here. You can use the script standalone without the rest of the tree directly from master to carry out the migrations. We ended up with the following pre-setup:
We then used an Ansible playbook to carry out the actual instance migrations, but that happenly mainly due to easier integration into other internal workflows which are not relevant here. It did the follwoing:
Our instance scenario:
I think I remember that while looking at the code of The above is far from perfect and still yields some issues that should actually be fixed upstream. But it worked well for us with several hundred instances so far. Hope that helps a bit :-) |
We have also used the same approach to move instances between Ganeti 3.0 clusters. However, due to an issue with more recent More information can be found in this issue |
Oh and I would definitely say (to add something more useful to this issue): I would suggest to a) extend the documentation with more guidance/example commands/pitfalls and b) of course fix the open/known issues, e.g. the setting of the shared secret which clearly is a bug. I might find some time in the next days to extend the documentation. |
Any move-instance call without --opportunistic-tries currently crashes with this: File "/usr/lib/ganeti/tools/move-instance", line 1132, in <module> main() File "/usr/share/ganeti/3.0/ganeti/rapi/client.py", line 274, in wrapper return fn(*args, **kwargs) File "/usr/lib/ganeti/tools/move-instance", line 1069, in main CheckOptions(parser, options, args) File "/usr/lib/ganeti/tools/move-instance", line 1007, in CheckOptions _CheckAllocatorOptions(parser, options) File "/usr/lib/ganeti/tools/move-instance", line 941, in _CheckAllocatorOptions if not options.iallocator and (options.opportunistic_tries > 0): TypeError: '>' not supported between instances of 'NoneType' and 'int' It is said this was a problem related to the 2to3 migration, but I haven't investigated fully. See: ganeti#1696
wow, that's all extremely useful! that
amazing, I converged over the exact same thing, probably because of the exact same bug, see ganeti/instance-debootstrap#18
so basically I need to actually allocate a MAC address for each VM I move? ouch? I was hoping i could just batch-move instances here to quickly evacuate a cluster, individually mapping MAC addresses doesn't sound like a fun time...
okay, that definitely sounds familiar. what's strange with this problem is that the problem occurs whether I pass a
what's interesting there is that the |
In ganeti#1696, I ended up in a situation where the instance I'm moving had both "network" and "link" parameters. That seems like a natural configuration in our cluster, but something move-instance really doesn't like, as it's passing this verbatim to create-instance which naturally freaks out on the contradictory arguments. Instead of just crashing, detect empty parameters (`not v` is a shortcut here) and null them out. This will result in those parameters being passed as empty to create-instance, or to be more accurate, not be passed at all. I've been able to fix the crash in ganeti#1696 by passing `--net 0:ip=pool,network=gnt-dal-01,mode=,link=` to move-instance with this patch. I'm not sure it's the right approach though. I'd much rather *not* have to pass `--net` at all, as I actually want to move multiple instances across clusters, and that seems incompatible with `--net` for some reason I cannot currently fathom (but which is possibly related to this problem).
i'd really love to hear where you found that code, because what I found was pretty generic, copying data around. i've made #1698 which seems to work as as stopgap measure here. i do wonder if the right place to do this might not better be somewhere in here: Lines 590 to 604 in 114e59f
i just can't figure out what to do with this stuff... it seems like it make sense to inherit it, but we're actually creating garbage here because it's where we create that dict which has both at least failing here would fail early and facilitate debugging? not sure what the best way forward is here either. |
so i have two more PRs here, #1698 and #1697 which fix the problems i've encountered so far. i'm at this error now:
i think this could be related to:
i have punched holes in the primary nodes, but not all nodes, so this might be what's crashing this... and then I guess i'll catch up with your #1681... how did you actually work around that one? |
Well yes and no. If you provide a
I probably should have looked at the code again before posting assumptions, sorry for that :-) But I think you have found the right spot and #1698 (along with @apoikos annotation/review) should do the trick and solve that issue.
Well, we took the short (and ugly) route and "hot-patched" this file on the sending node(s): ganeti/lib/impexpd/__init__.py Line 91 in 114e59f
...to state verify=0 . As we mainly used move-instance to migrate from older clusters to 3.0 clusters, we rarely ran into this problem (mostly cases where we actually moved an instance to the wrong destination cluster and had to move it between two 3.0 clusters afterwards). But nevertheless it is actually broken for everyone right now using 3.0 and it needs a proper solution.
|
In ganeti#1696, I ended up in a situation where the instance I'm moving had both "network" and "link" parameters. That seems like a natural configuration in our cluster, but something move-instance really doesn't like, as it's passing this verbatim to create-instance which naturally freaks out on the contradictory arguments. Instead of just crashing, detect empty parameters (`not v` is a shortcut here) and null them out. This will result in those parameters being passed as empty to create-instance, or to be more accurate, not be passed at all. I've been able to fix the crash in ganeti#1696 by passing `--net 0:ip=pool,network=gnt-dal-01,mode=,link=` to move-instance with this patch. I'm not sure it's the right approach though. I'd much rather *not* have to pass `--net` at all, as I actually want to move multiple instances across clusters, and that seems incompatible with `--net` for some reason I cannot currently fathom (but which is possibly related to this problem).
On 2023-03-14 04:36:44, Rudolph Bott wrote:
> > because of the change in the instance network configuration (see above) we alter the network parameters (e.g. `link=br604` turns into `link=gnt-bridge,vlan=604`) and also statically set the mac adress the interface had on the source cluster (otherwise a new one will be assigned during instance creation on the destination cluster) - this can be done by adding something like `--net 0:link=gnt-bridge,vlan=604,mac=aa:bb:cc:dd:ee:ff` to the command-line of `move-instance`)
>
> so basically I need to actually allocate a MAC address for each VM I move? ouch?
Well yes and no. If you provide a `--net` parameter and leave out `mac`, it will default to the value of `generate` which will cause the destination Ganeti Cluster to role the dices and generate a new mac address. If that does not cause any problems for you, you can completely ignore this. But if it _does_ cause Problems (DHCP reservations, older systems with autogenerated udev rules for ethX names etc.) you might want to retain the original mac address. In our case we simply ask RAPI on the source cluster for the current mac address(es) of the instance and pass it to the `--net` parameter of the `move-instance` command. In case of our ansible playbook it is a simple extra task. But YMMV, it might not even be required to retain the mac address(es) :-)
What I meant is that to override the error, I need to pass a `mac=`
setting somehow. But yeah, I think it's okay if our MACs get
renumbered. We do have a per-cluster MAC prefix anyway, so it would be
odd for those VMs to be different.
[...]
> and _then_ I guess i'll catch up with your #1681... how did you actually work around that one?
Well, we took the short (and ugly) route and "hot-patched" this file on the sending node(s):
https://github.com/ganeti/ganeti/blob/114e59fcc9d4a7c82618569f5d6b7389a0f80123/lib/impexpd/__init__.py#L91
...to state `verify=0`. As we mainly used `move-instance` to migrate from older clusters *to* 3.0 clusters, we rarely ran into this problem (mostly cases where we actually moved an instance to the wrong destination cluster and had to move it between two 3.0 clusters afterwards). But nevertheless it is actually broken for everyone right now using 3.0 and it needs a proper solution.
While we're talking about monkeypatching stuff here, I wonder if there's
a cleaner way to bypass this than just disabling verification. In our
case, this is flying over an untrusted network, so I actually really
don't want to disable verification. I think. Maybe there's a way to
hardcode the CA or something?
|
Any move-instance call without --opportunistic-tries currently crashes with this: File "/usr/lib/ganeti/tools/move-instance", line 1132, in <module> main() File "/usr/share/ganeti/3.0/ganeti/rapi/client.py", line 274, in wrapper return fn(*args, **kwargs) File "/usr/lib/ganeti/tools/move-instance", line 1069, in main CheckOptions(parser, options, args) File "/usr/lib/ganeti/tools/move-instance", line 1007, in CheckOptions _CheckAllocatorOptions(parser, options) File "/usr/lib/ganeti/tools/move-instance", line 941, in _CheckAllocatorOptions if not options.iallocator and (options.opportunistic_tries > 0): TypeError: '>' not supported between instances of 'NoneType' and 'int' It is said this was a problem related to the 2to3 migration, but I haven't investigated fully. See: #1696
In #1696, I ended up in a situation where the instance I'm moving had both "network" and "link" parameters. That seems like a natural configuration in our cluster, but something move-instance really doesn't like, as it's passing this verbatim to create-instance which naturally freaks out on the contradictory arguments. Instead of just crashing, detect empty parameters (`not v` is a shortcut here) and null them out. This will result in those parameters being passed as empty to create-instance, or to be more accurate, not be passed at all. I've been able to fix the crash in #1696 by passing `--net 0:ip=pool,network=gnt-dal-01,mode=,link=` to move-instance with this patch. I'm not sure it's the right approach though. I'd much rather *not* have to pass `--net` at all, as I actually want to move multiple instances across clusters, and that seems incompatible with `--net` for some reason I cannot currently fathom (but which is possibly related to this problem).
Any move-instance call without --opportunistic-tries currently crashes with this: File "/usr/lib/ganeti/tools/move-instance", line 1132, in <module> main() File "/usr/share/ganeti/3.0/ganeti/rapi/client.py", line 274, in wrapper return fn(*args, **kwargs) File "/usr/lib/ganeti/tools/move-instance", line 1069, in main CheckOptions(parser, options, args) File "/usr/lib/ganeti/tools/move-instance", line 1007, in CheckOptions _CheckAllocatorOptions(parser, options) File "/usr/lib/ganeti/tools/move-instance", line 941, in _CheckAllocatorOptions if not options.iallocator and (options.opportunistic_tries > 0): TypeError: '>' not supported between instances of 'NoneType' and 'int' It is said this was a problem related to the 2to3 migration, but I haven't investigated fully. See: ganeti#1696
In ganeti#1696, I ended up in a situation where the instance I'm moving had both "network" and "link" parameters. That seems like a natural configuration in our cluster, but something move-instance really doesn't like, as it's passing this verbatim to create-instance which naturally freaks out on the contradictory arguments. Instead of just crashing, detect empty parameters (`not v` is a shortcut here) and null them out. This will result in those parameters being passed as empty to create-instance, or to be more accurate, not be passed at all. I've been able to fix the crash in ganeti#1696 by passing `--net 0:ip=pool,network=gnt-dal-01,mode=,link=` to move-instance with this patch. I'm not sure it's the right approach though. I'd much rather *not* have to pass `--net` at all, as I actually want to move multiple instances across clusters, and that seems incompatible with `--net` for some reason I cannot currently fathom (but which is possibly related to this problem).
I'm trying to migrate between two Ganeti clusters. I have found with great anticipation the move-instance command, but I'm having a hard time making it work.
At first, it would just crash with a backtrace in Debian bullseye:
That's due to this code:
ganeti/tools/move-instance
Line 941 in 114e59f
If I pass
--opportunistic-tries=1
it tells me:So, basically, right now, you must use:
According to @apoikos (on IRC), the
TypeError
is a python2-to-3 leftover...The next problem I had with
move-instance
was aganeti.rapi.client.Error: Password not specified
, but that was me failing at setting up the RAPI users. I also gotganeti.rapi.client.GanetiApiError: 401 Unauthorized: No permission -- see authorization schemes
on the destination cluster. Maybe the docs could be improved to lead the operator the right way ("check your RAPI users again") in the documentation. Having a way to test the users out of band (say withcurl
) would also be useful here.Then I had another error which was pretty opaque:
So that might seem obvious but I did copy the secret over and ran:
So it seems the bug there is that the
--cluster-domain-secret=
argument actually fails to replace the secret on the cluster. I had to manually copy thecluster-domain-secret
file in/var/lib/ganeti
and restart the server for that to work.But what completely blocked me is this:
It looks like the source node is encoding NIC information in the backup and the target node is somewhat unhappy with it. I'm not sure how to debug this: I'm lost in the stack between the client and server method definitions and I don't actually understand what's going on so much.
Did anyone get that thing to work at all? What am I doing wrong?
Should I open separate issues for those things?
The text was updated successfully, but these errors were encountered: