Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inter-Cluster Instance Transfer fail due to socat TLS verification #1681

Open
rbott opened this issue Aug 21, 2022 · 13 comments · May be fixed by #1699
Open

Inter-Cluster Instance Transfer fail due to socat TLS verification #1681

rbott opened this issue Aug 21, 2022 · 13 comments · May be fixed by #1699

Comments

@rbott
Copy link
Member

rbott commented Aug 21, 2022

We have been successfully using the move-instance Script to move instances from older clusters (e.g. based on Debian Stretch) to newer Clusters (based on Debian Bullseye / Ganeti 3.0.2). However, we can not move Instances between Debian Bullseye servers.

This happens because socat is configured to verify the TLS certificate presented by the destination node:

SOCAT_OPENSSL_OPTS = ["verify=1", "cipher=%s" % constants.OPENSSL_CIPHERS]

However, with recent socat versions verification also includes matching the hostname to the certificate CN/SAN entries. For the connection, the destination node's ip address is used, but the cluster certificate always contains ganeti.example.com. This is hardcoded in the constants:

x509CertCn = "ganeti.example.com"

I see multiple solutions to this problem:

  • we can supply each node with a valid certificate (which contains the hostname and all primary/secondary IP addresses) and use that for the import socket server
  • we can make the verify switch configurable and leave it up to the user (with verify=1 being always broken)
  • set verify=0 to at least allow people to migrate instances again
  • something else

What would you suggest?

@anarcat
Copy link
Contributor

anarcat commented Mar 13, 2023

we can supply each node with a valid certificate (which contains the hostname and all primary/secondary IP addresses) and use that for the import socket server

this would seem like the proper course of action, and i can confirm that such a junk certificate is also used in our cluster configuration here.

@anarcat
Copy link
Contributor

anarcat commented Mar 14, 2023

x509CertCn = "ganeti.example.com"

just for the record, i couldn't find a trace of that variable anywhere in the code but eventually figured out the Haskell constants are kind of transpiled into Python code and it's X509_CERT_CN there.

grepping around for this, i found also this bit:

https://github.com/ganeti/ganeti/blob/114e59fcc9d4a7c82618569f5d6b7389a0f80123/lib/impexpd/__init__.py#L225-L229C64

which means this should already be working, thanks to 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20)....

what error message were you actually getting? right now I'm getting:

2023-03-14 19:54:03,139: DestForMove1 INFO [Tue Mar 14 19:54:03 2023] Disk 2 failed to receive data: Exited with status 1 (recent output: socat: W ioctl(9, IOCTL_VM_SOCKETS_GET_LOCAL_CID, ...): Inappropriate ioctl for device\n0+0 records in\n0+0 records out\n0 bytes copied, 12.6247 s, 0.0 kB/s)

... but i'm not sure it's related?

@anarcat
Copy link
Contributor

anarcat commented Mar 14, 2023

@rbott
Copy link
Member Author

rbott commented Mar 14, 2023

just for the record, i couldn't find a trace of that variable anywhere in the code but eventually figured out the Haskell constants are kind of transpiled into Python code and it's X509_CERT_CN there.

Yeah, the Haskell and Python worlds share the same constants, but the Python constants file (_constants.py) is getting generated during build time from the Constants.hs haskell file. This makes debugging the source code sometimes harder, but helps in the long run :-)

Anyways, instead of using X509_CERT_CN for a share cluster-wide certificate each node could/should have its own certificate (which holds the node name and also the cluster name as subject alternate names and also the related IP addreses). OR one certificate which holds all names and ip addresses. In any case that would be a rather big change to the ganeti configuration/data on disk.

grepping around for this, i found also this bit:

https://github.com/ganeti/ganeti/blob/114e59fcc9d4a7c82618569f5d6b7389a0f80123/lib/impexpd/__init__.py#L225-L229C64

which means this should already be working, thanks to 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20)....

The socat manpage states:

NOTE: Up to version 1.7.2.4 the server certificate was only checked for validity against the system certificate store or cafile or capath, but not for match with the server’s name or its IP ad‐
dress. Since version 1.7.3.0 socat checks the peer certificate for match with the parameter or the value of the openssl-commonname option. Socat tries to match it against the certifi‐
cates subject commonName, and the certificates extension subjectAltName DNS names. Wildcards in the certificate are supported.
Option groups: FD,SOCKET,IP4,IP6,TCP,OPENSSL,RETRY
Useful options: min-proto-version, cipher, verify, commonname, cafile, capath, certificate, key, compress, bind, pf, connect-timeout, sourceport, retry
See also: OPENSSL-LISTEN, TCP

I think I should do some more socat testing to qualify this issue properly. You might be on to something here :-)

what error message were you actually getting? right now I'm getting:

2023-03-14 19:54:03,139: DestForMove1 INFO [Tue Mar 14 19:54:03 2023] Disk 2 failed to receive data: Exited with status 1 (recent output: socat: W ioctl(9, IOCTL_VM_SOCKETS_GET_LOCAL_CID, ...): Inappropriate ioctl for device\n0+0 records in\n0+0 records out\n0 bytes copied, 12.6247 s, 0.0 kB/s)

The error we received was pretty clear:

Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

I think the Disk 2 failed to receive data[...] error must have its cause somewhere else. The problem only hit us with Ganeti 3.0 on Debian Bullseye (however, it is actually related to Debian shipping socat in version 1.7.4.1 with Bullseye which handles verify=1 different than older versions, it seems to have nothing to do with Ganeti itself).

@anarcat
Copy link
Contributor

anarcat commented Mar 15, 2023

Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

oh, i did get that too, actually.

the above ganeti/launchpad bugs work around the issue by completely downgrading socat, interestingly.

I think I should do some more socat testing to qualify this issue properly. You might be on to something here :-)

i've been tearing my hair out trying to get the impexpd to show the actual damn socat command it's running, i'm down to doing hot patches on the live code right now to include traces, and seriously considering a BPF trace to just show executed programs cluster-wide as well. arghl.

did you manage to get a sample of what the daemon actually executes on your end?

@anarcat
Copy link
Contributor

anarcat commented Mar 15, 2023

did you manage to get a sample of what the daemon actually executes on your end?

i managed to do an execsnoop and catch this:

socat            14118  14114    0 /usr/bin/socat -ls -d -d -b1048576 -u stdin OPEN
SSL:204.8.99.102:38547,connect-timeout=20,retry=10,intervall=1,keepalive,keepidle=6
0,keepintvl=10,keepcnt=5,verify=1,cipher

... so it seems execsnoops gets only a truncated version of the args, aarghl...

@anarcat
Copy link
Contributor

anarcat commented Mar 15, 2023

i managed to extract the full commandline with bpftrace:

592586     68380 /usr/bin/socat -ls -d -d -b1048576 -u stdin OPENSSL:204.8.99.102:43419,connect-timeout=20,retry=10,intervall=1,keepalive,keepidle=60,keepintvl=10,keepcnt=5,verify=1,cipher=HIGH:-DES:-3DES:-EXPORT:-DH,compress=none,key=/var/run/ganeti/crypto/x509-2023-03-15_16_54_20-q4e8scoz/key,cert=/var/run/ganeti/crypto/x509-2023-03-15_16_54_20-q4e8scoz/cert,cafile=/var/run/ganeti/import-export/export-disk2-2023-03-15_17_03_50-3p2a6wa7/ca,pf=ipv4,openssl-commonname=ganeti.example.com

@anarcat
Copy link
Contributor

anarcat commented Mar 15, 2023

okay, so i did this test.

  1. generate a self-signed certificate: openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1
  2. start a socat server: socat -ls -d -d STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem
  3. connect to it with a client: socat -ls -d -d STDIO OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem

.. and... that works! (Note that I've also tried with all the extra arguments on the client I found in the execsnoop above, it still works.)

So it certainly seems like we're actually creating a bad certificate that does not match ganeti.example.com.

now if i try commonname=ganeti.example.net (note the .net instead of .com) I get this error:

2023/03/15 14:07:57 socat[263499] E certificate is valid but its commonName does not match hostname "ganeti.example.net"

which is the error we're getting in this issue. So it seems like the certificate generated on the import side does not use the ganeti.example.com domain!

@anarcat
Copy link
Contributor

anarcat commented Mar 15, 2023

well shit:

root@chi-node-08:~# certtool -i < /run/ganeti/crypto/x509-2023-03-15_18_19_00-scnvnih3/cert | grep Subject:
        Subject: CN=chi-node-08.torproject.org

that's doing a move-instance, while the backup is being exported, before the socat, and on the source side... but it sure looks like the cert is being generated with the node name and NOT the ganeti.example.com thing!

now the trick here is that only the IP address (!? WHY?) is passed down to the API call. I traced the import-export stuff all the way up to noded.NodeRequestHandler.perspective_export_start:

ganeti/lib/server/noded.py

Lines 1247 to 1260 in 114e59f

def perspective_export_start(params):
"""Starts an export daemon.
"""
(opts_s, host, port, instance, component, (source, source_args)) = params
opts = objects.ImportExportOptions.FromDict(opts_s)
return backend.StartImportExportDaemon(constants.IEM_EXPORT, opts,
host, port,
objects.Instance.FromDict(instance),
component, source,
_DecodeImportExportIO(source,
source_args))

but worse than this, it seems we enforce the host to be an IP address in the import/export daemon:

if options.host is not None and not netutils.IPAddress.IsValid(options.host):
try:
options.host = netutils.Hostname.GetNormalizedName(options.host)
except errors.OpPrereqError as err:
parser.error("Invalid hostname '%s': %s" % (options.host, err))

so it's going to be pretty hard to fix that without some magic hackery (e.g. doing a reverse DNS lookup, ouch?).

in any case, this is starting to look pretty promising...

anarcat added a commit to anarcat/ganeti that referenced this issue Mar 15, 2023
In 7bb0351 (impexpd: fix certificate verification with new socat
versions, 2017-12-20), a hostname verification was introduced to fix
socat's new (and proper) behavior of actually checking the remote
hostname during OpenSSL-protected transfers.

The problem, however, is that the hostname used was the default
`X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded
to `ganeti.example.com`. In a real-world deployment, it seems like the
remote CommonName (CN) of the certificate used by the export daemon is
actually the target node name.

In my case, it meant I was getting the following error from socat
during transfers:

    Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

At first I thought socat might be doing us some trouble, but no: socat
works properly. An example is this:

 1. generate a self-signed certificate:

        openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1

 2. start a socat server:

        socat -ls -d -d  STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem

 3. connect to it with a client:

        socat -ls -d -d   STDIO  OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem

... which actually works, which means the `openssl-commonname`
argument actually works, and works properly. If it's changed, for
example, to `ganeti.example.net`, the above fails with the
aforementioned error message.

We fix this by doing a reverse name resolution on the provided IP
address. Now, we don't *assume* it's an IP address: this code kicks in
only if the impexpd is passed an actual IP address, but in my
experience it seems to always be the case (which is probably a
separate problem to fix).

This is rather brittle and assumes DNS will not lie, which is quite a
stretch. In our environment, however, we have end-to-end DNSSEC so we
can trust the DNS. And this beats hardcoding verify=0, which is the
other workaround that can be done to fix this issue.

Closes: ganeti#1681
@anarcat
Copy link
Contributor

anarcat commented Mar 15, 2023

okay, I got this to work! i had to do some pretty nasty stuff like resolving the IP address given to impexpd as I can't figure out why or where the IP is passed instead of the hostname. but it works for me, and i figured it was worth sharing. phew!

@rbott
Copy link
Member Author

rbott commented Mar 17, 2023

Wow, good work - thank you! I was actually on vacation during the last week and not able to do any testing myself (or respond earlier). I will take a look at your PR now!

//Edit: OK, I'll respond here, to not mess up the discussion flow :-)

The temporary certificate with the instance's primary node name in it seems to be requested here:

result = self.rpc.call_x509_cert_create(self.instance.primary_node,

We could also try and work around this issue and have the certificate created for the IP address instead of the name instead. openssl-commonname also works perfectly with an IP address if the certificate's subject contains one :-) That way we would not need to rely on DNS reverse resolution. We could also have the code put both the IP address and the hostname in the certificate as SAN/subject alternative names to not break other usages (although I do not think there are any right now). However, this is a bit more complicated than I expected.

The "heavy lifting" (creating a certificate) is done by this function:

ganeti/lib/utils/x509.py

Lines 254 to 283 in 114e59f

def GenerateSelfSignedX509Cert(common_name, validity, serial_no):
"""Generates a self-signed X509 certificate.
@type common_name: string
@param common_name: commonName value
@type validity: int
@param validity: Validity for certificate in seconds
@return: a tuple of strings containing the PEM-encoded private key and
certificate
"""
# Create private and public key
key = OpenSSL.crypto.PKey()
key.generate_key(OpenSSL.crypto.TYPE_RSA, constants.RSA_KEY_BITS)
# Create self-signed certificate
cert = OpenSSL.crypto.X509()
if common_name:
cert.get_subject().CN = common_name
cert.set_serial_number(serial_no)
cert.gmtime_adj_notBefore(0)
cert.gmtime_adj_notAfter(validity)
cert.set_issuer(cert.get_subject())
cert.set_pubkey(key)
cert.sign(key, constants.X509_CERT_SIGN_DIGEST)
key_pem = OpenSSL.crypto.dump_privatekey(OpenSSL.crypto.FILETYPE_PEM, key)
cert_pem = OpenSSL.crypto.dump_certificate(OpenSSL.crypto.FILETYPE_PEM, cert)
return (key_pem, cert_pem)

It is called by the follwoing backend function, which in turn is part of the noded RPC which is running on each node. Bascially the ganeti master asks the node to create a certificate with its name in it (it always passes netutils.Hostname.GetSysName() as the cert's common name).

ganeti/lib/backend.py

Lines 5102 to 5132 in 114e59f

def CreateX509Certificate(validity, cryptodir=pathutils.CRYPTO_KEYS_DIR):
"""Creates a new X509 certificate for SSL/TLS.
@type validity: int
@param validity: Validity in seconds
@rtype: tuple; (string, string)
@return: Certificate name and public part
"""
serial_no = int(time.time())
(key_pem, cert_pem) = \
utils.GenerateSelfSignedX509Cert(netutils.Hostname.GetSysName(),
min(validity, _MAX_SSL_CERT_VALIDITY),
serial_no)
cert_dir = tempfile.mkdtemp(dir=cryptodir,
prefix="x509-%s-" % utils.TimestampForFilename())
try:
name = os.path.basename(cert_dir)
assert len(name) > 5
(_, key_file, cert_file) = _GetX509Filenames(cryptodir, name)
utils.WriteFile(key_file, mode=0o400, data=key_pem)
utils.WriteFile(cert_file, mode=0o400, data=cert_pem)
# Never return private key as it shouldn't leave the node
return (name, cert_pem)
except Exception:
shutil.rmtree(cert_dir, ignore_errors=True)
raise

Simply extending GenerateSelfSignedX509Cert() to accept a list of names/IPs (and add them as subject alternate names) is not possible, as the crypto library used does not support SAN certificates. It refers to the cryptography library as a higher level replacement that should be used instead of OpenSSL/crypto directly.

We could also modify CreateX509Certificate to use the node's IP address instead of netutils.Hostname.GetSysName() - this should be possible by using netutils.Hostname.GetIP(netutils.Hostname.GetSysName()) instead. However, netutils.Hostname.GetIP() also relies on DNS resolution eventually (although it is a forward lookup instead of a reverse lookup). All in all, this doesn't get us much further than your approach I guess.

So we are down to "replace the crypto/x509 library with possible side-effects" or "still rely on DNS, just at a different stage".

What would you say @anarcat?

@anarcat
Copy link
Contributor

anarcat commented Mar 20, 2023

So we are down to "replace the crypto/x509 library with possible side-effects" or "still rely on DNS, just at a different stage".

i'm not sure i can follow you all the way down that rabbit hole, @rbott, but it seems to make sense. i think that, short term, the latter seems to be sane, but the former is probably a requirement in the long term anyway...

that said, i'd personally like to see the impexpd get the hostname instead of the IP address; i don't get why we pass it the IP address (or where!). it's what i identify as the root cause of the problem. i was just too tired to walk back up the stack (and i was confused by the API layer i couldn't walk up from) to figure out how ti fix that....

but i think passing the hostname instead of the IP address would fix the cert issue neatly without having to change anything in the cert generation. i think we could even revert the openssl-commonnmame stuff since we'd be using the real hostname with a "real" cert...

is that something we could consider here?

also note that I'll be using my three patches in production tomorrow, i am not sure i will have much more time to fight this one problem, as we're already late in this project and i'm living a bit on borrowed time here. :) but if you have patches to test, that's something i could probably do...

thanks!

@rbott
Copy link
Member Author

rbott commented Mar 21, 2023

So we are down to "replace the crypto/x509 library with possible side-effects" or "still rely on DNS, just at a different stage".

Actually I was a bit wrong here. It is possible with the current OpenSSL implementation to create a certificate with additional SAN entries. I will look into that (having both the IP and name in the cert surely won't do any harm here).

that said, i'd personally like to see the impexpd get the hostname instead of the IP address; i don't get why we pass it the IP address (or where!).

You are right, I totally forgot to check that path as well. I will look into this and post my findings!

also note that I'll be using my three patches in production tomorrow, i am not sure i will have much more time to fight this one problem, as we're already late in this project and i'm living a bit on borrowed time here. :) but if you have patches to test, that's something i could probably do...

I am quite busy right now and will try to allocate some time for this issue soon. Along with updating the move-instance documentation :-)

anarcat added a commit to anarcat/ganeti that referenced this issue May 25, 2023
In 7bb0351 (impexpd: fix certificate verification with new socat
versions, 2017-12-20), a hostname verification was introduced to fix
socat's new (and proper) behavior of actually checking the remote
hostname during OpenSSL-protected transfers.

The problem, however, is that the hostname used was the default
`X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded
to `ganeti.example.com`. In a real-world deployment, it seems like the
remote CommonName (CN) of the certificate used by the export daemon is
actually the target node name.

In my case, it meant I was getting the following error from socat
during transfers:

    Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

At first I thought socat might be doing us some trouble, but no: socat
works properly. An example is this:

 1. generate a self-signed certificate:

        openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1

 2. start a socat server:

        socat -ls -d -d  STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem

 3. connect to it with a client:

        socat -ls -d -d   STDIO  OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem

... which actually works, which means the `openssl-commonname`
argument actually works, and works properly. If it's changed, for
example, to `ganeti.example.net`, the above fails with the
aforementioned error message.

We fix this by doing a reverse name resolution on the provided IP
address. Now, we don't *assume* it's an IP address: this code kicks in
only if the impexpd is passed an actual IP address, but in my
experience it seems to always be the case (which is probably a
separate problem to fix).

This is rather brittle and assumes DNS will not lie, which is quite a
stretch. In our environment, however, we have end-to-end DNSSEC so we
can trust the DNS. And this beats hardcoding verify=0, which is the
other workaround that can be done to fix this issue.

Closes: ganeti#1681
anarcat added a commit to anarcat/ganeti that referenced this issue May 29, 2023
In 7bb0351 (impexpd: fix certificate verification with new socat
versions, 2017-12-20), a hostname verification was introduced to fix
socat's new (and proper) behavior of actually checking the remote
hostname during OpenSSL-protected transfers.

The problem, however, is that the hostname used was the default
`X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded
to `ganeti.example.com`. In a real-world deployment, it seems like the
remote CommonName (CN) of the certificate used by the export daemon is
actually the target node name.

In my case, it meant I was getting the following error from socat
during transfers:

    Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

At first I thought socat might be doing us some trouble, but no: socat
works properly. An example is this:

 1. generate a self-signed certificate:

        openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1

 2. start a socat server:

        socat -ls -d -d  STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem

 3. connect to it with a client:

        socat -ls -d -d   STDIO  OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem

... which actually works, which means the `openssl-commonname`
argument actually works, and works properly. If it's changed, for
example, to `ganeti.example.net`, the above fails with the
aforementioned error message.

We fix this by doing a reverse name resolution on the provided IP
address. Now, we don't *assume* it's an IP address: this code kicks in
only if the impexpd is passed an actual IP address, but in my
experience it seems to always be the case (which is probably a
separate problem to fix).

This is rather brittle and assumes DNS will not lie, which is quite a
stretch. In our environment, however, we have end-to-end DNSSEC so we
can trust the DNS. And this beats hardcoding verify=0, which is the
other workaround that can be done to fix this issue.

Closes: ganeti#1681
anarcat added a commit to anarcat/ganeti that referenced this issue May 29, 2023
In 7bb0351 (impexpd: fix certificate verification with new socat
versions, 2017-12-20), a hostname verification was introduced to fix
socat's new (and proper) behavior of actually checking the remote
hostname during OpenSSL-protected transfers.

The problem, however, is that the hostname used was the default
`X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded
to `ganeti.example.com`. In a real-world deployment, it seems like the
remote CommonName (CN) of the certificate used by the export daemon is
actually the target node name.

In my case, it meant I was getting the following error from socat
during transfers:

    Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

At first I thought socat might be doing us some trouble, but no: socat
works properly. An example is this:

 1. generate a self-signed certificate:

        openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1

 2. start a socat server:

        socat -ls -d -d  STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem

 3. connect to it with a client:

        socat -ls -d -d   STDIO  OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem

... which actually works, which means the `openssl-commonname`
argument actually works, and works properly. If it's changed, for
example, to `ganeti.example.net`, the above fails with the
aforementioned error message.

We fix this by doing a reverse name resolution on the provided IP
address. Now, we don't *assume* it's an IP address: this code kicks in
only if the impexpd is passed an actual IP address, but in my
experience it seems to always be the case (which is probably a
separate problem to fix).

This is rather brittle and assumes DNS will not lie, which is quite a
stretch. In our environment, however, we have end-to-end DNSSEC so we
can trust the DNS. And this beats hardcoding verify=0, which is the
other workaround that can be done to fix this issue.

Closes: ganeti#1681
anarcat added a commit to anarcat/ganeti that referenced this issue May 29, 2023
In 7bb0351 (impexpd: fix certificate verification with new socat
versions, 2017-12-20), a hostname verification was introduced to fix
socat's new (and proper) behavior of actually checking the remote
hostname during OpenSSL-protected transfers.

The problem, however, is that the hostname used was the default
`X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded
to `ganeti.example.com`. In a real-world deployment, it seems like the
remote CommonName (CN) of the certificate used by the export daemon is
actually the target node name.

In my case, it meant I was getting the following error from socat
during transfers:

    Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

At first I thought socat might be doing us some trouble, but no: socat
works properly. An example is this:

 1. generate a self-signed certificate:

        openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1

 2. start a socat server:

        socat -ls -d -d  STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem

 3. connect to it with a client:

        socat -ls -d -d   STDIO  OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem

... which actually works, which means the `openssl-commonname`
argument actually works, and works properly. If it's changed, for
example, to `ganeti.example.net`, the above fails with the
aforementioned error message.

We fix this by doing a reverse name resolution on the provided IP
address. Now, we don't *assume* it's an IP address: this code kicks in
only if the impexpd is passed an actual IP address, but in my
experience it seems to always be the case (which is probably a
separate problem to fix).

This is rather brittle and assumes DNS will not lie, which is quite a
stretch. In our environment, however, we have end-to-end DNSSEC so we
can trust the DNS. And this beats hardcoding verify=0, which is the
other workaround that can be done to fix this issue.

Closes: ganeti#1681
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants