Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impexpd: verify remote socket against actual host name #1699

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anarcat
Copy link
Contributor

@anarcat anarcat commented Mar 15, 2023

In 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20), a hostname verification was introduced to fix socat's new (and proper) behavior of actually checking the remote hostname during OpenSSL-protected transfers.

The problem, however, is that the hostname used was the default X509_CERT_CN constant (x509CertCn in Haskell) which is hardcoded to ganeti.example.com. In a real-world deployment, it seems like the remote CommonName (CN) of the certificate used by the export daemon is actually the target node name.

In my case, it meant I was getting the following error from socat during transfers:

Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

At first I thought socat might be doing us some trouble, but no: socat works properly. An example is this:

  1. generate a self-signed certificate:

    openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1
    
  2. start a socat server:

    socat -ls -d -d  STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem
    
  3. connect to it with a client:

    socat -ls -d -d   STDIO  OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem
    

... which actually works, which means the openssl-commonname argument actually works, and works properly. If it's changed, for example, to ganeti.example.net, the above fails with the aforementioned error message.

We fix this by doing a reverse name resolution on the provided IP address. Now, we don't assume it's an IP address: this code kicks in only if the impexpd is passed an actual IP address, but in my experience it seems to always be the case (which is probably a separate problem to fix).

This is rather brittle and assumes DNS will not lie, which is quite a stretch. In our environment, however, we have end-to-end DNSSEC so we can trust the DNS. And this beats hardcoding verify=0, which is the other workaround that can be done to fix this issue.

Closes: #1681

lib/impexpd/__init__.py Fixed Show fixed Hide fixed
lib/impexpd/__init__.py Fixed Show fixed Hide fixed
lib/impexpd/__init__.py Fixed Show fixed Hide fixed
@rbott
Copy link
Member

rbott commented May 25, 2023

Hi @anarcat,

there is an overlong line which currently blocks the tests from passing:

Longest line in lib/impexpd/__init__.py is longer than 80 characters

@anarcat
Copy link
Contributor Author

anarcat commented May 25, 2023

hi!

there is an overlong line which currently blocks the tests from passing:

i believe i have fixed this particular warning, but i must say my local linters scream at Ganeti all the time, so I can't tell if I really fixed everything here. Emacs's Flycheck is flagging 161 warnings in that particular file here... those are:

    139 E111 indentation is not a multiple of 4 (lsp)
     17 E114 indentation is not a multiple of 4 (comment) (lsp)
      1 E129 visually indented line with same indent as next logical line (lsp)
      3 E501 line too long (80 > 79 characters) (lsp)
      1 F401 'time' imported but unused (lsp)
      1 Cyclomatic complexity too high: 16 (threshold 15) (lsp)

i do not believe the three E501 errors above are mine, so let's hope for the best?

also, do you know how to silence those CodeQL errors? they seem rather harmless as the IP addresses in question are server-side so not really PII, same for the (public) TLS cert...

lib/impexpd/__init__.py Dismissed Show dismissed Hide dismissed
lib/impexpd/__init__.py Dismissed Show dismissed Hide dismissed
lib/impexpd/__init__.py Dismissed Show dismissed Hide dismissed
@rbott
Copy link
Member

rbott commented May 26, 2023

There is actually a Python Test failing now:

OK
Running ganeti.impexpd_unittest.py
.F.s.E..
======================================================================
ERROR: testOptionLengthError (__main__.TestCommandBuilder)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/gntbuild.tFSgkssF/test/py/ganeti.impexpd_unittest.py", line 204, in testOptionLengthError
    self.assertRaises(errors.GenericError, builder.GetCommand)
  File "/usr/lib/python3.9/unittest/case.py", line 733, in assertRaises
    return context.handle('assertRaises', args, kwargs)
  File "/usr/lib/python3.9/unittest/case.py", line 201, in handle
    callable_obj(*args, **kwargs)
  File "/tmp/gntbuild.tFSgkssF/ganeti/impexpd/__init__.py", line 400, in GetCommand
    transport_cmd = self._GetTransportCommand()
  File "/tmp/gntbuild.tFSgkssF/ganeti/impexpd/__init__.py", line 349, in _GetTransportCommand
    (utils.ShellQuoteArgs(self._GetSocatCommand()),
  File "/tmp/gntbuild.tFSgkssF/ganeti/impexpd/__init__.py", line [234](https://github.com/ganeti/ganeti/actions/runs/5081651671/jobs/9130312541?pr=1699#step:8:235), in _GetSocatCommand
    if netutils.IPAddress.IsValid(host):
  File "/tmp/gntbuild.tFSgkssF/ganeti/netutils.py", line 404, in IsValid
    family = cls.GetAddressFamily(address)
  File "/tmp/gntbuild.tFSgkssF/ganeti/netutils.py", line 508, in GetAddressFamily
    return IP4Address(address).family
  File "/tmp/gntbuild.tFSgkssF/ganeti/netutils.py", line 619, in __init__
    if not self.IsValid(address):
  File "/tmp/gntbuild.tFSgkssF/ganeti/netutils.py", line 411, in IsValid
    socket.inet_pton(family, address)
TypeError: inet_pton() argument 2 must be str, not None

You should be able to reproduce this by running make py-tests. I always use the docker/podman + ganeti CI containers for this (e.g. run docker run -it --rm -v $PWD:/usr/src/ganeti ganeti/ci:bullseye-py3 inside the ganeti source folder).

Copy link
Contributor

@atta atta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to remove the security warning, cause by printing the IP and CN we can echo the commands to pull them if needed, otherwise we just state that it dose not match

@anarcat
Copy link
Contributor Author

anarcat commented May 29, 2023

TypeError: inet_pton() argument 2 must be str, not None

ah, well I'll treat a None IP address as invalid then... pushed a new version of this...

i couldn't figure out how to run this container in docker, for some reason, podman is failing to pull docker images right now:

anarcat@angela:ganeti$ podman pull docker.io/library/ganeti/ci
Trying to pull docker.io/library/ganeti/ci:latest...
Error: initializing source docker://library/ganeti/ci:latest: Requesting bearer token: invalid status code from registry 400 (Bad Request)
anarcat@angela:ganeti[125]$ 

go figure.

@anarcat
Copy link
Contributor Author

anarcat commented May 29, 2023

in order to remove the security warning, cause by printing the IP and CN we can echo the commands to pull them if needed, otherwise we just state that it dose not match

i'm not sure what you mean here. why can't we just put those IPs in there? it would seem tremendously useful for debugging, and in fact it's why I explicitly added those...

@anarcat
Copy link
Contributor Author

anarcat commented May 29, 2023

also, i think that codeQL is just giving us a false positive here: x509_cert_cn is just a name, it's not the actual cert, it's the hostname...

@anarcat
Copy link
Contributor Author

anarcat commented May 29, 2023

ugh, of course the PR still fails because we still need to resolve the IP, duh.

this makes no sense to me, how can this be None? that would mean the CommandBuilder gets setup without a host option? how did that even work ever?

In 7bb0351 (impexpd: fix certificate verification with new socat
versions, 2017-12-20), a hostname verification was introduced to fix
socat's new (and proper) behavior of actually checking the remote
hostname during OpenSSL-protected transfers.

The problem, however, is that the hostname used was the default
`X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded
to `ganeti.example.com`. In a real-world deployment, it seems like the
remote CommonName (CN) of the certificate used by the export daemon is
actually the target node name.

In my case, it meant I was getting the following error from socat
during transfers:

    Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

At first I thought socat might be doing us some trouble, but no: socat
works properly. An example is this:

 1. generate a self-signed certificate:

        openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1

 2. start a socat server:

        socat -ls -d -d  STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem

 3. connect to it with a client:

        socat -ls -d -d   STDIO  OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem

... which actually works, which means the `openssl-commonname`
argument actually works, and works properly. If it's changed, for
example, to `ganeti.example.net`, the above fails with the
aforementioned error message.

We fix this by doing a reverse name resolution on the provided IP
address. Now, we don't *assume* it's an IP address: this code kicks in
only if the impexpd is passed an actual IP address, but in my
experience it seems to always be the case (which is probably a
separate problem to fix).

This is rather brittle and assumes DNS will not lie, which is quite a
stretch. In our environment, however, we have end-to-end DNSSEC so we
can trust the DNS. And this beats hardcoding verify=0, which is the
other workaround that can be done to fix this issue.

Closes: ganeti#1681
@anarcat
Copy link
Contributor Author

anarcat commented May 29, 2023

i tried to fix the None host in https://github.com/ganeti/ganeti/compare/3d98c297ec7aa6a65afffd5773c3800ea43f79e6..a270666546c73ac1c03989c23969f7782395f194

... it still doesn't work, would love some help on this.

@rbott
Copy link
Member

rbott commented Jun 12, 2023

I spent some time on this and found something with regards to

  def testOptionLengthError(self):
    testopts = [
      CmdBuilderConfig(bind="0.0.0.0" + ("A" * impexpd.SOCAT_OPTION_MAXLEN),
                       port=1234, ca="/tmp/ca"),
      CmdBuilderConfig(host="localhost", port=1234,
                       ca="/tmp/ca" + ("B" * impexpd.SOCAT_OPTION_MAXLEN)),
      CmdBuilderConfig(host="localhost", port=1234,
                       key="/tmp/key" + ("B" * impexpd.SOCAT_OPTION_MAXLEN)),
      ]

I found the error regarding the testOptionLengthError() test: this one is trying to validate that the CommandBuilder class catches overlong socat command lines. It does so by calling it three times with fake parameters. The first one actually uses "0.0.0.0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA[...]" as value to the bind parameter. This now finds its way to the netutils validate/lookup code and causes errors.

Not tampering with the bind variable (e.g. remove the first element off the list) 'fixes' this test.

Regarding the failing TestCommandBuilder test I am pretty sure we need to change these lines:

https://github.com/anarcat/ganeti/blob/61b23cc52b3080d2a5e52d590818f5ba8a425270/test/py/ganeti.impexpd_unittest.py#L150-L151

...to actually check for the hostname instead of the ganeti.example.com constant:

self.assertTrue("openssl-commonname=%s" % host in ssl_addr)

However, that still fails. Looking at the three test cases ["localhost", "198.51.100.4", "192.0.2.99"] I would assume the following:

  • "localhost" would lead to openssl-commonname=localhost
  • "198.51.100.4" and "192.0.2.99" are a bit unclear - they either resolve to something in the test environment or the code reverts to placing ganeti.example.com in the commandline

The above assumption is backed by the fact that removing both IP addresses from the list lets the test pass.

For a more complete test we need to work with a list of dictionaries instead of a plain list, e.g. something like:

testcases = [
  { 'target': 'localhost', 'expected_hostname': 'localhost' },
  { 'target': '198.51.100.4', 'expected_hostname': constants.X509_CERT_CN },
  { 'target': '192.0.2.99', 'expected_hostname': constants.X509_CERT_CN }
]
for testcase in testcases:
  ...

However, this would still rely on those ip addresses not resolving to anything in the test environment (and it also does not cover the case where the IP actually is supposed to resolve to test all code paths). I'd say the only way to fix this would be to mock away the DNS resolving part so that the tests do not depend on anything uncontrollable. Other tests make use of the Python Mock module which should be able to fake the results of the gethostbyaddr() calls.

Hope that helps a bit, I currently lack the time to come up with an implementation of the above myself, sorry :-(

@anarcat
Copy link
Contributor Author

anarcat commented Jun 13, 2023

Hope that helps a bit, I currently lack the time to come up with an implementation of the above myself, sorry :-(

i also feel a bit out of my depth here but:

However, this would still rely on those ip addresses not resolving to anything in the test environment (and it also does not cover the case where the IP actually is supposed to resolve to test all code paths).

i think the proper way to do this would be to use some reserved IP space. This is probably what 192.0.2.99 is designed to be: it's part of TEST-NET-1 (rfc5737). Same for 198.51.100.4 (TEST-NET-2). If a test environment correctly resolves those, it's doing it wrong.

localhost could be one that is supposed to resolve as well, and i suspect it's why it's there.

I'd say the only way to fix this would be to mock away the DNS resolving part so that the tests do not depend on anything uncontrollable. Other tests make use of the Python Mock module which should be able to fake the results of the gethostbyaddr() calls.

It does sound like mocking would be the better approach here, but i'm not sure how to implement that tooling...

@rbott
Copy link
Member

rbott commented Jun 18, 2023

Hi @anarcat,

I finally found some time to look into this and created a patch against the current master based on your PR with working tests. The test code now mocks the return value of socket.gethostbyaddr() and tests the command construction both with a hostname and an IP address as parameter. What do you think of this?

diff --git a/lib/impexpd/__init__.py b/lib/impexpd/__init__.py
index a643aac16..b6fcaee99 100644
--- a/lib/impexpd/__init__.py
+++ b/lib/impexpd/__init__.py
@@ -225,8 +225,30 @@ class CommandBuilder(object):
       # For socat versions >= 1.7.3, we need to also specify
       # openssl-commonname, otherwise server certificate verification will
       # fail.
+      x509_cert_cn = host
+      # we were previously hardcoding constants.X509_CERT_CN here, but
+      # that's always ganeti.example.com while the other end generates
+      # an actual cert that matches the hostname. unfortunately here
+      # we are typically given an IP address, so let's try to find a
+      # real hostname for the certificate check
+      if netutils.IPAddress.IsValid(host):
+        # this looks like an IP address, override based on the reverse
+        # DNS
+        try:
+          x509_cert_cn, _, _ = socket.gethostbyaddr(host)
+          logging.warning("overriden IP address %s to reverse hostname %s",
+                          host,
+                          x509_cert_cn)
+        except OSError as e:
+          logging.error(
+            "failed to resolve IP address %s, reverting to default %s: %s",
+            host,
+            constants.X509_CERT_CN,
+            e,
+          )
+          x509_cert_cn = constants.X509_CERT_CN
       if self._GetSocatVersion() >= (1, 7, 3):
-        addr2 += ["openssl-commonname=%s" % constants.X509_CERT_CN]
+        addr2 += ["openssl-commonname=%s" % x509_cert_cn]
 
     else:
       raise errors.GenericError("Invalid mode '%s'" % self._mode)
diff --git a/test/py/ganeti.impexpd_unittest.py b/test/py/ganeti.impexpd_unittest.py
index 7e54f09cf..62d6849d7 100755
--- a/test/py/ganeti.impexpd_unittest.py
+++ b/test/py/ganeti.impexpd_unittest.py
@@ -44,7 +44,7 @@ from ganeti import errors
 from ganeti import impexpd
 
 import testutils
-
+import unittest.mock
 
 class CmdBuilderConfig(objects.ConfigObject):
   __slots__ = [
@@ -108,13 +108,20 @@ class TestCommandBuilder(unittest.TestCase):
           else:
             self.assertFalse(magic_cmd)
 
-        for host in ["localhost", "198.51.100.4", "192.0.2.99"]:
+        testcases = [
+          { "target": "localhost", "expected_hostname": "localhost" },
+          { "target": "198.51.100.4", "expected_hostname": "my.ganeti.org" },
+        ]
+
+        for host in testcases:
+          socket_patcher = unittest.mock.patch("socket.gethostbyaddr", return_value=(host["expected_hostname"], [], []))
+          socket_patcher.start()
           for port in [0, 1, 1234, 7856, 45452]:
             for cmd_prefix in [None, "PrefixCommandGoesHere|",
                                "dd if=/dev/hda bs=1048576 |"]:
               for cmd_suffix in [None, "< /some/file/name",
                                  "| dd of=/dev/null"]:
-                opts = CmdBuilderConfig(host=host, port=port, compress=compress,
+                opts = CmdBuilderConfig(host=host["target"], port=port, compress=compress,
                                         cmd_prefix=cmd_prefix,
                                         cmd_suffix=cmd_suffix)
 
@@ -141,15 +148,16 @@ class TestCommandBuilder(unittest.TestCase):
                   self.assertTrue(("OPENSSL-LISTEN:%s" % port) in ssl_addr)
                 elif mode == constants.IEM_EXPORT:
                   ssl_addr = socat_cmd[-1].split(",")
-                  self.assertTrue(("OPENSSL:%s:%s" % (host, port)) in ssl_addr)
+                  self.assertTrue(("OPENSSL:%s:%s" % (host["target"], port)) in ssl_addr)
                   if impexpd.CommandBuilder._GetSocatVersion() >= (1, 7, 3):
                     self.assertTrue("openssl-commonname=%s" %
-                                 constants.X509_CERT_CN in ssl_addr)
+                                 host["expected_hostname"] in ssl_addr)
                   else:
                     self.assertTrue("openssl-commonname=%s" %
                                  constants.X509_CERT_CN not in ssl_addr)
 
                 self.assertTrue("verify=1" in ssl_addr)
+          socket_patcher.stop()
 
   @testutils.RequiresIPv6()
   def testIPv6(self):
@@ -190,8 +198,6 @@ class TestCommandBuilder(unittest.TestCase):
 
   def testOptionLengthError(self):
     testopts = [
-      CmdBuilderConfig(bind="0.0.0.0" + ("A" * impexpd.SOCAT_OPTION_MAXLEN),
-                       port=1234, ca="/tmp/ca"),
       CmdBuilderConfig(host="localhost", port=1234,
                        ca="/tmp/ca" + ("B" * impexpd.SOCAT_OPTION_MAXLEN)),
       CmdBuilderConfig(host="localhost", port=1234,

@anarcat
Copy link
Contributor Author

anarcat commented Jun 26, 2023

What do you think of this?

nice, thanks! LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inter-Cluster Instance Transfer fail due to socat TLS verification
3 participants