Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: Reset CT entry for CT_REOPENED #32614

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented May 17, 2024

Reset the CT entry when an old entry has been found when looking up an entry for TCP SYN packet (CT_REOPENED). This way the old entry has no effect to the entry for the new connection and security identity, as well as proxy and nodeport related flags get set correctly for the new connection.

This prevents failing proxy redirect when the CT entry of a non-redirected connection would have been reused for a redirected connection, which has happened in the CI where an egress L7 policy was applied and an old CT entry from time without the L7 policy existed.

For this bug to trigger a pod needs to open a new TCP connection using the same (ephemeral) source port to the same destination before and after a change in policy adding an L7 policy applicable to that connection. While this is a bug that has surfaced in CI, this bug is less likely to be hit when policies are more stable. Given this we need further discussion about backporting this bug fix to the stable releases.

Fixes: #27762 (see #27762 (comment))

Datapath conntrack entries for reopened connections are fully reinitialized to fix rare L7 proxy redirect failures.

@jrajahalme jrajahalme added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. release-note/bug This PR fixes an issue in a previous release of Cilium. labels May 17, 2024
@jrajahalme jrajahalme requested a review from a team as a code owner May 17, 2024 22:21
@jrajahalme jrajahalme requested a review from markpash May 17, 2024 22:21
@jrajahalme jrajahalme marked this pull request as draft May 17, 2024 22:21
@jrajahalme jrajahalme force-pushed the bpf-ct-reopened-reset-proxy-redirect branch from cee504a to 81a8e00 Compare May 17, 2024 22:44
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme marked this pull request as ready for review May 17, 2024 23:03
@jrajahalme jrajahalme force-pushed the bpf-ct-reopened-reset-proxy-redirect branch from 81a8e00 to c901696 Compare May 19, 2024 15:37
@jrajahalme
Copy link
Member Author

Simplified bpf changes a bit.

@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme force-pushed the bpf-ct-reopened-reset-proxy-redirect branch from c901696 to 130f2ea Compare May 19, 2024 15:46
@jrajahalme
Copy link
Member Author

bpf lint fixes.

@jrajahalme
Copy link
Member Author

/test

@julianwiedmann julianwiedmann self-requested a review May 21, 2024 05:28
@julianwiedmann
Copy link
Member

Nice catch! For clarity - this change would only help TCP connections, correct? I think we need to find a solution for UDP / SCTP as well (or CT_ESTABLISHED in general). As something easy to backport, we could just follow the current approach and add the additional constraints here.

But long-term I think the CT lookups from bpf_lxc should follow the approach we've already started for Service-related CT entries, and use more precise matching in the low-level conntrack code. Which is also why I would prefer if we split the nodeport.h changes off into a separate patch (or even PR?). There's a good chance that the lookup code there is already good enough ...

@julianwiedmann julianwiedmann added feature/conntrack dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. labels May 21, 2024
@jrajahalme
Copy link
Member Author

Nice catch! For clarity - this change would only help TCP connections, correct? I think we need to find a solution for UDP / SCTP as well (or CT_ESTABLISHED in general). As something easy to backport, we could just follow the current approach and add the additional constraints here.

TCP only, due to this snippet in bpf/lib/conntrack.h (assuming SCTP does not have TCP_FLAG_SYN):

	if (unlikely(flags.value & TCP_FLAG_SYN))
		return ACTION_CREATE;

Our Envoy proxy redirects are TCP only, so to fix the Envoy issue fixing this for TCP suffices for now. This same issue could come up with DNS proxy though, so it would be good to solve this for UDP (and SCTP) as well.

But long-term I think the CT lookups from bpf_lxc should follow the approach we've already started for Service-related CT entries, and use more precise matching in the low-level conntrack code. Which is also why I would prefer if we split the nodeport.h changes off into a separate patch (or even PR?). There's a good chance that the lookup code there is already good enough ...

I'll split off the nodeport.h changes to a separate commit.

Reset the CT entry when an old entry has been found when looking up an
entry for TCP SYN packet (CT_REOPENED). This way the old entry has no
effect to the entry for the new connection and security identity, as well
as proxy and nodeport related flags get set correctly for the new
connection.

This prevents failing proxy redirect when the CT entry of a
non-redirected connection would have been reused for a redirected
connection, which has happened in the CI where an egress L7 policy was
applied and an old CT entry from time without the L7 policy existed.

Fixes: cilium#27762
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Reset the CT entry when an old entry has been found when looking up an
entry for TCP SYN packet (CT_REOPENED). This way the old entry has no
effect to the entry for the new connection and security identity, as well
as proxy and nodeport related flags get set correctly for the new
connection.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme force-pushed the bpf-ct-reopened-reset-proxy-redirect branch from 130f2ea to d813c38 Compare May 21, 2024 08:46
@jrajahalme
Copy link
Member Author

jrajahalme commented May 21, 2024

Nice catch! For clarity - this change would only help TCP connections, correct? I think we need to find a solution for UDP / SCTP as well (or CT_ESTABLISHED in general). As something easy to backport, we could just follow the current approach and add the additional constraints here.

I considered this approach, but concluded that it would easily result into more brittle code, and this simpler approach is better, for the following reasons:

  • we'd have to expand the comparison logic to cover all the possible differences between the old entry and the new state
  • this could be extracted to a helper function
  • but still the helper function could get stale
  • CT_REOPENED happens only for new connections, and we already typically create new entries for new connections; it is simpler to just do the CT_NEW behavior for CT_REOPENED

Given this it might be better to get rid of CT_REOPENED altogether and only return CT_NEW though?

Current lib/conntrack.h updates the given ct_state for CT_REOPENED, so we have to explicitly clear it. would be simpler if lib/conntrack.h simply returned CT_NEW (without touching the given *ct_state), if the policy always is to reset the CT entry anyway?

As for changes for CT_ESTABLISHED: changes in proxy redirection mid-connection will not work, but currently policy changes break existing connections for which redirection status changes, due to forward direction immediately behaving accoring to the new policy, while return direction takes the redirection status from the CT entry, that is not currently updated for CT_ESTABLISHED. I guess this breaks DNS proxy for UDP across policy changes as well, and updating a stale CT entry for CT_ESTABLISHED would fix this.

But long-term I think the CT lookups from bpf_lxc should follow the approach we've already started for Service-related CT entries, and use more precise matching in the low-level conntrack code.

Looks like ct_entry_matches_types() would have to to get much more complex if proxy redirection would be added in there?

@jrajahalme
Copy link
Member Author

/test

Comment on lines -1984 to -1988
# ifdef ENABLE_DSR
/* Clear .dsr_internal flag for old connections: */
if (ret == CT_REOPENED && ct_state->dsr_internal)
ct_update_dsr(get_ct_map4(tuple), tuple, false);
# endif /* ENABLE_DSR */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's address this part separately (#32642), there's more to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was conditionally setting the flag to false, removed this as now the flag is false by default. IMO removal of this is the right thing to do in this PR (no functional change for the parts that were already updated for CT_REOPENED.

Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty! I took a closer look at the nodeport.h changes.

Comment on lines -2335 to 2337
/* Maybe we can be a bit more selective about CT_REOPENED?
* But we have to assume that both the CT and the SNAT entry are stale.
*/
case CT_REOPENED:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the comment without any explanation why it no longer applies, doesn't feel very helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes to reset the CT entry for CT_REOPENED, so there is no longer any concern for being more selective?

Comment on lines -2348 to 2349
ct_state_new.dsr_internal = 1;
ct_state_new.proxy_redirect = 0;
ct_state_new.from_l7lb = 0;
ct_state_new.dsr_internal = true;
/* all other flags are already false */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a cleanup, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, stemming from the fact that other sites were missing these explicit = 0 statements.

Comment on lines +2899 to 2906
case CT_REOPENED:
memset(&ct_state, 0, sizeof(ct_state));
ct_state.rev_nat_index = ct_state_svc.rev_nat_index;
fallthrough;
case CT_NEW:
ct_state.src_sec_id = src_sec_identity;
ct_state.node_port = 1;
ct_state.node_port = true;
#ifndef HAVE_FIB_IFINDEX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe that teaching ct_entry_matches_types() about src_sec_id and ifindex for the case of CT_ENTRY_NODEPORT would get us much further here. And address any concerns for CT_ESTABLISHED.

We could literally just copy the code that fills the ct_state in front of the CT lookup, and say "this is what I'm expecting".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean the ct_* function would insert the new entry then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do policy lookup before CT lookup, as we can do policy lookup on a forward direction only, and a CT lookup is needed to know if a packet is in forward or reply direction. Also, some of the CT entry contents come from the policy lookup, so it does not seem feasible to be able to give a full expected CT entry before the CT lookup.

@julianwiedmann
Copy link
Member

julianwiedmann commented May 21, 2024

Given this it might be better to get rid of CT_REOPENED altogether and only return CT_NEW though?

Current lib/conntrack.h updates the given ct_state for CT_REOPENED, so we have to explicitly clear it. would be simpler if lib/conntrack.h simply returned CT_NEW (without touching the given *ct_state), if the policy always is to reset the CT entry anyway?

If we follow your change suggestion here, that would be my conclusion as well. We're already clearing quite a few parts of the entry, so why not go all the way. But I don't have sufficient historical context on why we've introduced CT_REOPENED in the first place. Therefore I'm a bit hesitant to just drop it when it feels inconvenient.

As for changes for CT_ESTABLISHED: changes in proxy redirection mid-connection will not work, but currently policy changes break existing connections for which redirection status changes, due to forward direction immediately behaving accoring to the new policy, while return direction takes the redirection status from the CT entry, that is not currently updated for CT_ESTABLISHED. I guess this breaks DNS proxy for UDP across policy changes as well, and updating a stale CT entry for CT_ESTABLISHED would fix this.

Ack. So we need to decide whether this PR is only about fixing the L7 proxy aspect, or wants to address fuzzy CT lookups as a whole.

@jrajahalme
Copy link
Member Author

Ack. So we need to decide whether this PR is only about fixing the L7 proxy aspect, or wants to address fuzzy CT lookups as a whole.

I would like this PR to be a backportable bug fix for the l7 proxy redirect issue/flake. In that light maybe I should remove the cleanups as well to make this truly minimal?

THB, I don't know if the nodeport.h changes are absolutely necessary for the l7 proxy redirect fix, though.

@jrajahalme
Copy link
Member Author

Given this it might be better to get rid of CT_REOPENED altogether and only return CT_NEW though?
Current lib/conntrack.h updates the given ct_state for CT_REOPENED, so we have to explicitly clear it. would be simpler if lib/conntrack.h simply returned CT_NEW (without touching the given *ct_state), if the policy always is to reset the CT entry anyway?

If we follow your change suggestion here, that would be my conclusion as well. We're already clearing quite a few parts of the entry, so why not go all the way. But I don't have sufficient historical context on why we've introduced CT_REOPENED in the first place. Therefore I'm a bit hesitant to just drop it when it feels inconvenient.

Here's the commit that added CT_REOPENED` (#13340):

commit ada53a91a98618d24dad2abda97dd9c60923003c
Author: Zang Li <zangli@google.com>
Date:   Mon Sep 28 12:15:14 2020 -0700

    Add the return code CT_REOPENED for conntrack table lookup.
    
    When a tcp SYN packet hits a conntrack entry that is in close-wait state,
    the entry timeout will be modified and the entry will be reopened. But
    because the return code is CT_ESTABLISHED, policy-verdict event will not
    be generated even this is a new connection. The patch fix the issue by
    returning the code CT_REOPENED instead in this case. Except for policy
    verdict events, all other handlings are the same as if CT_ESTABLISHED
    is returned.

It looks to me CT_REOPENED was added to get policy verdicts also for re-opened connections. Furthermore, from the discussion therein I'd conclude that CT_REOPENED is in fact more likely to be a new connection (due to CT entry lingering longer than TCP state at the endpoints) rather than the same connection really being re-opened from TCP viewpoint. @joestringer argued in the PR that even a real TCP re-open should likely be policy enforced like a new connection. This is doubly so if there is a policy change between the original SYN and the new one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. feature/conntrack release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Conformance E2E: client-egress-l7-named-port/pod-to-pod: command terminated with exit code 28 (timeout)
2 participants