Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fault-tolerance for cache system errors #577

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

johnnyshields
Copy link

@johnnyshields johnnyshields commented Mar 18, 2022

Fixes #511.

This PR adds several features designed to increase fault-tolerance in case the cache infrastructure (Redis, Memcached, etc.) fails. This introduces a few new configs; the intention here is to make a "sensible default" that will work for 99% of real-world Rails apps, while adding customizability for the DIY crowd.

  • Catch and handle all errors once per request.
  • Remove the rescuing blocks from the store proxies; rescuing per-method (read, write, increment) is bad because (a) it may result in undefined behavior, and (b) it will trigger repeated connection timeouts if your cache is down, e.g. N * M * timeout latency where N is the number of Rack::Attack metrics and M is the cache requests per metric.
  • Add Rack::Attack.allowed_errors config. This defaults to Dalli::DalliError and Redis::BaseError.
  • Add Rack::Attack.failure_cooldown config. This temporarily disables Rack::Attack after an error occurs (including allowed errors), to prevent cache connection latency. The default is 60 seconds.
  • Add Rack::Attack.error_handler which takes a Proc for custom error handling. It's probably not needed but there may be esoteric use cases for it. You can also use the shortcut symbols :block, :throttle, and :allow to respond to errors using those.
  • Add Rack::Attack.calling? method which uses Thread.current (or RequestStore, if available) to indicate that Rack::Attack code is executing. The reason for this is so Rails Cache error hander can raise the error for Rack::Attack to handle (i.e. if the error occurred while Rack::Attack was executing.) Refer to readme.
  • Add "Fault Tolerance & Error Handling" section to Readme which includes all of the above.
  • Bit of refactor to main Rack::Attack class; code is now split into more methods.

In addition, I have added RSpec Mocks as a development dependency. Specifically for tests related to raising/handling errors, I find RSpec Mocks a heck of a lot easier to work with than MiniTest, however, if there are any MiniTest gurus out there who can suggest a pure-MiniTest way for my tests, I'd be happy to change it. Note that RSpec Mocks is added by itself without requiring all of RSpec.

Lastly, I'd like to give a special shoutout to Amazon MemoryDB for Redis for causing the failure that inspired me to write this PR.

- Catch and handle all errors once per request.
- Remove the `rescuing` blocks from the store proxies; rescuing per-method (read, write, increment) is bad because (a) it may result in undefined behavior, and (b) it will trigger repeated connection timeouts if your cache is down, e.g. N * M * timeout latency where N is the number of Rack::Attack metrics and M is the cache requests per metric.
- Add `Rack::Attack.ignored_errors` config. This defaults to Dalli::DalliError and Redis::BaseError.
- Add `Rack::Attack.failure_cooldown` config. This temporarily disables Rack::Attack after an error occurs (including ignored errors), to prevent cache connection latency. The default is 60 seconds.
- Add `Rack::Attack.error_handler` which takes a Proc for custom error handling. It's probably not needed but there may be esoteric use cases for it. You can also use the shortcut symbols :block, :throttle, and :allow to respond to errors using those.
- Add `Rack::Attack.calling?` method which uses Thread.current (or RequestStore, if available) to indicate that Rack::Attack code is executing. The reason for this is to add custom error handlers in the Rails Cache, i.e. "raise the error if it occurred while Rack::Attack was executing, so that Rack::Attack and handle it." Refer to readme.
- Add "Fault Tolerance & Error Handling" section to Readme which includes all of the above.
@johnnyshields
Copy link
Author

@grzuy can you please review this PR?

@johnnyshields
Copy link
Author

FYI I've been running this in production now for several weeks with no issue.

@johnnyshields
Copy link
Author

@grzuy any consideration to merge this?

Copy link

@timdiggins timdiggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this PR can be merged. Just adding a few points and a review in case helpful

lib/rack/attack.rb Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
lib/rack/attack.rb Outdated Show resolved Hide resolved
lib/rack/attack.rb Outdated Show resolved Hide resolved
@johnnyshields
Copy link
Author

@timdiggins I've updated the PR in response to your comments. Thanks!

@johnnyshields
Copy link
Author

@grzuy any chance to merge this? I've been using in Production for 6 months.

@johnnyshields
Copy link
Author

@grzuy please review.

@justinhoward
Copy link

I love the idea of this PR. I just have one concern about it being enabled by default. If an attacker is able to influence a system to enter the bypass mode, they might be able to compromise a critical security feature in some cases.

Consider if someone uses rack-attack to throttle requests to their sign in form. They also use the same Redis cache for other application features. If an attacker is able to DOS the system so that Redis fails and rack-attack enters bypass mode, they may be able to also bypass the throttling on the sign in form.

Personally, I'd classify rack-attack as a security component, and it violates the "principal of least surprise" that a security system would automatically disable itself when under threat. I do think there are use-cases where this is useful, but it should probably be something that users need to enable explicitly once they understand what it does.

Of course, I could be wrong, and maybe there are mitigating factors here I don't understand. I'm happy to be corrected. Thanks either way!

@johnnyshields
Copy link
Author

johnnyshields commented Jan 20, 2023

@justinhoward Im fine to disable this behavior by default on this PR. If i do that can someone please merge?

@jrochkind
Copy link

I am very interested in this too, I'm running into problem cases similar to described in original issue -- if your store is misbehaving, it can wind up bringing your whole app down. Which is pretty disastrous. Ability to configure to in some cases just ignore rack-attack failure and proceed with app processing seems crucial.

@jrochkind
Copy link

Actually, now trying to trace what's going on in my app, I am realizing that when using Memcached/Dalli, similar behavior may actually already be built into Dalli? eg down_retry_delay reminds me of what I was looking for with failure_cooldown here...

But I think you really do want to suspend rack-attack at this higher level notwithstanding... I realize I am not sure I understand what's going on.

@johnnyshields
Copy link
Author

@grzuy please review.

@henryaj
Copy link

henryaj commented Nov 13, 2023

@grzuy any chance of getting this into master? I just had my production app nearly taken out by a bad memcached!

@santib
Copy link
Collaborator

santib commented Nov 14, 2023

Hi @johnnyshields, I think that the problem you are solving is legit, thanks for your contribution 🙌

As you said, this PR adds several features, which IMO makes it hard to be accepted and merged by maintainers. Maybe if we keep this PR as simple as possible, and separate the other features/improvements (like allowed_errors, failure_cooldown, calling?, etc) into their own separate PRs, it’d be easier to discuss each of them and get them approved. So I’m wondering: what would be the simplest solution that would solve the main issue while also taking into consideration @justinhoward’s comment?.

What comes to my mind is something like:

  1. Catch one error per request (to avoid the N*M timeouts)
  2. Allow devs to handle this exception (by default re-raise it to keep existing behavior)

Maybe this could be done by keeping the existing catch statements and re-raising a new custom tailored exception Rack::Attack::CacheError that bubbles up and is caught in the main method with something like

rescue Rack::Attack::CacheError => e
  Rack::Attack.exception_handler.call(e)
  @app.call(env)
end

My question is: if we do that, will it have solved the issue you experienced and the ones described in #511? What else would be missing? Does the solution has any drawback in terms of behavior, security, or any other aspect?

@johnnyshields
Copy link
Author

johnnyshields commented Nov 14, 2023

@santib I am happy to try to break this into a series of smaller PRs if I can get your commitment to review and merge in a timely manner.

FWIW I've served billions of requests at TableCheck in the last ~1 year using this code, with no issues.

@johnnyshields
Copy link
Author

I just had my production app nearly taken out by a bad memcached!

@henryaj welcome to the club!

@johnnyshields
Copy link
Author

johnnyshields commented Nov 14, 2023

@santib

Maybe this could be done by keeping the existing catch statements.. and introduce a custom tailored Rack::Attack::CacheError

That would be possible, but since the internal Dalli/RedisProxy classes would be now raising an error (instead of suppressing it as previously), from an app-owner perspective I don't think it makes a huge difference. TBH I think is desirable to have the original database-driver specific error messages, rather than catching and raising a new one. By removing all the silly rescuing { } blocks the code is both cleaner and has (slightly) better performance.

@santib
Copy link
Collaborator

santib commented Nov 14, 2023

Hey, just in case, no, I'm not a maintainer.

That would be possible, but since the internal Dalli/RedisProxy classes would be now raising an error (instead of suppressing it as previously), from an app-owner perspective I don't think it makes a huge difference.

Agree, the difference is mostly for the maintainance of this gem, so that each store_proxy continues to encapsulate the specifics of each store.

Leaving here an example of what I'm referring to: main...santib:rack-attack:system-hardening-refactor

Apart from the introduced Rack::Attack::CacheError, the changes are:

  • The fact that we let the errors bubble up from the cache stores and catch them in the main flow, allow us to not have the N*M timeouts (just one) and solve the denial of service issue.
  • The cache_error_handler does nothing by default (keeping existing behavior) but allow devs to notify/log about it.

@johnnyshields
Copy link
Author

johnnyshields commented Nov 14, 2023

Yep I understand what you are proposing. IMHO that's actually more complex than my PR. My preference is to keep things as simple as possible and just use the native errors--less code, less bloat, no new not-yet-seen error types, etc.

That being said I will defer to the maintainers whether they prefer the bloat. If a maintainer can chime in here I'd appreciate it.

@jrochkind
Copy link

@johnnyshields if you fork rack-attack, I'll use your fork! It does not appear to me that current maintainers of rack-attack are interested in improvements.

@johnnyshields
Copy link
Author

@jrochkind it will be second gem I've forked this week 🤣 OK give me a few days I'll do it over the weekend.

@jamiemccarthy
Copy link

It does not appear to me that current maintainers of rack-attack are interested in improvements

I will note that over on #578 one of the maintainers has been engaging with me, on that PR I submitted last year. He's had some great feedback, and some progress has been made just in this past week.

I'm not sure I'd give up on them just yet!

@jrochkind
Copy link

@jamiemccarthy it looks to me like over on #578 you have been discussing your PR with @santib .... who said here on this issue that they are not a maintainer....

@santib , if you give detailed feedback and requests for changes that sound like they are coming from a maintainer, and people invest their time in discussing and making those changes thinking if they meet your approval they will get their PR merged by you... they may be dismayed when they find out you actually have no influence over that process...

@santib
Copy link
Collaborator

santib commented Nov 14, 2023

@jamiemccarthy I'm sorry if I left the wrong impression 🙏

@jrochkind Don't worry, I'll make sure to always clarify I'm not a maintainer from now on 👍

What I do think is that if we make PRs simpler they'll eventually get reviewed by maintainers. But of course you don't have to trust me.

@jamiemccarthy
Copy link

Oops! My misunderstanding. Well, I'm not dismayed — @santib I do think your feedback has been helpful!

In any case the repo has had 8 PRs merged in October, so it doesn't seem like it's abandoned.

@grzuy
Copy link
Collaborator

grzuy commented Nov 15, 2023

Thank you @johnnyshields for your work and patience on this PR.

And thank you @timdiggins, @justinhoward, @jrochkind, @henryaj, @santib and @jamiemccarthy for all the helpful comments and feedback.

Agree with @justinhoward and @santib comments in that we would want to narrow it down to the smallest changeset possible that address yours issue in production, being an opt-in setting. I would prefer, given the considerable amount of rack-attack users and maturity of the gem, that mostly nothing changes for existing users that don't opt-in.

Thanks again.

@johnnyshields
Copy link
Author

@grzuy Appreciate the comment. Let me try to break this into a series of smaller PRs.

One thing that will be inevitable is that we need to raise an error from the Redis/DalliProxy classes to the parent level. Currently this is being swallowed inside the proxy classes themselves. Two options here:

  1. @santib has proposed to introduce a new "wrapper" error class such as Rack::Attack::CacheError
  2. I am proposing to use the existing Dalli/RedisErrors, which may have more info and are easier to work with IMHO. It makes the code simpler if we do this, and we can always go @santib's way later if there is a strong reason to do so.

Do you have a strong opinion on which way we should go?

@grzuy
Copy link
Collaborator

grzuy commented Nov 18, 2023

@jamiemccarthy it looks to me like over on #578 you have been discussing your PR with @santib .... who said here on this issue that they are not a maintainer....

Just for what is worth, he is a mantainer now #637.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't fail when Redis is misbehaving, e.g. OOM command not allowed when used memory > 'maxmemory'
8 participants