Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow regex under certain circumstances involving \b #8166

Open
mrckzgl opened this issue Mar 25, 2024 · 4 comments
Open

Slow regex under certain circumstances involving \b #8166

mrckzgl opened this issue Mar 25, 2024 · 4 comments

Comments

@mrckzgl
Copy link

mrckzgl commented Mar 25, 2024

Environment: Linux 6.5.0-26-generic #26~22.04.1-Ubuntu

Consider following minimal example:

@time_hash = {}

def perf_time(id)
  @time_hash[id] = Time.now
end

def perf_time_end(id)
  puts("#{id.to_s}: #{format('%.5f', Time.now - @time_hash[id])}")
end

test_string_mixed = "this is a copied colophon
انتهى هذا الكتاب من نسخه بحمد الله وعونه ... احمد بن عبد الله بن الحاج احمد بن الحاج محمد بن سعيد بن يحيى الحاحي ثم التغمامي
96v and the same again on 97r
انتهى هذا الكتاب من نسخه بحمد الله وعونه ... عبد العزيز بن محمد بن الحاج محمد بن احمد المكنى اشكربان الاندلسي الاشبلي اصلا المراكشي داراً ومنشأ ... نسخه لاخيه في الله الشيخ علي ابن ابراهيم بن الحاج السعدي الهوزالي/الهوزاىي ... نجز من نسخه بتمازت في زاوية الولي الصالح البركة القطب الرباني ابي السرور سيدي عياد بن عبد الله السوسي وكان الفراغ منه عشية يوم الخميس العشرين من ربيع الثاني من سنة ثلاث واربعين وماية والف"

test_string_ascii = "this is a copied colophon
dddddddd dd dddddd dddd dd dddd dd dddd ddddd dd dddd ddddd dd dddd ddd dd dddd ... ddddd dddd dddd dddd dd dddddv ddd ddddd
96v and the same again on 97r
dddd ddddd ddddddd ddd ddd dd dddddd dddd dd ddddddd dddddd ddd dddd ddd dddddd dddd dddddd dddd ddd dd dddd dddd dddddd ddd ddddddd ddddd dddddd dddddd ddddd ddddd dd dddddd dddd dd ddd ... dddddddd dddddddd dddddd ddddd dd ddddddd ddd ddd ddddd dddd dd dddd dddd ... ddddd dddd dddddddd ddd dddddd ddddddd ddddddd dddddd dddd dd dddd ddddd dd dddd dd dddddd ddd ... ddddd dddd dddd dddd dd dddddd ddd ddddd"

perf_time("op te \\b mixed")
test_string_mixed.match?(/\b[^\b]*op[^\b]*\b.*[^\b]*te[^\b]*\b.*/mi)
perf_time_end("op te \\b mixed")

perf_time("op te \\w mixed")
test_string_mixed.match?(/\w[^\w]*op[^\w]*\w.*[^\w]*te[^\w]*\w.*/mi)
perf_time_end("op te \\w mixed")

perf_time("op te \\b ascii")
test_string_ascii.match?(/\b[^\b]*op[^\b]*\b.*[^\b]*te[^\b]*\b.*/mi)
perf_time_end("op te \\b ascii")

perf_time("te op \\b mixed")
test_string_mixed.match?(/\b[^\b]*te[^\b]*\b.*[^\b]*op[^\b]*\b.*/mi)
perf_time_end("te op \\b mixed")

perf_time("te op \\b ascii")
test_string_ascii.match?(/\b[^\b]*te[^\b]*\b.*[^\b]*op[^\b]*\b.*/mi)
perf_time_end("te op \\b ascii")

Here are timing results (in seconds) for jruby-9.4.6.0 (very similar timings in current jruby-head):

op te \b mixed: 11.01084
op te \w mixed: 0.04852
op te \b ascii: 7.47750
te op \b mixed: 0.00185
te op \b ascii: 0.00026

For ruby 3.1.3:

op te \b mixed: 7.49875
op te \w mixed: 0.00360
op te \b ascii: 3.17154
te op \b mixed: 0.00020
te op \b ascii: 0.00000

For ruby 3.2.1:

op te \b mixed: 0.00056
op te \w mixed: 0.00043
op te \b ascii: 0.00015
te op \b mixed: 0.00022
te op \b ascii: 0.00000

So the issue is that for certain (but not all) regex employing \b there are drastic performance issues when run against multi line strings consisting of a lot of words. Issue does not happen if instead of \b \w is used. At first, I thought it may be related to the arabic characters, but it is not, happening also on a purely ASCII string. The issue is present also in regular ruby, but on 3.2 it seems fixed. I haven't found a matching ruby lang issue, if anyone knows I would be interested.

Question for me is if this will be fixed "automatically", once jruby targets ruby 3.2 compatibility (e.g. if the underlying regex engine is purely implemented in ruby), or if a separate fix is needed?

@headius
Copy link
Member

headius commented Mar 26, 2024

This is likely an optimization that standard ruby has implemented in their regex engine that we are missing. It would be an enormous help if you can figure out what they changed! You might try checking their bug tracker to see if you can find a similar report.

@lopex
Copy link
Member

lopex commented Mar 26, 2024

@headius
Copy link
Member

headius commented Mar 26, 2024

@lopex That would make sense. Along with that change came the linear_time? method. I don't understand the implementation yet but it doesn't seem too huge.

@enebo
Copy link
Member

enebo commented Mar 27, 2024

@headius @lopex and I briefly looked at how onigmo did linear_time? and the method they use is way more than what is needed for just that feature. The method it calls does quite a bit more for what I guess for caching logic mentioned above? I think we determined it really only is non-linear if we push a StackEntry (which is not very many joni opcodes). @lopex you can correct me if that is wrong.

In any case this if this issue does involve regexp cache then perhaps we do need the whole method :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants