Pattern Analysis #86

Andersama · 2019-12-29T03:17:18Z

Adds functionality to analyze the minimum and maximum # of
characters a regex may match.

Adds functionality to analyze the minimum and maximum # of characters a regex may match.

hanickadot · 2019-12-29T09:23:43Z

I don't understand motivation of the PR. What's the use-case?

Andersama · 2019-12-29T16:56:44Z

In my use case I'm using it to categorize regexes at compile time into two sets, ones which may or do consume characters when they succeed. I'm using your library with a c++ library taocpp pegtl which also mimics regular expression rules but to construct more complex grammars. You can try to analyze the grammar for problems, say for example if you've created a grammar which contains an infinite loop, but in order to work that out their library needs to understand rules under one of four categories, one of the two I've described and two others, a rule that behaves like a an alternation or a rule that behaves like a sequence. The two later rules will eventually boil down into one of the top two.

The other use case would be rejecting input strings which are too short. If an input string for example was 10 characters long but we know the regex requires a bare minimum of 11, then we can reject the result at the start as opposed to processing all the rules. Rewriting the evaluation function a bit with an extra control struct at the start you could take these results and perform a size check on the start and end iterators.

If we're looking for an exact match we can also ignore input strings which are longer than the maximum.

You could also use it for the search function in order to terminate early since you'll have what would be the window size of the regex.

Say for example we construct a regex "aaaa" and run it against an input string of "aaa", you can reject the result immediately since we need at least 4 characters in sequence, but the input was only 3.

If we were searching and had a regex like "aaa" but an input string of "aaaa" we know in advance that we only need to evaluate the regex at most two times.

If we had a regex like "aaa?a" one of a characters was optional, so we can safely assume the minimum length string we could match is 3. (1 + 1 + 1 * 0 + 1) because we know that we could match anywhere between 0 and 1 of that character. On that same token we know the maximum would be still the same (1 + 1 + 1 * 1 + 1). And we can work this out without an input string.

With how you've structured the structs it becomes really quick to calculate the whole expression because it generalizes, for any given regex we could have wrapped it in a capture group, which is itself a regex and then applied one of the modifiers, or that for any given regex on its own is equivalent to having wrapped it in a capture group with {1,1} following it. So the minimum or maximum for any given expression is the minimum and maximum for an expression multiplied by some repetition modifier.

It might stand out a bit more with the equivalent regex "(a){3,4}". There's no way for any string under 3 characters long to match this regex, because we'll need at least one character repeated at least 3 times over. Any sequence of rules adds together and any alternation will mean that we take the absolute minimum and maximum of the regexs between. so "(a){3,4}|(b){2,3}" requires at least two characters and at most 4. Input strings "a" or "b" can be rejected even though they'll match the first rules to start.

template <typename Pattern>
static constexpr auto trampoline_analysis(Pattern) noexcept;

template <typename... Patterns>
static constexpr auto trampoline_analysis(ctll::list<Patterns...>) noexcept;

template<typename T, typename R>
static constexpr auto trampoline_analysis(T, R captures) noexcept;

// calling with pattern prepare stack and triplet of iterators
template <typename Iterator, typename EndIterator, typename Pattern> 
constexpr inline auto match_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
	using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
	const analysis_results min_max_range = trampoline_analysis(ctll::list<Pattern>(), return_type{});
	const size_t input_size = std::distance(begin, end);
        //perform a single size check at the start
	if (input_size < min_max_range.first || input_size > min_max_range.second)
		return return_type{};
	else
		return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, assert_end, end_mark, accept>());
}

Andersama · 2020-09-10T05:09:52Z

I didn't consider variable length encodings before, looking back here's a better sketch of what'd be possible:

// calling with pattern prepare stack and triplet of iterators
template <typename Iterator, typename EndIterator, typename Pattern> 
constexpr inline auto match_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
	using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
	if constexpr (!ctre::is_variable_length_encoded<Iterator>() && std::is_same<std::iterator_traits<Iterator>::iterator_category, std::random_access_iterator_tag>::value) {
		constexpr auto lengths = ctre::pattern_match_minmax_characters(ctll::list<start_mark, Pattern, assert_end, end_mark, accept>(), return_type());
		//check the size of the input string
		auto length = std::distance(begin, end);
		if (length >= lengths.min && length <= lengths.max) //if not within bounds we can avoid this call
			return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, assert_end, end_mark, accept>());
		else
			return return_type{};
	} else {
                //normal unchecked size call
		return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, assert_end, end_mark, accept>());
	}
}

template <typename Iterator, typename EndIterator, typename Pattern> 
constexpr inline auto starts_with_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
	using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
	if constexpr (!ctre::is_variable_length_encoded<Iterator>() && std::is_same<std::iterator_traits<Iterator>::iterator_category, std::random_access_iterator_tag>::value) {
		constexpr auto lengths = ctre::pattern_match_minmax_characters(ctll::list<start_mark, Pattern, end_mark, accept>(), return_type());
		//check the size of the input string
		auto length = std::distance(begin, end);
		if (length >= lengths.min) //we only check the minimum size requirement since starts with implicitly trails w/ (.*) meaning the max is infinite
			return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
		else
			return return_type{};
	} else {
                //normal unchecked size call
		return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
	}
}

template <typename Iterator, typename EndIterator, typename Pattern> 
constexpr inline auto search_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
	using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
	
	constexpr bool fixed = starts_with_anchor(ctll::list<Pattern>{});
	if constexpr (!ctre::is_variable_length_encoded<Iterator>() && std::is_same<std::iterator_traits<Iterator>::iterator_category, std::random_access_iterator_tag>::value) {
		constexpr auto lengths = ctre::pattern_match_minmax_characters(ctll::list<start_mark, Pattern, end_mark, accept>(), return_type());
		//check the size of the input string
		auto length = std::distance(begin, end);

		auto it = begin;
		for (; end != it && !fixed && length >= lengths.min; ++it) { //similar to starts_with, but we loop
			if (auto out = evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>())) {
				return out;
			}
			length--;
		}

		// in case the RE is empty or fixed
		return evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
	}
	else {
                //normal unchecked size loop
		auto it = begin;
		for (; end != it && !fixed; ++it) {
			if (auto out = evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>())) {
				return out;
			}
		}

		// in case the RE is empty or fixed
		return evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
	}
}

The const analysis_results min_max_range = trampoline_analysis(ctll::list<Pattern>(), return_type{}); from before would be ctre::pattern_match_minmax_characters<...>()

I'm going to probably going to rebase this pr to my rewrite which does the above:
Andersama@531a9ba

Andersama · 2020-09-11T00:27:31Z

Had a thought*, pattern analysis can allow you to perform these transformations:

ctll::list<assert_begin, consuming sequence, assert_begin, Content...> -> ctll::list<reject>
ctll::list<assert_end, consuming sequence, assert_end, Content...> -> ctll::list<reject>

You'd only need to prove some sequence between the anchors consumes a minimum of 1 character, which the algorithm can provide.

hanickadot · 2020-09-11T06:13:26Z

Mind that every such analysis / transformation is instantiating a large amount of templates, and has a big impact on compile time.

Also optimizers can see a lot of things already:
https://compiler-explorer.com/z/MoYhff

Andersama · 2020-09-11T20:07:36Z

I guess that optimization should be expected since it would likely track the value of the pointer and work out that it would be incremented later.

It just strikes me how much compile time cost you're paying for having to do this with templates when for the most part regex's are relatively straightforward and probably could be made into byte code, passed to a compiler, optimized etc... and likely would give you back the same results in almost no time. Say if someone were particularly crazy and made a frontend specifically for regex to llvm and have it build out a function per regex. Doesn't work out in terms of having something native for c++. But if there were an ability to generate bytecode that the compiler could take and use to define a function a lot of this would probably be a lot less painful.

Really wish you weren't fighting compile time costs, because in terms of writing optimizations you've made it pretty easy, even if the compiler's showing me up by working those out in the end anyway. Like here: #72 you mention (a*)* I'm fairly certain can be safely reduced to (a*). Being able to just write an overload to a function to do that is incredibly handy.

Andersama force-pushed the pattern_fold_analysis branch from 2b62f7e to 230377d Compare December 29, 2019 03:35

Pattern Analysis

bc5d07b

Adds functionality to analyze the minimum and maximum # of characters a regex may match.

Andersama force-pushed the pattern_fold_analysis branch from 230377d to bc5d07b Compare December 29, 2019 03:51

Andersama added 3 commits December 29, 2019 16:23

fix constexpr and issue with select

d44f1c8

Handle characterlike things as opposed to just characters

94b3a93

Merge branch 'master' into pattern_fold_analysis

2ee6ead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern Analysis #86

Pattern Analysis #86

Andersama commented Dec 29, 2019

hanickadot commented Dec 29, 2019

Andersama commented Dec 29, 2019 •

edited

Andersama commented Sep 10, 2020 •

edited

Andersama commented Sep 11, 2020

hanickadot commented Sep 11, 2020 •

edited

Andersama commented Sep 11, 2020 •

edited

Pattern Analysis #86

Are you sure you want to change the base?

Pattern Analysis #86

Conversation

Andersama commented Dec 29, 2019

hanickadot commented Dec 29, 2019

Andersama commented Dec 29, 2019 • edited

Andersama commented Sep 10, 2020 • edited

Andersama commented Sep 11, 2020

hanickadot commented Sep 11, 2020 • edited

Andersama commented Sep 11, 2020 • edited

Andersama commented Dec 29, 2019 •

edited

Andersama commented Sep 10, 2020 •

edited

hanickadot commented Sep 11, 2020 •

edited

Andersama commented Sep 11, 2020 •

edited