Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pattern Analysis #86

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Andersama
Copy link
Contributor

Adds functionality to analyze the minimum and maximum # of
characters a regex may match.

Adds functionality to analyze the minimum and maximum # of
characters a regex may match.
@hanickadot
Copy link
Owner

I don't understand motivation of the PR. What's the use-case?

@Andersama
Copy link
Contributor Author

Andersama commented Dec 29, 2019

In my use case I'm using it to categorize regexes at compile time into two sets, ones which may or do consume characters when they succeed. I'm using your library with a c++ library taocpp pegtl which also mimics regular expression rules but to construct more complex grammars. You can try to analyze the grammar for problems, say for example if you've created a grammar which contains an infinite loop, but in order to work that out their library needs to understand rules under one of four categories, one of the two I've described and two others, a rule that behaves like a an alternation or a rule that behaves like a sequence. The two later rules will eventually boil down into one of the top two.

The other use case would be rejecting input strings which are too short. If an input string for example was 10 characters long but we know the regex requires a bare minimum of 11, then we can reject the result at the start as opposed to processing all the rules. Rewriting the evaluation function a bit with an extra control struct at the start you could take these results and perform a size check on the start and end iterators.

If we're looking for an exact match we can also ignore input strings which are longer than the maximum.

You could also use it for the search function in order to terminate early since you'll have what would be the window size of the regex.

Say for example we construct a regex "aaaa" and run it against an input string of "aaa", you can reject the result immediately since we need at least 4 characters in sequence, but the input was only 3.

If we were searching and had a regex like "aaa" but an input string of "aaaa" we know in advance that we only need to evaluate the regex at most two times.

If we had a regex like "aaa?a" one of a characters was optional, so we can safely assume the minimum length string we could match is 3. (1 + 1 + 1 * 0 + 1) because we know that we could match anywhere between 0 and 1 of that character. On that same token we know the maximum would be still the same (1 + 1 + 1 * 1 + 1). And we can work this out without an input string.

With how you've structured the structs it becomes really quick to calculate the whole expression because it generalizes, for any given regex we could have wrapped it in a capture group, which is itself a regex and then applied one of the modifiers, or that for any given regex on its own is equivalent to having wrapped it in a capture group with {1,1} following it. So the minimum or maximum for any given expression is the minimum and maximum for an expression multiplied by some repetition modifier.

It might stand out a bit more with the equivalent regex "(a){3,4}". There's no way for any string under 3 characters long to match this regex, because we'll need at least one character repeated at least 3 times over. Any sequence of rules adds together and any alternation will mean that we take the absolute minimum and maximum of the regexs between. so "(a){3,4}|(b){2,3}" requires at least two characters and at most 4. Input strings "a" or "b" can be rejected even though they'll match the first rules to start.

template <typename Pattern>
static constexpr auto trampoline_analysis(Pattern) noexcept;

template <typename... Patterns>
static constexpr auto trampoline_analysis(ctll::list<Patterns...>) noexcept;

template<typename T, typename R>
static constexpr auto trampoline_analysis(T, R captures) noexcept;

// calling with pattern prepare stack and triplet of iterators
template <typename Iterator, typename EndIterator, typename Pattern> 
constexpr inline auto match_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
	using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
	const analysis_results min_max_range = trampoline_analysis(ctll::list<Pattern>(), return_type{});
	const size_t input_size = std::distance(begin, end);
        //perform a single size check at the start
	if (input_size < min_max_range.first || input_size > min_max_range.second)
		return return_type{};
	else
		return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, assert_end, end_mark, accept>());
}

@Andersama
Copy link
Contributor Author

Andersama commented Sep 10, 2020

I didn't consider variable length encodings before, looking back here's a better sketch of what'd be possible:

// calling with pattern prepare stack and triplet of iterators
template <typename Iterator, typename EndIterator, typename Pattern> 
constexpr inline auto match_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
	using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
	if constexpr (!ctre::is_variable_length_encoded<Iterator>() && std::is_same<std::iterator_traits<Iterator>::iterator_category, std::random_access_iterator_tag>::value) {
		constexpr auto lengths = ctre::pattern_match_minmax_characters(ctll::list<start_mark, Pattern, assert_end, end_mark, accept>(), return_type());
		//check the size of the input string
		auto length = std::distance(begin, end);
		if (length >= lengths.min && length <= lengths.max) //if not within bounds we can avoid this call
			return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, assert_end, end_mark, accept>());
		else
			return return_type{};
	} else {
                //normal unchecked size call
		return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, assert_end, end_mark, accept>());
	}
}

template <typename Iterator, typename EndIterator, typename Pattern> 
constexpr inline auto starts_with_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
	using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
	if constexpr (!ctre::is_variable_length_encoded<Iterator>() && std::is_same<std::iterator_traits<Iterator>::iterator_category, std::random_access_iterator_tag>::value) {
		constexpr auto lengths = ctre::pattern_match_minmax_characters(ctll::list<start_mark, Pattern, end_mark, accept>(), return_type());
		//check the size of the input string
		auto length = std::distance(begin, end);
		if (length >= lengths.min) //we only check the minimum size requirement since starts with implicitly trails w/ (.*) meaning the max is infinite
			return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
		else
			return return_type{};
	} else {
                //normal unchecked size call
		return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
	}
}

template <typename Iterator, typename EndIterator, typename Pattern> 
constexpr inline auto search_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
	using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
	
	constexpr bool fixed = starts_with_anchor(ctll::list<Pattern>{});
	if constexpr (!ctre::is_variable_length_encoded<Iterator>() && std::is_same<std::iterator_traits<Iterator>::iterator_category, std::random_access_iterator_tag>::value) {
		constexpr auto lengths = ctre::pattern_match_minmax_characters(ctll::list<start_mark, Pattern, end_mark, accept>(), return_type());
		//check the size of the input string
		auto length = std::distance(begin, end);

		auto it = begin;
		for (; end != it && !fixed && length >= lengths.min; ++it) { //similar to starts_with, but we loop
			if (auto out = evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>())) {
				return out;
			}
			length--;
		}

		// in case the RE is empty or fixed
		return evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
	}
	else {
                //normal unchecked size loop
		auto it = begin;
		for (; end != it && !fixed; ++it) {
			if (auto out = evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>())) {
				return out;
			}
		}

		// in case the RE is empty or fixed
		return evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
	}
}

The const analysis_results min_max_range = trampoline_analysis(ctll::list<Pattern>(), return_type{}); from before would be ctre::pattern_match_minmax_characters<...>()

I'm going to probably going to rebase this pr to my rewrite which does the above:
Andersama@531a9ba

@Andersama
Copy link
Contributor Author

Had a thought*, pattern analysis can allow you to perform these transformations:

ctll::list<assert_begin, consuming sequence, assert_begin, Content...> -> ctll::list<reject>
ctll::list<assert_end, consuming sequence, assert_end, Content...> -> ctll::list<reject>

You'd only need to prove some sequence between the anchors consumes a minimum of 1 character, which the algorithm can provide.

@hanickadot
Copy link
Owner

hanickadot commented Sep 11, 2020

Mind that every such analysis / transformation is instantiating a large amount of templates, and has a big impact on compile time.

Also optimizers can see a lot of things already:
https://compiler-explorer.com/z/MoYhff

@Andersama
Copy link
Contributor Author

Andersama commented Sep 11, 2020

I guess that optimization should be expected since it would likely track the value of the pointer and work out that it would be incremented later.

It just strikes me how much compile time cost you're paying for having to do this with templates when for the most part regex's are relatively straightforward and probably could be made into byte code, passed to a compiler, optimized etc... and likely would give you back the same results in almost no time. Say if someone were particularly crazy and made a frontend specifically for regex to llvm and have it build out a function per regex. Doesn't work out in terms of having something native for c++. But if there were an ability to generate bytecode that the compiler could take and use to define a function a lot of this would probably be a lot less painful.

Really wish you weren't fighting compile time costs, because in terms of writing optimizations you've made it pretty easy, even if the compiler's showing me up by working those out in the end anyway. Like here: #72 you mention (a*)* I'm fairly certain can be safely reduced to (a*). Being able to just write an overload to a function to do that is incredibly handy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants