Re: Row pattern recognition - Mailing list pgsql-hackers
From | Jacob Champion |
---|---|
Subject | Re: Row pattern recognition |
Date | |
Msg-id | fcf6dcfa-5a61-00ea-4311-d3d003b790ed@timescale.com Whole thread Raw |
In response to | Re: Row pattern recognition (Tatsuo Ishii <ishii@sraoss.co.jp>) |
Responses |
Re: Row pattern recognition
|
List | pgsql-hackers |
Hello! > (1) I completely changed the pattern matching engine so that it > performs backtracking. Now the engine evaluates all pattern elements > defined in PATTER against each row, saving matched pattern variables > in a string per row. For example if the pattern element A and B > evaluated to true, a string "AB" is created for current row. If I understand correctly, this strategy assumes that one row's membership in a pattern variable is independent of the other rows' membership. But I don't think that's necessarily true: DEFINE A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', ... > See row_is_in_reduced_frame, search_str_set and search_str_set_recurse > in nodeWindowAggs.c for more details. For now I use a naive depth > first search and probably there is a lot of rooms for optimization > (for example rewriting it without using > recursion). The depth-first match is doing a lot of subtle work here. For example, with PATTERN ( A+ B+ ) DEFINE A AS TRUE, B AS TRUE (i.e. all rows match both variables), and three rows in the partition, our candidates will be tried in the order aaa aab aba abb ... bbb The only possible matches against our regex `^a+b+` are "aab" and "abb", and that order matches the preferment order, so it's fine. But it's easy to come up with a pattern where that's the wrong order, like PATTERN ( A+ (B|A)+ ) Now "aaa" will be considered before "aab", which isn't correct. Similarly, the assumption that we want to match the longest string only works because we don't allow alternation yet. > Suggestions/patches are welcome. Cool, I will give this piece some more thought. Do you mind if I try to add some more complicated pattern quantifiers to stress the architecture, or would you prefer to tackle that later? Just alternation by itself will open up a world of corner cases. > With the new engine, cases above do not fail anymore. See new > regression test cases. Thanks for providing valuable test cases! You're very welcome! On 8/9/23 01:41, Tatsuo Ishii wrote: > - I found a bug with pattern matching code. It creates a string for > subsequent regular expression matching. It uses the initial letter > of each define variable name. For example, if the varname is "foo", > then "f" is used. Obviously this makes trouble if we have two or > more variables starting with same "f" (e.g. "food"). To fix this, I > assign [a-z] to each variable instead of its initial letter. However > this way limits us not to have more than 26 variables. I hope 26 is > enough for most use cases. There are still plenty of alphanumerics left that could be assigned... But I'm wondering if we might want to just implement the NFA directly? The current implementation's Cartesian explosion can probably be pruned aggressively, but replaying the entire regex match once for every backtracked step will still duplicate a lot of work. -- I've attached another test case; it looks like last_value() is depending on some sort of side effect from either first_value() or nth_value(). I know the window frame itself is still under construction, so apologies if this is an expected failure. Thanks! --Jacob
Attachment
pgsql-hackers by date: