Home > mailing lists

Re: Row pattern recognition - Mailing list pgsql-hackers

From	Jacob Champion
Subject	Re: Row pattern recognition
Date	September 8, 2023 00:00:07
Msg-id	fcf6dcfa-5a61-00ea-4311-d3d003b790ed@timescale.com Whole thread Raw
In response to	Re: Row pattern recognition (Tatsuo Ishii <ishii@sraoss.co.jp>)
Responses	Re: Row pattern recognition
List	pgsql-hackers

Tree view

Hello!

> (1) I completely changed the pattern matching engine so that it
> performs backtracking. Now the engine evaluates all pattern elements
> defined in PATTER against each row, saving matched pattern variables
> in a string per row. For example if the pattern element A and B
> evaluated to true, a string "AB" is created for current row.

If I understand correctly, this strategy assumes that one row's
membership in a pattern variable is independent of the other rows'
membership. But I don't think that's necessarily true:

    DEFINE
      A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A',
      ...

> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
> in nodeWindowAggs.c for more details. For now I use a naive depth
> first search and probably there is a lot of rooms for optimization
> (for example rewriting it without using
> recursion).

The depth-first match is doing a lot of subtle work here. For example, with

    PATTERN ( A+ B+ )
    DEFINE A AS TRUE,
           B AS TRUE

(i.e. all rows match both variables), and three rows in the partition,
our candidates will be tried in the order

    aaa
    aab
    aba
    abb
    ...
    bbb

The only possible matches against our regex `^a+b+` are "aab" and "abb",
and that order matches the preferment order, so it's fine. But it's easy
to come up with a pattern where that's the wrong order, like

    PATTERN ( A+ (B|A)+ )

Now "aaa" will be considered before "aab", which isn't correct.

Similarly, the assumption that we want to match the longest string only
works because we don't allow alternation yet.

> Suggestions/patches are welcome.

Cool, I will give this piece some more thought. Do you mind if I try to
add some more complicated pattern quantifiers to stress the
architecture, or would you prefer to tackle that later? Just alternation
by itself will open up a world of corner cases.

> With the new engine, cases above do not fail anymore. See new
> regression test cases. Thanks for providing valuable test cases!

You're very welcome!

On 8/9/23 01:41, Tatsuo Ishii wrote:
> - I found a bug with pattern matching code. It creates a string for
>   subsequent regular expression matching. It uses the initial letter
>   of each define variable name. For example, if the varname is "foo",
>   then "f" is used. Obviously this makes trouble if we have two or
>   more variables starting with same "f" (e.g. "food"). To fix this, I
>   assign [a-z] to each variable instead of its initial letter. However
>   this way limits us not to have more than 26 variables. I hope 26 is
>   enough for most use cases.

There are still plenty of alphanumerics left that could be assigned...

But I'm wondering if we might want to just implement the NFA directly?
The current implementation's Cartesian explosion can probably be pruned
aggressively, but replaying the entire regex match once for every
backtracked step will still duplicate a lot of work.

--

I've attached another test case; it looks like last_value() is depending
on some sort of side effect from either first_value() or nth_value(). I
know the window frame itself is still under construction, so apologies
if this is an expected failure.

Thanks!
--Jacob

Attachment

test-last_value.diff.txt

pgsql-hackers by date:

From: Michael Paquier
Date: 07 September 2023, 23:48:12
Subject: Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

From: Michael Paquier
Date: 08 September 2023, 00:07:13
Subject: Re: Impact of checkpointer during pg_upgrade

Re: Row pattern recognition - Mailing list pgsql-hackers

Attachment

Previous

Next