Re: Row pattern recognition - Mailing list pgsql-hackers

From Henson Choi
Subject Re: Row pattern recognition
Date
Msg-id CAAAe_zAgyfStcRAdm9j6QrA9p0pOk3scxpmkAkxi3goSdgxo7Q@mail.gmail.com
Whole thread Raw
In response to Re: Row pattern recognition  (Henson Choi <assam258@gmail.com>)
Responses Re: Row pattern recognition
List pgsql-hackers
Hi Ishii-san,


The attached patch implements streaming NFA-based pattern matching.
This is based on your v37 patch and the pattern grammar work I submitted earlier.

This is a testable draft for review, not production-ready yet. It allows
in-depth verification of pattern optimization, pattern matching/merging,
context absorption, and window behavior.


Summary of changes:

1. NFA-based executor replacing regex approach
   - Streaming execution: processes one row at a time
   - Eliminates regex compilation overhead and 26-variable limit
   - Supports up to 252 pattern variables

   Note on 252 limit:
   Each match path stores a sequence of classifiers (one per matched row).
   Using 1-byte encoding keeps memory usage reasonable for long matches.
   Some values are reserved for pattern representation; this limit may
   decrease if more are needed. 200+ should be sufficient for practical use.

2. Pattern optimization moved to planner phase (as you suggested)
   - pg_get_viewdef() now shows original pattern
   - EXPLAIN shows optimized pattern
   - Optimizations: VAR merge (A A A -> A{3}), GROUP merge ((A B) (A B)+ -> (A B){2,}),
     duplicate removal, SEQ/ALT flattening, quantifier multiplication
   - These optimizations also enable absorption for patterns that wouldn't
     otherwise qualify (e.g., (A B) (A B)+ -> (A B){2,} becomes absorbable)

3. Multi-context support with context absorption
   - Multiple concurrent NFA contexts for overlapping matches
   - Absorption optimization: earlier context absorbs later ones
     when they reach the same endpoint

   Note on absorption scope:
   I haven't fully worked out the theoretical bounds for safe absorption yet.
   The current implementation covers most practically useful cases, but a
   complete solution would require more formal analysis.

   Current implementation handles cases where the behavior is clear:
   - Only patterns with unbounded first element (A+, A*, (A B)+, etc.)
   - Only when there's exactly one unbounded element at top level
   - Per-branch analysis for alternation patterns
   This may be extended as we better understand the theoretical limits.

   Examples:
   - A+ B     : absorbable (A+ unbounded first, same endpoint guaranteed)
   - A* B     : absorbable (A* unbounded first)
   - A B+     : NOT absorbable (unbounded not first, different start points)
   - A+ B+    : partial (absorbable while in A+, not after entering B+)
   - A? B     : NOT absorbable (? is bounded: {0,1})
   - A{2,} B  : absorbable (unbounded quantifier at first element)
   - A{2,5} B : NOT absorbable (bounded quantifier)
   - (A B){2,}: absorbable (GROUP with unbounded quantifier at top level)
   - A+ | B+  : absorbable per-branch (each branch analyzed independently)
   - A+ | B C : partial per-branch (A+ absorbable, B C not)
   - (A+ B)+   : NOT YET (absorbable at first A+ entry, nested analysis not implemented)

4. Reluctant quantifiers
   - Will be supported in a follow-up patch
   - I need to study some edge cases in the spec first

Also updated: parser/README, typedefs.list


Testing:

The rpr regression test has been expanded significantly with tests for:
- Alternation and grouping patterns
- Quantifiers ({n}, {n,}, {n,m}, +, *, ?)
- Pattern optimization verification
- Multi-context matching
- Context absorption behavior
- Lexical order compliance

Below is a reference query designed to test NFA functionality in a clean,
unambiguous way. Using explicit ARRAY flags eliminates edge cases and makes
the pattern matching behavior completely predictable, allowing clear verification
of multi-context execution and alternation handling.

Example (multi-context overlapping match test):
Uses ARRAY flags with ANY() to make each row's pattern membership explicit,
making test behavior predictable and easy to verify.

  WITH data AS (
    SELECT * FROM (VALUES
      (1, ARRAY['A']), (2, ARRAY['B']), (3, ARRAY['C']),
      (4, ARRAY['D']), (5, ARRAY['E']), (6, ARRAY['F'])
    ) AS t(id, flags)
  )
  SELECT id, flags, first_value(id) OVER w AS start, last_value(id) OVER w AS end
  FROM data
  WINDOW w AS (
    ORDER BY id
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
    AFTER MATCH SKIP TO NEXT ROW
    PATTERN (A B C D E | B C D | C D E F)
    DEFINE A AS 'A' = ANY(flags), B AS 'B' = ANY(flags), ...
  );

  -- Result (SKIP TO NEXT ROW enables multi-context):
   id | flags | start | end
  ----+-------+-------+-----
    1 | {A}   |     1 |   5   <- A B C D E
    2 | {B}   |     2 |   4   <- B C D
    3 | {C}   |     3 |   6   <- C D E F
    4 | {D}   |       |
    5 | {E}   |       |
    6 | {F}   |       |


Performance:

Regression test timing (rpr test):

  Parallel group (with other tests):
    Before (v37 regex):  198 ms
    After (NFA):         123 ms  (~38% faster)
    After (NFA + tests): 127 ms  (still ~36% faster)

  Single test (isolated):
    Before (v37 regex):  38 ms
    After (NFA):         26 ms  (~32% faster)
    After (NFA + tests): 41 ms  (+8%, more tests added)

Note: The rpr test was moved out of the parallel group in the schedule
for more accurate timing measurement.

The streaming NFA approach eliminates regex compilation overhead per-row
and reduces redundant DEFINE evaluations, resulting in better performance
especially for complex patterns.


Let me know if you have any questions or concerns.


Best regards,
Henson

Attachment

pgsql-hackers by date:

Previous
From: Japin Li
Date:
Subject: Re: Add IS_INDEX macro to brin and gist index
Next
From: Japin Li
Date:
Subject: Re: Add IS_INDEX macro to brin and gist index