Home > mailing lists

Re: Row pattern recognition - Mailing list pgsql-hackers

From	Henson Choi
Subject	Re: Row pattern recognition
Date	January 14 18:31:32
Msg-id	CAAAe_zAgyfStcRAdm9j6QrA9p0pOk3scxpmkAkxi3goSdgxo7Q@mail.gmail.com Whole thread Raw
In response to	Re: Row pattern recognition (Henson Choi <assam258@gmail.com>)
Responses	Re: Row pattern recognition
List	pgsql-hackers

Tree view

Hi Ishii-san,

The attached patch implements streaming NFA-based pattern matching.
This is based on your v37 patch and the pattern grammar work I submitted earlier.

This is a testable draft for review, not production-ready yet. It allows
in-depth verification of pattern optimization, pattern matching/merging,
context absorption, and window behavior.

Summary of changes:

1. NFA-based executor replacing regex approach
- Streaming execution: processes one row at a time
- Eliminates regex compilation overhead and 26-variable limit
- Supports up to 252 pattern variables

Note on 252 limit:
Each match path stores a sequence of classifiers (one per matched row).
Using 1-byte encoding keeps memory usage reasonable for long matches.
Some values are reserved for pattern representation; this limit may
decrease if more are needed. 200+ should be sufficient for practical use.

2. Pattern optimization moved to planner phase (as you suggested)
- pg_get_viewdef() now shows original pattern
- EXPLAIN shows optimized pattern
- Optimizations: VAR merge (A A A -> A{3}), GROUP merge ((A B) (A B)+ -> (A B){2,}),
duplicate removal, SEQ/ALT flattening, quantifier multiplication
- These optimizations also enable absorption for patterns that wouldn't
otherwise qualify (e.g., (A B) (A B)+ -> (A B){2,} becomes absorbable)

3. Multi-context support with context absorption
- Multiple concurrent NFA contexts for overlapping matches
- Absorption optimization: earlier context absorbs later ones
when they reach the same endpoint

Note on absorption scope:
I haven't fully worked out the theoretical bounds for safe absorption yet.
The current implementation covers most practically useful cases, but a
complete solution would require more formal analysis.

Current implementation handles cases where the behavior is clear:
- Only patterns with unbounded first element (A+, A*, (A B)+, etc.)
- Only when there's exactly one unbounded element at top level
- Per-branch analysis for alternation patterns
This may be extended as we better understand the theoretical limits.

Examples:
- A+ B : absorbable (A+ unbounded first, same endpoint guaranteed)
- A* B : absorbable (A* unbounded first)
- A B+ : NOT absorbable (unbounded not first, different start points)
- A+ B+ : partial (absorbable while in A+, not after entering B+)
- A? B : NOT absorbable (? is bounded: {0,1})
- A{2,} B : absorbable (unbounded quantifier at first element)
- A{2,5} B : NOT absorbable (bounded quantifier)
- (A B){2,}: absorbable (GROUP with unbounded quantifier at top level)
- A+ | B+ : absorbable per-branch (each branch analyzed independently)
- A+ | B C : partial per-branch (A+ absorbable, B C not)
- (A+ B)+ : NOT YET (absorbable at first A+ entry, nested analysis not implemented)

4. Reluctant quantifiers
- Will be supported in a follow-up patch
- I need to study some edge cases in the spec first

Also updated: parser/README, typedefs.list

Testing:

The rpr regression test has been expanded significantly with tests for:
- Alternation and grouping patterns
- Quantifiers ({n}, {n,}, {n,m}, +, *, ?)
- Pattern optimization verification
- Multi-context matching
- Context absorption behavior
- Lexical order compliance

Below is a reference query designed to test NFA functionality in a clean,
unambiguous way. Using explicit ARRAY flags eliminates edge cases and makes
the pattern matching behavior completely predictable, allowing clear verification
of multi-context execution and alternation handling.

Example (multi-context overlapping match test):
Uses ARRAY flags with ANY() to make each row's pattern membership explicit,
making test behavior predictable and easy to verify.

WITH data AS (
SELECT * FROM (VALUES
(1, ARRAY['A']), (2, ARRAY['B']), (3, ARRAY['C']),
(4, ARRAY['D']), (5, ARRAY['E']), (6, ARRAY['F'])
) AS t(id, flags)
)
SELECT id, flags, first_value(id) OVER w AS start, last_value(id) OVER w AS end
FROM data
WINDOW w AS (
ORDER BY id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP TO NEXT ROW
PATTERN (A B C D E | B C D | C D E F)
DEFINE A AS 'A' = ANY(flags), B AS 'B' = ANY(flags), ...
);

-- Result (SKIP TO NEXT ROW enables multi-context):
id | flags | start | end
----+-------+-------+-----
1 | {A} | 1 | 5 <- A B C D E
2 | {B} | 2 | 4 <- B C D
3 | {C} | 3 | 6 <- C D E F
4 | {D} | |
5 | {E} | |
6 | {F} | |

Performance:

Regression test timing (rpr test):

Parallel group (with other tests):
Before (v37 regex): 198 ms
After (NFA): 123 ms (~38% faster)
After (NFA + tests): 127 ms (still ~36% faster)

Single test (isolated):
Before (v37 regex): 38 ms
After (NFA): 26 ms (~32% faster)
After (NFA + tests): 41 ms (+8%, more tests added)

Note: The rpr test was moved out of the parallel group in the schedule
for more accurate timing measurement.

The streaming NFA approach eliminates regex compilation overhead per-row
and reduces redundant DEFINE evaluations, resulting in better performance
especially for complex patterns.

Let me know if you have any questions or concerns.

Best regards,
Henson

Attachment

0001-NFA-based-pattern-matching.txt

pgsql-hackers by date:

From: Japin Li
Date: 14 January, 18:31:11
Subject: Re: Add IS_INDEX macro to brin and gist index

From: Japin Li
Date: 14 January, 18:35:53
Subject: Re: Add IS_INDEX macro to brin and gist index

Re: Row pattern recognition - Mailing list pgsql-hackers

Attachment

Previous

Next