Re: Row pattern recognition - Mailing list pgsql-hackers
| From | Henson Choi |
|---|---|
| Subject | Re: Row pattern recognition |
| Date | |
| Msg-id | CAAAe_zAgyfStcRAdm9j6QrA9p0pOk3scxpmkAkxi3goSdgxo7Q@mail.gmail.com Whole thread Raw |
| In response to | Re: Row pattern recognition (Henson Choi <assam258@gmail.com>) |
| Responses |
Re: Row pattern recognition
|
| List | pgsql-hackers |
Hi Ishii-san,
The attached patch implements streaming NFA-based pattern matching.
This is based on your v37 patch and the pattern grammar work I submitted earlier.
This is a testable draft for review, not production-ready yet. It allows
in-depth verification of pattern optimization, pattern matching/merging,
context absorption, and window behavior.
Summary of changes:
1. NFA-based executor replacing regex approach
- Streaming execution: processes one row at a time
- Eliminates regex compilation overhead and 26-variable limit
- Supports up to 252 pattern variables
Note on 252 limit:
Each match path stores a sequence of classifiers (one per matched row).
Using 1-byte encoding keeps memory usage reasonable for long matches.
Some values are reserved for pattern representation; this limit may
decrease if more are needed. 200+ should be sufficient for practical use.
2. Pattern optimization moved to planner phase (as you suggested)
- pg_get_viewdef() now shows original pattern
- EXPLAIN shows optimized pattern
- Optimizations: VAR merge (A A A -> A{3}), GROUP merge ((A B) (A B)+ -> (A B){2,}),
duplicate removal, SEQ/ALT flattening, quantifier multiplication
- These optimizations also enable absorption for patterns that wouldn't
otherwise qualify (e.g., (A B) (A B)+ -> (A B){2,} becomes absorbable)
3. Multi-context support with context absorption
- Multiple concurrent NFA contexts for overlapping matches
- Absorption optimization: earlier context absorbs later ones
when they reach the same endpoint
Note on absorption scope:
I haven't fully worked out the theoretical bounds for safe absorption yet.
The current implementation covers most practically useful cases, but a
complete solution would require more formal analysis.
Current implementation handles cases where the behavior is clear:
- Only patterns with unbounded first element (A+, A*, (A B)+, etc.)
- Only when there's exactly one unbounded element at top level
- Per-branch analysis for alternation patterns
This may be extended as we better understand the theoretical limits.
Examples:
- A+ B : absorbable (A+ unbounded first, same endpoint guaranteed)
- A* B : absorbable (A* unbounded first)
- A B+ : NOT absorbable (unbounded not first, different start points)
- A+ B+ : partial (absorbable while in A+, not after entering B+)
- A? B : NOT absorbable (? is bounded: {0,1})
- A{2,} B : absorbable (unbounded quantifier at first element)
- A{2,5} B : NOT absorbable (bounded quantifier)
- (A B){2,}: absorbable (GROUP with unbounded quantifier at top level)
- A+ | B+ : absorbable per-branch (each branch analyzed independently)
- A+ | B C : partial per-branch (A+ absorbable, B C not)
- (A+ B)+ : NOT YET (absorbable at first A+ entry, nested analysis not implemented)
4. Reluctant quantifiers
- Will be supported in a follow-up patch
- I need to study some edge cases in the spec first
Also updated: parser/README, typedefs.list
Testing:
The rpr regression test has been expanded significantly with tests for:
- Alternation and grouping patterns
- Quantifiers ({n}, {n,}, {n,m}, +, *, ?)
- Pattern optimization verification
- Multi-context matching
- Context absorption behavior
- Lexical order compliance
Below is a reference query designed to test NFA functionality in a clean,
unambiguous way. Using explicit ARRAY flags eliminates edge cases and makes
the pattern matching behavior completely predictable, allowing clear verification
of multi-context execution and alternation handling.
Example (multi-context overlapping match test):
Uses ARRAY flags with ANY() to make each row's pattern membership explicit,
making test behavior predictable and easy to verify.
WITH data AS (
SELECT * FROM (VALUES
(1, ARRAY['A']), (2, ARRAY['B']), (3, ARRAY['C']),
(4, ARRAY['D']), (5, ARRAY['E']), (6, ARRAY['F'])
) AS t(id, flags)
)
SELECT id, flags, first_value(id) OVER w AS start, last_value(id) OVER w AS end
FROM data
WINDOW w AS (
ORDER BY id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP TO NEXT ROW
PATTERN (A B C D E | B C D | C D E F)
DEFINE A AS 'A' = ANY(flags), B AS 'B' = ANY(flags), ...
);
-- Result (SKIP TO NEXT ROW enables multi-context):
id | flags | start | end
----+-------+-------+-----
1 | {A} | 1 | 5 <- A B C D E
2 | {B} | 2 | 4 <- B C D
3 | {C} | 3 | 6 <- C D E F
4 | {D} | |
5 | {E} | |
6 | {F} | |
Performance:
Regression test timing (rpr test):
Parallel group (with other tests):
Before (v37 regex): 198 ms
After (NFA): 123 ms (~38% faster)
After (NFA + tests): 127 ms (still ~36% faster)
Single test (isolated):
Before (v37 regex): 38 ms
After (NFA): 26 ms (~32% faster)
After (NFA + tests): 41 ms (+8%, more tests added)
Note: The rpr test was moved out of the parallel group in the schedule
for more accurate timing measurement.
The streaming NFA approach eliminates regex compilation overhead per-row
and reduces redundant DEFINE evaluations, resulting in better performance
especially for complex patterns.
Let me know if you have any questions or concerns.
Best regards,
Henson
The attached patch implements streaming NFA-based pattern matching.
This is based on your v37 patch and the pattern grammar work I submitted earlier.
This is a testable draft for review, not production-ready yet. It allows
in-depth verification of pattern optimization, pattern matching/merging,
context absorption, and window behavior.
Summary of changes:
1. NFA-based executor replacing regex approach
- Streaming execution: processes one row at a time
- Eliminates regex compilation overhead and 26-variable limit
- Supports up to 252 pattern variables
Note on 252 limit:
Each match path stores a sequence of classifiers (one per matched row).
Using 1-byte encoding keeps memory usage reasonable for long matches.
Some values are reserved for pattern representation; this limit may
decrease if more are needed. 200+ should be sufficient for practical use.
2. Pattern optimization moved to planner phase (as you suggested)
- pg_get_viewdef() now shows original pattern
- EXPLAIN shows optimized pattern
- Optimizations: VAR merge (A A A -> A{3}), GROUP merge ((A B) (A B)+ -> (A B){2,}),
duplicate removal, SEQ/ALT flattening, quantifier multiplication
- These optimizations also enable absorption for patterns that wouldn't
otherwise qualify (e.g., (A B) (A B)+ -> (A B){2,} becomes absorbable)
3. Multi-context support with context absorption
- Multiple concurrent NFA contexts for overlapping matches
- Absorption optimization: earlier context absorbs later ones
when they reach the same endpoint
Note on absorption scope:
I haven't fully worked out the theoretical bounds for safe absorption yet.
The current implementation covers most practically useful cases, but a
complete solution would require more formal analysis.
Current implementation handles cases where the behavior is clear:
- Only patterns with unbounded first element (A+, A*, (A B)+, etc.)
- Only when there's exactly one unbounded element at top level
- Per-branch analysis for alternation patterns
This may be extended as we better understand the theoretical limits.
Examples:
- A+ B : absorbable (A+ unbounded first, same endpoint guaranteed)
- A* B : absorbable (A* unbounded first)
- A B+ : NOT absorbable (unbounded not first, different start points)
- A+ B+ : partial (absorbable while in A+, not after entering B+)
- A? B : NOT absorbable (? is bounded: {0,1})
- A{2,} B : absorbable (unbounded quantifier at first element)
- A{2,5} B : NOT absorbable (bounded quantifier)
- (A B){2,}: absorbable (GROUP with unbounded quantifier at top level)
- A+ | B+ : absorbable per-branch (each branch analyzed independently)
- A+ | B C : partial per-branch (A+ absorbable, B C not)
- (A+ B)+ : NOT YET (absorbable at first A+ entry, nested analysis not implemented)
4. Reluctant quantifiers
- Will be supported in a follow-up patch
- I need to study some edge cases in the spec first
Also updated: parser/README, typedefs.list
Testing:
The rpr regression test has been expanded significantly with tests for:
- Alternation and grouping patterns
- Quantifiers ({n}, {n,}, {n,m}, +, *, ?)
- Pattern optimization verification
- Multi-context matching
- Context absorption behavior
- Lexical order compliance
Below is a reference query designed to test NFA functionality in a clean,
unambiguous way. Using explicit ARRAY flags eliminates edge cases and makes
the pattern matching behavior completely predictable, allowing clear verification
of multi-context execution and alternation handling.
Example (multi-context overlapping match test):
Uses ARRAY flags with ANY() to make each row's pattern membership explicit,
making test behavior predictable and easy to verify.
WITH data AS (
SELECT * FROM (VALUES
(1, ARRAY['A']), (2, ARRAY['B']), (3, ARRAY['C']),
(4, ARRAY['D']), (5, ARRAY['E']), (6, ARRAY['F'])
) AS t(id, flags)
)
SELECT id, flags, first_value(id) OVER w AS start, last_value(id) OVER w AS end
FROM data
WINDOW w AS (
ORDER BY id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP TO NEXT ROW
PATTERN (A B C D E | B C D | C D E F)
DEFINE A AS 'A' = ANY(flags), B AS 'B' = ANY(flags), ...
);
-- Result (SKIP TO NEXT ROW enables multi-context):
id | flags | start | end
----+-------+-------+-----
1 | {A} | 1 | 5 <- A B C D E
2 | {B} | 2 | 4 <- B C D
3 | {C} | 3 | 6 <- C D E F
4 | {D} | |
5 | {E} | |
6 | {F} | |
Performance:
Regression test timing (rpr test):
Parallel group (with other tests):
Before (v37 regex): 198 ms
After (NFA): 123 ms (~38% faster)
After (NFA + tests): 127 ms (still ~36% faster)
Single test (isolated):
Before (v37 regex): 38 ms
After (NFA): 26 ms (~32% faster)
After (NFA + tests): 41 ms (+8%, more tests added)
Note: The rpr test was moved out of the parallel group in the schedule
for more accurate timing measurement.
The streaming NFA approach eliminates regex compilation overhead per-row
and reduces redundant DEFINE evaluations, resulting in better performance
especially for complex patterns.
Let me know if you have any questions or concerns.
Best regards,
Henson
Attachment
pgsql-hackers by date: