Home > mailing lists

Re: Row pattern recognition - Mailing list pgsql-hackers

From	Henson Choi
Subject	Re: Row pattern recognition
Date	January 16 02:44:42
Msg-id	CAAAe_zC1Ui2=xjoPQ_9CAfP4vPW6utFWc+a7Vb_j_aaBB0YM4Q@mail.gmail.com Whole thread Raw
In response to	Re: Row pattern recognition (Henson Choi <assam258@gmail.com>)
List	pgsql-hackers

Tree view

Hi Ishii-san,

Thank you for raising the frame boundary concern. You're right that
different contexts have different frame boundaries.

2026년 1월 15일 (목) PM 10:28, Henson Choi <assam258@gmail.com>님이 작성:

Hi Ishii-san

We need to check the *frame" end, not the partition end.

I think your patch relies on !window_gettupleslot() to check whether
the row exists.

if (!window_gettupleslot(winobj, pos, slot))
return false; /* No row exists */

But the function only checks the row existence in the current partition:

* Fetch the pos'th tuple of the current partition into the slot,

Thus it is possible that window_gettupleslot() returns true but the
row is not in the current frame in case that the partition is divided
into some frames. You need to check the row existence in a frame. For
this purpose you can use row_is_in_frame().

Your suggestion was to use row_is_in_frame() to check frame boundaries
for each row. However, I took a simpler approach: just disable context
absorption entirely when the frame is limited.

The root cause is that context absorption assumes all contexts see
the same rows. With limited frames, each context starting at a different
row has a different visible range, so we cannot absorb one context into
another.

As I mentioned before, I think we should use absorption conservatively
for now - only in cases where it's clearly safe (e.g., A* at the
beginning of the pattern). We can extend it later through more in-depth
research.

The fix:

if (winstate->rpSkipTo == ST_PAST_LAST_ROW && !hasLimitedFrame)
nfa_absorb_contexts(winstate, targetCtx, currentPos);

I added a test case to verify this:

-- Pattern: A+ B, Data: A(0) A(1) A(2) B(3), Frame: CURRENT ROW AND 2 FOLLOWING
-- Row 0: frame [0,2], cannot see B at row 3 -> no match
-- Row 1: frame [1,3], can see A A B -> should match rows 1-3
WITH frame_absorb_test AS (
SELECT * FROM (VALUES
(0, 'A'), (1, 'A'), (2, 'A'), (3, 'B')
) AS t(id, flag)
)
SELECT id, flag, first_value(id) OVER w AS match_start, last_value(id) OVER w AS match_end
FROM frame_absorb_test
WINDOW w AS (
ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A+ B)
DEFINE
A AS flag = 'A',
B AS flag = 'B'
);

Before fix (absorption incorrectly absorbs row 1's context into row 0's):

id | flag | match_start | match_end
----+------+-------------+-----------
0 | A | |
1 | A | |
2 | A | |
3 | B | |

After fix (row 1 correctly matches):

id | flag | match_start | match_end
----+------+-------------+-----------
0 | A | |
1 | A | 1 | 3
2 | A | |
3 | B | |

Note that for SKIP TO NEXT ROW, absorption is already disabled,
so your test case with SKIP TO NEXT ROW should
work correctly.

I'm attaching three related commits:

1. Row pattern recognition: Improve NFA state memory management

This commit refactors NFA state management for better readability
and maintainability.

2. Row pattern recognition: Disable context absorption for limited frame

This commit further restricts absorption: even with SKIP PAST LAST ROW,
absorption is disabled when the frame is limited. With limited frames,
different contexts have different frame boundaries, making absorption
unsafe.

I ran following query and got the result with v38 (which includes your
NFA patches).

WITH data AS (
SELECT * FROM (VALUES
('A', 1), ('A', 2),
('B', 3), ('B', 4)
) AS t(gid, id))
SELECT gid, id, array_agg(id) OVER w
FROM data
WINDOW w AS (
PARTITION BY gid
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
AFTER MATCH SKIP TO NEXT ROW
PATTERN (A+)
DEFINE A AS id < 10
);
gid | id | array_agg
-----+----+-----------
A | 1 | {1,2}
A | 2 |
B | 3 | {3,4}
B | 4 |
(4 rows)

I think the second and 4th rows are expected to return some data in
array_agg colum. In fact v37 patch returns following results for the
same query:

gid | id | array_agg
-----+----+-----------
A | 1 | {1,2}
A | 2 | {2}
B | 3 | {3,4}
B | 4 | {4}
(4 rows)

While investigating the issue you mentioned, I found that the problem
is caused by the context absorption logic, not the frame boundary check.

The absorption logic removes a newer context when
older->matchEndRow >= newer->matchEndRow:

Example with pattern A+:
- Context at row 1: matchEndRow = 2 (match {1,2})
- Context at row 2: matchEndRow = 2 (match {2})
→ Newer context absorbed, row 2's result lost

With SKIP TO NEXT ROW, each row should produce its own independent match.
The absorption was incorrectly removing these valid overlapping matches.

I modified the absorption logic to only apply to SKIP PAST LAST ROW mode,
where it provides performance benefits without affecting correctness.

For SKIP TO NEXT ROW, absorption is now disabled, allowing each row
to produce independent overlapping matches as expected:

gid | id | array_agg
-----+----+-----------
A | 1 | {1,2}
A | 2 | {2}
B | 3 | {3,4}
B | 4 | {4}
(4 rows)

I'm attaching 3 patches:

1. Test cases from Jacob Champion's branch
- Added 355 lines of test cases and 51 lines of executor changes

2. Refactored update_reduced_frame for readability and performance
- Extracted large code blocks into separate functions
- Early return for already-processed positions
- Integrated nfa_unlink_context into nfa_context_free
- Used oldest-first context ordering for early termination

3. Enable context absorption only for SKIP PAST LAST ROW
- Fixes overlapping match issue for SKIP TO NEXT ROW
- In my 5000-row test, absorption showed significant performance gain for SKIP PAST LAST ROW
- Added your test case to regression tests

Since I implemented it in the order of parser/planner/nfa/executor,
the executor flow became somewhat confusing.
I plan to continue improving it by reviewing and refactoring
from the top-level functions downward.

During refactoring, I plan to simplify pattern optimization and context
absorption logic, keeping only the parts that are clearly correct, even
if this means some performance loss in the initial version. We can add
optimizations back later once the core logic is stable and well-tested.

I'm now reviewing the remaining parts: pattern optimization in planner,
absorption function, and step function. Once the review is complete,
I'll send the rest of the patches.

Best regards,

Henson

Attachment

pgsql-hackers by date:

From: Chao Li
Date: 16 January, 02:35:04
Subject: Re: Extended Statistics set/restore/clear functions.

From: Movead
Date: 16 January, 03:04:32
Subject: Re: Can we change pg_rewind used without wal_log_hints and data_checksums

Re: Row pattern recognition - Mailing list pgsql-hackers

Attachment

Previous

Next