Thread: jsonpath duplication result

jsonpath duplication result

From

Charles Paperman

Date:

02 September 2021, 13:06:20

Hi, 

I am looking at the documentation and how jsonpath works and I fall on the following example:

(On PostgreSQL 12.4, package for Debian 12.4-3):

SELECT jsonb_path_query(j, '$.**.a.**.b') FROM (SELECT '{"a":{"a":{"b":{"c":3}}}'::jsonb as j) as T;

The result:

 jsonb_path_query 
------------------
 {"c": 3}
 {"c": 3}
(2 lignes)


From what in understand, it matches twice the last `b` with every `a` before. For instance, adding one more 'a' on the
tree:

 SELECT jsonb_path_query(j, '$.**.a.**.b') FROM (SELECT '{"a":{"a":{"a":{"b":{"c":3}}}}}'::jsonb as j) as T;
 jsonb_path_query 
------------------
 {"c": 3}
 {"c": 3}
 {"c": 3}

This could be a fine semantic for jsonpath query (one could call that the Bag semantic), 
however it implies that in some situation the number of output duplicated can become huge.
Actually in the worst case, it could be exponential in the depth of the tree.

On more example (for the fun):

SELECT Count(*) FROM (SELECT jsonb_path_query(j,
'$.**.a.**.a.**.a.**.a.**.a.**.a.**.a.**.a.**.a.**.a.**.a.**.a.**.a.**.b')FROM (SELECT
'{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"a":{"b":{"c":3}}}}}}}}}}}}}}}}}}}}}}}}}}}}}'::jsonb
asj) as T) as T2;
 

Takes 24s a returns 20058300 rows.

This semantic for jsonpath is not so classical (other implementation choice to return only once each selected element,
inthe SET semantic spirit) 
 
and differs from the semantic of xpath, as shown here, which follows the SET semantic:

SELECT xpath('//a//a//b', '<a><a><a><b>1</b></a></a></a>');

Returns: 

   xpath    
------------
 {<b>1</b>}


We can of course get the appropriate semantic by adding DISTINCT, but the initial list of rows is computed nevertheless
(aO(n) algorithm exists, so it is avoidable).
 

I also wonder if it is problematic to have simple small query like that that can be design to make the database work
endlessly?

Best, 

Charles Paperman
Maitre de Conférence à l'université de Lille

Re: jsonpath duplication result

From

"David G. Johnston"

Date:

02 September 2021, 13:55:45

On Thu, Sep 2, 2021 at 6:06 AM Charles Paperman <charles.paperman@inria.fr> wrote:

I also wonder if it is problematic to have simple small query like that that can be design to make the database work endlessly?

No, because the "simple small" part of that doesn't matter. A user with access has various ways to affect denial-of-service on the server. If you don't trust someone from doing this kind of thing intentionally don't provide them a login. Accidental issues of this nature should be accounted for in other ways - like testing. And the process-oriented nature of the system helps too since CPU starving becomes difficulty (so memory ends up being the main concern).

The pathological case shown here is not all that concerning to me either.

That the implementation choice could have been different, and match other implementations, is interesting to me. But changing this kind of behavior is usually not an option. Though adding new features to accomplish the new behavior is something to consider.

David J.

Re: jsonpath duplication result

From

Charles Paperman

Date:

02 September 2021, 18:00:03

Le 09/02 06:55, David G. Johnston a écrit :
> On Thu, Sep 2, 2021 at 6:06 AM Charles Paperman <charles.paperman@inria.fr>
> wrote:
> 
>     I also wonder if it is problematic to have simple small query like that
>     that can be design to make the database work endlessly?
> 
> 
> 
> No, because the "simple small" part of that doesn't matter.  A user with access
> has various ways to affect denial-of-service on the server.  If you don't trust
> someone from doing this kind of thing intentionally don't provide them a
> login.  Accidental issues of this nature should be accounted for in other ways
> - like testing.  And the process-oriented nature of the system helps too since
> CPU starving becomes difficulty (so memory ends up being the main concern).

I see. Thanks for the precision.

> The pathological case shown here is not all that concerning to me either. 
> That the implementation choice could have been different, and match other
> implementations, is interesting to me.  But changing this kind of behavior is
> usually not an option.  Though adding new features to accomplish the new
> behavior is something to consider.

I looked into https://goessner.net/articles/JsonPath/ (which seems to be the initial implementation others take as the
standard).
And it seems that it follows the BAG semantics (the one of PostgreSQL).
On the SET semantic sides, without looking to much I found jq (which is somehow a standard as well)  and
jsonpath-python(which claims

to follow the Goessner's one but actually doesn't). My guess is that depending on specific details on the
implementation,you can fall in

one semantic or the other without thinking about it. It is not something tested in the test suits of the packaged I
lookedor in the initial

Goessner examples.

I can see a reason to prefer the BAG semantic: Goessner provides in the API a way to get all the paths and not just
extracting
the final node. Then selecting the nodes could be a way to find other kind of information that is within the path.
I think that providing the path as a json function in postgresql is definitely a features that should 
not be to hard to add and which could be handy.

On the other hand, the Bag semantic means that any descendant predicate will increase the complexity (and possibly the
setof results).

Looking at the code of `jsonpath_exec` I can see that adding the SET semantic as a new feature might be not obvious ...