Re: Collecting statistics about contents of JSONB columns - Mailing list pgsql-hackers
From | Mahendra Singh Thalor |
---|---|
Subject | Re: Collecting statistics about contents of JSONB columns |
Date | |
Msg-id | CAKYtNAo1Pf0c-4t2UYobg73UR+fkkfBA8ZYHkeUGpV5bE2i7Sw@mail.gmail.com Whole thread Raw |
In response to | Re: Collecting statistics about contents of JSONB columns (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Collecting statistics about contents of JSONB columns
|
List | pgsql-hackers |
>
>
>
> On 2/4/22 03:47, Tomas Vondra wrote:
> > ./json-generate.py 30000 2 8 1000 6 1000
>
> Sorry, this should be (different order of parameters):
>
> ./json-generate.py 30000 2 1000 8 6 1000
>
Thanks, Tomas for this test case.
Hi Hackers,
Below is the summary:
Point 1) analyse is taking very much time for large documents:
For large JSON documents, analyze took very large time as compared to the current head. For reference, I am attaching test file (./json-generate.py 30000 2 1000 8 6 1000)
Head: analyze test ; Time: 120.864 ms
With patch: analyze test ; Time: more than 2 hours
analyze is taking a very large time because with these patches, firstly we iterate over all sample rows (in above case 30000), and we store all the paths (here around 850k paths).
Point 2) memory consummation increases rapidly for large documents:
In the above test case, there are total 851k paths and to keep stats for one path, we allocate 1120 bytes.
Memory for 1 path to keep stats: 1120 ~ 1 KB
(sizeof(JsonValueStats) = 1120 from “Analyze Column”)
Total memory for all paths: 852689 * 1120 = 955011680 ~ 955 MB
Extra memory for each path will be more. I mean, while analyzing each path, we allocate some more memory based on frequency and others
To keep all entries(851k paths) in the hash, we use around 1GB memory for hash so this is also very large.
Point 3) Review comment noticed by Tomas Vondra:
+ oldcxt = MemoryContextSwitchTo(ctx->stats->anl_context);
+ pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+ MemoryContextSwitchTo(oldcxt);
Above should be:
+ oldcxt = MemoryContextSwitchTo(ctx->mcxt);
+ pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+ MemoryContextSwitchTo(oldcxt);
+/*
+ * jsonAnalyzeCollectPath
+ * Extract a single path from JSON documents and collect its values.
+ */
+static void
+jsonAnalyzeCollectPath(JsonAnalyzeContext *ctx, Jsonb *jb, void *param)
+{
+ JsonPathAnlStats *pstats = (JsonPathAnlStats *) param;
+ JsonbValue jbvtmp;
+ JsonbValue *jbv = JsonValueInitBinary(&jbvtmp, jb);
+ JsonPathEntry *path;
+ JsonPathEntry **entries;
+ int i;
+
+ entries = palloc(sizeof(*entries) * pstats->depth);
+
+ /* Build entry array in direct order */
+ for (path = &pstats->path, i = pstats->depth - 1;
+ path->parent && i >= 0;
+ path = path->parent, i--)
+ entries[i] = path;
+
+ jsonAnalyzeCollectSubpath(ctx, pstats, jbv, entries, 0);
+
+ pfree(entries);
----many times, we are trying to palloc with zero size and entries is pointing to invalid memory (because pstats->depth=0) so I think, we should not try to palloc with 0??
Fix:
+ If (pstats->depth)
+ entries = palloc(sizeof(*entries) * pstats->depth);
We can set limits(like MCV) for paths or we can give an explicit path to collect stats for a particular path only or we can pass a subset of the JSON values.
In the above case, there are total 851k paths, but we can collect stats for only 1000 paths that are most common so this way we can minimize time and memory also and we might even keep at
least frequencies for the non-analyzed paths.
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: