Re: Is it really such a good thing for newNode() to be a macro? - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Is it really such a good thing for newNode() to be a macro? |
Date | |
Msg-id | 26133.1220037409@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Is it really such a good thing for newNode() to be a macro? (Peter Eisentraut <peter_e@gmx.net>) |
Responses |
Re: Is it really such a good thing for newNode() to be a macro?
Re: Is it really such a good thing for newNode() to be a macro? |
List | pgsql-hackers |
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane wrote: >> I considered that one, but since part of my argument is that inlining >> this is a waste of code space, it seems like a better inlining >> technology isn't really the answer. > The compiler presumably has the intelligence and the command-line options to > control how much inlining one wants to do. But without any size vs. > performance measurements it is an idle discussion. Getting rid of a global > variable and macro ugliness is a worthwhile goal of its own. I got around to doing some experiments. The method suggested by Heikki (out-of-line subroutine for everything except the MemSetTest) reduces the size of the backend executable by about 0.5% (about 20K) in CVS HEAD on Fedora 9 x86_64, in a non-assert-enabled build. However it also makes it measurably slower. I couldn't detect any difference in a regular pgbench run, so instead I timed iterations of this: explain select * from tenk1 a join tenk1 b using(unique1) join tenk1 c on a.unique1 = c.unique2 join tenk1 d on a.unique1= d.thousand join tenk1 e on a.unique1 = e.ten join tenk1 f on a.unique1 = f.tenthous join tenk1 g on a.unique1= g.unique2 where exists(select 1 from int4_tbl where f1 = b.unique2); in the regression database. Put the above (as a single line!) into "explainjoin.sql" and try pgbench -c 1 -t 100 -n -f explainjoin.sql regression This is mostly stressing the planner, which is pretty newNode-heavy. I get consistently about 14.1 tps on straight CVS HEAD and about 13.8 with the partially out-of-line implementation. I also tried the "static inline" implementation, but that doesn't work at all: gcc refuses to inline it, which makes the palloc0fast call a dead loss. So indeed what we need here is a better inlining technology. I looked into using gcc's "a compound statement enclosed in parentheses is an expression" extension, thus: #define newNode(size, tag) \ ({ Node *newNodeMacroHolder; \ AssertMacro((size) >= sizeof(Node)); /* need the tag, at least */ \ newNodeMacroHolder= (Node *) palloc0fast(size); \ newNodeMacroHolder->type = (tag); \ newNodeMacroHolder; \ }) This gets rid of the global, but incredibly, it's even slower: 13.5 tps on the explain test. I do not understand that result. I looked at the generated machine code to verify that it was what I expected, and indeed it's about the same as CVS HEAD except that there's no store-and-fetch into a global. Getting rid of the global variable accesses reduces the size of the backend by about 12K on this architecture, and the only theory I can think of is that that moves things around enough to make the instruction cache less efficient on some code path that this test happens to exercise heavily. In theory the above implementation of newNode should be a clear win, so I'm thinking this result must be an artifact of some kind. I'm going to go try it on PPC and HPPA machines next; does anyone want to try it on something else? regards, tom lane
pgsql-hackers by date: