WIP: Generic functions for Node types using generated metadata - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | WIP: Generic functions for Node types using generated metadata |
Date | |
Msg-id | 20190828234136.fk2ndqtld3onfrrp@alap3.anarazel.de Whole thread Raw |
Responses |
Re: WIP: Generic functions for Node types using generated metadata
Re: WIP: Generic functions for Node types using generated metadata |
List | pgsql-hackers |
Hi, There's been a lot of complaints over the years about how annoying it is to keep the out/read/copy/equalfuncs.c functions in sync with the actual underlying structs. There've been various calls for automating their generation, but no actual patches that I am aware of. There also recently has been discussion about generating more efficient memory layout for node trees that we know are read only (e.g. plan trees inside the plancache), and about copying such trees more efficiently (e.g. by having one big allocation, and then just adjusting pointers). One way to approach this problem would be to to parse the type definitions, and directly generate code for the various functions. But that does mean that such a code-generator needs to be expanded for each such functions. An alternative approach is to have a parser of the node definitions that doesn't generate code directly, but instead generates metadata. And then use that metadata to write node aware functions. This seems more promising to me. In the attached, *very experimental WIP*, patch I've played around with this approach. I have written a small C program that uses libclang (more on that later) to parse relevant headers, and generate metadata about node types. Currently the generated metadata roughly looks like this: #define TYPE_CAT_SCALAR (1 << 0) // FIXME: this isn't well named rn #define TYPE_CAT_POINTER (1 << 1) #define TYPE_CAT_UNION (1 << 2) // FIXME: unused #define TYPE_CAT_INCOMPLETE (1 << 3) #define KNOWN_TYPE_UNKNOWN 0 #define KNOWN_TYPE_STRING 1 #define KNOWN_TYPE_INT 2 #define KNOWN_TYPE_NODE 3 #define KNOWN_TYPE_BITMAPSET 4 typedef struct NodeTypeInfo { /* FIXME: intern me */ const char *structname; int16 nelems; int16 start_at; int16 size; } NodeTypeInfo; typedef struct NodeTypeComponents { /* FIXME: intern me */ const char *fieldname; /* FIXME: intern me */ const char *fieldtype; int16 offset; int16 size; int16 flags; int16 node_type_id; int16 known_type_id; } NodeTypeComponents; NodeTypeInfo node_type_info[] = { {0}, {.structname = "IndexInfo", .nelems = 22, .start_at = 2128, .size = sizeof(IndexInfo)}, ... {.structname = "Append", .nelems = 4, .start_at = 1878, .size = sizeof(Append)}, ... NodeTypeComponents node_type_components[] = { ... {.fieldname = "plan", .fieldtype = "struct Plan", .offset = offsetof(struct Append, plan), .size = sizeof(struct Plan),.flags = TYPE_CAT_SCALAR, .node_type_id = 9, .known_type_id = KNOWN_TYPE_UNKNOWN}, {.fieldname = "appendplans", .fieldtype = "struct List *", .offset = offsetof(struct Append, appendplans), .size = sizeof(structList *), .flags = TYPE_CAT_POINTER, .node_type_id = 223, .known_type_id = KNOWN_TYPE_UNKNOWN}, {.fieldname = "first_partial_plan", .fieldtype = "int", .offset = offsetof(struct Append, first_partial_plan), .size= sizeof(int), .flags = TYPE_CAT_SCALAR, .node_type_id = -1, .known_type_id = KNOWN_TYPE_UNKNOWN}, {.fieldname = "part_prune_info", .fieldtype = "struct PartitionPruneInfo *", .offset = offsetof(struct Append, part_prune_info),.size = sizeof(struct PartitionPruneInfo *), .flags = TYPE_CAT_POINTER, .node_type_id = 53, .known_type_id= KNOWN_TYPE_UNKNOWN}, ... i.e. each node type is listed in node_type_info, indexed by the NodeTag value. The fields building up the node are defined in node_type_components. By including the size for Node structs, and components, and by including the offset for components, we can write generic routines accessing node trees. In the attached I used this metadata to write a function that can compute the amount of memory a node tree needs (without currently considering memory allocator overhead). The function for doing so has a pretty limited amount of type awareness - it needs to know about strings, the various list types, value nodes and Bitmapset. I'm fairly sure this metadata can also be used to write the other currently existing node functions. I *suspect* that it'll be quite OK performancewise, compared to bespoke code, but I have *NOT* measured that. The one set of fields this currently can not deal with is the various arrays that we directly reference from nodes. For e.g. typedef struct Sort { Plan plan; int numCols; /* number of sort-key columns */ AttrNumber *sortColIdx; /* their indexes in the target list */ Oid *sortOperators; /* OIDs of operators to sort them by */ Oid *collations; /* OIDs of collations */ bool *nullsFirst; /* NULLS FIRST/LAST directions */ } Sort; the generic code has no way of knowing that sortColIdx, sortOperators, collations, nullsFirst are all numCols long arrays. I can see various ways of dealing with that: 1) We could convert them all to lists, now that we have fast arbitrary access. But that'd add a lot of indirection / separate allocations. 2) We could add annotations to the sourcecode, to declare that association. That's probably not trivial, but wouldn't be that hard - one disadvantage is that we probably couldn't use that declaration for automated asserts etc. 3) We could introduce a few macros to create array type that include the length of the members. That'd duplicate the lenght for each of those arrays, but I have a bit of a hard time believing that that's a meaningful enough overhead. I'm thinking of a macro that'd be used like ARRAY_FIELD(AttrNumber) *sortColIdx; that'd generate code like struct { size_t nmembers; AttrNumber members[FLEXIBLE_ARRAY_MEMBER]; } *sortColIdx; plus a set of macros (or perhaps inline functions + macros) to access them. With any of these we could add enough metadata for node type members to be able to handle such arrays in a generic manner, rather than having to write code for each of them. With regards to using libclang for the parsing: I chose that because it seemed the easiest to experiment with, compared to annotating all the structs with enough metadata to be able to easily parse them from a perl script. The node definitions are after all distributed over quite a few headers. I think it might even be the correct way forward, over inventing our own mini-languages and writing ad-hoc parsers for those. It sure is easier to understand plain C code, compared to having to understand various embeded mini-languages consisting out of macros. The obvious drawback is that it'd require more people to install libclang - a significant imposition. I think it might be OK if we only required it to be installed when changing node types (although it's not necessarily trivial how to detect that node types haven't changed, if we wanted to allow other code changes in the relevant headers), and stored the generated code in the repository. Alternatively we could annotate the code enough to be able to write our own parser, or use some other C parser. I don't really want to invest significantly more time into this without first debating the general idea. The attached patch really just is a fairly minimal prototype. It does not: - properly integrate into the buildsystem, to automatically re-generate the node information when necessary - instead one has to "make -C src/backend/catalog gennodes-run" - solve the full problem (see discussion of referenced arrays above) - actually use the metadata in a useful manner (there just is a function that estimates node tree sizes, which is used for parse/plan trees) - have clean code One interesting - but less important - bit is that the patch currently has the declared node type id available for node members, e.g: typedef struct FieldStore { Expr xpr; Expr *arg; /* input tuple value */ List *newvals; /* new value(s) for field(s) */ List *fieldnums; /* integer list of field attnums */ Oid resulttype; /* type of result (same as type of arg) */ /* Like RowExpr, we deliberately omit a typmod and collation here */ } FieldStore; but we cannot actually rely on arg actually pointing to a node of type Expr* - it'll always point to a "subclass". Obviously we can handle that by dispatching using nodeTag(arg), but it'd be more efficient if we didn't need that. And we also loose some error checking due to that. I wonder if we could collect enough metadata to determine which potential subclasses a node type has. And then dispatch more efficiently when only one, and assert that it's a legal subclass for the other cases. Greetings, Andres Freund
Attachment
pgsql-hackers by date: