Reducing data type space usage - Mailing list pgsql-hackers
From | Gregory Stark |
---|---|
Subject | Reducing data type space usage |
Date | |
Msg-id | 87mz91ow7m.fsf@enterprisedb.com Whole thread Raw |
Responses |
Re: Reducing data type space usage
Re: Reducing data type space usage Re: Reducing data type space usage |
List | pgsql-hackers |
Following up on the recent discussion on list about wasted space in data representations I want to summarise what we found and make some proposals: As I see it there are two cases: Case 1) Data types that are variable length but often quite small. This includes things like NUMERIC which in common usewill rarely be larger than 12-20 bytes and often things like text. In cases like these we really only need 1 or sometimes 2 byte varlena header overhead, not 4 as we currently do. In factwe *never* need more than 2 bytes of varlena header on disk anyways with the standard configuration. Case 2) Data types that are different sizes depending on the typmod but are always the same size that can be determinedstatically for a given typmod. In the case of a ASCII encoded database CHAR(n) fits this category and in any case we'll eventually have per-column encoding. NUMERC(a,b) could also be made to fit this as well. In cases like thesewe don't need *any* varlena header. If we could arrange for the functions to have enough information to know how largethe data must be. Solutions proposed: Case 1) We've discussed the variable sized varlena headers and I think it's clear that that's the most realistic way toapproach it. I don't think any other approaches were even suggested. Tom said he wanted a second varlena format for numeric that wouldhave 2-byte alignment. But I think we could always just say that we always use the 2-byte varlena header on data typeswith 2-byte alignment and the 4-byte header on data types with 4-byte alignment needs. Or heap_form_tuple could beeven cleverer about it but I'm not sure it's worth it. This limits the wasted space to 1-2% for most variable sized data that are 50 bytes long or more. But for very small datasuch as the quite common cases where those are often only 1-4 bytes it still means a 25-100% performance drain. Case 2) Solving this is quite difficult without introducing major performance problems or security holes. The one approachwe have that's practical right now is introducing special data types such as the oft-mentioned "char" data type."char" doesn't have quite the right semantics to use as a transparent substitute for CHAR but we could define a CHAR(1)with exactly the right semantics and substitute it transparently in parser/analyze.c (btw having two files namedanalyze.c is pretty annoying). We could do the same with NUMERIC(a,b) for sufficiently small values of a and b withsomething like D'Arcy's CASH data type (which uses an integer internally). The problem with defining lots of data types is that the number of casts and cross-data-type comparisons grows quadraticallyas the number of data types grows. In theory we would save space by defining a CHAR(n) for whatever size nthe user needs but I can't really see anything other than CHAR(1) being worthwhile. Similarly a 4-byte NUMERIC substitutelike CASH (with full NUMERIC semantics though) and maybe a 2-byte and 8-byte substitute might be reasonable butanything else would be pointless. I see these two solutions as complementary. The variable varlena headers take care of the larger data and the special-purpose data types take care of the extremely small data. And pretty important to cover both cases data that fits in 1-4 bytes is quite common. You often see databases with dozens of CHAR(1) flag columns or NUMERIC(10,2) currency columns. With a CHAR(1) and CASH style numeric substitute we won't have 25-100% performance lost on the things that would fit in 1-4 bytes. And with the variable sized varlena header we'll limit to 25% at worst and 1-2% usually the performance drain due to wasted space on larger data. Doing better would require a complete solution to data types that can understand how large they are based on their typmod. That would imply more dramatic solutions like I mused about involving passing around structures that contain the Datum as well as the attlen or atttypmod. The more I think about these ideas the more I think they may have merit but they would be awfully invasive and require more thought. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: