Thread: Datatype sizes; a space and speed issue?
Hi again, Something I have been wondering about and haven't found an answer to yet is how the size of a datatype (I hope that is the right term) effects performance. What effect is there if I specify "TEXT" instead of say "VARCHAR(255)"? How much benefit is there do being more aggresive and say cutting it in half again by using "VARCHAR(128)"? I hope this isn't too basic a question! Thanks! Madison
On Tuesday 22 June 2004 11:26 pm, Madison Kelly wrote: > Something I have been wondering about and haven't found an answer to > yet is how the size of a datatype (I hope that is the right term) > effects performance. What effect is there if I specify "TEXT" instead of > say "VARCHAR(255)"? How much benefit is there do being more aggresive > and say cutting it in half again by using "VARCHAR(128)"? There is no performance difference between varchar, char and text. Infact, text is equal to varchar with no length specified (a Postgresql extension). I use varchar when I want a max limit on the number of characters. Char when I want a set number of characters (auto space padded). And use text all other times, because it is more readable than varchar. All this and more can be found in the Docs - 7.4.2: Ch 8: Data Types, Sec 8.3: Character Types. good luck, Vams
> What effect is there if I specify "TEXT" instead of > say "VARCHAR(255)"? Well, theoretically, in the latter case, the database will allocate 256 (257? 259?) bytes for that field in the record itself. That is, that field will consume 256 bytes for each record stored. In the former case, the database will (again, theoretically) allocate a string pool (think of a separate file, as one way to do it) for the field, and will allocate a pointer (4 or 8 bytes) into the string pool for the field. This means that the field will consume only four or eight bytes for each record stored, plus (don't forget this, it's important) however many bytes are actually needed in the string pool for the text value. If you have, for instance, a fifteen character name stored in Unicode UTF32, there will be ninety, erm, octets (think bytes for most CPUs) in the pool for that field of that record, plus a few pointers and such so the system can tell where to collect garbage. > How much benefit is there do being more aggresive > and say cutting it in half again by using "VARCHAR(128)"? Well, that would reduce the storage requirements for that field by half. It might also prevent you from storing necessary information. That's easily 128 characters if you're only using US-ASCII in UTF-8, but it's only 32 characters of Unicode in UTF-32, and it could be anywhere between 128 and 32 in Unicode UTF-8. Making a good prediction about the tradeoff is one of the things a database engineer is paid for. > I hope this isn't too basic a question! Since you ask, yeah, it is, but unless things have changed here recently, the people here aren't going to raise the oven temperature too high when they roast me for answering it. -- Joel Matthew <rees@ddcom.co.jp>
Vams wrote: > On Tuesday 22 June 2004 11:26 pm, Madison Kelly wrote: > >> Something I have been wondering about and haven't found an answer to >>yet is how the size of a datatype (I hope that is the right term) >>effects performance. What effect is there if I specify "TEXT" instead of >>say "VARCHAR(255)"? How much benefit is there do being more aggresive >>and say cutting it in half again by using "VARCHAR(128)"? > > > There is no performance difference between varchar, char and text. Infact, > text is equal to varchar with no length specified (a Postgresql extension). > > I use varchar when I want a max limit on the number of characters. Char when > I want a set number of characters (auto space padded). And use text all > other times, because it is more readable than varchar. > > All this and more can be found in the Docs - 7.4.2: Ch 8: Data Types, Sec 8.3: > Character Types. > > good luck, > > Vams Thank you very much Vams!! Madison
On Wed, 23 Jun 2004, Joel Matthew wrote: > > What effect is there if I specify "TEXT" instead of > > say "VARCHAR(255)"? > > Well, theoretically, in the latter case, the database will allocate 256 > (257? 259?) bytes for that field in the record itself. That is, that > field will consume 256 bytes for each record stored. Both text and varchar are stored in PostgreSQL as length + string so varchar(256) doesn't not require storage of the unused bytes. char space pads so it's the oddball (requiring length + padded string). > > How much benefit is there do being more aggresive > > and say cutting it in half again by using "VARCHAR(128)"? > > Well, that would reduce the storage requirements for that field by half. > It might also prevent you from storing necessary information. That's > easily 128 characters if you're only using US-ASCII in UTF-8, but it's > only 32 characters of Unicode in UTF-32, and it could be anywhere > between 128 and 32 in Unicode UTF-8. Making a good prediction about the > tradeoff is one of the things a database engineer is paid for. varchar lengths should be in characters not bytes, so that should be 128 characters in any of the encodings, but the actual number of bytes that those 128 characters can take up may vary.
> Both text and varchar are stored in PostgreSQL as length + string so > varchar(256) doesn't not require storage of the unused bytes. char space > pads so it's the oddball (requiring length + padded string). Yeah, I saw Vams's post and r3m3mb3r3d that I'd f0rg0t. > varchar lengths should be in characters not bytes, so that should be 128 > characters in any of the encodings, but the actual number of bytes that > those 128 characters can take up may vary. And that's actually the best rationale, IMO, for making the optimization a full-fledged feature. Boy, the things you forget ... -- Joel Matthew <rees@ddcom.co.jp>
On Wed, Jun 23, 2004 at 01:15:23PM +0900, Joel Matthew wrote: > > What effect is there if I specify "TEXT" instead of > > say "VARCHAR(255)"? > > Well, theoretically, in the latter case, the database will allocate 256 > (257? 259?) bytes for that field in the record itself. That is, that > field will consume 256 bytes for each record stored. Nope. Actually, in both cases the length will be stored first (4 bytes) and the actual content following it, using the indicated amount of bytes. There's absolutely no difference in storage. A varchar(256) field will allow you to store a text not with 256 bytes max, but 256 _chars_ max. Think multibyte encodings such as utf8 -- the varchar(256) can take anything from 4 + 1 bytes (a single byte string) to 4 + 256 * max_bytes_per_char. The difference you cite is for char(N) fields, which are always padded with blanks to fill the N chars. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "There is evil in the world. There are dark, awful things. Occasionally, we get a glimpse of them. But there are dark corners; horrors almost impossible to imagine... even in our worst nightmares." (Van Helsing, Dracula A.D. 1972)
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > The difference you cite is for char(N) fields, which are always padded > with blanks to fill the N chars. And, again, that's N logical *characters*, not bytes. We used to have some attempts at optimizing on the assumption that char(n) fields were physically fixed-width, but we gave it up as a bad job several major releases back ... it was never more than a very marginal optimization anyway ... regards, tom lane
Alvaro Herrera wrote: > On Wed, Jun 23, 2004 at 01:15:23PM +0900, Joel Matthew wrote: > >>>What effect is there if I specify "TEXT" instead of >>>say "VARCHAR(255)"? >> >>Well, theoretically, in the latter case, the database will allocate 256 >>(257? 259?) bytes for that field in the record itself. That is, that >>field will consume 256 bytes for each record stored. > > > Nope. Actually, in both cases the length will be stored first (4 bytes) > and the actual content following it, using the indicated amount of > bytes. There's absolutely no difference in storage. > > A varchar(256) field will allow you to store a text not with 256 bytes > max, but 256 _chars_ max. Think multibyte encodings such as utf8 -- the > varchar(256) can take anything from 4 + 1 bytes (a single byte string) to > 4 + 256 * max_bytes_per_char. > > > The difference you cite is for char(N) fields, which are always padded > with blanks to fill the N chars. Thank you everyone for enswering, that actually clears up a lot. Thank you too for not flaming this n00b! Madison
> We used to have some attempts at optimizing on the assumption that > char(n) fields were physically fixed-width, but we gave it up as a > bad job several major releases back ... it was never more than a > very marginal optimization anyway ... Does that mean that PostGreSQL fixes character width at thirty-two bits, or that it uses UTF-8, or that it just stores what it gets? (Checked chapter 8.3 in the manual, didn't see the answer there. Not that I really want to know. With Unicode, trying to optimize record sizes for char/text fields is a little like trying to play Russian Roulette. Wait, is that no longer politically correct? Should it be called six-chamber roulette, now? Don't want to offend anyone.) -- Joel Matthew <rees@ddcom.co.jp>
> ... the length will be stored first (4 bytes) > and the actual content following it Section 8.3 of the manual seems to indicate that the actual character data is stored in a separate file (background table -- I guess that might not be a separate file?) for text, unspecified width char, and char fields which exceed the specified length. -- Joel Matthew <rees@ddcom.co.jp>
Joel Matthew <rees@ddcom.co.jp> writes: >> ... the length will be stored first (4 bytes) >> and the actual content following it > Section 8.3 of the manual seems to indicate that the actual character > data is stored in a separate file (background table -- I guess that > might not be a separate file?) for text, unspecified width char, and > char fields which exceed the specified length. We may push very-wide fields out to a separate table ("toast table"). This has nothing to do with whether the field is text, varchar(n), unspecified varchar, or whatever, but only with the physical size of the data. regards, tom lane
Joel Matthew <rees@ddcom.co.jp> writes: > Does that mean that PostGreSQL fixes character width at thirty-two bits, > or that it uses UTF-8, or that it just stores what it gets? We store text data in the form indicated by the database encoding setting. UCS-32 is not a supported encoding, but UTF-8 is --- among others. > (Checked chapter 8.3 in the manual, didn't see the answer there. Not > that I really want to know. With Unicode, trying to optimize record > sizes for char/text fields is a little like trying to play Russian > Roulette. No, it's entirely like pointless. You just don't know how many bytes will be taken up by N characters. regards, tom lane
On Wed, Jun 23, 2004 at 02:43:06PM +0900, Joel Matthew wrote: > > We used to have some attempts at optimizing on the assumption that > > char(n) fields were physically fixed-width, but we gave it up as a > > bad job several major releases back ... it was never more than a > > very marginal optimization anyway ... > > Does that mean that PostGreSQL fixes character width at thirty-two bits, > or that it uses UTF-8, or that it just stores what it gets? It'll use utf8 if configured to do so; it'll just store what it gets if configured as SQL_ASCII (it isn't really ASCII); or it will convert from the client encoding to the server encoding before storing, if they are different. > (Checked chapter 8.3 in the manual, didn't see the answer there. Not > that I really want to know. With Unicode, trying to optimize record > sizes for char/text fields is a little like trying to play Russian > Roulette. Wait, is that no longer politically correct? Should it be > called six-chamber roulette, now? Don't want to offend anyone.) Of course, the optimization could have only worked on fixed-width encodings (not utf8 -- maybe possible with utf32, but Postgres doesn't support that AFAIK), but since current versions enable multibyte by default there's really no point in trying. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "I would rather have GNU than GNOT." (ccchips, lwn.net/Articles/37595/)