Thread: Datatype sizes; a space and speed issue?

Datatype sizes; a space and speed issue?

From

Madison Kelly

Date:

23 June 2004, 00:26:34

Hi again,

   Something I have been wondering about and haven't found an answer to
yet is how the size of a datatype (I hope that is the right term)
effects performance. What effect is there if I specify "TEXT" instead of
say "VARCHAR(255)"? How much benefit is there do being more aggresive
and say cutting it in half again by using "VARCHAR(128)"?

   I hope this isn't too basic a question! Thanks!

Madison

Re: Datatype sizes; a space and speed issue?

From

Vams

Date:

23 June 2004, 00:58:57

On Tuesday 22 June 2004 11:26 pm, Madison Kelly wrote:
>    Something I have been wondering about and haven't found an answer to
> yet is how the size of a datatype (I hope that is the right term)
> effects performance. What effect is there if I specify "TEXT" instead of
> say "VARCHAR(255)"? How much benefit is there do being more aggresive
> and say cutting it in half again by using "VARCHAR(128)"?

There is no performance difference between varchar, char and text.  Infact,
text is equal to varchar with no length specified (a Postgresql extension).

I use varchar when I want a max limit on the number of characters.  Char when
I want a set number of characters (auto space padded).  And use text all
other times, because it is more readable than varchar.

All this and more can be found in the Docs - 7.4.2: Ch 8: Data Types, Sec 8.3:
Character Types.

good luck,

  Vams

Re: Datatype sizes; a space and speed issue?

From

Joel Matthew

Date:

23 June 2004, 01:15:38

> What effect is there if I specify "TEXT" instead of
> say "VARCHAR(255)"?

Well, theoretically, in the latter case, the database will allocate 256
(257? 259?) bytes for that field in the record itself. That is, that
field will consume 256 bytes for each record stored.

In the former case, the database will (again, theoretically) allocate a
string pool (think of a separate file, as one way to do it) for the
field, and will allocate a pointer (4 or 8 bytes) into the string pool
for the field. This means that the field will consume only four or eight
bytes for each record stored, plus (don't forget this, it's important)
however many bytes are actually needed in the string pool for the text
value. If you have, for instance, a fifteen character name stored in
Unicode UTF32, there will be ninety, erm, octets (think bytes for most
CPUs) in the pool for that field of that record, plus a few pointers and
such so the system can tell where to collect garbage.

> How much benefit is there do being more aggresive
> and say cutting it in half again by using "VARCHAR(128)"?

Well, that would reduce the storage requirements for that field by half.
It might also prevent you from storing necessary information. That's
easily 128 characters if you're only using US-ASCII in UTF-8, but it's
only 32 characters of Unicode in UTF-32, and it could be anywhere
between 128 and 32 in Unicode UTF-8. Making a good prediction about the
tradeoff is one of the things a database engineer is paid for.

> I hope this isn't too basic a question!

Since you ask, yeah, it is, but unless things have changed here recently,
the people here aren't going to raise the oven temperature too high when
they roast me for answering it.

--
Joel Matthew <rees@ddcom.co.jp>

Re: Datatype sizes; a space and speed issue?

From

Madison Kelly

Date:

23 June 2004, 01:25:57

Vams wrote:
> On Tuesday 22 June 2004 11:26 pm, Madison Kelly wrote:
>
>>   Something I have been wondering about and haven't found an answer to
>>yet is how the size of a datatype (I hope that is the right term)
>>effects performance. What effect is there if I specify "TEXT" instead of
>>say "VARCHAR(255)"? How much benefit is there do being more aggresive
>>and say cutting it in half again by using "VARCHAR(128)"?
>
>
> There is no performance difference between varchar, char and text.  Infact,
> text is equal to varchar with no length specified (a Postgresql extension).
>
> I use varchar when I want a max limit on the number of characters.  Char when
> I want a set number of characters (auto space padded).  And use text all
> other times, because it is more readable than varchar.
>
> All this and more can be found in the Docs - 7.4.2: Ch 8: Data Types, Sec 8.3:
> Character Types.
>
> good luck,
>
>   Vams

   Thank you very much Vams!!

Madison

Re: Datatype sizes; a space and speed issue?

From

Stephan Szabo

Date:

23 June 2004, 01:34:22

On Wed, 23 Jun 2004, Joel Matthew wrote:

> > What effect is there if I specify "TEXT" instead of
> > say "VARCHAR(255)"?
>
> Well, theoretically, in the latter case, the database will allocate 256
> (257? 259?) bytes for that field in the record itself. That is, that
> field will consume 256 bytes for each record stored.

Both text and varchar are stored in PostgreSQL as length + string so
varchar(256) doesn't not require storage of the unused bytes.  char space
pads so it's the oddball (requiring length + padded string).

> > How much benefit is there do being more aggresive
> > and say cutting it in half again by using "VARCHAR(128)"?
>
> Well, that would reduce the storage requirements for that field by half.
> It might also prevent you from storing necessary information. That's
> easily 128 characters if you're only using US-ASCII in UTF-8, but it's
> only 32 characters of Unicode in UTF-32, and it could be anywhere
> between 128 and 32 in Unicode UTF-8. Making a good prediction about the
> tradeoff is one of the things a database engineer is paid for.

varchar lengths should be in characters not bytes, so that should be 128
characters in any of the encodings, but the actual number of bytes that
those 128 characters can take up may vary.

Re: Datatype sizes; a space and speed issue?

From

Joel Matthew

Date:

23 June 2004, 01:47:36

> Both text and varchar are stored in PostgreSQL as length + string so
> varchar(256) doesn't not require storage of the unused bytes.  char space
> pads so it's the oddball (requiring length + padded string).

Yeah, I saw Vams's post and r3m3mb3r3d that I'd f0rg0t.

> varchar lengths should be in characters not bytes, so that should be 128
> characters in any of the encodings, but the actual number of bytes that
> those 128 characters can take up may vary.

And that's actually the best rationale, IMO, for making the optimization
a full-fledged feature. Boy, the things you forget ...

--
Joel Matthew <rees@ddcom.co.jp>

Re: Datatype sizes; a space and speed issue?

From

Alvaro Herrera

Date:

23 June 2004, 01:52:06

On Wed, Jun 23, 2004 at 01:15:23PM +0900, Joel Matthew wrote:
> > What effect is there if I specify "TEXT" instead of
> > say "VARCHAR(255)"?
>
> Well, theoretically, in the latter case, the database will allocate 256
> (257? 259?) bytes for that field in the record itself. That is, that
> field will consume 256 bytes for each record stored.

Nope.  Actually, in both cases the length will be stored first (4 bytes)
and the actual content following it, using the indicated amount of
bytes.  There's absolutely no difference in storage.

A varchar(256) field will allow you to store a text not with 256 bytes
max, but 256 _chars_ max.  Think multibyte encodings such as utf8 -- the
varchar(256) can take anything from 4 + 1 bytes (a single byte string) to
4 + 256 * max_bytes_per_char.

The difference you cite is for char(N) fields, which are always padded
with blanks to fill the N chars.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"There is evil in the world. There are dark, awful things. Occasionally, we get
a glimpse of them. But there are dark corners; horrors almost impossible to
imagine... even in our worst nightmares." (Van Helsing, Dracula A.D. 1972)

Re: Datatype sizes; a space and speed issue?

From

Tom Lane

Date:

23 June 2004, 02:15:25

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> The difference you cite is for char(N) fields, which are always padded
> with blanks to fill the N chars.

And, again, that's N logical *characters*, not bytes.

We used to have some attempts at optimizing on the assumption that
char(n) fields were physically fixed-width, but we gave it up as a
bad job several major releases back ... it was never more than a
very marginal optimization anyway ...

            regards, tom lane

Re: Datatype sizes; a space and speed issue?

From

Madison Kelly

Date:

23 June 2004, 02:35:58

Alvaro Herrera wrote:
> On Wed, Jun 23, 2004 at 01:15:23PM +0900, Joel Matthew wrote:
>
>>>What effect is there if I specify "TEXT" instead of
>>>say "VARCHAR(255)"?
>>
>>Well, theoretically, in the latter case, the database will allocate 256
>>(257? 259?) bytes for that field in the record itself. That is, that
>>field will consume 256 bytes for each record stored.
>
>
> Nope.  Actually, in both cases the length will be stored first (4 bytes)
> and the actual content following it, using the indicated amount of
> bytes.  There's absolutely no difference in storage.
>
> A varchar(256) field will allow you to store a text not with 256 bytes
> max, but 256 _chars_ max.  Think multibyte encodings such as utf8 -- the
> varchar(256) can take anything from 4 + 1 bytes (a single byte string) to
> 4 + 256 * max_bytes_per_char.
>
>
> The difference you cite is for char(N) fields, which are always padded
> with blanks to fill the N chars.

Thank you everyone for enswering, that actually clears up a lot. Thank
you too for not flaming this n00b!

Madison

Re: Datatype sizes; a space and speed issue?

From

Joel Matthew

Date:

23 June 2004, 02:43:20

> We used to have some attempts at optimizing on the assumption that
> char(n) fields were physically fixed-width, but we gave it up as a
> bad job several major releases back ... it was never more than a
> very marginal optimization anyway ...

Does that mean that PostGreSQL fixes character width at thirty-two bits,
or that it uses UTF-8, or that it just stores what it gets?

(Checked chapter 8.3 in the manual, didn't see the answer there. Not
that I really want to know. With Unicode, trying to optimize record
sizes for char/text fields is a little like trying to play Russian
Roulette. Wait, is that no longer politically correct? Should it be
called six-chamber roulette, now? Don't want to offend anyone.)

--
Joel Matthew <rees@ddcom.co.jp>

Re: Datatype sizes; a space and speed issue?

From

Joel Matthew

Date:

23 June 2004, 02:54:35

> ... the length will be stored first (4 bytes)
> and the actual content following it

Section 8.3 of the manual seems to indicate that the actual character
data is stored in a separate file (background table -- I guess that
might not be a separate file?) for text, unspecified width char, and
char fields which exceed the specified length.

--
Joel Matthew <rees@ddcom.co.jp>

Re: Datatype sizes; a space and speed issue?

From

Tom Lane

Date:

23 June 2004, 03:18:15

Joel Matthew <rees@ddcom.co.jp> writes:
>> ... the length will be stored first (4 bytes)
>> and the actual content following it

> Section 8.3 of the manual seems to indicate that the actual character
> data is stored in a separate file (background table -- I guess that
> might not be a separate file?) for text, unspecified width char, and
> char fields which exceed the specified length.

We may push very-wide fields out to a separate table ("toast table").
This has nothing to do with whether the field is text, varchar(n),
unspecified varchar, or whatever, but only with the physical size
of the data.

            regards, tom lane

Re: Datatype sizes; a space and speed issue?

From

Tom Lane

Date:

23 June 2004, 03:22:33

Joel Matthew <rees@ddcom.co.jp> writes:
> Does that mean that PostGreSQL fixes character width at thirty-two bits,
> or that it uses UTF-8, or that it just stores what it gets?

We store text data in the form indicated by the database encoding
setting.  UCS-32 is not a supported encoding, but UTF-8 is --- among
others.

> (Checked chapter 8.3 in the manual, didn't see the answer there. Not
> that I really want to know. With Unicode, trying to optimize record
> sizes for char/text fields is a little like trying to play Russian
> Roulette.

No, it's entirely like pointless.  You just don't know how many bytes
will be taken up by N characters.

            regards, tom lane

Re: Datatype sizes; a space and speed issue?

From

Alvaro Herrera

Date:

27 June 2004, 03:45:35

On Wed, Jun 23, 2004 at 02:43:06PM +0900, Joel Matthew wrote:
> > We used to have some attempts at optimizing on the assumption that
> > char(n) fields were physically fixed-width, but we gave it up as a
> > bad job several major releases back ... it was never more than a
> > very marginal optimization anyway ...
>
> Does that mean that PostGreSQL fixes character width at thirty-two bits,
> or that it uses UTF-8, or that it just stores what it gets?

It'll use utf8 if configured to do so; it'll just store what it gets if
configured as SQL_ASCII (it isn't really ASCII); or it will convert from
the client encoding to the server encoding before storing, if they are
different.

> (Checked chapter 8.3 in the manual, didn't see the answer there. Not
> that I really want to know. With Unicode, trying to optimize record
> sizes for char/text fields is a little like trying to play Russian
> Roulette. Wait, is that no longer politically correct? Should it be
> called six-chamber roulette, now? Don't want to offend anyone.)

Of course, the optimization could have only worked on fixed-width
encodings (not utf8 -- maybe possible with utf32, but Postgres doesn't
support that AFAIK), but since current versions enable multibyte by
default there's really no point in trying.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"I would rather have GNU than GNOT."  (ccchips, lwn.net/Articles/37595/)