Thread: reducing IO and memory usage: sending the content of a table to multiple files

reducing IO and memory usage: sending the content of a table to multiple files

From

Ivan Sergio Borgonovo

Date:

02 April 2009, 06:20:20

This is the work-flow I've in mind:

1a) take out *all* data from a table in chunks (M record for each
file, one big file?) (\copy??, from inside a scripting language?)

2a) process each file with awk to produce N files very similar each
other (substantially turn them into very simple xml)
3a) gzip them

2b) use any scripting language to process and gzip them avoiding a
bit of disk IO

Does PostgreSQL offer me any contrib, module, technique... to save
some IO (and maybe disk space for temporary results?).

Are there any memory usage implication if I'm doing a:
pg_query("select a,b,c from verylargetable; --no where clause");
vs.
the \copy equivalent
any way to avoid them?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: reducing IO and memory usage: sending the content of a table to multiple files

From

Sam Mason

Date:

02 April 2009, 13:28:00

On Thu, Apr 02, 2009 at 11:20:02AM +0200, Ivan Sergio Borgonovo wrote:
> This is the work-flow I've in mind:
>
> 1a) take out *all* data from a table in chunks (M record for each
> file, one big file?) (\copy??, from inside a scripting language?)

What about using cursors here?

> 2a) process each file with awk to produce N files very similar each
> other (substantially turn them into very simple xml)
> 3a) gzip them

GZIP uses significant CPU time; there are various lighter weight schemes
available that may be better depending on where this data is going.

> 2b) use any scripting language to process and gzip them avoiding a
> bit of disk IO

What disk IO are you trying to save and why?

> Does PostgreSQL offer me any contrib, module, technique... to save
> some IO (and maybe disk space for temporary results?).
>
> Are there any memory usage implication if I'm doing a:
> pg_query("select a,b,c from verylargetable; --no where clause");
> vs.
> the \copy equivalent
> any way to avoid them?

As far as I understand it will get all the data from the database into
memory first and then your code gets a chance.  For large datasets this
obviously doesn't work well.  CURSORs are you friend here.

--
  Sam  http://samason.me.uk/

Re: reducing IO and memory usage: sending the content of a table to multiple files

From

Ivan Sergio Borgonovo

Date:

02 April 2009, 16:48:44

On Thu, 2 Apr 2009 17:27:55 +0100
Sam Mason <sam@samason.me.uk> wrote:

> On Thu, Apr 02, 2009 at 11:20:02AM +0200, Ivan Sergio Borgonovo
> wrote:
> > This is the work-flow I've in mind:
> >
> > 1a) take out *all* data from a table in chunks (M record for each
> > file, one big file?) (\copy??, from inside a scripting language?)
>
> What about using cursors here?

The only way I've seen using cursors with php involve defining a
function... It looks a bit messy for a 10 lines script having to
define a function just as a shell for a sql statement.
I'd even write it in python if the project didn't start to look as a
small Frankenstein... and sooner or later I bet I'll have to include
some php files to recycle some function.
I didn't find any elegant example of cursor use in PHP... OK PHP is
not the most elegant language around... but still any good exapmle
someone could point me at?

If you could point me to some clean way to use cursors in php I'd
happy to learn.

I was thinking about using another embedded language that better
suits data processing (an unsafe version?) so I could directly
output to files from within a postgresql function...

> > 2a) process each file with awk to produce N files very similar
> > each other (substantially turn them into very simple xml)
> > 3a) gzip them

> GZIP uses significant CPU time; there are various lighter weight
> schemes available that may be better depending on where this data
> is going.

That's a requirement.

> > 2b) use any scripting language to process and gzip them avoiding
> > a bit of disk IO

> What disk IO are you trying to save and why?

Because this is going to be the largest write operation the all
system will have to handle during the day.
I'm not interested in fast complicated queries, planning,
transactions, caching... I just need to get a whole table pass it
through a filter and output several filtered "versions" of the same
table.
So I think the largest cost of the operation will be IO.
\copy should be optimised for "raw" data output, but maybe all its
advantages get lost once I've to use pipes and adding complexity to
filtering.

> > Does PostgreSQL offer me any contrib, module, technique... to
> > save some IO (and maybe disk space for temporary results?).
> >
> > Are there any memory usage implication if I'm doing a:
> > pg_query("select a,b,c from verylargetable; --no where clause");
> > vs.
> > the \copy equivalent
> > any way to avoid them?
>
> As far as I understand it will get all the data from the database
> into memory first and then your code gets a chance.  For large
> datasets this obviously doesn't work well.  CURSORs are you friend
> here.

I was reading about all the php documents and trying to understand
how buffers and memory usage works, so I gave a look to MySQL
documents too...
MySQL has mysql_unbuffered_query.
So I was wondering how memory is managed on the server and on
clients.

What's going to happen when I do a
$result=pg_query("select * from t1;");
while($row=pg_fetch_array($result)) {
}
vs.
using cursors...
vs.
asynchronous query (they just look as non stopping queries with no
relationship with memory usage)

Where are the buffers etc...

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: reducing IO and memory usage: sending the content of a table to multiple files

From

Sam Mason

Date:

02 April 2009, 22:05:33

On Thu, Apr 02, 2009 at 09:48:33PM +0200, Ivan Sergio Borgonovo wrote:
> I didn't find any elegant example of cursor use in PHP... OK PHP is
> not the most elegant language around... but still any good exapmle
> someone could point me at?

I don't program PHP; but my guess would be something like:

  pg_query("BEGIN;");
  pg_query("DECLARE cur CURSOR FOR SELECT * FROM t1;");
  while (pg_num_rows($result = pg_query("FETCH 1000 FROM cur;")) > 0) {
    while($row = pg_fetch_array($result)) {
    }
  }
  pg_query("COMMIT;");

You can obviously increase the "FETCH" upwards and if you're feeling
fancy you could even run the FETCH async from the code that processes
the results.  Maybe something like:

  pg_query($conn, "BEGIN;");
  pg_query($conn, "DECLARE cur CURSOR FOR SELECT * FROM t1;");
  pg_send_query($conn, "FETCH 1000 FROM cur;");
  while(1) {
    $result = pg_get_result($conn);
    pg_send_query($conn, "FETCH 1000 FROM cur;");
    if (pg_num_rows($result) == 0)
      break;
    while($row = pg_fetch_array($conn, $result)) {
    }
    if (pg_get_result($conn)) {
      // badness, only expecting a single result
    }
  }

Note, I've never tried to do PG database stuff from PHP, let alone stuff
like this so it may be all wrong!  AFAICT, there's no need to bother
with pg_connection_busy because the call to pg_get_result will block
until the results come back from the database.

> So I think the largest cost of the operation will be IO.
> \copy should be optimised for "raw" data output, but maybe all its
> advantages get lost once I've to use pipes and adding complexity to
> filtering.

Streaming IO is pretty fast, I think you'll be hard pushed to keep up
with it from PHP and you'll end up CPU bound in no time.  Be interesting
to find out though.

> I was reading about all the php documents and trying to understand
> how buffers and memory usage works, so I gave a look to MySQL
> documents too...

Not sure about PG, but the C api pretty much always buffers everything
in memory first.  There was mention of getting control of this, but I've
got no idea where it got.

--
  Sam  http://samason.me.uk/

Re: reducing IO and memory usage: sending the content of a table to multiple files

From

Scott Marlowe

Date:

02 April 2009, 23:54:38

On Thu, Apr 2, 2009 at 7:05 PM, Sam Mason <sam@samason.me.uk> wrote:
> On Thu, Apr 02, 2009 at 09:48:33PM +0200, Ivan Sergio Borgonovo wrote:
>> I didn't find any elegant example of cursor use in PHP... OK PHP is
>> not the most elegant language around... but still any good exapmle
>> someone could point me at?
>
> I don't program PHP; but my guess would be something like:
>
>  pg_query("BEGIN;");
>  pg_query("DECLARE cur CURSOR FOR SELECT * FROM t1;");
>  while (pg_num_rows($result = pg_query("FETCH 1000 FROM cur;")) > 0) {
>    while($row = pg_fetch_array($result)) {
>    }
>  }
>  pg_query("COMMIT;");

I've done something similar and it worked just fine.

Re: reducing IO and memory usage: sending the content of a table to multiple files

From

Ivan Sergio Borgonovo

Date:

03 April 2009, 06:10:45

On Fri, 3 Apr 2009 02:05:19 +0100
Sam Mason <sam@samason.me.uk> wrote:

> On Thu, Apr 02, 2009 at 09:48:33PM +0200, Ivan Sergio Borgonovo
> wrote:
> > I didn't find any elegant example of cursor use in PHP... OK PHP
> > is not the most elegant language around... but still any good
> > exapmle someone could point me at?

> You can obviously increase the "FETCH" upwards and if you're
> feeling fancy you could even run the FETCH async from the code
> that processes the results.  Maybe something like:

What kind of advantage should I get running asynchronously?
oh I didn't mean you were suggesting any advantage... just wondering.
It could be an option if once everything is up I want to keep under
control resources sucked by this process (?).

> > So I think the largest cost of the operation will be IO.
> > \copy should be optimised for "raw" data output, but maybe all
> > its advantages get lost once I've to use pipes and adding
> > complexity to filtering.

> Streaming IO is pretty fast, I think you'll be hard pushed to keep
> up with it from PHP and you'll end up CPU bound in no time.  Be
> interesting to find out though.

Filtering is currently very simple... I'm building a very simple xml
just queueing constant strings and what comes out of the DB.
But if I had to parse a CSV (split) or just assign names to columns
(and I expect this stuff is going to be adjusted frequently) or use
regexp... I was expecting to waste more human cycles or CPU cycles
than avoiding to rely on optimised IO of \copy (if any).
Most of the operations end up being:
$output=SOMEHEADER;
gzwrite($f1);
gzwrite($f2);
gzwrite($f3);
gzwrite($f4);
while(...) {
 $record1='<sometag
 someattr>.SOMECONST1.$row['col4'].</sometag><someother
 someattr>'.$row['col3'].'</someothertag>';
 $record2='<sometag
 someattr>.SOMECONST2.$row['col4'].</sometag><someother
 someattr>'.$row['col3'].'</someothertag>';
 gzwrite($f1);
 gzwrite($f2);
 gzwrite($f3);
 gzwrite($f4);
}
$output=SOMEFOOTER;
gzwrite($f1);
gzwrite($f2);
gzwrite($f3);
gzwrite($f4);

I've the largest table of my DB to be sliced into multiple xml files
that have to be written on disk.
So actually 1) reading the whole table and returning 30% of its
fields 2) writing all these data multiple times.
This is by far the largest write load the server is going to incur
in a day.
But well it may be the largest CPU load it is going to incur in a
day as well considering I've to gzip all the files.
Still I think I've read on this list that compression was going
to be a bottleneck more than IO.
I just did a preliminary test and xml-ing and gzipping 80K records
out of 1M takes less than 2sec.
So maybe I was over concerned.

Anyway I'd like to understand a bit better how IO and memory
consumption is managed once you've cursor vs. plain select and
client drivers in the middle.

> > I was reading about all the php documents and trying to
> > understand how buffers and memory usage works, so I gave a look
> > to MySQL documents too...
>
> Not sure about PG, but the C api pretty much always buffers
> everything in memory first.  There was mention of getting control
> of this, but I've got no idea where it got.

buffer *everything*?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: reducing IO and memory usage: sending the content of a table to multiple files

From

Sam Mason

Date:

03 April 2009, 11:50:05

On Fri, Apr 03, 2009 at 11:09:56AM +0200, Ivan Sergio Borgonovo wrote:
> On Fri, 3 Apr 2009 02:05:19 +0100 Sam Mason <sam@samason.me.uk> wrote:
> > On Thu, Apr 02, 2009 at 09:48:33PM +0200, Ivan Sergio Borgonovo wrote:
> > > I didn't find any elegant example of cursor use in PHP... OK PHP
> > > is not the most elegant language around... but still any good
> > > exapmle someone could point me at?
>
> > You can obviously increase the "FETCH" upwards and if you're
> > feeling fancy you could even run the FETCH async from the code
> > that processes the results.  Maybe something like:
>
> What kind of advantage should I get running asynchronously?

PG will be running the next "FETCH" command while you're processing the
last one.  Otherwise you're serialising everything; i.e.

  1) ask for results
  2) PG executes query
  3) PG sends you results
  4) you process results

all happen sequentially, even though they can be done in parallel.  If
you're doing it async I'd expect that steps 2 and 3 will be done in the
background while your code is running step 4.  I've done this sort of
thing in C before but not from PHP, there shouldn't be much difference
though.  If you've got a step 5 of sending them off somewhere else, then
you may be able to arrange for this to happen in parallel to.

I'd stay away from threads if at all possible; it's very easy to
introduce obscure bugs that you don't notice for a long time.  It's
easy to write multi-threaded code, what's *very* difficult is to write
correct multi-threaded code that solves a non-trivial problem.

> oh I didn't mean you were suggesting any advantage... just wondering.
> It could be an option if once everything is up I want to keep under
> control resources sucked by this process (?).

It's only ever going to get one "block" (i.e. 1000 rows in the example
above) ahead of itself.  The resource consumption is quite tightly
bounded.

> Filtering is currently very simple... I'm building a very simple xml
> just queueing constant strings and what comes out of the DB.
> But if I had to parse a CSV (split) or just assign names to columns
> (and I expect this stuff is going to be adjusted frequently) or use
> regexp... I was expecting to waste more human cycles or CPU cycles
> than avoiding to rely on optimised IO of \copy (if any).

OK, it was this I was wondering.  I wasn't sure if you were just
gzipping some sort of CSV output or something.  Once you have to start
dealing with individual fields apart you're probably better off with
using normal queries.

> Still I think I've read on this list that compression was going
> to be a bottleneck more than IO.
> I just did a preliminary test and xml-ing and gzipping 80K records
> out of 1M takes less than 2sec.
> So maybe I was over concerned.

The interesting measurement is throughput in bytes per second.  A single
harddrive will stream data out at 80MB/s, gzip is much slower than this
whenever I tried it.  There are lighter weight compression schemes, but
I think you said you needed gzip so it's not worth going into much.

> > Not sure about PG, but the C api pretty much always buffers
> > everything in memory first.  There was mention of getting control
> > of this, but I've got no idea where it got.
>
> buffer *everything*?

As far as I know (i.e. this is how the C library works) the data behind
your "$result" is copied from PG and into memory on the machine that's
running your PHP code before pg_query() returns.  If you're asking for
the whole of a table with a billion rows you're probably going to run
out of memory and not get any useful work done.

When you use cursors; your PHP code is only dealing with, say, 1000 rows
at a time and and PG is worrying about where to keep the rest.  Thus
resource consumption is under your control with cursors and otherwise
it's at the mercy of you writing sensible queries.

The motivating reason for this behaviour is handling of errors in
processing; the assumption is that your code shouldn't get a result set
back until it's known that these results are complete and consistent.
The easiest way to do this is to buffer them in memory until you've got
everything back from PG.  E.g. what if the last row generated a divide
by zero exception, or something caused the backend to die.

--
  Sam  http://samason.me.uk/