Thread: multiple sampling from tables and saving output

multiple sampling from tables and saving output

From
David Orme
Date:
Hi,

I need to perform a statistical bootstrap on data held in a postgresql
database and I was wondering if anyone could recommend strategies.

The process I need to do is a loop of  1000 repetitions of the
following:

1) select a random subset of the data from a table
2) save various summaries of the randomly selected data

I can think of various external ways of doing this - my current plan is
to use a shell script to resend the same set of instructions repeated
times using 'psql -f instruction_set.sql'  - but I was wondering if
there was a canonical way of doing this within pgsql. I've had a bit of
a look at procedural languages but I'm not sure which is best for
handling this kind of process.

I've included an example of the set of instructions I want to repeat.

Cheers,
David Orme

[running psql 7.3.4 on RHEL 3]

-- Select 1096 species subsets
select grid_id, species_id
     into temp random_locs
     from possible_locations
    where species_id in (select rand_pick.species_id from
         (select species_id, random() as random_id
         from species
         order by random_id
         limit 1096) as rand_pick);

-- set up tab delimited unaligned
\a
\f '\t'

-- export summary table by grid_id
\o curr_gridid.txt
-- get a count by grid id including nulls
select grid_id, cnt from
     behr_grid left join (
     select grid_id, count(distinct(species_id)) as cnt
         from random_locs
         group by grid_id) as loc_count
     using (grid_id)
     order by grid_id;


Re: multiple sampling from tables and saving output

From
Tom Lane
Date:
David Orme <d.orme@imperial.ac.uk> writes:
> The process I need to do is a loop of  1000 repetitions of the
> following:

> 1) select a random subset of the data from a table
> 2) save various summaries of the randomly selected data

> I can think of various external ways of doing this - my current plan is
> to use a shell script to resend the same set of instructions repeated
> times using 'psql -f instruction_set.sql'  - but I was wondering if
> there was a canonical way of doing this within pgsql.

If you want a sample of, say, 1% of the rows in a table, you can do

    select * from mytable where random() < 0.01;

and get a genuinely unbiased sample.  Keep in mind though that you can't
get an exact sample size this way --- it'll be close to 1% but probably
not spot on.

            regards, tom lane