Home > mailing lists

Re: any solution for doing a data file import spawning it on multiple processes - Mailing list pgsql-general

From	Edson Richter
Subject	Re: any solution for doing a data file import spawning it on multiple processes
Date	June 16, 2012 13:25:04
Msg-id	BLU0-SMTP47A450268B2B3573457921CFFA0@phx.gbl Whole thread Raw
In response to	Re: any solution for doing a data file import spawning it on multiple processes ("hb@101-factory.eu" <hb@101-factory.eu>)
Responses	Re: any solution for doing a data file import spawning it on multiple processes
List	pgsql-general

Tree view

Em 16/06/2012 12:59, hb@101-factory.eu escreveu:
> thanks i thought about splitting the file, but that did no work out well.
>
> so we receive 2 files evry 30 seconds and need to import this as fast as possible.
>
> we do not run java curently but maybe it's an option.
> are you willing to share your code?
>
> also i was thinking using perl for it
>
>
> henk
>
> On 16 jun. 2012, at 17:37, Edson Richter <edsonrichter@hotmail.com> wrote:
>
>> Em 16/06/2012 12:04, hb@101-factory.eu escreveu:
>>> hi there,
>>>
>>> I am trying to import large data files into pg.
>>> for now i used the. xarg linux command to spawn the file line for line and set  and use the  maximum available
connections.
>>>
>>> we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file.
>>>
>>> problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it
createszombies ;(. 
>>>
>>> does anybody have any other tricks that will do the job?
>>>
>>> thanks,
>>>
>>> Henk
>> I've used custom Java application using connection pooling (limited to 1000 connections, mean 1000 concurrent file
imports).
>>
>> I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, without memory leaks neither zombies,
and(of course) no missing records. 
>>
>> Besides I each thread import separate file, I have another situation where I have separated threads importing
differentlines of same file. No problems at all. Do not forget to check your OS "file open" limits (it was a big issue
inthe past for me due Lucene indexes generated during import). 
>>
>> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, Sun Java 1.6.27.
>>
>> Regards,
>>
>> Edson Richter
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
I'm not allowed to publish my company's code, but the logic if very easy
to understand (you will have to "invent" your own solution, below code
is bare bone):

class MainThread implements Runnable {
     private boolean keepRunning = true;

     public void run() {
         while(keepRunning) {
             try {
                 executeFiles();
                 Thread.sleep(30000); // sleep 30 seconds
             } catch(Exception ex) {
                 ex.printStackTrace();
             }
         }
     }

     private void executeFiles() {
         File monitorDir = new File("/var/mydatafolder/");
         File processingDir = new File("/var/myprocessingfolder/");

         // I'll import only files with names like "data20120621.csv":
         FileFilter fileFilter = new FileFilter() {
             public boolean accept(File file) {
                 boolean isfile = file.isFile() && !file.isHidden() &&
!file.isDirectory();
                 if(!isfile) return false;
                 String fname = file.getName();
                 return fname.startsWith("data") &&
(file.getName().endsWith("csv"));
              }
          };

         List<File> forProcessing = monitorDir.listFiles(fileFilter);

         for(File fileFound : forProcessing) {
             // FileUtil is a utility class, you will have to create
your own... your move method will vary according your Operating System
             FileUtil.move(fileFound, processingDir);
             // ProcessFile is a class that implements Runnable, and do
your stuff there...
             Thread t = new Thread(new ProcessFile(processingDir,
fileFound.getName()));
             t.start();
         }
     }

     /** Use this method to stop the thread from another place in your
complex system! */
     public void synchronized stopWorker() {
         keepRunning = false;
     }

     public static void main(String [] args) {
         Thread t = new Thread(new MainThread());
         t.start();
     }
}

pgsql-general by date:

From: Bosco Rama
Date: 16 June 2012, 13:16:01
Subject: Re: any solution for doing a data file import spawning it on multiple processes

From: Bill House
Date: 16 June 2012, 14:12:12
Subject: v9.1.3 WITH with_query UPDATE

Re: any solution for doing a data file import spawning it on multiple processes - Mailing list pgsql-general

Previous

Next