Re: any solution for doing a data file import spawning it on multiple processes - Mailing list pgsql-general
From | Edson Richter |
---|---|
Subject | Re: any solution for doing a data file import spawning it on multiple processes |
Date | |
Msg-id | BLU0-SMTP47A450268B2B3573457921CFFA0@phx.gbl Whole thread Raw |
In response to | Re: any solution for doing a data file import spawning it on multiple processes ("hb@101-factory.eu" <hb@101-factory.eu>) |
Responses |
Re: any solution for doing a data file import spawning it on multiple processes
|
List | pgsql-general |
Em 16/06/2012 12:59, hb@101-factory.eu escreveu: > thanks i thought about splitting the file, but that did no work out well. > > so we receive 2 files evry 30 seconds and need to import this as fast as possible. > > we do not run java curently but maybe it's an option. > are you willing to share your code? > > also i was thinking using perl for it > > > henk > > On 16 jun. 2012, at 17:37, Edson Richter <edsonrichter@hotmail.com> wrote: > >> Em 16/06/2012 12:04, hb@101-factory.eu escreveu: >>> hi there, >>> >>> I am trying to import large data files into pg. >>> for now i used the. xarg linux command to spawn the file line for line and set and use the maximum available connections. >>> >>> we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file. >>> >>> problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it createszombies ;(. >>> >>> does anybody have any other tricks that will do the job? >>> >>> thanks, >>> >>> Henk >> I've used custom Java application using connection pooling (limited to 1000 connections, mean 1000 concurrent file imports). >> >> I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, without memory leaks neither zombies, and(of course) no missing records. >> >> Besides I each thread import separate file, I have another situation where I have separated threads importing differentlines of same file. No problems at all. Do not forget to check your OS "file open" limits (it was a big issue inthe past for me due Lucene indexes generated during import). >> >> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, Sun Java 1.6.27. >> >> Regards, >> >> Edson Richter >> >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general I'm not allowed to publish my company's code, but the logic if very easy to understand (you will have to "invent" your own solution, below code is bare bone): class MainThread implements Runnable { private boolean keepRunning = true; public void run() { while(keepRunning) { try { executeFiles(); Thread.sleep(30000); // sleep 30 seconds } catch(Exception ex) { ex.printStackTrace(); } } } private void executeFiles() { File monitorDir = new File("/var/mydatafolder/"); File processingDir = new File("/var/myprocessingfolder/"); // I'll import only files with names like "data20120621.csv": FileFilter fileFilter = new FileFilter() { public boolean accept(File file) { boolean isfile = file.isFile() && !file.isHidden() && !file.isDirectory(); if(!isfile) return false; String fname = file.getName(); return fname.startsWith("data") && (file.getName().endsWith("csv")); } }; List<File> forProcessing = monitorDir.listFiles(fileFilter); for(File fileFound : forProcessing) { // FileUtil is a utility class, you will have to create your own... your move method will vary according your Operating System FileUtil.move(fileFound, processingDir); // ProcessFile is a class that implements Runnable, and do your stuff there... Thread t = new Thread(new ProcessFile(processingDir, fileFound.getName())); t.start(); } } /** Use this method to stop the thread from another place in your complex system! */ public void synchronized stopWorker() { keepRunning = false; } public static void main(String [] args) { Thread t = new Thread(new MainThread()); t.start(); } }
pgsql-general by date: