PG 7.1.2 Crash: cannot read xlog dir - Mailing list pgsql-admin
From | kay |
---|---|
Subject | PG 7.1.2 Crash: cannot read xlog dir |
Date | |
Msg-id | NGBBKFMOILMAGDABPFEGEEADENAA.efesar@nmia.com Whole thread Raw |
Responses |
Re: PG 7.1.2 Crash: cannot read xlog dir
|
List | pgsql-admin |
My situation: PG 7.1.2, Redhat 7.2, running in a chroot jail on a "VDS" server at my new ISP. I can't recompile anything, can't upgrade PG (basically, I'm stuck with 7.1.2). This issue was previously noted in a thread in late 2002. The actual thread that Tom Lane suggests it might be a permissions issue is missing from the archive, but I found it in Google's cache ( for two Webcrawler docs: http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%22cannot+read+xlog+d ir%22+&btnG=Google+Search ). As to why they aren't on archives.postgresql.org ... ya got me. I changed permissions to the most permissive setting I know (0777), plus I own the directory, I own the files, and I own the postmaster process, so the only thing I can think is that 'readdir' is badly linked or has some freaky kernel interaction. I have Python, perl and PHP on the system, and they all use 'opendir' and 'readdir' and 'closedir' just fine on the pg_xlog directory. My problem: I've deduced that the 'readdir' call is broken in my PG. I examined the source code for 7.1 very very thoroughly ( http://developer.postgresql.org/cvsweb.cgi/pgsql-server/src/backend/access/t ransam/xlog.c?rev=1.65.2.1&content-type=text/x-cvsweb-markup&only_with_tag=R EL7_1_STABLE see MoveOfflineLogs). What I've found is that 'opendir' seems to open the directory fine (does not return a NULL value), but when 'readdir' tries to grab a filename something bombs with a file system error 'No such file or directory' and it returns a NULL and 'errno' gets set. The strange thing is that it gets in there ONCE and does ONE file (0000000000000000) and then it won't do anymore, ever again, until I stop the server and run initdb again. At this point I know that there's nothing wrong with the XLOG directory or the files in it, because PG has been writing transactions fine for 7-8 hours up to this point. It can only be a bad 'readdir' call. My question: Is there some runtime setting I can use to prevent MoveOfflineLogs() from ever being called? I would MUCH rather have a couple of old XLOGs lying around than a fatal crash. Maybe by CHECKPOINTing every hour or something ... I've tried playing with a bunch of different WAL settings and ... I can't stop MoveOfflineLogs from being called. Please keep in mind my hands are tied, and I can't recompile and I can't upgrade. Even if I could upgrade, I imagine that 'readdir' would still be broken, and I'd still have this issue. If anybody can think of a workaround I'd really appreciate it. I've been racking my brain on this for a week. Thanks -Keith ================== Here's the log. /usr/local/pgsql/bin/postmaster: reaping dead processes... /usr/local/pgsql/bin/postmaster: CleanupProc: pid 24626 exited with status 0 XLogFlush: rqst 0/12259528; wrt 0/0; flsh 0/0 XLogFlush: rqst 0/17078212; wrt 0/17078248; flsh 0/17078248 XLogFlush: rqst 0/17078152; wrt 0/17078248; flsh 0/17078248 XLogFlush: rqst 0/0; wrt 0/17078248; flsh 0/17078248 INSERT @ 0/17078248: prev 0/17078212; xprev 0/0; xid 0: XLOG - checkpoint: redo 0/17078248; undo 0/0; sui 28; xid 3495; oid 36603; online XLogFlush: rqst 0/17078312; wrt 0/17078248; flsh 0/17078248 DEBUG: MoveOfflineLogs: remove 0000000000000000 FATAL 2: MoveOfflineLogs: cannot read xlog dir: No such file or directory DEBUG: proc_exit(2) DEBUG: shmem_exit(2) DEBUG: exit(2) /usr/local/pgsql/bin/postmaster: reaping dead processes... /usr/local/pgsql/bin/postmaster: CleanupProc: pid 24736 exited with status 512 Server process (pid 24736) exited with status 512 at Sat May 31 09:57:57 2003 Terminating any active server processes... Server processes were terminated at Sat May 31 09:57:57 2003 Reinitializing shared memory and semaphores invoking IpcMemoryCreate(size=1236992) DEBUG: database system was interrupted at 2003-05-31 09:57:57 EDT DEBUG: CheckPoint record at (0, 17078248) DEBUG: Redo record at (0, 17078248); Undo record at (0, 0); Shutdown FALSE DEBUG: NextTransactionId: 3495; NextOid: 36603 DEBUG: database system was not properly shut down; automatic recovery in progress... DEBUG: ReadRecord: record with zero len at (0, 17078312) DEBUG: redo is not required INSERT @ 0/17078312: prev 0/17078248; xprev 0/0; xid 0: XLOG - checkpoint: redo 0/17078312; undo 0/0; sui 28; xid 3495; oid 36603; shutdown XLogFlush: rqst 0/17078376; wrt 0/17078312; flsh 0/17078312 FATAL 2: MoveOfflineLogs: cannot read xlog dir: No such file or directory DEBUG: proc_exit(2) DEBUG: shmem_exit(2) DEBUG: exit(2) ========================= Here's the code from 7.1. static void MoveOfflineLogs(uint32 log, uint32 seg) { DIR *xldir; struct dirent *xlde; char lastoff[32]; char path[MAXPGPATH]; Assert(XLOG_archive_dir[0] == 0); /* ! implemented yet */ xldir = opendir(XLogDir); if (xldir == NULL) elog(STOP, "MoveOfflineLogs: cannot open xlog dir: %m"); sprintf(lastoff, "%08X%08X", log, seg); errno = 0; while ((xlde = readdir(xldir)) != NULL) { if (strlen(xlde->d_name) == 16 && strspn(xlde->d_name, "0123456789ABCDEF") == 16 && strcmp(xlde->d_name, lastoff) <= 0) { elog(LOG, "MoveOfflineLogs: %s %s", (XLOG_archive_dir[0]) ? "archive" : "remove", xlde->d_name); sprintf(path, "%s%c%s", XLogDir, SEP_CHAR, xlde->d_name); if (XLOG_archive_dir[0] == 0) unlink(path); } errno = 0; } if (errno) elog(STOP, "MoveOfflineLogs: cannot read xlog dir: %m"); closedir(xldir); }
pgsql-admin by date: