Home > mailing lists

Implementation Proposal For Add Free Behind Capability For Large Sequential Scan - Mailing list pgsql-hackers

From	Amit Kumar Khare
Subject	Implementation Proposal For Add Free Behind Capability For Large Sequential Scan
Date	February 25, 2002 08:17:34
Msg-id	20020223133036.93175.qmail@web10104.mail.yahoo.com Whole thread Raw
Responses	Re: Implementation Proposal For Add Free Behind
List	pgsql-hackers

Tree view

Hi All,

Here is at your disposal my implementation Idea of a TODO item titled �Add Free-Behind Capability For Large Sequential Scans�. I cordially invite your comments.

Part I:

Problem: The present default page replacement policy of PostgreSQL is Least Recently Used (LRU). This policy may be good for any other database access patterns but it is not efficient for Large Sequential Scan, since it makes cache useless. For example:

Let shared buffer size be 1 MB and the size of relation is 2MB. Lets assume sequential scan is applied on this relation over and over again. The cache becomes useless because by the time the second read of the table happens the first 1mb has already been forced out of the cache.

The problem appeared in the TODO list of PostgreSQL under section CACHE titled:

�Add Free-Behind Capability For Large Sequential Scans �

Solution: In his paper �Operating System Support For Database Management�, Professor Michael Stonebraker identifies various database access patterns in INGRES (grand father of PostgreSQL) suggest that for sequential access to blocks, which will be cyclically rereferenced, the buffer management strategy should be to Toss Immediately, unless all n pages of relation can be kept in the cache. He further suggest that for buffer management to be of some use to application like DBMS there should be some provision that it accepts �advice� from an application concerning replacement strategy.

Considering these suggestions as our baseline we have decided to implement Most Recently Used (MRU) cache replacement policy. Our task will be to identify a large sequential scan and then advice the buffer manager to apply MRU for cache replacement.

Detailed Plan of Implementation:

(I) Identifying the query execution flow control

Goal of identifying the query execution flow: (i) This is necessary as we have to figure out a spot where we will identify that the block access is actually a large sequential scan and from this point onward we think to send advice (in form of parameter ) to buffer manager.�

(ii) Through this query execution trace our aim is to reach at one of the files listed below named �freelist.c�. This is the file where all the funcitions related to replacement policy are implemented. Hence our trace ends after reaching �freelist.c�.

(1) Files Involved (Important files listed):

(1) bin/psql/mainloop.c

(2) bin/psql/common.c

(3) src/interfaces/libpq/fe-exec.c

(4) src/backend/tcop/postgres.c

(5) src/backend/tcop/pquery.c

(6) src/backend/executor/execMain.c

(7) src/backend/executor/execProcnode.c

(8) src/backend/access/heap/heapam.c

(9) src/backend/utils/cache/relcache.c

(10)src/backend/utils/cache/relcache.c

(11)src/backend/utils/hash/dynahash.c

(12)src/backend/storage/smgr/smgr.c

(13)src/include/utils/rel.h

(14)src/backend/access/common/scankey.c

(15)src/backend/storage/buffer/bufmgr.c

(16)src/backend/storage/buffer/buf_table.c

(17)src/backend/storage/buffer/freelist.c

(18)src/backend/executor/nodeSeqscan.c

(19)src/backend/executor/execScan.c

(20)src/backend/executor/execTuples.c

(2) Function Call Trace:

Notation Convention: (i) Arrow �� will represent call

�� (ii) Functions are listed as

�<FileName>. <FunctionName> for example common. SendQuery� i.e. SendQuery function is defined in file name common

(i) Query submitted to psql

mainloop.Mainloop� common.SendQuery

common.SendQuery � fe-exec.PQexec

- SendQuery send the query string to the backen and is a �front door� way to send a query.

- PQexec sends the query to the backend and package up the results.

(ii) Query Execution from �postgres� backend process

postgres. pg_exec_query_string � pquery.ProcessQuery

pquery.ProcessQuery � execMain.ExecutorStart

(This routine is called at the beginning of any execution of any

query plan)

pquery.ProcessQuery � execMain.ExecutorRun

( This is the main routine of the executor module. It accepts

�� the query descriptor from the traffic cop and executes the

�� query plan.)

pquery.ProcessQuery � execMain.ExecutorEnd

(This routine must be called at the end of execution of any

�query plan)

we will follow the execution trace from execMain.ExecutorRun

execMain.ExecutorRun � execMain.ExecutePlan � execProcnode.ExecProcNode � nodeSeqscan.ExecSeqScan � execScan.ExecScan

ExecSeqScan, scans the relation sequentially and returns the next qualifying

tuple. It passes� �SeqNext� as the access method to ExecScan as one of the parameters.

SeqNext is actually a function defined in the file nodeSeqscan this function is called through a function pointer (slot=(*accessMtd) (node)) in an infinite for loop untill the slot returned by it contains NULL.

execScan.ExecScan � nodeSeqscan.SeqNext � heapam.heap_getnext � heapam.heapgettup

heapgettup returns immediately if the relation is empty after making a call to bufmgr.ReleaseBuffer. Otherwise bufmgr.ReleaseAndReadBuffer is called.

bufmgr.ReleaseAndReadBuffer � bufmgr.ReadBufferInternal

ReadBufferInternal calls smgr.smgrnblocks to get the accurate number of POSTGRES blocks in the supplied relation if caller asks for P_NEW.

bufmgr.ReadBufferInternal � bufmgr.BufferAlloc

BufferAlloc creates a new buffer tag and search (buf_table.BufTableLookup) it in the buffer pool whether it is already present in it. If found in the buffer pool then it pins the buffer and starts the buffer IO (bufmgr.StartBufferIO)

�If it is not found in the buffer pool. It� initializes a new buffer. Grabs one from the free list by calling freelist.GetFreeBuffer which returns NULL if all the buffers are already pinned.

BufferAlloc at times calls freelist.UnpinBuffer for example if somebody pins the buffer while the current transaction was doing I/O. it clears its I/O flags, remove its pin and start all over again. (see comments and code in bufmgr.c from line 456 onward).

bufmgr.BufferAlloc � freelist.UnpinBuffer � freelist.AddBufferToFreelist.

( 3) Analysis of� Execution trace:

(i) freelist.AddBufferToFreelist is the right place to implement any buffer replacement policy. Look at the comments of this function it clearly states

�In theory, this is the only routine that needs to be changed if the buffer replacement strategy changes. Just change

the manner in which buffers are added to the freelist queue. Currently, they are added on an LRU basis.�

(ii)We feel bufmgr.ReadBufferInternal is the right place where we can guess whether the buffer access is Large Sequential Scan or not. We say this because

(a) We get a pointer (reln) to the �Relation� as one of the parameter through which we can get the size of the relation (size = reln -> rd_nblocks)

(b) Secondly if� the �blockNum� (one of the parameters passesed to it) is equal to P_NEW (this contains �InvalidBlockNumber�a constant defined in block.h, means grow the file to get a new page), ReadBufferInternal makes a call to the function smgr.smgrnblocks to get accurate file length.

Hence at this point we have accurate knowledge of size of the relation being scaned.

(iii)� An external varable �NBuffers� declared in bufmgr.h and defined in globals.c, contains the size of shared buffer.

(iv) Hence in this function we can make a guess whether the scan is large sequential or not by comparing relation size and shared buffer size. If relation�s size is�� greater than shared buffer�s size clearly it is a large sequential scan.

(4) First Probable Implementation Details:

(i) We are thinking of declaring constant in bufmgr.h as following

#define LRU �1

#define MRU� 2

(ii) These constant can be passes further upto freelist.AddBufferToFreelist. In this function after making a check appropriate policy can be applied.

(iii) Functions affected by the implementation:

Nearly all the functions in the freelist.c will be changed to implement MRU in addition two functions bufmgr.ReadBufferInternal and bufmgr.BufferAlloc are most likely to be changed.

Second Probable Implementation Details:

We assume that LRU�s �SharedFreeList� circular queue keeps all the least recently used buffers at the head of the queue and most recently used buffers at the tail. If this is the case then there is absolutely no need to change the AddBufferToFreelist function in freelist.c, let it add the buffers to shared free list as it is presently doing. We just change the freelist.GetFreeBuffer so that instead of removing (LRU) buffers from head of the queue for MRU scheme we will remove the (MRU) buffers from its tail. ( we are afraid whether GetFreeBuffer is the only function that gives out the buffer from free list. Just in case if there is some other function too, then, yet we have to trace it out )�

PART II:

Testing Objective:

(i) First, We have to test performance of new alogrithm MRU that we intend to implement. Then we have to look whether we need specific sequential scan code or not.

(ii) Second, We would like to compare the performance of MRU with that of LRU-K algorithm and more specifically LRU-2.

LRU-K a brief introduction:

LRU-K is a new approach to database disk buffering proposed by Elizabeth J. O�Neil, Patrick E. O�Neil and Gerhard Weikum. The basic idea of LRU-K is to keep track of the times of the last K references to popular database pages, using this information to statistically estimate the interarrival times of references on a page to page basis. The algorithm is self-tuning and adaptive to evolving access patterns.

A patch for LRU-2 was provided to us by Mr. Bruce Momjian. The algorithm was implemented and tested by Mr Tom Lane. Both are from PostgreSQL development group. Mr. Tom Lane had tested the algorithm but his test were not that good and the tests did not contain any sequential scan. So it would be an interesting exercise.

Testing Plan:

(i) We plan to do a big sequential scan and at the same time do a small sequential scan What should happen is that the small sequntial scan should keep its pages in the buffer while the large scan is happening if it doesn't, we have a problem.

(ii) We have populated two tables namely simple and simple1 with only one integer field. We have populated these tables with 8192 values. We have to make one of the table even more bigger so that one will be used for large sequential scan and another will be used for small sequential scan.

(iii) Then we plan to put some simple queries like given below in a file say �samplequeries.txt�

Select * from simple where x = 98765432;

Select * from simple1 where y = 567890432;

Where lets assume simple is large table and simple1 is smaller table. The large file is should have to be larger than shared buffer size and smaller file should have to be say 10% smaller than shared buffer size.

(iv) Then we have to run psql with �f flag giving the file as parameter. While doing that we will keep track of the time.

(v) The other way is that we have set the various statistics flags true like parser statistics, planner statistics, executor statistics etc. All the statistics gets collected into a logfile.

Following is one of the sample statistics generated during our sample query test run on default LRU algorithm:

2002-02-22 18:40:58 DEBUG:� EXECUTOR STATISTICS

! system usage stats:

!�� 0.050108 elapsed 0.000000 user 0.000000 system sec

!�� [0.050000 user 0.130000 sys total]

!�� 4/0 [50/0] filesystem blocks in/out

!�� 4/0 [50/0] page faults/reclaims, 0 [0] swaps

!�� 0 [0] signals rcvd, 0/0 [2/5] messages rcvd/sent

!�� 4/0 [149/17] voluntary/involuntary context switches

! postgres usage stats:

!�� Shared blocks:�� 3 read,�� 0 written, buffer hit rate = 25.00%

!�� Local� blocks:�� 0 read,�� 0 written, buffer hit rate = 0.00%

!�� Direct blocks:�� 0 read,�� 0 written

(vi) Another way is to run the query with EXPLAIN command in VERBOSE and ANALYZE mode. This prints the total execution time of the query along with other matrics like cost of sequential scan startup and total cost etc.

Yardstick of Measurement:

(i) We will run the test queries on MRU and LRU-2 implementation. If the execution time for MRU is less than that of LRU-2 then We have to think of having a specific code for sequential scan.

(ii) Otherwise if the execution time of LRU-2 is less than or close to the MRU execution time than we would suggest replacing the default LRU code of PostgreSQL with LRU-2 algorithm. Since than there will be only one alogrithm for all the access patterns and no need of �advicing� the buffer manager for replacement policy and no need of having different algorithms for large sequential access.

��

Conclusion:

�� To conclude, fetching data from disk requires at least a factor of 1000 more time than fetching data from a RAM buffer. For this reason, good use of the buffer can significantly improve the throughput and response time fo any data-intensive system. Equally important is choosing an efficient buffer replacement algorithm. Thus we aim at improving the performance of PostgreSQL by studying which algorithm among MRU and LRU-2 will prove to be the best bet.

We thank Professor Kevin Chang and TAs for introducing us to such a life time opportunity and for encouraging us to enhance PostgreSQL.

We thank Mr. Bruce Momjian for all his timely help and advices that� he had rendered to us whole heartedly. It is his encouragement that our efforts are directed twords really contributing to PostgreSQL development.

Amit Kumar Khare

Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games

pgsql-hackers by date:

From: "Dave Cramer"
Date: 25 February 2002, 08:13:26
Subject: Open magazine article on open source rdbms

From: Thomas Lockhart
Date: 25 February 2002, 08:50:10
Subject: Re: elog() proposal

Implementation Proposal For Add Free Behind Capability For Large Sequential Scan - Mailing list pgsql-hackers

Previous

Next