Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
For the first assignment, you modified the storage layers of PostgreSQL , in particular the buffer manager. The focus of this assignment is on query processing. In order to implement a runtime optimization, you will have to modify hash-joins in PostgreSQL .
As in the previous assignment, you will be modifying PostgreSQL 7.4.13. The code for this assignment does not build upon the code of the previous one, and you will be modifying the source code as downloaded from the web.
PostgreSQL implements the hybrid hash-join algorithm. A short overview of this algorithm is provided below; for more details, you can to consult other resources.
During our discussion, we will assume that we are computing
R daR.A=S.B S (1)
-relation S will be referred as the inner(build) relation,and
-relation R will be referred as the outer(probe)
Suppose now that there is a way to filter out tuples from the probe input which will not join with any build tuple. For such filtering to be useful, it should be performed during the first pass (when a tuple is read for the first time). If we detect that a (probe) tuple will not join with any (build) tuple, we can drop that tuple right then.
Essentially, we want to be able to do membership tests on the join attribute of the probe input, in order to check whether there is a build tuple that might match with it. Convince yourselves that such a membership test
but it should not have false-negatives: a tuple that could join with another tuple is dropped (this should not happen).
For such a membership test to be effective, the false-positive rates have to be low.
The purpose of this assignment is to implement the optimization described above, by using bloom filters.
A Bloom ftlter is a probabilistic data structure that allows us to implement the membership test described in Section 2.1,
Bloom filters are extremely useful in practice, and are used in a wide variety of application domains.
-A brief description of the Bloom filters is provided below; however, you are encouraged to refer to other resources for a more detailed overview.
-ABloom filter for representing a set S = x1, x2, . . . , xn of n elements is described by an array of m bits, which are all initially set to 0.
-Thefilter uses k independent hash functions h1, . . . , hk within the range {1, . . . , m}.
-For eachelement x ∈S, we compute h1(x), . . . , hk(x), and set the corresponding positions in the bit array to 1.
Once we have done this operation for each element of S, we should have a bit array that acts as an approximate representation of the set.
-To check if y S, we check whether for each of the k hash functions, the position hi(y) is set to 1 in the bit If at least one of these positions is set to 0, it is clear that y / S. If however, all the positions are set to 1, we know that y may be a member of S with someprobability.
** (You may check the derivation of probabilities of error rates if you are interested).
In order to use Bloom filters to optimize the hash-join algorithm, we can consider the elements of the join attribute (of either relation) as the members of the set (S above); the details are indicated in the following:
For this assignment, the following assumptions are allowed:
In a more general case, when the join attributes are on general domains, your code may not work (it can even crash) under these assumptions. However, this should not be a major issue if your evaluation is on tables containing only integers.
-This file contains the hash-join source
2.src/backend/executor/nodeHash.c
-This file contains the hash operation source
3.src/include/nodes/execnodes.h
src/include/nodes/plannodes.h
-Theseheader files contain the data structures which maintain the state of an You may need to modify them.
4.src/include/executor/nodeHash.h
src/include/executor/nodeHashjoin.h
These header files contain function definitions for the hash/hash-join
5.src/backend/executor/execProcnode.c
Thisfile contains the source code used by PostgreSQL to control the tuple flow across
You should be able to complete this assignment by modifying only the files from (1) to (4). Another file you might need as a reference is the following:
6. src/include/postgres.h
-This file contains definitions of the PostgreSQL type system. Everything in PostgreSQL is of the type “Datum”. From this data type it is possible to extract the usual C-representation of the data (in this assignment you will be extractingintegers).
Note that this assignment does not build upon the first assignment. You need to change the code of regular unmodified PostgreSQL .