Sxooter and I are working on a script to adaptively partition a database so that we create as few partitions as possible and yet enough partitions so that no partition is larger than some size (e.g., 1M😎. This will supposedly be useful for databases (like one that I have) where data is clustered quite unevenly across some range.

The example table I'm using has 3 fields:

CREATE TABLE my_points (
    id integer NOT NULL,
    fx real,
    fy real
);

think of each record as a point on a plane. although a few scattered points are way out on the edge, most points are clustered together in the middle somewhere. as you can imagine, to partition the entire range of x and y values into equally sized sections is going to result in a bunch of empty partitions and then a few with a ton of records in it. Unfortunately the data is distributed in a totally random way so I cannot parametrize my partitions.

I've written a script which creates an m by n matrix. i then rip thru my data table and count up how many records go into each 'cell' in my matrix. In addition to how many records it contains, each cell in the matrix can be characterized and uniquely identified in a number of ways:
1) by it's minimum and maximum x values and its minimum and maximum y values
2) a unique 'number' assigned by counting the cell at x=0 and y=0 as #1, the one at x=1 and y=0 as #2, and so on.
3) by its x and y coordinates.

at any rate, i know how to convert these cells into db partitions and i know how to further sub-partition ones that are too crowded. i need to come up with a good algorithm to merge cells that are empty or which have very little data into larger cells. any tips would be appreciated. for what it's worth the database partitions generally restrict cell mergers to rectangular areas.

here's my matrix-forming code:

<?
define('DB_ENGINE', 'postgresql');
$path_to_web_root = '../';
include($path_to_web_root . '_top.php');

set_time_limit(0);

// == config section
$headroom = 0.0001; // amount added to min and max so we can be sure to include them all
$precision = 4; // number of decimal places used for display of values
$x_count = 100; // divisions in the x direction
$x_field = 'fy'; // the db field i assign to my x axis
$y_count = 100; // divisions in the y direction
$y_field = 'fx'; // the db field i assign to my y axis
$cell_size = 5; // the size of each table cell in pixels
$table_name = 'sxooter_float';
// == end config section


$sql = "SELECT MIN(" . $x_field . ") AS x_min, MIN(" . $y_field . ") AS y_min, MAX(" . $x_field . ") as x_max, MAX(" . $y_field . ") as y_max FROM " . $table_name . ";";
$start = get_microtime();
$result = $db->query($sql)
  or die ('min/max query failed');
echo "min/max query elapsed time:" . (get_microtime() - $start) . " seconds<br>";
$row = $db->fetchrow($result);
echo 'actual x_min:' . $row['x_min'] . '<br>';
echo 'actual x_max:' . $row['x_max'] . '<br>';
echo 'actual y_min:' . $row['y_min'] . '<br>';
echo 'actual y_max:' . $row['y_max'] . '<br>';

$x_min = $row['x_min'] - $headroom;
$x_max = $row['x_max'] + $headroom;
$x_step = ($x_max - $x_min) / $x_count;

$y_min = $row['y_min'] - $headroom;
$y_max = $row['y_max'] + $headroom;
$y_step = ($y_max - $y_min) / $y_count;

// init the partition array
$matrix = array_fill(0, $x_count, array_fill(0, $y_count, 0));

$start = get_microtime();
$sql = "SELECT id, " . $x_field . ", " . $y_field . " FROM " . $table_name . ";";
$result = $db->query($sql)
  or die ('min/max query failed');
echo 'query to fetch all records time elapsed:' . (get_microtime() - $start) . '<br>';

$start = get_microtime();
$max_value = 0;
while($row = $db->fetchrow($result)) {
  $x = floor(($row[$x_field] - $x_min) / $x_step);
  $y = floor(($row[$y_field] - $y_min) / $y_step);

  $num = $matrix[$x][$y]; // read existing record count
  $num++; // increment it
  $max_value = ($num > $max_value) ? $num : $max_value; // check if it's the highest
  $matrix[$x][$y] = $num; // store the incremented value
} // foreach record retrieved
echo 'time elapsed processing records:' . (get_microtime() - $start) . '<br>';

// at this point, all records have been counted for each cell.
?>

    As an alternative to merging empty and underpopulated partitions, perhaps you could go the other way; start off with everything in one big partition, and then repeat:

    1. Find the largest partition (the one with the most records)

    2. If it's below the threshold, you're done

    3. Otherwise, find the median x or y value within the partition (which one you use depends, you could either strictly alternate, or pick whichever coordinate has the biggest difference between max and min; obviously this can generalise to n dimensions)

    4. Make two new partitions, one with all the records whose x/y values are <= the median, and one where they are all > the median

    5. repeat

    What I'm picturing here is comparable to the job of quantising a truecolour image to a limited palette - the difference there is that the cutoff comes when there are a certain number of partitions, while here it's when the partitions are all below a certain size.

    That's the result of my first 30 seconds' thought.

    Ten seconds later:
    That's good enough to be starting off with, but over time the partition boundaries can shift. Perhaps monitor the size of partitions when adding and removing records: if they tip over the threshold, split them in the manner above. When removing a record compare the new size of the partition with the sizes of adjacent partitions. If the new size of the partition, combined with the size of one of the adjacent partitions, is below the threshold, then merge them. To prevent too much fiddling about with frequent splittings/mergings build in a bit of a hysteresis: allow the partition to reach say 75% capacity before considering a merge, and 125% before splitting.

      I like this algorithm idea...my intuition tells me that it would likely result in highly effective partitioning scheme with relatively few iterations. The only drawback I see is that my table currently has 10 million records (and possibly more later) which results in a big performance hit when I have to query it. This is why it needs to be partitioned! My original thinking was to use a lot of RAM to get a bead on where the data lies.

      And ultimately we still have the task of merging partitions. i guess I'll go ahead and try to implement your suggestion and see what happens.

        And as I am writing code to implement your suggestion, it occurs to me that my previous approach might be useful to create a 'dummy' version of the main table with much less detail which might reduce the number of times we need to visit the db.

          Write a Reply...