Sxooter and I are working on a script to adaptively partition a database so that we create as few partitions as possible and yet enough partitions so that no partition is larger than some size (e.g., 1M. This will supposedly be useful for databases (like one that I have) where data is clustered quite unevenly across some range.
The example table I'm using has 3 fields:
CREATE TABLE my_points (
id integer NOT NULL,
fx real,
fy real
);
think of each record as a point on a plane. although a few scattered points are way out on the edge, most points are clustered together in the middle somewhere. as you can imagine, to partition the entire range of x and y values into equally sized sections is going to result in a bunch of empty partitions and then a few with a ton of records in it. Unfortunately the data is distributed in a totally random way so I cannot parametrize my partitions.
I've written a script which creates an m by n matrix. i then rip thru my data table and count up how many records go into each 'cell' in my matrix. In addition to how many records it contains, each cell in the matrix can be characterized and uniquely identified in a number of ways:
1) by it's minimum and maximum x values and its minimum and maximum y values
2) a unique 'number' assigned by counting the cell at x=0 and y=0 as #1, the one at x=1 and y=0 as #2, and so on.
3) by its x and y coordinates.
at any rate, i know how to convert these cells into db partitions and i know how to further sub-partition ones that are too crowded. i need to come up with a good algorithm to merge cells that are empty or which have very little data into larger cells. any tips would be appreciated. for what it's worth the database partitions generally restrict cell mergers to rectangular areas.
here's my matrix-forming code:
<?
define('DB_ENGINE', 'postgresql');
$path_to_web_root = '../';
include($path_to_web_root . '_top.php');
set_time_limit(0);
// == config section
$headroom = 0.0001; // amount added to min and max so we can be sure to include them all
$precision = 4; // number of decimal places used for display of values
$x_count = 100; // divisions in the x direction
$x_field = 'fy'; // the db field i assign to my x axis
$y_count = 100; // divisions in the y direction
$y_field = 'fx'; // the db field i assign to my y axis
$cell_size = 5; // the size of each table cell in pixels
$table_name = 'sxooter_float';
// == end config section
$sql = "SELECT MIN(" . $x_field . ") AS x_min, MIN(" . $y_field . ") AS y_min, MAX(" . $x_field . ") as x_max, MAX(" . $y_field . ") as y_max FROM " . $table_name . ";";
$start = get_microtime();
$result = $db->query($sql)
or die ('min/max query failed');
echo "min/max query elapsed time:" . (get_microtime() - $start) . " seconds<br>";
$row = $db->fetchrow($result);
echo 'actual x_min:' . $row['x_min'] . '<br>';
echo 'actual x_max:' . $row['x_max'] . '<br>';
echo 'actual y_min:' . $row['y_min'] . '<br>';
echo 'actual y_max:' . $row['y_max'] . '<br>';
$x_min = $row['x_min'] - $headroom;
$x_max = $row['x_max'] + $headroom;
$x_step = ($x_max - $x_min) / $x_count;
$y_min = $row['y_min'] - $headroom;
$y_max = $row['y_max'] + $headroom;
$y_step = ($y_max - $y_min) / $y_count;
// init the partition array
$matrix = array_fill(0, $x_count, array_fill(0, $y_count, 0));
$start = get_microtime();
$sql = "SELECT id, " . $x_field . ", " . $y_field . " FROM " . $table_name . ";";
$result = $db->query($sql)
or die ('min/max query failed');
echo 'query to fetch all records time elapsed:' . (get_microtime() - $start) . '<br>';
$start = get_microtime();
$max_value = 0;
while($row = $db->fetchrow($result)) {
$x = floor(($row[$x_field] - $x_min) / $x_step);
$y = floor(($row[$y_field] - $y_min) / $y_step);
$num = $matrix[$x][$y]; // read existing record count
$num++; // increment it
$max_value = ($num > $max_value) ? $num : $max_value; // check if it's the highest
$matrix[$x][$y] = $num; // store the incremented value
} // foreach record retrieved
echo 'time elapsed processing records:' . (get_microtime() - $start) . '<br>';
// at this point, all records have been counted for each cell.
?>