It occurred to me that perhaps some of my confusion here may be related to notation -- we use square brackets in PHP to denote arrays but in octave we use them to denote matrices or vectors. Code that does one thing in PHP might do something entirely different in octave. For example, this results in some 1x3 matrices being concatenated into a 1x9 matrix:

x1 = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
x1 =

   1   2   3   4   5   6   7   8   9

in octave you delimit rows with semicolons:

x1 = [[1, 2, 3]; [4, 5, 6]; [7, 8, 9]]
x1 =

   1   2   3
   4   5   6
   7   8   9

And the treatment of brackets, commas, and semicolons suggests to me that there's no concept of any matrix/tensor/object with a rank greater than 2:

x1 = [[1; 2; 3]; [4; 5; 6]; [7; 8; 9]]
x1 =

   1
   2
   3
   4
   5
   6
   7
   8
   9

Weedpacket Makes sense (or, at least, is consistent). It's a vector; it only has one dimension, while the transpose needs at least two dimensions to work with (so at least a matrix), seeing as it's expected to swap them.

I have a dim inkling that this is perhaps too clever by half to say that there's no such thing as a matrix with a singleton dimension. Or perhaps there is a distinction between a vector with 4 elements and a 1x4 matrix or 4x1 matrix. Octave certainly seem to make the distinction, and complains when dimensions don't match.

Regarding performance, I have some very disappointing news for these PHP libs. I have a data object distilled from a corpus of ham & spam messages. The training set, Xtrain, has 1375 elements, which of which represents one email message in my corpus. Each element is an array with 1966 elements, each of which is a zero or one indicating whether a particular word (from a vocabulary of 1966 words) is present in that particular email message. I munged my data corpus with some other scripts and exported this 2-D array into a json file, which can be loaded thusly:

$data_file = '../../../machine-learning/training_data_sets.json';
$data = json_decode(file_get_contents($data_file), TRUE);
$keys = array_keys($data); // Xtrain, ytrain, Xval, yval, Xtest, ytest
echo "Xtrain has " . sizeof($data['Xtrain']) . " elements\n"; // 1375 elements
echo "Xtrain[0] has " . sizeof($data['Xtrain'][0]) . " elements\n"; // 1966 elements

One of the most time-consuming calculations in the svm training script is to take this Xtrain corpus and multiply it by its transpose. In octave, this happens almost instantly with an older, slightly different corpus:

size(X)
ans =

   1382   1996

% this returns insantly
K = X * X';

I've tried this using all 3 of the matrix libs I've stumbled across, and they run MUCH slower

NumPHP:

$X = new NumArray($data['Xtrain']);
$start = microtime(TRUE);
$XT = $X->getTranspose();
echo "getTranspose in " . (microtime(TRUE) - $start) . " seconds\n"; // 5.9611051082611 seconds
$start = microtime(TRUE);
$K = $X->dot($XT);
echo "dot product in " . (microtime(TRUE) - $start) . " seconds\n"; // 466.75154590607 seconds

MarkBaker:

$X = new Matrix\Matrix($data['Xtrain']);
$start = microtime(TRUE);
$XT = $X->transpose();
echo "transpose in " . (microtime(TRUE) - $start) . " seconds\n"; // 0.13952708244324 seconds
$start = microtime(TRUE);
$K = $X->multiply($XT);
echo "multiply in " . (microtime(TRUE) - $start) . " seconds\n"; // 1821.0363881588 seconds

PHP-ML:

$X = new Matrix($data['Xtrain']);
$start = microtime(TRUE);
$XT = $X->transpose();
echo "transpose in " . (microtime(TRUE) - $start) . " seconds\n"; // 0.14473080635071 seconds
$start = microtime(TRUE);
$K = $X->multiply($XT);
echo "multiply in " . (microtime(TRUE) - $start) . " seconds\n"; // 292.54629206657 seconds

I've yet to investigate whether these short PHP scripts are calculating the correct multiplication, but the performance seems really slow, especially given the speed at which octave seems to perform the calculation. Notational confusion aside, I am now questioning whether these libs are suitable at all for this application.

sneakyimp but the performance seems really slow

Time to switch to Rust? https://youtu.be/oY0XwMOSzq4
😉

I haven't dived into it yet, but a guy on our team is doing amazing things with it to manage the processing of huge JSON files of health insurance data.

NogDog
Ugh. I confess that rust sounds interesting, and I quite enjoy hearing from you, NogDog, but that narrator claims to have written backends in python, ruby, and node, then digresses into a rant about the linux kernel adoption being proof of rust's awesomeness. Then digresses further to discuss Jack@twitter, Discord, and the perils of garbage collection and high-level languages. His accent, which exudes a kindly-but-pedantic hauteur, is infuriating. He mentions haskell, julia, and scala, but not once are matrix operations mentioned.

Octave (a FOSS variant of matlab) handles matrix operations effortlessly. I might offload the matrix operations to octave via exec, but I might as well rely on the PHP-ML lib if I must offload my calculations to some other programming language. My stated goal is to make a dependency-free PHP implementation for use in plugins for a variety of popular frameworks/CMSes. I might further refine that goal to avoid use of exec-type functions.

    sneakyimp I have a dim inkling that this is perhaps too clever by half to say that there's no such thing as a matrix with a singleton dimension. Or perhaps there is a distinction between a vector with 4 elements and a 1x4 matrix or 4x1 matrix. Octave certainly seem to make the distinction, and complains when dimensions don't match.

    Well, it's doing it by using different punctuation to indicate different levels; none of your octave samples to now mentioned that. So [1,3,5,7], [[1,3,5,7]] and [[1],[3],[5],[7]] are all the same thing to octave and its matrix multiplication operation treats them all as 1×4 matrices; since they're all the same object as var as octave is concerned, dot is treating them all the same.

    So dot([1,1,1,1],[1,3,5,7]) == [1,1,1,1]*[1;3;5;7], right? Will that latter give you a number or a 1x1 matrix? Or, for that matter, a length-1 vector? How would you even distinguish them? Either way, I expect the numerical value would be the same. What do dot([1;1;1;1],[1;3;5;7]), dot([1,1,1,1],[1;3;5;7]), and dot([1;1;1;1],[1,3,5,7]) produce?

    (Incidentally, personally I've been using a different notation again; I've been translating it into PHP and what I thought was octave before learning about ; for the sake of not adding another notation into the mix.)

    Basically, since octave doesn't do tensors and its 1-row matrices look exactly like vectors (and its 1-column matrices look exactly like column vectors), it's ended up with two distinct functions to do the same job (one treats [1,3,5,7] as a vector and one as a matrix); to get the same results, one of them requires you to distinguish "row" vectors and "column" vectors by using , vs. ;, and the other doesn't bother to distinguish "row" vectors and "column" vectors even if you do.

    The NumPHP and NumPY libraries don't do this confounding-vectors-with-matrices thing; they retain rank information. As intended, their dot product function works on tensors with arbitrary rank — including vectors and matrices. The dot product of two vectors is a scalar (rank 1+1-2) and the dot product of a 1×n matrix with an n×1 matrix is a 1x1 matrix (rank 2+2-2) with its sole element having the same numerical value as the scalar. The same function also of course handles multiplying length-n vectors by n×m matrices and n×m by m, with the result in both cases being a vector.

    sneakyimp

    Weedpacket Makes sense (or, at least, is consistent). It's a vector; it only has one dimension, while the transpose needs at least two dimensions to work with (so at least a matrix), seeing as it's expected to swap them.

    I have a dim inkling that this is perhaps too clever by half to say that there's no such thing as a matrix with a singleton dimension.

    Yup. Octave's dot function chooses to pretend they don't exist; its matrix-multiplication operator doesn't. (The other problem is that by ignoring ranks higher than 2, it's stuck with having to come up with some other mechanism to distinguish "row vectors" and "column vectors".)

    sneakyimp Octave certainly seem to make the distinction

    Only if you use , and ; as appropriate to distinguish between "row" and "column" vectors (or 1×n and n×1 matrices, which to octave are the same thing). Otherwise it doesn't make the distinction which is why it needs two different functions to do the same job.

    sneakyimp Regarding performance, I have some very disappointing news for these PHP libs.... Octave (a FOSS variant of matlab) handles matrix operations effortlessly.

    Well, octave was built with maths processing in mind, and its matrix handling is probably heavily optimised. PHP ... isn't. (Actually, that splitting of dot product and matrix multiplication, and limiting itself to rank-2, may have been deliberate performance decisions.)

    Even NumPY resorts to compiled machine code when it can (and the GPU when it can find one). If you were to go the compiled route you'd probably want to make (or find) a library that you can use PHP's FFI extension with.

    Weedpacket So [1,3,5,7], [[1,3,5,7]] and [[1],[3],[5],[7]] are all the same thing to octave and its matrix multiplication operation treats them all as 1×4 matrices

    Yes.

    Weedpacket So dot([1,1,1,1],[1,3,5,7]) == [1,1,1,1]*[1;3;5;7], right?

    Yes. Both of those appear to return a scalar value. HOWEVER, if you reverse the order of operands, they return different results:

    octave:82> dot([1,1,1,1],[1,3,5,7])
    ans = 16
    octave:83> [1,1,1,1]*[1;3;5;7]
    ans = 16
    octave:84> dot([1,3,5,7], [1,1,1,1])
    ans = 16
    octave:85> dot([1;3;5;7], [1,1,1,1])
    ans = 16
    octave:86> [1;3;5;7]*[1,1,1,1]
    ans =
    
       1   1   1   1
       3   3   3   3
       5   5   5   5
       7   7   7   7

    Weedpacket What do dot([1;1;1;1],[1;3;5;7]), dot([1,1,1,1],[1;3;5;7]), and dot([1;1;1;1],[1,3,5,7]) produce?

    octave:86> dot([1;1;1;1],[1;3;5;7])
    ans = 16
    octave:87> dot([1,1,1,1],[1;3;5;7])
    ans = 16
    octave:88> dot([1;1;1;1],[1,3,5,7])
    ans = 16

    Weedpacket (Incidentally, personally I've been using a different notation again; I've been translating it into PHP and what I thought was octave before learning about ; for the sake of not adding another notation into the mix.)

    I appreciate your efforts, and have only slowly come to realize where my confusion lies. You are enormously helpful, and hardly to blame for Octave's quirks or dissimilarities to PHP--or my confused & anemic descriptions.

    Weedpacket ended up with two distinct functions to do the same job

    The dot function seems like a convenience function more than anything. I haven't seen it used in the code I've been looking at. Like many PHP functions, it surely has its own quirks to which one must become accustomed.

    Weedpacket The NumPHP and NumPY libraries don't do this confounding-vectors-with-matrices thing; they retain rank information

    This may indeed be the desired behavior from a sort of abstract mathematical perspective, but I would point out that matrix multiplication is not symmetric -- see those first dot-versus-multiplication results at the top of this post. I apologize that I'm unable to foresee the broader mathematical/code significance of this. My comprehension of matrix operations and their relation to the underlying algegbra is dim, at best. I remain confused about what it means that dot always returns a scalar (as NumPY presumably does) when multiplying these vectors/singleton matrices, whereas the octave/matlab multiplication operation returns a scalar in one case and a 4x4 matrix when the operands are reversed.

    Weedpacket Well, octave was built with maths processing in mind, and its matrix handling is probably heavily optimised. PHP ... isn't. (Actually, that splitting of dot product and matrix multiplication, and limiting itself to rank-2, may have been deliberate performance decisions.)

    Quite frankly, I am far more interested in performance for the task at hand which has two main parts:

    • convert this svm code from octave/matlab PHP such that it doesn't require installation of additional PHP or exec functions
    • insure that performance is reasonable for common use cases

    I'm far less interested in any mathematical orthodoxy as far as distinctions between a vector and a 1xn or nx1 matrix. In fact, you might say I'm sort of accustomed to the quirky octave behaviors and rather hooked on the stubborn refusal of the octave multiplier operation to proceed if the dimensions don't match. When the code makes assumptions about what I'm trying to multiply, this, to me, introduces ambiguity. I like that [1,1,1,1] * [1;3;5;7] and [1;3;5;7]*[1,1,1,1] produce different results. When they don't, I feel like things are getting unpredictable.

    There can be no question that octave is highly optimized for multiplying a matrix times its transpose -- at least compared to these PHP libs I've identified. Octave is three or four orders of magnitude faster on a 1375x1966 matrix. I wrote my own multiply function, and it takes over ten minutes on this laptop to multiply this matrix times its transpose. The transposition happens in about 0.14 seconds. It's the multiplication of 1375x1966 matrix by 1966x1375 matrix that takes so long. My function:

            public function multiply($matrix) {
                    $iteration_count = $this->colCount();
                    if ($iteration_count !== $matrix->rowCount()) {
                            throw new Exception('nonconformant arguments (op1 is ' . $this->sizeAsString() . ', op2 is ' . $matrix->sizeAsString() . ')');
                    }
    
                    $result_rows = $this->rowCount();
                    $result_cols = $matrix->colCount();
    
                    $ret_array = [];
                    for($i=0; $i<$result_rows; $i++) {
                            $ret_array[$i] = array_fill(0, $result_cols, 0);
                            for($j=0; $j<$result_cols; $j++) {
                                    for($k=0; $k<$iteration_count; $k++) {
                                            $ret_array[$i][$j] += ($this->data[$i][$k] * $matrix->data[$k][$j]);
                                    }
                            }
                    }
    
                    return new static($ret_array);
            }

    multiply in 734.63768601418 seconds

    I looked into the PHP-ML code, which had the fastest multiplication operation in my earlier testing, and appropriated its multiplication algorithm to produce this function:

            public function multiply2($matrix) {
                    // this fn largely borrowed from php-ml, which said:
                    /*
                    - To speed-up multiplication, we need to avoid use of array index operator [ ] as much as possible( See #255 for details)
                    - A combination of "foreach" and "array_column" works much faster then accessing the array via index operator
                    */
    
                    $iteration_count = $this->colCount();
                    if ($iteration_count !== $matrix->rowCount()) {
                            throw new Exception('nonconformant arguments (op1 is ' . $this->sizeAsString() . ', op2 is ' . $matrix->sizeAsString() . ')');
                    }
    
                    $result_cols = $matrix->colCount();
                    $product = [];
                    foreach ($this->data as $row => $rowData) {
                            for ($col = 0; $col < $result_cols; ++$col) {
                                    $columnData = array_column($matrix->data, $col);
                                    $sum = 0;
                                    foreach ($rowData as $key => $valueData) {
                                            $sum += $valueData * $columnData[$key];
                                    }
                                    $product[$row][$col] = $sum;
                            }
                    }
    
                    return new self($product);
            }

    It's nearly 3 times faster:

    multiply2 in 288.33355784416 seconds

    However, as I mentioned previously, octave performs a calculation like this instantly. If I had to guess, I'd say octave is probably capitalizing on some kind of contiguous memory structure where there's no need to traverse/search an array to locate a particular index. I expect there might also be some shrewd mathematical optimizations -- there's probably a symmetry that results when you multiply a matrix by its transpose, and perhaps they're using some kind of middle-out optimization or something. I seriously doubt it'll be possible to massage these PHP multiplication functions to boost performance by 3 orders of magnitude, but I'm open to suggestions. This is a drag.

    sneakyimp but I would point out that matrix multiplication is not symmetric -- see those first dot-versus-multiplication results at the top of this post.

    This isn't really an issue if you keep in mind that vectors are not matrices; the same formula applies for vectors and matrices.
    For vectors it reduces to

    T    · U   = ∑ T  U
     t1     u1   k  k  k

    And U·T will produce the same result

    For matrix×matrix it's

     T      · U      = ∑ T     U
      t1,t2    u1,u2   k  t1,k  k,u2

    which isn't the same thing as

     U      · T      = ∑ U     T
      u1,u2    t1,t2   k  u1,k  k,t2

    because you're pairing different elements together (in the first, the elements in one row of T with the elements in a column of U, and in the second the elements in a row of U with the elements in a column of T). That's why matrix multiplication isn't symmetric, and also why it doesn't matter for vector dot product; vectors don't have that many dimensions so there is less freedom for how their elements can be paired up.

    sneakyimp I like that [1,1,1,1] * [1;3;5;7] and [1;3;5;7]*[1,1,1,1] produce different results. When they don't, I feel like things are getting unpredictable.

    Which is what you'd expect from multiplying a 1×4 matrix and a 4×1 matrix vs. a 4×1 matrix and a 1×4 matrix. Not having that ,/; distinction I'd write them as

    {{1,1,1,1}}.{{1},{3},{5},{7}} = {{16}}

    and

    {{1},{3},{5},{7}}.{{1,1,1,1}} = {{1, 1, 1, 1}, {3, 3, 3, 3}, {5, 5, 5, 5}, {7, 7, 7, 7}}

    And the brackets are significant because without them they're not matrices but vectors, which means

    {1, 1, 1, 1} . {1, 3, 5, 7}={1, 3, 5, 7} . {1, 1, 1, 1}=16

    sneakyimp . I remain confused about what it means that dot always returns a scalar

    That I think comes back to dot deliberately ignoring singleton dimensions; it throws away the distinction between [1,1,1,1] and [1;1;1;1] which means dot([1;1;1;1].[1,1,1,1]) and dot([1,1,1,1].[1;1;1;1]) are indistinguishable; so no wonder they produce the same answer. The matrix multiplication operator doesn't do this.
    Octave's notation makes it impossible to distinguish between a length-n vector and a 1×n matrix, so it has to have different functions where one treats it as a vector and the other treats it as a matrix. The one that treats it as a vector has to also treat n×1 matrices as vectors as well.

    Weedpacket (in the first, the elements in one row of T with the elements in a column of U, and in the second the elements in a row of U with the elements in a column of T). That's why matrix multiplication isn't symmetric, and also why it doesn't matter for vector dot product; vectors don't have that many dimensions so there is less freedom for how their elements can be paired up.

    Thanks once again, Weedpacket, for patiently explaining these things. I do have a much better grip, I think, on vector/matrix operations. I feel stronger in my conceptual understanding of what is transpiring in these operations. My only lingering question is perhaps should we always assume 1xn or nx1 matrices are vectors, or are there circumstances where we'd want to treat them as actual matrices? All the code and explanations I've seen here suggest that 1xn/nx1 matrices are always intended as vectors.

    And, lastly, I think it seems clear that these three PHP libs we've examined are sorely inadequate for a pure PHP implementation (no FANN, no exec of other executable or script) of my Support Vector Machine. My spam filter example requires one to multiple a 1375x1966 matrix by its inverse 1966x1375 matrix and it is a hundred times too slow. Unless someone can suggest a way to construct a data structure in pure PHP that can efficiently access matrix indices, PHP is going to be doing some searching/traversal of these large arrays. I suspect (but am not sure) that the PHP slowness here is due to array search/traversal, whereas in C or Octave (or perhaps Rust), one might be able to construct an efficient, contiguous memory segment where accessing $matrix[$i][$j] is simply a matter of calculating a memory offset from some starting memory location and then it's simple RAM access rather than array search/traversal.

    sneakyimp is perhaps should we always assume 1xn or nx1 matrices are vectors, or are there circumstances where we'd want to treat them as actual matrices? All the code and explanations I've seen here suggest that 1xn/nx1 matrices are always intended as vectors.

    Er ... if you mean my code and explanations then that's the opposite of what I was saying.

    sneakyimp My spam filter example requires one to multiple a 1375x1966 matrix by its inverse 1966x1375 matrix

    You mean "transpose", of course, because multiplying a matrix by its inverse is really easy if you assume it has one 🙂

    I do have a couple of suggestions; one is to use an SPLFixedArray (especially since the array's sizes are all known).
    The other, if AA' is the main slow part, is to use an algorithm optimised for that case. Mainly, since the result is symmetric, you only need to compute half of it; the other half would be a mirror image. You could either calculate the half and then copy everything into the other half (if you're using SPLFixedArrays this won't make much difference memory-wise, but the time spent on it may save more time later on), or just leave the array half-filled and redirect lookups as appropriate (so, every time you want to look up $symm[$j][$i] you'd check to see if $i>$j and if it is you'd look up $symm[$i][$j] instead.)

    Weedpacket You mean "transpose", of course, because multiplying a matrix by its inverse is really easy if you assume it has one 🙂

    OOF yes I meant transpose. Inverse is something else entirely. Sorry for the mistake.

    I will of course attempt a multiply function using SPLFixedArray.

      Not having much luck wtih SplFixedArray. Here's a script:

      class JMatrix {
      	public $data;
      
      	public function __construct($arr) {
      		if (is_a($arr, 'SplFixedArray')) {
      			// shortcut...provide SplFixedArray param, bypass checks and just assign
      			$this->data = $arr;
      		} elseif (!is_array($arr)) {
      			// if not SplFixedArray, must be standard array
      			throw new Exception('Constructor param must be an array');
      		} else {
      			// if it's an array, check if 2D array (matrix) or just 1D array vector, which we treat as matrix
      			if (!is_array($arr[0])) {
      				// the supplied parameter is a simple array, so we'll construct a row vector
      				// from its elements
      				$row_count = 1;
      				$col_count = sizeof($arr);
      				$this->data = new SplFixedArray(1);
      				$this->data->offsetSet(0, new SplFixedArray($col_count));
      				foreach($arr as $j => $v) {
      					$this->data->offsetGet(0)->offsetSet($j, $v);
      				}
      
      			} else {
      				// an array of arrays...sanity check on element counts
      				$row_count = sizeof($arr);
      				$col_count = sizeof($arr[0]);
      				foreach($arr as $v) {
      					if (sizeof($v) !== $col_count) {
      						throw new Exception('Invalid matrix array. Inconsistent column count in constructor array.');
      					}
      				}
      				$this->data = new SplFixedArray($row_count);
      				foreach($arr as $i => $row) {
      					$this->data->offsetSet($i, new SplFixedArray($col_count));
      					foreach($row as $j => $v) {
      						$this->data->offsetGet($i)->offsetSet($j, $v);
      					}
      				}
      				
      			}
      		}
      	}
      
      	public function rowCount() {
      		return $this->data->getSize();
      	}
      	public function colCount() {
      		return $this->data->offsetGet(0)->getSize();
      	}
      
      	public function size() {
      		return [$this->rowCount(), $this->colCount()];
      	}
      	public function sizeAsString() {
      		return $this->rowCount() . 'x' . $this->colCount();
      	}
      
      	public function __toString() {
      		$retval = "[\n";
      		foreach($this->data as $i => $row) {
      			$retval .= "\t[" . implode(",", (array)$row) . "]\n";
      		}
      		$retval .= "]\n";
      		return $retval;
      	}
      
      	public function multiply($matrix) {
      		$iteration_count = $this->colCount();
      		if ($iteration_count !== $matrix->rowCount()) {
      			throw new Exception('nonconformant arguments (op1 is ' . $this->sizeAsString() . ', op2 is ' . $matrix->sizeAsString() . ')');
      		}
      
      		$result_rows = $this->rowCount();
      		$result_cols = $matrix->colCount();
      
      		// accumulate results in this array
      		$ret_array = new SplFixedArray($result_rows);
      
      		// temp vars to avoid object property lookup
      		$td = $this->data;
      		$md = $matrix->data;
      
      		for($i=0; $i<$result_rows; $i++) {
      			// current row from this matrix
      			$drow = $td->offsetGet($i);
      
      			// accumulate output matrix row here
      			$retrow = new SplFixedArray($result_cols);
      
      			// loop thru $matrix columns
      			for($j=0; $j<$result_cols; $j++) {
      
      				// accumulate product sum here
      				$sum = 0;
      
      				// iteration count is number of columns in this matrix == number of rows in $matrix
      				for($k=0; $k<$iteration_count; $k++) {
      					$sum += ($drow->offsetGet($k) * $matrix->data->offsetGet($k)->offsetGet($j));
      				}
      
      				// all products summed, store it in our output matrix row
      				$retrow->offsetSet($j, $sum);
      			}
      
      			// put the output matrix row into our output matrix
      			$ret_array->offsetSet($i, $retrow);
      		}
      		return new static($ret_array);
      	}
      
      	public function multiply2($matrix) {
      		$iteration_count = $this->colCount();
      		if ($iteration_count !== $matrix->rowCount()) {
      			throw new Exception('nonconformant arguments (op1 is ' . $this->sizeAsString() . ', op2 is ' . $matrix->sizeAsString() . ')');
      		}
      
      		$result_rows = $this->rowCount();
      		$result_cols = $matrix->colCount();
      
      		// accumulate results in this array
      		$ret_array = new SplFixedArray($result_rows);
      
      		// temp vars to avoid object property lookup
      		$td = $this->data;
      		$md = $matrix->data;
      
      
      		$td->rewind();
      		$i = 0;
      		while ($td->valid()) {
      			// current row from this matrix
      			$drow = $td->current();
      
      			// accumulate output matrix row here
      			$retrow = new SplFixedArray($result_cols);
      
      			// loop thru $matrix columns
      			for($j=0; $j<$result_cols; $j++) {
      
      				// accumulate product sum here
      				$sum = 0;
      
      				// iteration count is number of columns in this matrix == number of rows in $matrix
      				$md->rewind();
      				$k = 0;
      				while($md->valid()) {
      					$mrow = $md->current();
      					$sum += ($drow->offsetGet($k) * $mrow->offsetGet($j));
      
      					$md->next();
      					$k++;
      				}
      
      				// all products summed, store it in our output matrix row
      				$retrow->offsetSet($j, $sum);
      			}
      
      			// put the output matrix row into our output matrix
      			$ret_array->offsetSet($i, $retrow);
      
      			$td->next();
      			$i++;
      		}
      		return new static($ret_array);
      	}
      
      	public function transpose() {
      		$arr = [];
      		foreach($this->data as $row) {
      			$arr[] = (array)$row;
      		}
      		$arr = array_map(null, ...$arr);
      		return new static($arr);
      	}
      }
      
      
      echo "SANITY CHECK\n";
      $a = new JMatrix([
      	[1, 4, -2],
      	[3, 5, -6]
      ]);
      echo $a;
      echo "size: " .  $a->sizeAsString() . "\n";
      
      $b = new JMatrix([
      	[5, 2, 8, -1],
      	[3, 6, 4, 5],
      	[-2, 9, 7, -3]
      ]);
      echo $b;
      echo "size: " .  $b->sizeAsString() . "\n";
      
      $v = $a->multiply2($b);
      echo $v . "\n";
      echo "size: " .  $v->sizeAsString() . "\n";
      
      $vt = $v->transpose();
      echo "$vt\n";
      echo "size: " .  $vt->sizeAsString() . "\n";
      
      
      $data_file = './training_data_sets.json';
      $data = json_decode(file_get_contents($data_file), TRUE);
      
      echo 'Xtrain rows: ' . sizeof($data['Xtrain']) . "\n";
      echo 'Xtrain cols: ' . sizeof($data['Xtrain'][0]) . "\n";
      
      $X = new JMatrix($data['Xtrain']);
      
      $start = microtime(TRUE);
      $XT = $X->transpose();
      echo "transpose in " . (microtime(TRUE) - $start) . " seconds\n";
      
      $start = microtime(TRUE);
      $K = $X->multiply2($XT);
      echo "multiply2 in " . (microtime(TRUE) - $start) . " seconds\n";

      multiply, my first attempt, runs in 973 seconds:

      Xtrain rows: 1375
      Xtrain cols: 1966
      transpose in 0.79622912406921 seconds
      multiply in 973.16395616531 seconds

      multiply2, where i attempt to avoid indexed lookups, actually takes longer -- about 20 minutes 😢.

      Xtrain rows: 1375
      Xtrain cols: 1966
      transpose in 0.79264497756958 seconds
      multiply2 in 1199.6731460094 seconds

        Oh, well. Worth a try. (Incidentally, the instanceof operator could be used instead of the is_a function. Since SplFixedArray implements Countable you can use count() instead of calling ->getSize() manually; implementing IteratorAggregate means that the ->valid() ->rewind(), ->current(), and ->next() methods will be called by foreach as and when needed; and implementing the ArrayAccess interface means the ability to use array index notation as implemented by ->offsetGet() and ->offsetSet(), e.g. $this->data->offsetGet(0)->offsetSet($j, $v); becomes $this->data[0][$j] = $v;, And I'm guessing you're using PHP 7: SplFixedArray dropped the rewind() method in v8 when it was switched from being an Iterator to an IteratorAggregate. Woo. Big sidebar. But I did get some improvement by making those changes. Possibly because of the dedicated opcodes for those things.)

        But I think PHP will (even with JIT enabled) lag behind domain-specific languages that will have had dedicated native code for those tasks for which the language is intended ("a matrix multiplication? I have machine code for that compiled from C++ right here"); PHP's execution would still be passing through the various reference-counted zval structures and hashtables and whatnot before getting right to the gist of the task. (See, e.g. Internal value representation in PHP 7 - Part 2.)

        Weedpacket Oh, well. Worth a try. (Incidentally, the instanceof operator could be used instead of the is_a function...I did get some improvement by making those changes. Possibly because of the dedicated opcodes for those things.)

        Thanks for these helpful clarifications. It had occurred to me that the transpose operation is quite fast, and part of the slowness seems to be due to the array indexing required. Note how the fastest PHP implementation, PHP-ML, makes use of the array_column function and explicitly tries to avoid array indexing as much as possible. It therefore occurred to me that I might quickly transpose the matrix supplied to my multiply function and benefit from slightly more efficient array operations. However, I don't think these little changes will bring us anywhere near the supremely fast performance I see in Octave. I exported my 1375x1966 array to an octave-friendly text format and loaded it up on octave and multipled it with this little octave/matlab script:

        Xtrain = dlmread("Xtrain.txt");
        sz = size(Xtrain)
        t0 = clock();
        K = Xtrain * Xtrain';
        elapsed_time = etime(clock(), t0);
        fprintf("transpose multiplication completed in %d seconds\n", elapsed_time);

        Octave completes the matrix multiplication in a stunning 67ms:

        sz =
        
           1375   1966
        
        transpose multiplication completed in 0.0668411 seconds

        This is 4376.7 times faster than the fastest PHP code I've found, which is accomplished by the matrix classes in the PHP-ML library.

        NOTE that the PHP-ML library apparently punts to some kind of compiled executable to perform its SVM training operations. It constructs a command and invokes the external executable with exec.

        I consulted with a friend who's a talented python jockey who offered some interesting initial observations on matrix multiplication algorithms:

        Strassen's was the first notable speedup, 1970 — O(n ^ 2.8) iirc — and people have knocked the exponent down further and further in the ensuing decades. In fact just within the last two months I read about this recent improvement: as per https://en.wikipedia.org/wiki/Matrix_multiplication#Computational_complexity

        As of December 2020, the best matrix multiplication algorithm is by Josh Alman and
        Virginia Vassilevska Williams and has complexity O(n ^ 2.3728596).

        That statement just quoted has a footnote to this paper: https://arxiv.org/abs/2010.05846, "A Refined Laser Method and Faster Matrix Multiplication".
        It's a tiny speedup over the previous best-known algo: best was formerly 

        O(n^{2.37287 + ε}) for all ε;

        the paper improves that bound to 

        O(n^{2.37286 + ε}) for all ε;

        My friend rightly observes:

        It would not be easy to whip up working code from the paper, written for actual and aspiring specialists.

        He pointed out an interesting exercise in matrix operation optimization and noted that numpy is very fast. Like the paper, these articles get into a lot of the nitty gritty (i.e., the weeds) of bottlenecks and various optimizations.

        I guess Octave and Matlab use BLAS (Basic Linear Algebra Subprograms), originally a Fortran library now ported to C or something, under the hood.

        My kind friend imported my 1375x1966 matrix for some python profiling. Some interesting results on his 2015 Macbook Pro:

        Numpy takes 2.06s to matmult Xtrain by its transpose

        He took the trouble to write a "pure python" function which he ran in a "notebook" :

        PS I'm trying to time the naive algorithm in pure Python, operating either on the same numpy "ndarray"s, or on native Python data structures (list of lists of numbers)
        I've tried both, a few times; I give up waiting, I haven't seen either complete, it takes minutes (at least!) and I get bored.

        Safe conclusion: Performance of pure Python naive algorithm in a notebook is worse than PHP. Way worse: in a practical sense it doesn't even terminate.
        Performance at command line: (minutes later, still waiting… I'll check back in an hour, if it terminates ) 

        Maybe it's faster at the command line than in a notebook? I bet not enough to matter, but let's try. 
        With native Python datatypes: it takes at least tens of seconds... a minute… 5+ minutes… 
        Just checked back: it took 20m 29.15s
        (So, yes: faster than when running in a notebook, but still unusable.)

        He went on to make some minor improvements, but the 'pure python' approach, which doesn't make use of NumPY, is clearly slow like the pure PHP approach. Oddly enough, I find this something of a relief. I'd hate to think python is inherently better than my dear old PHP.

        The lesson here is that it's clearly advantageous--necessary!--to delegate matrix operations to something outside PHP (or python) which takes advantage of decades of algorithmic refinement by the world's finest nerds. I truly doubt that your typical linux server will have octave installed, but it's easy enough with apt if you have su access via CLI. I think your typical server is much more likely to have python installed. Or perhaps I could distribute my own executable to handle matrix operations? Any suggestions about farming out these matrix operations from PHP would be much appreciated. Ideally we'd find some tool for matrix operations that will already exist in your typical server so we can just exec it from PHP with reasonable hope of success.

          The last time I had to do anything numerically-matricy (it was actually a dynamic programming problem) that was just way too slow and/or large to run in PHP, I whipped up a program in C to do it and compiled (TinyCC was good enough) it as an executable that I communicated with via proc_open. That was years and major versions ago. If PHP had had FFI back then I would have considered compiling it as a library and calling the API directly (warning: requires third-party library, which I suspect is the same one NumPY ends up using) instead of that mucking around with stdio. (The PHP manual has a note that accessing FFI structures is twice as slow as accessing PHP ones, so "it makes no sense to use the FFI extension for speed". But accessing the structure isn't the bottleneck here.)

          Thanks, Weedpacket, for the detail. Your post is over my head, and I'll have to do some remedial study on the particulars of compiling/linking/assembling to understand the dynamics here. Offhand, it seems tricky to find some speedy lib external to PHP which can perform fast matrix operations. I expect there are two paths:
          1) try using exec to run some CLI commands in the hope that python or octave or some other package has been installed on the machine. Might also have to download a python lib or two like NumPy.
          2) try invoking some executable file that I cook up myself via exec or FFI or some other process-to-process method. This seems quite involved, and it sounds like a real chore to distribute executables that'll work on linux, MacOS, windows, or whatever.

          Obviously, my goal of a 'pure PHP' library is unattainable. I still hope that I might work up a solution that doesn't require su access to install (e.g., to configure PHP with FFI module or FANN, or to install octave or python).

            8 days later

            I google around for 'php language binding blas' and ran across this NP lib. It requires PHP 8 with FFI, as well as libblas and liblapack (not sure what the latter two are). It's evidently a work in progress and the composer command complains about stable versions on my MacOS laptop until I specified a dev branch:

            composer require ghostjat/np:dev-main

            This gets the composer install to work but then the PHP sample script they provide:

            require __DIR__ . '/vendor/autoload.php';
            use Np\matrix;
            
            $ta = matrix::randn(1000, 1000);    
            $tb = matrix::randn(1000, 1000); // to generate random 2d matrix
            $ta->dot($tb);                  // do a dot operation on given matrix
            $ta->getMemory();              // get memory use
            $ta->time();                  // get time

            fails, complaining that FFI:scope had a problem:

            PHP Fatal error: Uncaught FFI\Exception: Failed loading scope 'blas' in /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php:28
            Stack trace:
            #0 /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php(28): FFI::scope('blas')
            #1 /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php(73): Np\core\blas::init()
            #2 /Users/sneakyimp/np/vendor/ghostjat/np/src/linAlgb/linAlg.php(45): Np\core\blas::gemm(Object(Np\matrix), Object(Np\matrix), Object(Np\matrix))
            #3 /Users/sneakyimp/np/vendor/ghostjat/np/src/linAlgb/linAlg.php(30): Np\matrix->dotMatrix(Object(Np\matrix))
            #4 /Users/sneakyimp/np/foo.php(8): Np\matrix->dot(Object(Np\matrix))
            #5 {main}
            thrown in /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php on line 28
            
            Fatal error: Uncaught FFI\Exception: Failed loading scope 'blas' in /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php:28
            Stack trace:
            #0 /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php(28): FFI::scope('blas')
            #1 /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php(73): Np\core\blas::init()
            #2 /Users/sneakyimp/np/vendor/ghostjat/np/src/linAlgb/linAlg.php(45): Np\core\blas::gemm(Object(Np\matrix), Object(Np\matrix), Object(Np\matrix))
            #3 /Users/sneakyimp/np/vendor/ghostjat/np/src/linAlgb/linAlg.php(30): Np\matrix->dotMatrix(Object(Np\matrix))
            #4 /Users/sneakyimp/np/foo.php(8): Np\matrix->dot(Object(Np\matrix))
            #5 {main}
            thrown in /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php on line 28

            The FFI docs are sorely incomplete, so I've no idea what this command is meant to do or what conditions must be met for it to succeed. There is mention of FFI_SCOPE there in the docs, and there's a blas.h file that contains a #define that looks related to this, but I don't know how to fix this problem.

              I've made a tiny bit of progress with ghostjat/np by reading the source code (which is not well-commented) and by reading the PHP-FI documentation (which is not great). I don't have much faith that it'll be well-behaved and robust, but I have managed to get ghostjat/np to multiply my large matrix by its transpose and performance seems very fast. To get it working on my Ubuntu 20 workstation:

              ./composer.phar require ghostjat/np:dev-main

              Edit the file vendor/ghostjat/np/src/core/blas.h and change the value of FFI_LIB to refer to your BLAS shared library file. On my machine, this works:

              #define FFI_LIB "/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so"

              NOTE I don't really know much about what this accomplishes. The FFI docs say:

              The header file may contain a #define statement for the FFI_LIB variable to specify the library it exposes. If it is a system library only the file name is required, e.g.: #define FFI_LIB "libc.so.6". If it is a custom library, a relative path is required, e.g.: #define FFI_LIB "./mylib.so".

              I'd also point that I just got lucky by specifying that file. Just trial and error. A sudo apt search blas command returns a giant list of possible packages that one might or might not have installed on one's machine. My workstation has octave installed and I was able to locate one shared library, libopenblasp-r0.3.8.so, with this command:

              # barfs up lots of 'not permitted' warnings so route stderr to /dev/null
              sudo find / -name "*blas*.so" 2>/dev/null

              I'd like to take a moment to point out that this path is highly specific to my particular machine. THIS IS NOT REALLY PORTABLE AT ALL.

              Finally, i create this file bar.php in my working directory, which loads the JSON file containing my 1375*1966 matrix, among other things:

              require __DIR__ . '/vendor/autoload.php';
              use Np\matrix;
              
              $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
              
              // more trial and error...cooked this up after reading the ghostjat/np source code
              Np\core\blas::$ffi_blas = FFI::load(__DIR__ . '/vendor/ghostjat/np/src/core/blas.h');
              
              $ta = matrix::ar($data['Xtrain']);
              
              $start = microtime(true);
              $tb = $ta->transpose(); // to generate random 2d matrix
              echo "transpose elapsed time: " . (microtime(true) - $start) . "seconds\n";
              
              $start = microtime(true);
              $tc = $ta->dot($tb);
              echo "dot elapsed time: " . (microtime(true) - $start) . "seconds\n";
              
              $arr = $tc->asArray();
              echo "rows: " . sizeof($arr) . "\n";
              echo "cols: " . sizeof($arr[0]) . "\n";
              
              $json = json_encode($arr, JSON_PRETTY_PRINT);
              file_put_contents(__DIR__ . '/np-product.json', $json);
              echo "json md5 is " . md5($json) . "\n";

              It's fast!

              transpose elapsed time: 0.17577886581421seconds
              dot elapsed time: 0.089972019195557seconds
              rows: 1375
              cols: 1375
              json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e

              The resulting JSON file (with pretty print turned on) has 1893376 lines. I calculate the md5 of the JSON for comparison with the other algorithms. These appear to all produce the exact same JSON:

              MarkBaker:

              require_once 'vendor/autoload.php';
              
              $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
              
              $X = new Matrix\Matrix($data['Xtrain']);
              
              $start = microtime(TRUE);
              $XT = $X->transpose();
              echo "getTranspose in " . (microtime(TRUE) - $start) . " seconds\n";
              
              $start = microtime(TRUE);
              $K = $X->multiply($XT);
              echo "dot product in " . (microtime(TRUE) - $start) . " seconds\n";
              
              $arr = $K->toArray();
              echo "rows: " . sizeof($arr) . "\n";
              echo "cols: " . sizeof($arr[0]) . "\n";
              $json = json_encode($arr, JSON_PRETTY_PRINT);
              echo "json md5 is " . md5($json) . "\n";
              file_put_contents(__DIR__ . '/markbaker-product.json', $json);
              
              echo "COMPLETE\n";

              Multiplication is crazy slow:

              getTranspose in 0.083314180374146 seconds
              dot product in 1227.2787439823 seconds
              rows: 1375
              cols: 1375
              json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e
              COMPLETE

              NumPHP:

              require 'vendor/autoload.php';
              
              use NumPHP\Core\NumArray;
              
              $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
              
              $X = new NumArray($data['Xtrain']);
              
              $start = microtime(TRUE);
              $XT = $X->getTranspose();
              echo "getTranspose in " . (microtime(TRUE) - $start) . " seconds\n"; // 5.9611051082611 seconds
              
              $start = microtime(TRUE);
              $K = $X->dot($XT);
              echo "dot product in " . (microtime(TRUE) - $start) . " seconds\n"; // 466.75154590607 seconds
              
              $arr = $K->getData();
              echo "rows: " . sizeof($arr) . "\n";
              echo "cols: " . sizeof($arr[0]) . "\n";
              $json = json_encode($arr, JSON_PRETTY_PRINT);
              echo "json md5 is " . md5($json) . "\n";
              file_put_contents(__DIR__ . '/numphp-product.json', $json);
              
              echo "COMPLETE\n";

              transpose in numphp is surprisingly slow. Dot product unacceptably slow for deployment on a web server application:

              getTranspose in 3.0702431201935 seconds
              dot product in 224.07051301003 seconds
              rows: 1375
              cols: 1375
              json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e
              COMPLETE

              PHP-ML:

              require_once __DIR__ . '/vendor/autoload.php';
              
              use Phpml\Math\Matrix;
              
              $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
              
              $X = new Matrix($data['Xtrain']);
              
              $start = microtime(TRUE);
              $XT = $X->transpose();
              echo "transpose in " . (microtime(TRUE) - $start) . " seconds\n";
              
              $start = microtime(TRUE);
              $K = $X->multiply($XT);
              echo "multiply in " . (microtime(TRUE) - $start) . " seconds\n"; // 292.54629206657 seconds
              
              $arr = $K->toArray();
              echo "rows: " . sizeof($arr) . "\n";
              echo "cols: " . sizeof($arr[0]) . "\n";
              $json = json_encode($arr, JSON_PRETTY_PRINT);
              echo "json md5 is " . md5($json) . "\n";
              file_put_contents(__DIR__ . '/php-ml-product.json', $json);
              
              echo "COMPLETE\n";

              Perhaps the best 'pure php' implementation, but still unacceptably slow for web server applications:

              transpose in 0.082174062728882 seconds
              multiply in 178.16531610489 seconds
              rows: 1375
              cols: 1375
              json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e
              COMPLETE

              Home-rolled JMatrix:

              require_once __DIR__ . '/JMatrix.php';
              
              $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
              
              $X = new JMatrix($data['Xtrain']);
              
              $start = microtime(TRUE);
              $XT = $X->transpose();
              echo "transpose in " . (microtime(TRUE) - $start) . " seconds\n";
              
              $start = microtime(TRUE);
              $K = $X->multiply($XT);
              echo "multiply in " . (microtime(TRUE) - $start) . " seconds\n";
              
              $arr = $X->data;
              echo 'rows: ' . sizeof($arr) . "\n";
              echo 'cols: ' . sizeof($arr[0]) . "\n";

              This is also super slow:

              transpose in 0.51202392578125 seconds
              multiply in 739.11785197258 seconds
              rows: 1375
              cols: 1375
              json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e

              The ghostjat/np method via FFI seems promising, but I have a lot of concerns:

              • where to find the right shared library on a given system? my macos laptop has no BLAS file ending in .so
              • does the blas.h file included in ghostjat/np match up with my shared lib file? Will different libraries have different .h files with different declarations?
              • which BLASH libraries are compatible with ghostjat/np?
              • portability: what do do on windows or macos?
              • how does FFI work? I don't yet understand why ghostjat/np can multiply so much faster, and there appears to be some new-fangled PHP at work. the matrix class extends nd, but neither of these has a dot function.

              sneakyimp I don't yet understand why ghostjat/np can multiply so much faster,

              blas is written in Fortran (and, for the purposes of Octave, translated to C during the build). It gets compiled to machine code via assembly. So the matrix multiplication is being done on CPU-native types using an optimised sequence of CPU instructions:

              	movl	$1, j.8(%rip)
              	jmp	.L2
              .L9:
              	movq	-56(%rbp), %rax
              	movl	(%rax), %eax
              	movl	%eax, -20(%rbp)
              	movl	$1, i__.7(%rip)
              	jmp	.L3
              .L4:
              	movl	j.8(%rip), %eax
              	imull	-8(%rbp), %eax
              	movl	%eax, %edx
              	movl	i__.7(%rip), %eax
              	addl	%edx, %eax
              	cltq
              	leaq	0(,%rax,8), %rdx
              	movq	56(%rbp), %rax
              	addq	%rdx, %rax
              	pxor	%xmm0, %xmm0
              	movsd	%xmm0, (%rax)
              	movl	i__.7(%rip), %eax
              	addl	$1, %eax
              	movl	%eax, i__.7(%rip)
              .L3:
              	movl	i__.7(%rip), %eax
              	cmpl	%eax, -20(%rbp)
              	jge	.L4
              	movq	-72(%rbp), %rax
              	movl	(%rax), %eax
              	movl	%eax, -20(%rbp)
              	movl	$1, l.6(%rip)
              	jmp	.L5
              .L8:
              	movl	j.8(%rip), %eax
              	imull	-16(%rbp), %eax
              	movl	%eax, %edx
              	movl	l.6(%rip), %eax
              	addl	%edx, %eax
              	cltq
              	leaq	0(,%rax,8), %rdx
              	movq	32(%rbp), %rax
              	addq	%rdx, %rax
              	movsd	(%rax), %xmm0
              	movsd	%xmm0, temp.5(%rip)
              	movq	-56(%rbp), %rax
              	movl	(%rax), %eax
              	movl	%eax, -12(%rbp)
              	movl	$1, i__.7(%rip)
              	jmp	.L6
              .L7:
              	movl	j.8(%rip), %eax
              	imull	-8(%rbp), %eax
              	movl	%eax, %edx
              	movl	i__.7(%rip), %eax
              	addl	%edx, %eax
              	cltq
              	leaq	0(,%rax,8), %rdx
              	movq	56(%rbp), %rax
              	addq	%rdx, %rax
              	movsd	(%rax), %xmm1
              	movl	l.6(%rip), %eax
              	imull	-4(%rbp), %eax
              	movl	%eax, %edx
              	movl	i__.7(%rip), %eax
              	addl	%edx, %eax
              	cltq
              	leaq	0(,%rax,8), %rdx
              	movq	16(%rbp), %rax
              	addq	%rdx, %rax
              	movsd	(%rax), %xmm2
              	movsd	temp.5(%rip), %xmm0
              	mulsd	%xmm2, %xmm0
              	movl	j.8(%rip), %eax
              	imull	-8(%rbp), %eax
              	movl	%eax, %edx
              	movl	i__.7(%rip), %eax
              	addl	%edx, %eax
              	cltq
              	leaq	0(,%rax,8), %rdx
              	movq	56(%rbp), %rax
              	addq	%rdx, %rax
              	addsd	%xmm1, %xmm0
              	movsd	%xmm0, (%rax)
              	movl	i__.7(%rip), %eax
              	addl	$1, %eax
              	movl	%eax, i__.7(%rip)
              .L6:
              	movl	i__.7(%rip), %eax
              	cmpl	%eax, -12(%rbp)
              	jge	.L7
              	movl	l.6(%rip), %eax
              	addl	$1, %eax
              	movl	%eax, l.6(%rip)
              .L5:
              	movl	l.6(%rip), %eax
              	cmpl	%eax, -20(%rbp)
              	jge	.L8
              	movl	j.8(%rip), %eax
              	addl	$1, %eax
              	movl	%eax, j.8(%rip)
              .L2:
              	movl	j.8(%rip), %eax
              	cmpl	%eax, -24(%rbp)
              	jge	.L9

              The PHP-only implementations are not. Even if the Zend VM's opcodes are jitted, it would still be doing all the reference-counted copy-on-write zobject manipulation stuff that is PHP's managed environment.

              Thanks, weedpacket. If I'm not mistaken, that is assembler code you posted? Where did this code come from? Are you inspecting source code for some BLAS lib? Thanks also for the blas link.

              It seems obvious to me that pure PHP approaches to matrix multiplication are futile. The fastest PHP implementation is about 2000 times slower on this particular $X->dot($XT) operation. The FI approach, on the other hand, runs in 80 milliseconds, which is plenty fast for web server applications.

              My jury-rigged approach using the ghostjat/np library evidently does yield the correct matrix product, but I'm really concerned that I might be doing something wrong. I don't really know what could happen, but this warning in the FFI docs seems pretty stark:

              Caution
              FFI is dangerous, since it allows to interface with the system on a very low level. The FFI extension should only be used by developers having a working knowledge of C and the used C APIs. To minimize the risk, the FFI API usage may be restricted with the ffi.enable php.ini directive.

              My approach is clearly not portable, either. A given machine might be running Windows or MacOS or one of the myriad flavors of linux and, if I'm not mistaken, my workstation only happens to have libopenblasp-r0.3.8.so installed because I had previously installed octave on it using the Ubuntu package manager, apt. I hope there might be some way to provide the blazing speed of a proper BLAS library to PHP in a way that doesn't require a user to have root/su access to install system packages. Perhaps there's some way to distribute a shared library with my PHP source?

              I also am concerned that libopenblasp-r0.3.8.so might be totally incompatible with the blas.h file distributed in the ghostjat/np lib. I did a diff between that file and the cblas.h file linked in the netlib link and the two files have quite a few dissimilarities. E.g., ghostjat blas.h defines some things that clbas does not, e.g.

              void    openblas_set_num_threads(int num_threads);

              conversely, cblas defines functions ghostjat blas does not, e.g.

              void cblas_zher2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                const void *alpha, const void *A, const int lda,
                 const void *B, const int ldb, const double beta,
                 void *C, const int ldc);

              But they do seem to agree on some functions:

              //cblas.h
              double cblas_dsdot(const int N, const float *X, const int incX, const float *Y,
                const int incY);
              //ghostjat-blas.h
              double cblas_dsdot(const int N, const float *X, const int incX, const float *Y,
                const int incY);

              I'd be content to devise some adaptation of ghostjat's lib that works on linux platforms without the need for root/su permissions to install additional packages via yum/apt. I'd be delighted if I could get it to work on MacOS, too. I note that my mac laptop has some .SO files containing the string "blas" which were apparently installed with python:

              $ sudo find / -name "*blas*.so" 2>/dev/null
              Password:
              /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_dotblas.so
              /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/linalg/_fblas.so
              /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/fblas.so
              /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so
              /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_dotblas.so
              /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/linalg/_fblas.so
              /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/fblas.so
              /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so

              I've tried putting the path of both of those cblas.so libs in my blas.h file:

              // tried this, doesn't work
              #define FFI_LIB /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so"
              // this also doesn't work
              #define FFI_LIB /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so

              but FFI complains about both of them:

              PHP Fatal error:  Uncaught FFI\Exception: Failed loading '/System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so' in /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php:8
              Stack trace:
              #0 /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php(8): FFI::load('/Users/sneakyimp/Des...')
              #1 {main}
                thrown in /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php on line 8

              I searched the mac file system for files containing the string "libblas" and found /usr/lib/libblas.dylib. I tried putting that in blas.h:

              #define FFI_LIB "/usr/lib/libblas.dylib"

              And my php script doesn't like that either:

              PHP Fatal error:  Uncaught FFI\Exception: Failed resolving C function 'openblas_set_num_threads' in /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php:8
              Stack trace:
              #0 /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php(8): FFI::load('/Users/sneakyimp/Des...')
              #1 {main}
                thrown in /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php on line 8

              sneakyimp Thanks, weedpacket. If I'm not mistaken, that is assembler code you posted? Where did this code come from? Are you inspecting source code for some BLAS lib? Thanks also for the blas link.

              Yes; DGEMM is the BLAS routine that does matrix-matrix multiplication. I converted that to C (which probably added a layer of indirection complexity, but I didn't use any optimisation switches), cut out the premultiplier and accumulator (set alpha to 1 and beta to 0), compiled it to assembly, and excerpted the part that runs if neither matrix is to be transposed. The corresponding Fortran is

                            DO 90 J = 1,N
                                IF (BETA.EQ.ZERO) THEN
                                    DO 50 I = 1,M
                                        C(I,J) = ZERO
                 50                 CONTINUE
                                ELSE IF (BETA.NE.ONE) THEN
                                    DO 60 I = 1,M
                                        C(I,J) = BETA*C(I,J)
                 60                 CONTINUE
                                END IF
                                DO 80 L = 1,K
                                    TEMP = ALPHA*B(L,J)
                                    DO 70 I = 1,M
                                        C(I,J) = C(I,J) + TEMP*A(I,L)
                 70                 CONTINUE
                 80             CONTINUE
                 90         CONTINUE

              With ALPHA and BETA being hardcoded to 1 and 0, respectively, which is why they don't show up in the assembler.

              sneakyimp My jury-rigged approach using the ghostjat/np library evidently does yield the correct matrix product, but I'm really concerned that I might be doing something wrong. I don't really know what could happen, but this warning in the FFI docs seems pretty stark:

              Well, that's just a warning that external and OS libraries can do anything on your system. It's like the warning around eval only moreso, since now you've got the ability to execute arbitrary code at the operating system level (like, on Windows you can call the InitiateSystemShutdown method in advapi32.dll).

              However, like I said, I haven't had that much experience with FFI. You might like to consider downloading cblas.tgz and cblas.h from netlib above and compiling the library yourself. Then you'd have something you could distribute with the rest of the application.

              Weedpacket However, like I said, I haven't had that much experience with FFI. You might like to consider downloading cblas.tgz and cblas.h from netlib above and compiling the library yourself. Then you'd have something you could distribute with the rest of the application.

              This sounds like a reasonable next step. I would point out that PHP-ML delegates some operations to libsvm, and distributes distinct executables/libs for linux, macos, and windows. If I distribute my own shared library, that would eliminate the need to search for whatever flavor of BLAS may (or may not) be installed on the target machine, and we could be sure that we had the correct blas.h file.

              I am somewhat sheepishly wondering whether a CBLAS library compiled on my linux workstation would work on other linux distros? I dimly suspect that the chip architecture matters, and harbor a feeble hope that if I compile CBLAS on my intel system that the resulting .SO file would also run on red hat or centos if that system also had a 64-bit intel architecture -- and possibly also might run on machines with AMD chips? Sadly, I'm not very knowledgeable about these things.

              It seems to me that ghostjat/np is not well designed to address the need to specifically locate whichever BLAS library may (or may not) be installed on one's machine and then parse the correct blas.h file for that library. The blas.h distributed with ghostjat/np just refers to libblas.so:

              #define FFI_LIB "libblas.so"