sneakyimp but I would point out that matrix multiplication is not symmetric -- see those first dot-versus-multiplication results at the top of this post.

This isn't really an issue if you keep in mind that vectors are not matrices; the same formula applies for vectors and matrices.
For vectors it reduces to

T    · U   = ∑ T  U
 t1     u1   k  k  k

And U·T will produce the same result

For matrix×matrix it's

 T      · U      = ∑ T     U
  t1,t2    u1,u2   k  t1,k  k,u2

which isn't the same thing as

 U      · T      = ∑ U     T
  u1,u2    t1,t2   k  u1,k  k,t2

because you're pairing different elements together (in the first, the elements in one row of T with the elements in a column of U, and in the second the elements in a row of U with the elements in a column of T). That's why matrix multiplication isn't symmetric, and also why it doesn't matter for vector dot product; vectors don't have that many dimensions so there is less freedom for how their elements can be paired up.

sneakyimp I like that [1,1,1,1] * [1;3;5;7] and [1;3;5;7]*[1,1,1,1] produce different results. When they don't, I feel like things are getting unpredictable.

Which is what you'd expect from multiplying a 1×4 matrix and a 4×1 matrix vs. a 4×1 matrix and a 1×4 matrix. Not having that ,/; distinction I'd write them as

{{1,1,1,1}}.{{1},{3},{5},{7}} = {{16}}

and

{{1},{3},{5},{7}}.{{1,1,1,1}} = {{1, 1, 1, 1}, {3, 3, 3, 3}, {5, 5, 5, 5}, {7, 7, 7, 7}}

And the brackets are significant because without them they're not matrices but vectors, which means

{1, 1, 1, 1} . {1, 3, 5, 7}={1, 3, 5, 7} . {1, 1, 1, 1}=16

sneakyimp . I remain confused about what it means that dot always returns a scalar

That I think comes back to dot deliberately ignoring singleton dimensions; it throws away the distinction between [1,1,1,1] and [1;1;1;1] which means dot([1;1;1;1].[1,1,1,1]) and dot([1,1,1,1].[1;1;1;1]) are indistinguishable; so no wonder they produce the same answer. The matrix multiplication operator doesn't do this.
Octave's notation makes it impossible to distinguish between a length-n vector and a 1×n matrix, so it has to have different functions where one treats it as a vector and the other treats it as a matrix. The one that treats it as a vector has to also treat n×1 matrices as vectors as well.

Weedpacket (in the first, the elements in one row of T with the elements in a column of U, and in the second the elements in a row of U with the elements in a column of T). That's why matrix multiplication isn't symmetric, and also why it doesn't matter for vector dot product; vectors don't have that many dimensions so there is less freedom for how their elements can be paired up.

Thanks once again, Weedpacket, for patiently explaining these things. I do have a much better grip, I think, on vector/matrix operations. I feel stronger in my conceptual understanding of what is transpiring in these operations. My only lingering question is perhaps should we always assume 1xn or nx1 matrices are vectors, or are there circumstances where we'd want to treat them as actual matrices? All the code and explanations I've seen here suggest that 1xn/nx1 matrices are always intended as vectors.

And, lastly, I think it seems clear that these three PHP libs we've examined are sorely inadequate for a pure PHP implementation (no FANN, no exec of other executable or script) of my Support Vector Machine. My spam filter example requires one to multiple a 1375x1966 matrix by its inverse 1966x1375 matrix and it is a hundred times too slow. Unless someone can suggest a way to construct a data structure in pure PHP that can efficiently access matrix indices, PHP is going to be doing some searching/traversal of these large arrays. I suspect (but am not sure) that the PHP slowness here is due to array search/traversal, whereas in C or Octave (or perhaps Rust), one might be able to construct an efficient, contiguous memory segment where accessing $matrix[$i][$j] is simply a matter of calculating a memory offset from some starting memory location and then it's simple RAM access rather than array search/traversal.

sneakyimp is perhaps should we always assume 1xn or nx1 matrices are vectors, or are there circumstances where we'd want to treat them as actual matrices? All the code and explanations I've seen here suggest that 1xn/nx1 matrices are always intended as vectors.

Er ... if you mean my code and explanations then that's the opposite of what I was saying.

sneakyimp My spam filter example requires one to multiple a 1375x1966 matrix by its inverse 1966x1375 matrix

You mean "transpose", of course, because multiplying a matrix by its inverse is really easy if you assume it has one 🙂

I do have a couple of suggestions; one is to use an SPLFixedArray (especially since the array's sizes are all known).
The other, if AA' is the main slow part, is to use an algorithm optimised for that case. Mainly, since the result is symmetric, you only need to compute half of it; the other half would be a mirror image. You could either calculate the half and then copy everything into the other half (if you're using SPLFixedArrays this won't make much difference memory-wise, but the time spent on it may save more time later on), or just leave the array half-filled and redirect lookups as appropriate (so, every time you want to look up $symm[$j][$i] you'd check to see if $i>$j and if it is you'd look up $symm[$i][$j] instead.)

Weedpacket You mean "transpose", of course, because multiplying a matrix by its inverse is really easy if you assume it has one 🙂

OOF yes I meant transpose. Inverse is something else entirely. Sorry for the mistake.

I will of course attempt a multiply function using SPLFixedArray.

    Not having much luck wtih SplFixedArray. Here's a script:

    class JMatrix {
    	public $data;
    
    	public function __construct($arr) {
    		if (is_a($arr, 'SplFixedArray')) {
    			// shortcut...provide SplFixedArray param, bypass checks and just assign
    			$this->data = $arr;
    		} elseif (!is_array($arr)) {
    			// if not SplFixedArray, must be standard array
    			throw new Exception('Constructor param must be an array');
    		} else {
    			// if it's an array, check if 2D array (matrix) or just 1D array vector, which we treat as matrix
    			if (!is_array($arr[0])) {
    				// the supplied parameter is a simple array, so we'll construct a row vector
    				// from its elements
    				$row_count = 1;
    				$col_count = sizeof($arr);
    				$this->data = new SplFixedArray(1);
    				$this->data->offsetSet(0, new SplFixedArray($col_count));
    				foreach($arr as $j => $v) {
    					$this->data->offsetGet(0)->offsetSet($j, $v);
    				}
    
    			} else {
    				// an array of arrays...sanity check on element counts
    				$row_count = sizeof($arr);
    				$col_count = sizeof($arr[0]);
    				foreach($arr as $v) {
    					if (sizeof($v) !== $col_count) {
    						throw new Exception('Invalid matrix array. Inconsistent column count in constructor array.');
    					}
    				}
    				$this->data = new SplFixedArray($row_count);
    				foreach($arr as $i => $row) {
    					$this->data->offsetSet($i, new SplFixedArray($col_count));
    					foreach($row as $j => $v) {
    						$this->data->offsetGet($i)->offsetSet($j, $v);
    					}
    				}
    				
    			}
    		}
    	}
    
    	public function rowCount() {
    		return $this->data->getSize();
    	}
    	public function colCount() {
    		return $this->data->offsetGet(0)->getSize();
    	}
    
    	public function size() {
    		return [$this->rowCount(), $this->colCount()];
    	}
    	public function sizeAsString() {
    		return $this->rowCount() . 'x' . $this->colCount();
    	}
    
    	public function __toString() {
    		$retval = "[\n";
    		foreach($this->data as $i => $row) {
    			$retval .= "\t[" . implode(",", (array)$row) . "]\n";
    		}
    		$retval .= "]\n";
    		return $retval;
    	}
    
    	public function multiply($matrix) {
    		$iteration_count = $this->colCount();
    		if ($iteration_count !== $matrix->rowCount()) {
    			throw new Exception('nonconformant arguments (op1 is ' . $this->sizeAsString() . ', op2 is ' . $matrix->sizeAsString() . ')');
    		}
    
    		$result_rows = $this->rowCount();
    		$result_cols = $matrix->colCount();
    
    		// accumulate results in this array
    		$ret_array = new SplFixedArray($result_rows);
    
    		// temp vars to avoid object property lookup
    		$td = $this->data;
    		$md = $matrix->data;
    
    		for($i=0; $i<$result_rows; $i++) {
    			// current row from this matrix
    			$drow = $td->offsetGet($i);
    
    			// accumulate output matrix row here
    			$retrow = new SplFixedArray($result_cols);
    
    			// loop thru $matrix columns
    			for($j=0; $j<$result_cols; $j++) {
    
    				// accumulate product sum here
    				$sum = 0;
    
    				// iteration count is number of columns in this matrix == number of rows in $matrix
    				for($k=0; $k<$iteration_count; $k++) {
    					$sum += ($drow->offsetGet($k) * $matrix->data->offsetGet($k)->offsetGet($j));
    				}
    
    				// all products summed, store it in our output matrix row
    				$retrow->offsetSet($j, $sum);
    			}
    
    			// put the output matrix row into our output matrix
    			$ret_array->offsetSet($i, $retrow);
    		}
    		return new static($ret_array);
    	}
    
    	public function multiply2($matrix) {
    		$iteration_count = $this->colCount();
    		if ($iteration_count !== $matrix->rowCount()) {
    			throw new Exception('nonconformant arguments (op1 is ' . $this->sizeAsString() . ', op2 is ' . $matrix->sizeAsString() . ')');
    		}
    
    		$result_rows = $this->rowCount();
    		$result_cols = $matrix->colCount();
    
    		// accumulate results in this array
    		$ret_array = new SplFixedArray($result_rows);
    
    		// temp vars to avoid object property lookup
    		$td = $this->data;
    		$md = $matrix->data;
    
    
    		$td->rewind();
    		$i = 0;
    		while ($td->valid()) {
    			// current row from this matrix
    			$drow = $td->current();
    
    			// accumulate output matrix row here
    			$retrow = new SplFixedArray($result_cols);
    
    			// loop thru $matrix columns
    			for($j=0; $j<$result_cols; $j++) {
    
    				// accumulate product sum here
    				$sum = 0;
    
    				// iteration count is number of columns in this matrix == number of rows in $matrix
    				$md->rewind();
    				$k = 0;
    				while($md->valid()) {
    					$mrow = $md->current();
    					$sum += ($drow->offsetGet($k) * $mrow->offsetGet($j));
    
    					$md->next();
    					$k++;
    				}
    
    				// all products summed, store it in our output matrix row
    				$retrow->offsetSet($j, $sum);
    			}
    
    			// put the output matrix row into our output matrix
    			$ret_array->offsetSet($i, $retrow);
    
    			$td->next();
    			$i++;
    		}
    		return new static($ret_array);
    	}
    
    	public function transpose() {
    		$arr = [];
    		foreach($this->data as $row) {
    			$arr[] = (array)$row;
    		}
    		$arr = array_map(null, ...$arr);
    		return new static($arr);
    	}
    }
    
    
    echo "SANITY CHECK\n";
    $a = new JMatrix([
    	[1, 4, -2],
    	[3, 5, -6]
    ]);
    echo $a;
    echo "size: " .  $a->sizeAsString() . "\n";
    
    $b = new JMatrix([
    	[5, 2, 8, -1],
    	[3, 6, 4, 5],
    	[-2, 9, 7, -3]
    ]);
    echo $b;
    echo "size: " .  $b->sizeAsString() . "\n";
    
    $v = $a->multiply2($b);
    echo $v . "\n";
    echo "size: " .  $v->sizeAsString() . "\n";
    
    $vt = $v->transpose();
    echo "$vt\n";
    echo "size: " .  $vt->sizeAsString() . "\n";
    
    
    $data_file = './training_data_sets.json';
    $data = json_decode(file_get_contents($data_file), TRUE);
    
    echo 'Xtrain rows: ' . sizeof($data['Xtrain']) . "\n";
    echo 'Xtrain cols: ' . sizeof($data['Xtrain'][0]) . "\n";
    
    $X = new JMatrix($data['Xtrain']);
    
    $start = microtime(TRUE);
    $XT = $X->transpose();
    echo "transpose in " . (microtime(TRUE) - $start) . " seconds\n";
    
    $start = microtime(TRUE);
    $K = $X->multiply2($XT);
    echo "multiply2 in " . (microtime(TRUE) - $start) . " seconds\n";

    multiply, my first attempt, runs in 973 seconds:

    Xtrain rows: 1375
    Xtrain cols: 1966
    transpose in 0.79622912406921 seconds
    multiply in 973.16395616531 seconds

    multiply2, where i attempt to avoid indexed lookups, actually takes longer -- about 20 minutes 😢.

    Xtrain rows: 1375
    Xtrain cols: 1966
    transpose in 0.79264497756958 seconds
    multiply2 in 1199.6731460094 seconds

      Oh, well. Worth a try. (Incidentally, the instanceof operator could be used instead of the is_a function. Since SplFixedArray implements Countable you can use count() instead of calling ->getSize() manually; implementing IteratorAggregate means that the ->valid() ->rewind(), ->current(), and ->next() methods will be called by foreach as and when needed; and implementing the ArrayAccess interface means the ability to use array index notation as implemented by ->offsetGet() and ->offsetSet(), e.g. $this->data->offsetGet(0)->offsetSet($j, $v); becomes $this->data[0][$j] = $v;, And I'm guessing you're using PHP 7: SplFixedArray dropped the rewind() method in v8 when it was switched from being an Iterator to an IteratorAggregate. Woo. Big sidebar. But I did get some improvement by making those changes. Possibly because of the dedicated opcodes for those things.)

      But I think PHP will (even with JIT enabled) lag behind domain-specific languages that will have had dedicated native code for those tasks for which the language is intended ("a matrix multiplication? I have machine code for that compiled from C++ right here"); PHP's execution would still be passing through the various reference-counted zval structures and hashtables and whatnot before getting right to the gist of the task. (See, e.g. Internal value representation in PHP 7 - Part 2.)

      Weedpacket Oh, well. Worth a try. (Incidentally, the instanceof operator could be used instead of the is_a function...I did get some improvement by making those changes. Possibly because of the dedicated opcodes for those things.)

      Thanks for these helpful clarifications. It had occurred to me that the transpose operation is quite fast, and part of the slowness seems to be due to the array indexing required. Note how the fastest PHP implementation, PHP-ML, makes use of the array_column function and explicitly tries to avoid array indexing as much as possible. It therefore occurred to me that I might quickly transpose the matrix supplied to my multiply function and benefit from slightly more efficient array operations. However, I don't think these little changes will bring us anywhere near the supremely fast performance I see in Octave. I exported my 1375x1966 array to an octave-friendly text format and loaded it up on octave and multipled it with this little octave/matlab script:

      Xtrain = dlmread("Xtrain.txt");
      sz = size(Xtrain)
      t0 = clock();
      K = Xtrain * Xtrain';
      elapsed_time = etime(clock(), t0);
      fprintf("transpose multiplication completed in %d seconds\n", elapsed_time);

      Octave completes the matrix multiplication in a stunning 67ms:

      sz =
      
         1375   1966
      
      transpose multiplication completed in 0.0668411 seconds

      This is 4376.7 times faster than the fastest PHP code I've found, which is accomplished by the matrix classes in the PHP-ML library.

      NOTE that the PHP-ML library apparently punts to some kind of compiled executable to perform its SVM training operations. It constructs a command and invokes the external executable with exec.

      I consulted with a friend who's a talented python jockey who offered some interesting initial observations on matrix multiplication algorithms:

      Strassen's was the first notable speedup, 1970 — O(n ^ 2.8) iirc — and people have knocked the exponent down further and further in the ensuing decades. In fact just within the last two months I read about this recent improvement: as per https://en.wikipedia.org/wiki/Matrix_multiplication#Computational_complexity

      As of December 2020, the best matrix multiplication algorithm is by Josh Alman and
      Virginia Vassilevska Williams and has complexity O(n ^ 2.3728596).

      That statement just quoted has a footnote to this paper: https://arxiv.org/abs/2010.05846, "A Refined Laser Method and Faster Matrix Multiplication".
      It's a tiny speedup over the previous best-known algo: best was formerly 

      O(n^{2.37287 + ε}) for all ε;

      the paper improves that bound to 

      O(n^{2.37286 + ε}) for all ε;

      My friend rightly observes:

      It would not be easy to whip up working code from the paper, written for actual and aspiring specialists.

      He pointed out an interesting exercise in matrix operation optimization and noted that numpy is very fast. Like the paper, these articles get into a lot of the nitty gritty (i.e., the weeds) of bottlenecks and various optimizations.

      I guess Octave and Matlab use BLAS (Basic Linear Algebra Subprograms), originally a Fortran library now ported to C or something, under the hood.

      My kind friend imported my 1375x1966 matrix for some python profiling. Some interesting results on his 2015 Macbook Pro:

      Numpy takes 2.06s to matmult Xtrain by its transpose

      He took the trouble to write a "pure python" function which he ran in a "notebook" :

      PS I'm trying to time the naive algorithm in pure Python, operating either on the same numpy "ndarray"s, or on native Python data structures (list of lists of numbers)
      I've tried both, a few times; I give up waiting, I haven't seen either complete, it takes minutes (at least!) and I get bored.

      Safe conclusion: Performance of pure Python naive algorithm in a notebook is worse than PHP. Way worse: in a practical sense it doesn't even terminate.
      Performance at command line: (minutes later, still waiting… I'll check back in an hour, if it terminates ) 

      Maybe it's faster at the command line than in a notebook? I bet not enough to matter, but let's try. 
      With native Python datatypes: it takes at least tens of seconds... a minute… 5+ minutes… 
      Just checked back: it took 20m 29.15s
      (So, yes: faster than when running in a notebook, but still unusable.)

      He went on to make some minor improvements, but the 'pure python' approach, which doesn't make use of NumPY, is clearly slow like the pure PHP approach. Oddly enough, I find this something of a relief. I'd hate to think python is inherently better than my dear old PHP.

      The lesson here is that it's clearly advantageous--necessary!--to delegate matrix operations to something outside PHP (or python) which takes advantage of decades of algorithmic refinement by the world's finest nerds. I truly doubt that your typical linux server will have octave installed, but it's easy enough with apt if you have su access via CLI. I think your typical server is much more likely to have python installed. Or perhaps I could distribute my own executable to handle matrix operations? Any suggestions about farming out these matrix operations from PHP would be much appreciated. Ideally we'd find some tool for matrix operations that will already exist in your typical server so we can just exec it from PHP with reasonable hope of success.

        The last time I had to do anything numerically-matricy (it was actually a dynamic programming problem) that was just way too slow and/or large to run in PHP, I whipped up a program in C to do it and compiled (TinyCC was good enough) it as an executable that I communicated with via proc_open. That was years and major versions ago. If PHP had had FFI back then I would have considered compiling it as a library and calling the API directly (warning: requires third-party library, which I suspect is the same one NumPY ends up using) instead of that mucking around with stdio. (The PHP manual has a note that accessing FFI structures is twice as slow as accessing PHP ones, so "it makes no sense to use the FFI extension for speed". But accessing the structure isn't the bottleneck here.)

        Thanks, Weedpacket, for the detail. Your post is over my head, and I'll have to do some remedial study on the particulars of compiling/linking/assembling to understand the dynamics here. Offhand, it seems tricky to find some speedy lib external to PHP which can perform fast matrix operations. I expect there are two paths:
        1) try using exec to run some CLI commands in the hope that python or octave or some other package has been installed on the machine. Might also have to download a python lib or two like NumPy.
        2) try invoking some executable file that I cook up myself via exec or FFI or some other process-to-process method. This seems quite involved, and it sounds like a real chore to distribute executables that'll work on linux, MacOS, windows, or whatever.

        Obviously, my goal of a 'pure PHP' library is unattainable. I still hope that I might work up a solution that doesn't require su access to install (e.g., to configure PHP with FFI module or FANN, or to install octave or python).

          8 days later

          I google around for 'php language binding blas' and ran across this NP lib. It requires PHP 8 with FFI, as well as libblas and liblapack (not sure what the latter two are). It's evidently a work in progress and the composer command complains about stable versions on my MacOS laptop until I specified a dev branch:

          composer require ghostjat/np:dev-main

          This gets the composer install to work but then the PHP sample script they provide:

          require __DIR__ . '/vendor/autoload.php';
          use Np\matrix;
          
          $ta = matrix::randn(1000, 1000);    
          $tb = matrix::randn(1000, 1000); // to generate random 2d matrix
          $ta->dot($tb);                  // do a dot operation on given matrix
          $ta->getMemory();              // get memory use
          $ta->time();                  // get time

          fails, complaining that FFI:scope had a problem:

          PHP Fatal error: Uncaught FFI\Exception: Failed loading scope 'blas' in /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php:28
          Stack trace:
          #0 /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php(28): FFI::scope('blas')
          #1 /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php(73): Np\core\blas::init()
          #2 /Users/sneakyimp/np/vendor/ghostjat/np/src/linAlgb/linAlg.php(45): Np\core\blas::gemm(Object(Np\matrix), Object(Np\matrix), Object(Np\matrix))
          #3 /Users/sneakyimp/np/vendor/ghostjat/np/src/linAlgb/linAlg.php(30): Np\matrix->dotMatrix(Object(Np\matrix))
          #4 /Users/sneakyimp/np/foo.php(8): Np\matrix->dot(Object(Np\matrix))
          #5 {main}
          thrown in /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php on line 28
          
          Fatal error: Uncaught FFI\Exception: Failed loading scope 'blas' in /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php:28
          Stack trace:
          #0 /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php(28): FFI::scope('blas')
          #1 /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php(73): Np\core\blas::init()
          #2 /Users/sneakyimp/np/vendor/ghostjat/np/src/linAlgb/linAlg.php(45): Np\core\blas::gemm(Object(Np\matrix), Object(Np\matrix), Object(Np\matrix))
          #3 /Users/sneakyimp/np/vendor/ghostjat/np/src/linAlgb/linAlg.php(30): Np\matrix->dotMatrix(Object(Np\matrix))
          #4 /Users/sneakyimp/np/foo.php(8): Np\matrix->dot(Object(Np\matrix))
          #5 {main}
          thrown in /Users/sneakyimp/np/vendor/ghostjat/np/src/core/blas.php on line 28

          The FFI docs are sorely incomplete, so I've no idea what this command is meant to do or what conditions must be met for it to succeed. There is mention of FFI_SCOPE there in the docs, and there's a blas.h file that contains a #define that looks related to this, but I don't know how to fix this problem.

            I've made a tiny bit of progress with ghostjat/np by reading the source code (which is not well-commented) and by reading the PHP-FI documentation (which is not great). I don't have much faith that it'll be well-behaved and robust, but I have managed to get ghostjat/np to multiply my large matrix by its transpose and performance seems very fast. To get it working on my Ubuntu 20 workstation:

            ./composer.phar require ghostjat/np:dev-main

            Edit the file vendor/ghostjat/np/src/core/blas.h and change the value of FFI_LIB to refer to your BLAS shared library file. On my machine, this works:

            #define FFI_LIB "/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so"

            NOTE I don't really know much about what this accomplishes. The FFI docs say:

            The header file may contain a #define statement for the FFI_LIB variable to specify the library it exposes. If it is a system library only the file name is required, e.g.: #define FFI_LIB "libc.so.6". If it is a custom library, a relative path is required, e.g.: #define FFI_LIB "./mylib.so".

            I'd also point that I just got lucky by specifying that file. Just trial and error. A sudo apt search blas command returns a giant list of possible packages that one might or might not have installed on one's machine. My workstation has octave installed and I was able to locate one shared library, libopenblasp-r0.3.8.so, with this command:

            # barfs up lots of 'not permitted' warnings so route stderr to /dev/null
            sudo find / -name "*blas*.so" 2>/dev/null

            I'd like to take a moment to point out that this path is highly specific to my particular machine. THIS IS NOT REALLY PORTABLE AT ALL.

            Finally, i create this file bar.php in my working directory, which loads the JSON file containing my 1375*1966 matrix, among other things:

            require __DIR__ . '/vendor/autoload.php';
            use Np\matrix;
            
            $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
            
            // more trial and error...cooked this up after reading the ghostjat/np source code
            Np\core\blas::$ffi_blas = FFI::load(__DIR__ . '/vendor/ghostjat/np/src/core/blas.h');
            
            $ta = matrix::ar($data['Xtrain']);
            
            $start = microtime(true);
            $tb = $ta->transpose(); // to generate random 2d matrix
            echo "transpose elapsed time: " . (microtime(true) - $start) . "seconds\n";
            
            $start = microtime(true);
            $tc = $ta->dot($tb);
            echo "dot elapsed time: " . (microtime(true) - $start) . "seconds\n";
            
            $arr = $tc->asArray();
            echo "rows: " . sizeof($arr) . "\n";
            echo "cols: " . sizeof($arr[0]) . "\n";
            
            $json = json_encode($arr, JSON_PRETTY_PRINT);
            file_put_contents(__DIR__ . '/np-product.json', $json);
            echo "json md5 is " . md5($json) . "\n";

            It's fast!

            transpose elapsed time: 0.17577886581421seconds
            dot elapsed time: 0.089972019195557seconds
            rows: 1375
            cols: 1375
            json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e

            The resulting JSON file (with pretty print turned on) has 1893376 lines. I calculate the md5 of the JSON for comparison with the other algorithms. These appear to all produce the exact same JSON:

            MarkBaker:

            require_once 'vendor/autoload.php';
            
            $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
            
            $X = new Matrix\Matrix($data['Xtrain']);
            
            $start = microtime(TRUE);
            $XT = $X->transpose();
            echo "getTranspose in " . (microtime(TRUE) - $start) . " seconds\n";
            
            $start = microtime(TRUE);
            $K = $X->multiply($XT);
            echo "dot product in " . (microtime(TRUE) - $start) . " seconds\n";
            
            $arr = $K->toArray();
            echo "rows: " . sizeof($arr) . "\n";
            echo "cols: " . sizeof($arr[0]) . "\n";
            $json = json_encode($arr, JSON_PRETTY_PRINT);
            echo "json md5 is " . md5($json) . "\n";
            file_put_contents(__DIR__ . '/markbaker-product.json', $json);
            
            echo "COMPLETE\n";

            Multiplication is crazy slow:

            getTranspose in 0.083314180374146 seconds
            dot product in 1227.2787439823 seconds
            rows: 1375
            cols: 1375
            json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e
            COMPLETE

            NumPHP:

            require 'vendor/autoload.php';
            
            use NumPHP\Core\NumArray;
            
            $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
            
            $X = new NumArray($data['Xtrain']);
            
            $start = microtime(TRUE);
            $XT = $X->getTranspose();
            echo "getTranspose in " . (microtime(TRUE) - $start) . " seconds\n"; // 5.9611051082611 seconds
            
            $start = microtime(TRUE);
            $K = $X->dot($XT);
            echo "dot product in " . (microtime(TRUE) - $start) . " seconds\n"; // 466.75154590607 seconds
            
            $arr = $K->getData();
            echo "rows: " . sizeof($arr) . "\n";
            echo "cols: " . sizeof($arr[0]) . "\n";
            $json = json_encode($arr, JSON_PRETTY_PRINT);
            echo "json md5 is " . md5($json) . "\n";
            file_put_contents(__DIR__ . '/numphp-product.json', $json);
            
            echo "COMPLETE\n";

            transpose in numphp is surprisingly slow. Dot product unacceptably slow for deployment on a web server application:

            getTranspose in 3.0702431201935 seconds
            dot product in 224.07051301003 seconds
            rows: 1375
            cols: 1375
            json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e
            COMPLETE

            PHP-ML:

            require_once __DIR__ . '/vendor/autoload.php';
            
            use Phpml\Math\Matrix;
            
            $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
            
            $X = new Matrix($data['Xtrain']);
            
            $start = microtime(TRUE);
            $XT = $X->transpose();
            echo "transpose in " . (microtime(TRUE) - $start) . " seconds\n";
            
            $start = microtime(TRUE);
            $K = $X->multiply($XT);
            echo "multiply in " . (microtime(TRUE) - $start) . " seconds\n"; // 292.54629206657 seconds
            
            $arr = $K->toArray();
            echo "rows: " . sizeof($arr) . "\n";
            echo "cols: " . sizeof($arr[0]) . "\n";
            $json = json_encode($arr, JSON_PRETTY_PRINT);
            echo "json md5 is " . md5($json) . "\n";
            file_put_contents(__DIR__ . '/php-ml-product.json', $json);
            
            echo "COMPLETE\n";

            Perhaps the best 'pure php' implementation, but still unacceptably slow for web server applications:

            transpose in 0.082174062728882 seconds
            multiply in 178.16531610489 seconds
            rows: 1375
            cols: 1375
            json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e
            COMPLETE

            Home-rolled JMatrix:

            require_once __DIR__ . '/JMatrix.php';
            
            $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
            
            $X = new JMatrix($data['Xtrain']);
            
            $start = microtime(TRUE);
            $XT = $X->transpose();
            echo "transpose in " . (microtime(TRUE) - $start) . " seconds\n";
            
            $start = microtime(TRUE);
            $K = $X->multiply($XT);
            echo "multiply in " . (microtime(TRUE) - $start) . " seconds\n";
            
            $arr = $X->data;
            echo 'rows: ' . sizeof($arr) . "\n";
            echo 'cols: ' . sizeof($arr[0]) . "\n";

            This is also super slow:

            transpose in 0.51202392578125 seconds
            multiply in 739.11785197258 seconds
            rows: 1375
            cols: 1375
            json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e

            The ghostjat/np method via FFI seems promising, but I have a lot of concerns:

            • where to find the right shared library on a given system? my macos laptop has no BLAS file ending in .so
            • does the blas.h file included in ghostjat/np match up with my shared lib file? Will different libraries have different .h files with different declarations?
            • which BLASH libraries are compatible with ghostjat/np?
            • portability: what do do on windows or macos?
            • how does FFI work? I don't yet understand why ghostjat/np can multiply so much faster, and there appears to be some new-fangled PHP at work. the matrix class extends nd, but neither of these has a dot function.

            sneakyimp I don't yet understand why ghostjat/np can multiply so much faster,

            blas is written in Fortran (and, for the purposes of Octave, translated to C during the build). It gets compiled to machine code via assembly. So the matrix multiplication is being done on CPU-native types using an optimised sequence of CPU instructions:

            	movl	$1, j.8(%rip)
            	jmp	.L2
            .L9:
            	movq	-56(%rbp), %rax
            	movl	(%rax), %eax
            	movl	%eax, -20(%rbp)
            	movl	$1, i__.7(%rip)
            	jmp	.L3
            .L4:
            	movl	j.8(%rip), %eax
            	imull	-8(%rbp), %eax
            	movl	%eax, %edx
            	movl	i__.7(%rip), %eax
            	addl	%edx, %eax
            	cltq
            	leaq	0(,%rax,8), %rdx
            	movq	56(%rbp), %rax
            	addq	%rdx, %rax
            	pxor	%xmm0, %xmm0
            	movsd	%xmm0, (%rax)
            	movl	i__.7(%rip), %eax
            	addl	$1, %eax
            	movl	%eax, i__.7(%rip)
            .L3:
            	movl	i__.7(%rip), %eax
            	cmpl	%eax, -20(%rbp)
            	jge	.L4
            	movq	-72(%rbp), %rax
            	movl	(%rax), %eax
            	movl	%eax, -20(%rbp)
            	movl	$1, l.6(%rip)
            	jmp	.L5
            .L8:
            	movl	j.8(%rip), %eax
            	imull	-16(%rbp), %eax
            	movl	%eax, %edx
            	movl	l.6(%rip), %eax
            	addl	%edx, %eax
            	cltq
            	leaq	0(,%rax,8), %rdx
            	movq	32(%rbp), %rax
            	addq	%rdx, %rax
            	movsd	(%rax), %xmm0
            	movsd	%xmm0, temp.5(%rip)
            	movq	-56(%rbp), %rax
            	movl	(%rax), %eax
            	movl	%eax, -12(%rbp)
            	movl	$1, i__.7(%rip)
            	jmp	.L6
            .L7:
            	movl	j.8(%rip), %eax
            	imull	-8(%rbp), %eax
            	movl	%eax, %edx
            	movl	i__.7(%rip), %eax
            	addl	%edx, %eax
            	cltq
            	leaq	0(,%rax,8), %rdx
            	movq	56(%rbp), %rax
            	addq	%rdx, %rax
            	movsd	(%rax), %xmm1
            	movl	l.6(%rip), %eax
            	imull	-4(%rbp), %eax
            	movl	%eax, %edx
            	movl	i__.7(%rip), %eax
            	addl	%edx, %eax
            	cltq
            	leaq	0(,%rax,8), %rdx
            	movq	16(%rbp), %rax
            	addq	%rdx, %rax
            	movsd	(%rax), %xmm2
            	movsd	temp.5(%rip), %xmm0
            	mulsd	%xmm2, %xmm0
            	movl	j.8(%rip), %eax
            	imull	-8(%rbp), %eax
            	movl	%eax, %edx
            	movl	i__.7(%rip), %eax
            	addl	%edx, %eax
            	cltq
            	leaq	0(,%rax,8), %rdx
            	movq	56(%rbp), %rax
            	addq	%rdx, %rax
            	addsd	%xmm1, %xmm0
            	movsd	%xmm0, (%rax)
            	movl	i__.7(%rip), %eax
            	addl	$1, %eax
            	movl	%eax, i__.7(%rip)
            .L6:
            	movl	i__.7(%rip), %eax
            	cmpl	%eax, -12(%rbp)
            	jge	.L7
            	movl	l.6(%rip), %eax
            	addl	$1, %eax
            	movl	%eax, l.6(%rip)
            .L5:
            	movl	l.6(%rip), %eax
            	cmpl	%eax, -20(%rbp)
            	jge	.L8
            	movl	j.8(%rip), %eax
            	addl	$1, %eax
            	movl	%eax, j.8(%rip)
            .L2:
            	movl	j.8(%rip), %eax
            	cmpl	%eax, -24(%rbp)
            	jge	.L9

            The PHP-only implementations are not. Even if the Zend VM's opcodes are jitted, it would still be doing all the reference-counted copy-on-write zobject manipulation stuff that is PHP's managed environment.

            Thanks, weedpacket. If I'm not mistaken, that is assembler code you posted? Where did this code come from? Are you inspecting source code for some BLAS lib? Thanks also for the blas link.

            It seems obvious to me that pure PHP approaches to matrix multiplication are futile. The fastest PHP implementation is about 2000 times slower on this particular $X->dot($XT) operation. The FI approach, on the other hand, runs in 80 milliseconds, which is plenty fast for web server applications.

            My jury-rigged approach using the ghostjat/np library evidently does yield the correct matrix product, but I'm really concerned that I might be doing something wrong. I don't really know what could happen, but this warning in the FFI docs seems pretty stark:

            Caution
            FFI is dangerous, since it allows to interface with the system on a very low level. The FFI extension should only be used by developers having a working knowledge of C and the used C APIs. To minimize the risk, the FFI API usage may be restricted with the ffi.enable php.ini directive.

            My approach is clearly not portable, either. A given machine might be running Windows or MacOS or one of the myriad flavors of linux and, if I'm not mistaken, my workstation only happens to have libopenblasp-r0.3.8.so installed because I had previously installed octave on it using the Ubuntu package manager, apt. I hope there might be some way to provide the blazing speed of a proper BLAS library to PHP in a way that doesn't require a user to have root/su access to install system packages. Perhaps there's some way to distribute a shared library with my PHP source?

            I also am concerned that libopenblasp-r0.3.8.so might be totally incompatible with the blas.h file distributed in the ghostjat/np lib. I did a diff between that file and the cblas.h file linked in the netlib link and the two files have quite a few dissimilarities. E.g., ghostjat blas.h defines some things that clbas does not, e.g.

            void    openblas_set_num_threads(int num_threads);

            conversely, cblas defines functions ghostjat blas does not, e.g.

            void cblas_zher2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
              const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
              const void *alpha, const void *A, const int lda,
               const void *B, const int ldb, const double beta,
               void *C, const int ldc);

            But they do seem to agree on some functions:

            //cblas.h
            double cblas_dsdot(const int N, const float *X, const int incX, const float *Y,
              const int incY);
            //ghostjat-blas.h
            double cblas_dsdot(const int N, const float *X, const int incX, const float *Y,
              const int incY);

            I'd be content to devise some adaptation of ghostjat's lib that works on linux platforms without the need for root/su permissions to install additional packages via yum/apt. I'd be delighted if I could get it to work on MacOS, too. I note that my mac laptop has some .SO files containing the string "blas" which were apparently installed with python:

            $ sudo find / -name "*blas*.so" 2>/dev/null
            Password:
            /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_dotblas.so
            /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/linalg/_fblas.so
            /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/fblas.so
            /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so
            /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_dotblas.so
            /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/linalg/_fblas.so
            /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/fblas.so
            /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so

            I've tried putting the path of both of those cblas.so libs in my blas.h file:

            // tried this, doesn't work
            #define FFI_LIB /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so"
            // this also doesn't work
            #define FFI_LIB /System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so

            but FFI complains about both of them:

            PHP Fatal error:  Uncaught FFI\Exception: Failed loading '/System/Volumes/Data/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/lib/blas/cblas.so' in /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php:8
            Stack trace:
            #0 /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php(8): FFI::load('/Users/sneakyimp/Des...')
            #1 {main}
              thrown in /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php on line 8

            I searched the mac file system for files containing the string "libblas" and found /usr/lib/libblas.dylib. I tried putting that in blas.h:

            #define FFI_LIB "/usr/lib/libblas.dylib"

            And my php script doesn't like that either:

            PHP Fatal error:  Uncaught FFI\Exception: Failed resolving C function 'openblas_set_num_threads' in /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php:8
            Stack trace:
            #0 /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php(8): FFI::load('/Users/sneakyimp/Des...')
            #1 {main}
              thrown in /Users/sneakyimp/Desktop/biz/machine-learning/np/foo.php on line 8

            sneakyimp Thanks, weedpacket. If I'm not mistaken, that is assembler code you posted? Where did this code come from? Are you inspecting source code for some BLAS lib? Thanks also for the blas link.

            Yes; DGEMM is the BLAS routine that does matrix-matrix multiplication. I converted that to C (which probably added a layer of indirection complexity, but I didn't use any optimisation switches), cut out the premultiplier and accumulator (set alpha to 1 and beta to 0), compiled it to assembly, and excerpted the part that runs if neither matrix is to be transposed. The corresponding Fortran is

                          DO 90 J = 1,N
                              IF (BETA.EQ.ZERO) THEN
                                  DO 50 I = 1,M
                                      C(I,J) = ZERO
               50                 CONTINUE
                              ELSE IF (BETA.NE.ONE) THEN
                                  DO 60 I = 1,M
                                      C(I,J) = BETA*C(I,J)
               60                 CONTINUE
                              END IF
                              DO 80 L = 1,K
                                  TEMP = ALPHA*B(L,J)
                                  DO 70 I = 1,M
                                      C(I,J) = C(I,J) + TEMP*A(I,L)
               70                 CONTINUE
               80             CONTINUE
               90         CONTINUE

            With ALPHA and BETA being hardcoded to 1 and 0, respectively, which is why they don't show up in the assembler.

            sneakyimp My jury-rigged approach using the ghostjat/np library evidently does yield the correct matrix product, but I'm really concerned that I might be doing something wrong. I don't really know what could happen, but this warning in the FFI docs seems pretty stark:

            Well, that's just a warning that external and OS libraries can do anything on your system. It's like the warning around eval only moreso, since now you've got the ability to execute arbitrary code at the operating system level (like, on Windows you can call the InitiateSystemShutdown method in advapi32.dll).

            However, like I said, I haven't had that much experience with FFI. You might like to consider downloading cblas.tgz and cblas.h from netlib above and compiling the library yourself. Then you'd have something you could distribute with the rest of the application.

            Weedpacket However, like I said, I haven't had that much experience with FFI. You might like to consider downloading cblas.tgz and cblas.h from netlib above and compiling the library yourself. Then you'd have something you could distribute with the rest of the application.

            This sounds like a reasonable next step. I would point out that PHP-ML delegates some operations to libsvm, and distributes distinct executables/libs for linux, macos, and windows. If I distribute my own shared library, that would eliminate the need to search for whatever flavor of BLAS may (or may not) be installed on the target machine, and we could be sure that we had the correct blas.h file.

            I am somewhat sheepishly wondering whether a CBLAS library compiled on my linux workstation would work on other linux distros? I dimly suspect that the chip architecture matters, and harbor a feeble hope that if I compile CBLAS on my intel system that the resulting .SO file would also run on red hat or centos if that system also had a 64-bit intel architecture -- and possibly also might run on machines with AMD chips? Sadly, I'm not very knowledgeable about these things.

            It seems to me that ghostjat/np is not well designed to address the need to specifically locate whichever BLAS library may (or may not) be installed on one's machine and then parse the correct blas.h file for that library. The blas.h distributed with ghostjat/np just refers to libblas.so:

            #define FFI_LIB "libblas.so"

              I'm not having much luck compiling CBLAS on either MacOS or my Ubuntu workstation. Part of the problem is that the compiler output is greek to me. If I'm not mistaken, the process appears to build an archive file, lib/cblas_LINUX.a, but the make operation fails. I posted stack overflow seeking help and I'm getting confusing/confused posts about either installing libopenblas instead or installing libblas. I thought compiling cblas would be creating a BLAS library?

              I've been sniffing around the OpenBLAS library releases and the various zip/tgz archives I've downloaded don't appear to contain any .so files.

              Lots of trial and error so far. I've been trying some of the .h files and .a files here but FFI doesn't seem to like any of them but its original blas.h file, modified to refer to the /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so shared lib which happens to be installed on my machine. It also seems like FFI has its own ideas about what a valid blas.h file is. It complains about semicolons and other #define statements in the cblas.h file from that netlib site, for instance.

              for my own reference: further googling i found some suggested steps, which may be helpful.
              EDIT: those instructions didn't work, either. I'm starting to wonder if perhaps the complaint is related to the fortran code? Calling make from the cblas folder fails and this is the final output:

              gcc -I../include -O3 -DADD_ -c c_sblas1.c
              gfortran -O3   -c c_sblat1.f
              c_sblat1.f:214:48:
              
                214 |                CALL STEST1(SNRM2TEST(N,SX,INCX),STEMP,STEMP,SFAC)
                    |                                                1
              Error: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
              c_sblat1.f:218:48:
              
                218 |                CALL STEST1(SASUMTEST(N,SX,INCX),STEMP,STEMP,SFAC)
                    |                                                1
              Error: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
              make[1]: *** [c_sblat1.o] Error 1
              make: *** [alltst] Error 2

              I wonder if perhaps the BLAS fortran code is old somehow?

                Soooo I have been able to download and make/build the latest OpenBlas and build it both on Ubuntu 20 and MacOs (Catalina, I think? With intel chip) and get it to work with ghostjat/np on both platforms.

                The steps, should anyone be curious. These are the ubuntu steps but should also work with little modification on MacOs -- e.g., download the file with your browser instead of using wget

                #make a dir for openblas:
                mkdir openblas
                cd openblas
                #download latest OpenBlas, extract it, cd into the new dir and make the project.
                # make operation takes awile, and your machine will need a compiler and probably gfortrain, etc.:
                wget https://github.com/xianyi/OpenBLAS/releases/download/v0.3.21/OpenBLAS-0.3.21.tar.gz
                tar -xvzf OpenBLAS-0.3.21.tar.gz
                cd OpenBLAS-0.3.21
                make

                Take note of the library's location, which will probably have a name specific to your machine's CPU architecture. In my case, this is it: libopenblas_sandybridgep-r0.3.21.so

                Take note of the full path to that lib:
                /home/sneakyimp/biz/machine-learning/openblas/OpenBLAS-0.3.21/libopenblas_sandybridgep-r0.3.21.so

                We can now put that shared lib path in the blas.h file included with the ghostjat/np PHP code. I repeat here the steps to set up a brand new ghostjat/np file:

                cd ..
                mkdir np
                cd np
                # you may need to get composer, see: https://getcomposer.org/download/
                ./composer.phar require ghostjat/np:dev-main
                #edit the blas.h file that comes with it:
                nano vendor/ghostjat/np/src/core/blas.h

                In the blas.h file, change the FFI_LIB to point to the full path of the OpenBLAS lib you just built:
                #define FFI_LIB "/home/sneakyimp/biz/machine-learning/openblas/OpenBLAS-0.3.21/libopenblas_sandybridgep-r0.3.21.so"

                Now you should be able to run this PHP file, bar.php which performs a matrix multiplication:

                <?php
                require __DIR__ . '/vendor/autoload.php';
                use Np\matrix;
                
                // NOTE THIS IS MY DATA YOU'LL NEED TO COOK YOUR OWN
                // OR REFER BACK TO OUR PRIOR CORRESPONDENCE
                $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
                
                // more trial and error...cooked this up after reading the ghostjat/np source code
                Np\core\blas::$ffi_blas = FFI::load(__DIR__ . '/vendor/ghostjat/np/src/core/blas.h');
                
                $ta = matrix::ar($data['Xtrain']);
                
                $start = microtime(true);
                $tb = $ta->transpose(); // to generate random 2d matrix
                echo "transpose elapsed time: " . (microtime(true) - $start) . "seconds\n";
                
                $start = microtime(true);
                $tc = $ta->dot($tb);
                echo "dot elapsed time: " . (microtime(true) - $start) . "seconds\n";
                
                $arr = $tc->asArray();
                echo "rows: " . sizeof($arr) . "\n";
                echo "cols: " . sizeof($arr[0]) . "\n";
                
                $json = json_encode($arr, JSON_PRETTY_PRINT);
                file_put_contents(__DIR__ . '/np-product.json', $json);
                echo "json md5 is " . md5($json) . "\n";

                While I am delighted that bar.php runs and produces the correct output, I would point out that I'm just using the blas.h file that is distributed with ghostjat/np. I'm still quite concerned that blas.h might be totally mismatched against the latest OpenBlas. I tried using the cblas.h that is distributed with OpenBLAS, but FFI::load complains about it. Here's a typical error:

                PHP Fatal error:  Uncaught FFI\ParserException: ';' expected, got '<STRING>' at line 8 in /home/jaith/biz/machine-learning/np2/bar.php:9
                Stack trace:
                #0 /home/jaith/biz/machine-learning/np2/bar.php(9): FFI::load()
                #1 {main}

                It would appear that FFI has some stunted/limited ability to parse these h files, but its limitations are not very clear, nor am I sure how to convert the OpenBlas header file to something FFI can digest. The documentation offers a little help:

                C preprocessor directives are not supported, i.e. #include, #define and CPP macros do not work, except for special cases listed below.
                The header file should contain a #define statement for the FFI_SCOPE variable, e.g.: #define FFI_SCOPE "MYLIB". Refer to the class introduction for details.
                The header file may contain a #define statement for the FFI_LIB variable to specify the library it exposes. If it is a system library only the file name is required, e.g.: #define FFI_LIB "libc.so.6". If it is a custom library, a relative path is required, e.g.: #define FFI_LIB "./mylib.so".

                There's also a user comment that seems promising:

                Since #include's and #define's other than FFI_LIB and FFI_SCOPE are not supported in the header, you may want to use the C preprocessor to pre-process your headers in order to resolve all #include's and macros.

                I use -D"attribute(ARGS)=" to remove function attributes as well, which are not supported by FFI either.

                This is my script:

                echo '#define FFI_SCOPE "YOUR_SCOPE"' > header-ffi.h
                echo '#define FFI_LIB "/path/to/your_lib.so"' >> header-ffi.h
                cpp -P -C -D"attribute(ARGS)=" header_original >> header-ffi.h

                Unfortunately, I tried it and, although it produced a 15,000-line header-ffi.h when I ran it on cblas.h, this file also produced FFI errors. I've not been able to use any blas.h/cblas.h file from CBLAS or OpenBLAS and get it working. Any advice on how I might do this would be much appreciated.

                I'm also a bit concerned about using OpenBLAS for a couple of reasons:

                • this is a big project, probably overkill. It takes a while to make and the shared lib libopenblas_sandybridgep-r0.3.21.so is 14MB
                • I worry a bit about security. The contributors include quite a few far flung devs and the Chinese Academy of Sciences

                I have been able to build CBLAS on my ubuntu machine by first separately downloading and building BLAS and editing the CBLAS Makefile to refer to the static library it produces in this line:

                BLLIB = /home/sneakyimp/biz/machine-learning/blas/BLAS-3.11.0/blas_LINUX.a

                This won't build on MacOS, complaining about the previously mentioned fortran stuff, but on Ubuntu 20 it seems to build. Unforunately, it does not produce any .so shared libraries. It just produces a static library, lib/cblas_LINUX.a. I've tried that with ghostjat but it fails.

                Is there some simple flag I can provide to make so that CBLAS will build the shared libraries (i.e., the .so files)? Can anyone suggest an approach to creating a blas.h file for ghostjat/np that better matches either OpenBLAS or CBLAS?

                  More trial and error...

                  Looks like ghostjat/np requires OpenBLAS rather than CBLAS. Thanks to this little recipe I was finally able to get CBLAS compiled into a cblas_LINUX.so shared lib by making some changes to the CBLAS makefile. I then edited the blas.h file vendor/ghostjat/np/src/core/blas.h to point to this newly built library:

                  #define FFI_LIB "/home/sneakyimp/biz/machine-learning/blas/CBLAS/lib/cblas_LINUX.so"

                  and tried to run my bar.php file and ghostjat/np is evidently using openblas functions:

                  $ php bar.php
                  PHP Fatal error:  Uncaught FFI\Exception: Failed resolving C function 'openblas_set_num_threads' in /home/sneakyimp/biz/machine-learning/np/bar.php:9
                  Stack trace:
                  #0 /home/sneakyimp/biz/machine-learning/np/bar.php(9): FFI::load()
                  #1 {main}
                    thrown in /home/sneakyimp/biz/machine-learning/np/bar.php on line 9

                  This is disappointing as CBLAS seemed much more compact. I.e., cblas_LINUX.so is only 130K, whereas libopenblas_sandybridgep-r0.3.21.so is a hulking 14MB. Kinda makes one wonder what is in that shared library?

                  This now brings me back to my attempts to try and improve the blas.h file that's currently in ghostjat/np. It is quite dissimilar to the h files in either OpenBlas or CBLAS. I've tried manually removing preprocessor directives from the various blas.h and cblas.h files that exist in these libraries, but I can't seem to get FFI to parse the files that result. I tried that approach mentioned in the PHP-FFI documentation and it yielded a giant, 15000-line h file that also barfed.

                  Is it safe to use the blas.h that comes with ghostjat/np if I am compiling OpenBlas or CBLAS from source?

                  sneakyimp Is it safe to use the blas.h that comes with ghostjat/np if I am compiling OpenBlas or CBLAS from source?

                  For the purposes of FFI, I reckon the header file only needs to declare the interface that you're actually using. So you'd be writing it from scratch (starting with the #define FFI_LIB to name the library).

                  sneakyimp Kinda makes one wonder what is in that shared library?

                  At a brief glance, for a start it looks like the whole of LAPACK is in there as well.
                  https://www.openblas.net/

                    I am not C/C++ pro but I cooked up some PHP to parse the ghostjat/np blas.h file and the OpenBlas cblas.h file and report some stuff. I can't guarantee this grabs all the fn definitions declared, but it seems pretty good:

                    // ghostjat file
                    $gj_h_file = __DIR__ . '/vendor/ghostjat/np/src/core/blas.h';
                    echo "parsing $gj_h_file\n";
                    $code = file_get_contents($gj_h_file);
                    $pattern = '#^([a-z\*_]+)\s+([a-z0-9_]+)\s*\(([^)]+)\)#ismU';
                    $matches = null;
                    $mcount = preg_match_all($pattern, $code, $matches);
                    echo "$mcount functions declared in ghostjat\n";
                    $gj_fns = $matches[2];
                    
                    // there are several blas.h files in OpenBlas, i just parse the first
                    //openblas/OpenBLAS-0.3.21/cblas.h
                    //openblas/OpenBLAS-0.3.21/lapack-netlib/CBLAS/include/cblas.h
                    //openblas/OpenBLAS-0.3.21/relapack/src/blas.h
                    $ob_h_file = realpath(__DIR__ . '/../openblas/OpenBLAS-0.3.21/cblas.h');
                    echo "parsing $ob_h_file\n";
                    $code = file_get_contents($ob_h_file);
                    $pattern = '#^([a-z\*_]+)\s+([a-z0-9_]+)\s*\(([^)]+)\)#ismU';
                    $matches = null;
                    $mcount = preg_match_all($pattern, $code, $matches);
                    echo "$mcount functions declared in openblas\n";
                    $ob_fns = $matches[2];
                    
                    $gj_not_ob = array_diff($gj_fns, $ob_fns);
                    echo sizeof($gj_not_ob) . " fns in gj but not ob\n";
                    //echo implode("\n", $gj_not_ob), "\n";
                    
                    $ob_not_gj = array_diff($ob_fns, $gj_fns);
                    echo sizeof($ob_not_gj) . " fns in ob but not gj\n";
                    //echo implode("\n", $ob_not_gj), "\n";

                    OpenBlas appears to declare some 200 functions, and FFI doesn't like a lot of the stuff in there. I've not yet managed any version of this file that survives FFI::load.

                    parsing /home/sneakyimp/biz/machine-learning/np/vendor/ghostjat/np/src/core/blas.h
                    44 functions declared in ghostjat
                    parsing /home/sneakyimp/biz/machine-learning/openblas/OpenBLAS-0.3.21/cblas.h
                    201 functions declared in openblas
                    0 fns in gj but not ob
                    157 fns in ob but not gj

                    I haven't bothered to compare return type or parameter declarations, but it looks like every function in the ghostjat/np blas.h file is accounted for in the OpenBlas.h file. I have a dim inkling this is good, and imagine it means that FFI will somehow locate all the functions declared in blas.h in the libopenblas_sandybridgep-r0.3.21.so shared library and, conversely, we aren't declaring functions in PHP that don't exist in the library. It seems like it'd be bad if our PHP code thought there was some function in the library that wasn't in there and we tried to jump to the wrong memory location? I really have no idea, though.