sneakyimp As it currently is, I have no idea what it's multiplying.

It's just dummy data in the arrays. Basically, matrixmult bench n 6 1 1375 1966 1375 says multiply a 1375x1966 sized array with a 1966x1375 array filled with dummy data (1, 2, 3, 4, 5, 6, 7...) one time using the non-contiguous memory version of Implementation 6. The validate routine confirms that all 18 variants are functioning correctly (i.e. multiplying two matrices works properly).

At any rate, it looks like Implementation 6 is the best overall winner for non-contiguous memory. It's what I was leaning toward anyway because it is simpler than the later Implementations.

sneakyimp I admire your ambition, but the 'hodgepodge' nature might prove an impediment to acceptance by the curmudgeonly community.

Yup. I'm rather aware of how difficult the PHP core developers are to work with at times. The extension isn't intended to be a standalone extension but rather a series of functions - each one being it's own unique pain point being dealt with and can be isolated/relegated to its own RFC as needed. I figure it is better to make one extension with 15 functions instead of 15 separate extensions that just implement one function each. As such, it's more a proof of concept extension with working code than something intended to be run in production environments. That is why adding a matrix multiplication function isn't a big deal.

As a side note, if you are looking for a Bayesian style spam filter, Spambayes is written in Python. Should be relatively easy to port the training and classifier portions to PHP.

cubic At any rate, it looks like Implementation 6 is the best overall winner for non-contiguous memory. It's what I was leaning toward anyway because it is simpler than the later Implementations.

I'm curious what the plan is to feed PHP data structures into your matrixmult extension. If I'm not mistaken, these C programs need some kind of memory address (i.e., a pointer) fed to them so they know where to find the data structures to be multiplied.

I would point out that pretty much every matrix multiplication scheme I've examined in this thread tends to have a class specifically designed and you feed it a PHP array of identically-sized PHP arrays (vectors) via constructor or some static method. I note that ghostjat/np delegates the heavy lifting of the constructor to its setData method which looks like simple matrix operations but the $data property of its matrix class appears to have 'flattened' a 2d PHP array representing a matrix into a single array of values while it separately keeps track of row and column counts. The constructor makes use of FFI:cast somehow. I'm still not sure how our PHP data gets handed off as c-style data objects to FFI for the OpenBlas operations. I have a dim inkling that $data is of type FFI\CData, which can evidently be directly manipulated in this case with PHP array operators like square brackets and such and that we can pass this $data object directly to our FFI-wrapped c functions. There's a lot of inheritance and functions delegating to other functions in ghostjat/np, but it looks like my matrix multiplication gets handled by this dotMatrix method, defined in a PHP Trait, which delegates the multiplication to a static method, blass.gemm, which is just a PHP wrapper around the FFI-loaded cblas_dgemm function declared in the blas.h file. This elaborate approach works with blinding speed with OpenBLAS, but CBLAS doesn't like the multithreading declarations.

I wonder how CBLAS might compare, performance-wise, to your matrixmul performance? These instructions successfully compile CBLAS into a shared library on my ubuntu workstation (but not my MacOS laptop). This shared lib doesn't match up with the ghostjat/np blas.h file, but we might generate an FFI-friendly cblas.h file that works with this suggestion from the docs that I mentioned above:

echo '#define FFI_SCOPE "YOUR_SCOPE"' > header-ffi.h
echo '#define FFI_LIB "/path/to/your_lib.so"' >> header-ffi.h
cpp -P -C -D"attribute(ARGS)=" header_original >> header-ffi.h

Armed with cblas_LINUX.so compiled directly from BLAS and CBLAS, and a new cblas.h file distilled with this bit of cpp code, we might be able to cook up a test. I'm still trying to get my head around how this stuff all fits together.

cubic As such, it's more a proof of concept extension with working code than something intended to be run in production environments

I appreciate your desire to tinker with the PHP guts, and make your life better. At the same time, something like OpenBLAS seems massively overkill. CBLAS, on the other hand, is quite compact, and may actually improve performance for these matrix operations and others. An improvement from n3 to n2.37286 could be of great benefit*.

*My FANN script is still running as I type this.

cubic As a side note, if you are looking for a Bayesian style spam filter, Spambayes is written in Python. Should be relatively easy to port the training and classifier portions to PHP.

That might be of interest, but I'm already well dug into this attempt to create efficient SVM/ANN spam filter. It's for learning as much as anything. I hope others might appreciate my sharing of the particulars I've uncovered. Personally, I'm appalled at the lack of fast matrix ops in PHP.

    Interesting development! I compiled CBLAS according to this recipe that I linked previously. I took the resulting cblas_LINUX.so file and copied it to a new folder and downloaded the blas.h file from the ghostjat/np library:

    mkdir ~/cblas
    cp cblas_LINUX.so ~/cblas/
    cd ~/cblas
    curl https://raw.githubusercontent.com/ghostjat/Np/main/src/core/blas.h > blas.h

    I edited this line in blas.h:

    #define FFI_LIB "libblas.so"

    to point to the location of my cblas_LINUX.so

    #define FFI_LIB "/home/sneakyimp/cblas/cblas_LINUX.so"

    I started with a simple bar.php file, which loads that blas.h file via FFI:

    <?php
    $ffi_blas = FFI::load(__DIR__ . '/blas.h');

    This script complains because blas.h declares functions that do not exist in cblas_LINUX.so like so:

    PHP Fatal error:  Uncaught FFI\Exception: Failed resolving C function 'openblas_get_config' in /home/sneakyimp/biz/cblas/bar.php

    So I methodically ran it over and over again, each time deleting the function declaration in blas.h that caused a complaint, yielding this blas.h file:

    #define FFI_SCOPE "blas"
    #define FFI_LIB "/home/sneakyimp/cblas/cblas_LINUX.so"
    typedef  size_t CBLAS_INDEX_t;
    
    size_t cblas_idamax(const int N, const float *X, const int incX);
    double cblas_dsdot(const int N, const float *X, const int incX, const float *Y,
                       const int incY);
    double cblas_ddot(const int N, const double *X, const int incX,
                      const double *Y, const int incY);
    double cblas_dnrm2(const int N, const double *X, const int incX);
    double cblas_dasum(const int N, const double *X, const int incX);
    
    void cblas_dswap(const int N, double *X, const int incX, 
                     double *Y, const int incY);
    void cblas_dcopy(const int N, const double *X, const int incX, 
                     double *Y, const int incY);
    void cblas_daxpy(const int N, const double alpha, const double *X,
                     const int incX, double *Y, const int incY);
    void cblas_drotg(double *a, double *b, double *c, double *s);
    void cblas_drotmg(double *d1, double *d2, double *b1, const double b2, double *P);
    void cblas_drot(const int N, double *X, const int incX,
                    double *Y, const int incY, const double c, const double  s);
    void cblas_drotm(const int N, double *X, const int incX,
                    double *Y, const int incY, const double *P);
    void cblas_dscal(const int N, const double alpha, double *X, const int incX);
    void cblas_dgemv(const enum CBLAS_ORDER order,
                     const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                     const double alpha, const double *A, const int lda,
                     const double *X, const int incX, const double beta,
                     double *Y, const int incY);
    void cblas_dgbmv(const enum CBLAS_ORDER order,
                     const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                     const int KL, const int KU, const double alpha,
                     const double *A, const int lda, const double *X,
                     const int incX, const double beta, double *Y, const int incY);
    void cblas_dtrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                     const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                     const int N, const double *A, const int lda, 
                     double *X, const int incX);
    void cblas_dtbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                     const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                     const int N, const int K, const double *A, const int lda, 
                     double *X, const int incX);
    void cblas_dtpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                     const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                     const int N, const double *Ap, double *X, const int incX);
    void cblas_dtrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                     const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                     const int N, const double *A, const int lda, double *X,
                     const int incX);
    void cblas_dtbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                     const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                     const int N, const int K, const double *A, const int lda,
                     double *X, const int incX);
    void cblas_dtpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                     const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                     const int N, const double *Ap, double *X, const int incX);
    void cblas_dsymv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                     const int N, const double alpha, const double *A,
                     const int lda, const double *X, const int incX,
                     const double beta, double *Y, const int incY);
    void cblas_dsbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                     const int N, const int K, const double alpha, const double *A,
                     const int lda, const double *X, const int incX,
                     const double beta, double *Y, const int incY);
    void cblas_dspmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                     const int N, const double alpha, const double *Ap,
                     const double *X, const int incX,
                     const double beta, double *Y, const int incY);
    void cblas_dger(const enum CBLAS_ORDER order, const int M, const int N,
                    const double alpha, const double *X, const int incX,
                    const double *Y, const int incY, double *A, const int lda);
    void cblas_dsyr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                    const int N, const double alpha, const double *X,
                    const int incX, double *A, const int lda);
    void cblas_dspr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                    const int N, const double alpha, const double *X,
                    const int incX, double *Ap);
    void cblas_dsyr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                    const int N, const double alpha, const double *X,
                    const int incX, const double *Y, const int incY, double *A,
                    const int lda);
    void cblas_dspr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                    const int N, const double alpha, const double *X,
                    const int incX, const double *Y, const int incY, double *A);
    void cblas_dgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
                     const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
                     const int K, const double alpha, const double *A,
                     const int lda, const double *B, const int ldb,
                     const double beta, double *C, const int ldc);
    void cblas_dsymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                     const enum CBLAS_UPLO Uplo, const int M, const int N,
                     const double alpha, const double *A, const int lda,
                     const double *B, const int ldb, const double beta,
                     double *C, const int ldc);
    void cblas_dsyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                     const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                     const double alpha, const double *A, const int lda,
                     const double beta, double *C, const int ldc);
    void cblas_dsyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                      const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                      const double alpha, const double *A, const int lda,
                      const double *B, const int ldb, const double beta,
                      double *C, const int ldc);
    void cblas_dtrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                     const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                     const enum CBLAS_DIAG Diag, const int M, const int N,
                     const double alpha, const double *A, const int lda,
                     double *B, const int ldb);
    void cblas_dtrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                     const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                     const enum CBLAS_DIAG Diag, const int M, const int N,
                     const double alpha, const double *A, const int lda,
                     double *B, const int ldb);

    I then emulated the code in ghostjat/np to create the necessary FFI\CData objects that I could feed to the cblas_gdemm function to perform matrix multiplication using CBLAS:

    <?php
    // load CBLAS lib
    $ffi_blas = FFI::load(__DIR__ . '/blas.h');
    
    // load my data
    $data = json_decode(file_get_contents(__DIR__ . '/../machine-learning/training_data_sets.json'), TRUE);
    
    // this is the 1375 x 1966 matrix i want to multiply times its transpose
    $m1 = $data['Xtrain'];
    
    $start = microtime(true);
    // transpose it
    $m1t = array_map(null, ...$m1);
    echo "transpose elapsed time: " . (microtime(true) - $start) . "seconds\n";
    
    // wrote this by examining code in ghostjat/np, converts PHP array/array obj to FFI\CData
    function get_cdata($m) {
    	// this check should be more specific, but works for now
    	if (!is_array($m) || !is_array($m[0])) {
    		throw new Exception('param must be array of arrays');
    	}
    
    	$rowcount = sizeof($m);
    	$colcount = sizeof($m[0]);
    	$flat_m = [];
    	$size = $rowcount * $colcount;
    	$cdata = \FFI::cast('double *', \FFI::new("double[$size]"));
    	$i = 0;
    	foreach($m as $row) {
    		foreach($row as $val) {
    			$flat_m[$i] = $val;
    			$cdata[$i] = $val;
    			$i++;
    		}
    	}
    
    	return [
    		'rows' => $rowcount,
    		'cols' => $colcount,
    		'flat_m' => $flat_m,
    		'cdata' => $cdata
    	];
    }
    
    // FFI function needs CData objects as params
    $m1_c = get_cdata($m1);
    $m1t_c = get_cdata($m1t);
    
    // these constants mean something to CBLAS. copied from ghostjat/np
    define('CBLAS_ROW_MAJOR', 101);
    define('CBLAS_NO_TRANS', 111);
    
    // some dimensions of our expected output matrix
    $mr_size = $m1_c['rows'] * $m1t_c['cols'];
    $mr_rows = $m1_c['rows'];
    $mr_cols = $m1t_c['cols'];
    $mr_cdata = \FFI::cast('double *', \FFI::new("double[$mr_size]"));
    
    $start = microtime(true);
    // INVOKE CBLAS VIA FFI. This fn works by pointer, returns some void object, I think
    $some_val = $ffi_blas->cblas_dgemm(CBLAS_ROW_MAJOR, CBLAS_NO_TRANS, CBLAS_NO_TRANS, $m1_c['rows'], $m1t_c['cols'], $m1_c['cols'], 1.0, $m1_c['cdata'], $m1_c['cols'], $m1t_c['cdata'], $m1t_c['cols'], 0.0, $mr_cdata, $mr_cols);
    echo "cblas_dgemm elapsed time: " . (microtime(true) - $start) . "seconds\n";
    
    // convert resulting cdata to php object
    $idx = 0;
    $mr = [];
    for($i=0; $i<$mr_rows; $i++) {
    	$mr[$i] = [];
    	for($j=0; $j<$mr_cols; $j++) {
    		$mr[$i][$j] = $mr_cdata[$idx];
    		$idx++;
    	}
    }
    
    
    echo "rows: " . sizeof($mr) . "\n";
    echo "cols: " . sizeof($mr[0]) . "\n";
    
    $json = json_encode($mr, JSON_PRETTY_PRINT);
    file_put_contents(__DIR__ . '/cblas-product.json', $json);
    echo "json md5 is " . md5($json) . "\n";

    It runs in 250ms, and appears to yield the same (correct?) output as all the other functions I've tested:

    transpose elapsed time: 0.056370973587036seconds
    cblas_dgemm elapsed time: 0.28544211387634seconds
    rows: 1375
    cols: 1375
    json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e

      I tried just uploading this little folder from my Ubuntu 20 workstation running PHP 8.2 to another workstation running PHP 22.04 with php 8.1. I adjusted the paths accordingly for this machine but it does not work:

      $ php bar.php
      transpose elapsed time: 0.14942502975464seconds
      php: symbol lookup error: /home/sneakyimp/cblas/cblas_LINUX.so: undefined symbol: dgemm_

      I also tried uploading this little folder to an Ubuntu 20.04 machine running PHP 7.4 and it had the same problem:

      $ php bar.php
      transpose elapsed time: 0.12622904777527seconds
      php: symbol lookup error: /home/sneakyimp/cblas/cblas_LINUX.so: undefined symbol: dgemm_

      On this latter machine, I even tried to recompile BLAS and CBLAS on it and it still barfs.

        If anyone has any suggestions about how to get CBLAS-via-FFI working on these virtual machines, I'd certainly appreciate advice. I have no idea what this error means:

        php: symbol lookup error: /home/sneakyimp/cblas/cblas_LINUX.so: undefined symbol: dgemm_

        NOTE: I have posted a detailed question about this on stack overflow. I feel that if I can get CBLAS to work here, that might just be the best approach. OpenBLAS is giant and bloated. CBLAS, although not quite as fast, is still pretty darn fast, and much more compact. I've been googling for hours and can't seem to make a whit of progress on this.

          I'm please (but also bemused) to report that I got BLAS invoked from PHP via FFI--bemused because I'm unable to get it working with the self-compiled cblas_LINUX.so. I was able to get a speedy BLAS matrix multiplication working on all my various ubuntu machines by first installing libblas-dev:

          sudo apt install libblas-dev

          and by using the cblas.h file that I had previously reduced by removing openblas functions etc. from the ghostjat/np blas.h file. Here is the entire reduced file, which I have saved as cblas.h:

          #define FFI_SCOPE "blas"
          #define FFI_LIB "libblas.so"
          typedef  size_t CBLAS_INDEX_t;
          
          
          size_t cblas_idamax(const int N, const float *X, const int incX);
          
          double cblas_dsdot(const int N, const float *X, const int incX, const float *Y,
                             const int incY);
          double cblas_ddot(const int N, const double *X, const int incX,
                            const double *Y, const int incY);
          double cblas_dnrm2(const int N, const double *X, const int incX);
          double cblas_dasum(const int N, const double *X, const int incX);
          
          void cblas_dswap(const int N, double *X, const int incX, 
                           double *Y, const int incY);
          void cblas_dcopy(const int N, const double *X, const int incX, 
                           double *Y, const int incY);
          void cblas_daxpy(const int N, const double alpha, const double *X,
                           const int incX, double *Y, const int incY);
          void cblas_drotg(double *a, double *b, double *c, double *s);
          void cblas_drotmg(double *d1, double *d2, double *b1, const double b2, double *P);
          void cblas_drot(const int N, double *X, const int incX,
                          double *Y, const int incY, const double c, const double  s);
          void cblas_drotm(const int N, double *X, const int incX,
                          double *Y, const int incY, const double *P);
          void cblas_dscal(const int N, const double alpha, double *X, const int incX);
          void cblas_dgemv(const enum CBLAS_ORDER order,
                           const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                           const double alpha, const double *A, const int lda,
                           const double *X, const int incX, const double beta,
                           double *Y, const int incY);
          void cblas_dgbmv(const enum CBLAS_ORDER order,
                           const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                           const int KL, const int KU, const double alpha,
                           const double *A, const int lda, const double *X,
                           const int incX, const double beta, double *Y, const int incY);
          void cblas_dtrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                           const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                           const int N, const double *A, const int lda, 
                           double *X, const int incX);
          void cblas_dtbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                           const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                           const int N, const int K, const double *A, const int lda, 
                           double *X, const int incX);
          void cblas_dtpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                           const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                           const int N, const double *Ap, double *X, const int incX);
          void cblas_dtrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                           const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                           const int N, const double *A, const int lda, double *X,
                           const int incX);
          void cblas_dtbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                           const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                           const int N, const int K, const double *A, const int lda,
                           double *X, const int incX);
          void cblas_dtpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                           const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                           const int N, const double *Ap, double *X, const int incX);
          void cblas_dsymv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                           const int N, const double alpha, const double *A,
                           const int lda, const double *X, const int incX,
                           const double beta, double *Y, const int incY);
          void cblas_dsbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                           const int N, const int K, const double alpha, const double *A,
                           const int lda, const double *X, const int incX,
                           const double beta, double *Y, const int incY);
          void cblas_dspmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                           const int N, const double alpha, const double *Ap,
                           const double *X, const int incX,
                           const double beta, double *Y, const int incY);
          void cblas_dger(const enum CBLAS_ORDER order, const int M, const int N,
                          const double alpha, const double *X, const int incX,
                          const double *Y, const int incY, double *A, const int lda);
          void cblas_dsyr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                          const int N, const double alpha, const double *X,
                          const int incX, double *A, const int lda);
          void cblas_dspr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                          const int N, const double alpha, const double *X,
                          const int incX, double *Ap);
          void cblas_dsyr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                          const int N, const double alpha, const double *X,
                          const int incX, const double *Y, const int incY, double *A,
                          const int lda);
          void cblas_dspr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                          const int N, const double alpha, const double *X,
                          const int incX, const double *Y, const int incY, double *A);
          void cblas_dgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
                           const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
                           const int K, const double alpha, const double *A,
                           const int lda, const double *B, const int ldb,
                           const double beta, double *C, const int ldc);
          void cblas_dsymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                           const enum CBLAS_UPLO Uplo, const int M, const int N,
                           const double alpha, const double *A, const int lda,
                           const double *B, const int ldb, const double beta,
                           double *C, const int ldc);
          void cblas_dsyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                           const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                           const double alpha, const double *A, const int lda,
                           const double beta, double *C, const int ldc);
          void cblas_dsyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                            const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                            const double alpha, const double *A, const int lda,
                            const double *B, const int ldb, const double beta,
                            double *C, const int ldc);
          void cblas_dtrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                           const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                           const enum CBLAS_DIAG Diag, const int M, const int N,
                           const double alpha, const double *A, const int lda,
                           double *B, const int ldb);
          void cblas_dtrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                           const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                           const enum CBLAS_DIAG Diag, const int M, const int N,
                           const double alpha, const double *A, const int lda,
                           double *B, const int ldb);

          NOTE : FFI_LIB is simply 'libblas.so' without any path information. This should result in linux using the shared library installed with the libblas-ev package.

          Lastly, here's a simplified script that loads my JSON file and multiplies my training set by its transpose. The cblas_dgemm call executes in about 284ms on my workstation. Slower on the virtual machines for whatever reason.

          
          $data = json_decode(file_get_contents(__DIR__ . '/../machine-learning/training_data_sets.json'), TRUE);
          
          // more trial and error...cooked this up after reading the ghostjat/np source code
          $ffi_blas = FFI::load(__DIR__ . '/cblas.h');
          
          
          $m1 = $data['Xtrain'];
          
          
          $start = microtime(true);
          $m1t = array_map(null, ...$m1);
          echo "transpose elapsed time: " . (microtime(true) - $start) . "seconds\n";
          
          // lifted from ghostjat/np, converts matrix to obj with FFI\CData
          function get_cdata($m) {
          	// this check should be more specific, but works for now
          	if (!is_array($m) || !is_array($m[0])) {
          		throw new Exception('param must be array of arrays');
          	}
          
          	$rowcount = sizeof($m);
          	$colcount = sizeof($m[0]);
          	$flat_m = [];
          	$size = $rowcount * $colcount;
          	$cdata = \FFI::cast('double *', \FFI::new("double[$size]"));
          	$i = 0;
          	foreach($m as $row) {
          		foreach($row as $val) {
          			$flat_m[$i] = $val;
          			$cdata[$i] = $val;
          			$i++;
          		}
          	}
          
          	return [
          		'rows' => $rowcount,
          		'cols' => $colcount,
          		'flat_m' => $flat_m,
          		'cdata' => $cdata
          	];
          }
          
          $m1_c = get_cdata($m1);
          $m1t_c = get_cdata($m1t);
          
          define('CBLAS_ROW_MAJOR', 101);
          define('CBLAS_NO_TRANS', 111);
          
          $mr_size = $m1_c['rows'] * $m1t_c['cols'];
          $mr_rows = $m1_c['rows'];
          $mr_cols = $m1t_c['cols'];
          $mr_cdata = \FFI::cast('double *', \FFI::new("double[$mr_size]"));
          
          $start = microtime(true);
          // this fn works by pointer, returns some void object, I think?
          $some_val = $ffi_blas->cblas_dgemm(CBLAS_ROW_MAJOR, CBLAS_NO_TRANS, CBLAS_NO_TRANS, $m1_c['rows'], $m1t_c['cols'], $m1_c['cols'], 1.0, $m1_c['cdata'], $m1_c['cols'], $m1t_c['cdata'], $m1t_c['cols'], 0.0, $mr_cdata, $mr_cols);
          echo "cblas_dgemm elapsed time: " . (microtime(true) - $start) . "seconds\n";
          
          
          // convert cdata to php object
          $idx = 0;
          $mr = [];
          for($i=0; $i<$mr_rows; $i++) {
          	$mr[$i] = [];
          	for($j=0; $j<$mr_cols; $j++) {
          		$mr[$i][$j] = $mr_cdata[$idx];
          		$idx++;
          	}
          }
          
          
          echo "rows: " . sizeof($mr) . "\n";
          echo "cols: " . sizeof($mr[0]) . "\n";
          
          $json = json_encode($mr, JSON_PRETTY_PRINT);
          file_put_contents(__DIR__ . '/cblas-product.json', $json);
          echo "json md5 is " . md5($json) . "\n";

          The result, $mr is precisely what the other matrix multiply/dot functions returned. The output:

          transpose elapsed time: 0.057420015335083seconds
          cblas_dgemm elapsed time: 0.28933596611023seconds
          rows: 1375
          cols: 1375
          json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e
            10 days later

            I've got native PHP matrix addition (mat_add())) and subtraction (mat_sub())) routines working well in the extension now. Had to come up with a new way to cleanly access PHP hash tables because none of the existing C macros would work for independent iteration over two distinct arrays.

            Assuming all goes well, I hope to have a native PHP matrix multiply implemented sometime this week.

            Slower on the virtual machines for whatever reason.

            Virtual machines are...virtual. AWS and other cloud providers supply vCores, which are slices of CPU time. Your VM is sharing system resources with other VMs on the same hardware. Some providers let users reserve whole CPU cores for just their VM but those are much more expensive than the ultra-cheap VMs where CPU cores are time shared. I believe some low quality providers also overprovision RAM to cram more VMs onto one piece of hardware, which can result in very low usage VMs to dump to swap but can cause memory thrashing if they start being used and the system gets too busy. Your local machine is almost certainly more powerful and capable than any VM you reserve over in the cloud.

            cubic I've got native PHP matrix addition (mat_add())) and subtraction (mat_sub())) routines working well in the extension now. Had to come up with a new way to cleanly access PHP hash tables because none of the existing C macros would work for independent iteration over two distinct arrays.

            I suspect (but have no real evidence) that the slowness of my pure PHP implementations of these matrix operations is due to the need to traverse some data structure to locate row $i and column $j. I suspect (but have no evidence) that the implementations written in C benefit from some algorithm that quickly locates the data element i,j. I admire your courage for digging into the PHP source, and am somewhat curious what you are doing differently than the existing php codebase. Seems to me PHP itself might benefit from your approach.

            cubic Assuming all goes well, I hope to have a native PHP matrix multiply implemented sometime this week.

            That sounds pretty exciting!

            I have some possibly relevant news to report regarding real-world performance of BLAS. As previously posted, I managed to get BLAS-via-FFI working in PHP by installing libblas-dev. I even got it working on my MacOS laptop. I'm not sure which library is handling PHP::FFI's calls to cblas_x calls, but I'm loading a modified blas.h file from the ghostjat repo. In particular, the FFI_lib declaration is:

            #define FFI_LIB "libblas.dylib"

            and I seem to have removed the various openblas declarations that were causing complaints before. I don't really know which libblas.dylib is being summoned, but I see that I've installed openblas and lapack via brew. A search of the filesystem shows libblas.dylib files in numerous places. It appears to be part of both lapack and openblas, but the most official looking location is /usr/lib/libblas.dylib which is symbolic link to some file in /System/Library/Frameworks which did not show up in my file system 'find' command results. If anyone has thoughts on how I determine which libblas.dylib is being invoked, I'd appreciate it.

            My simple bar.php file, which multiplies my example matrix by its transpose using the ghostjat/np lib which, in turn, uses FFI to load libblas, multiplies my 1375x1966 sample matrix by its transpose in about 115ms on this MacOS laptop. This is very fast compared to the other methods we've been looking at here.

            Encouraged by this quick result, I laboriously converted the original svm-training matlab/octave code to PHP. This was laborious because expressions that are really tidy and efficient in octave are not tidy or efficient in PHP. I'll try to provide some specifics here shortly -- in particular, the fact that ghostjat/np doesn't distinguish between column vectors and row vectors required me to write some ungainly looping code to probe into the raw data of these objects. The resulting code is not intuitive and at times seems pretty kludgy, but it works in reasonable time frames. By 'works in reasonable time frames' I mean that I can train spam filter SVM using my matrices in anywhere from 1 to about 10 minutes. I've yet to thoroughly test everything, but I can train the spam filter on 1375x1966 training set, typically in 2-3 minutes, and then use the resulting SVM on another test set without about 450 messages and get the results back in 1-3 seconds. Depending on linear or gaussian kernel, and depending on parameters chosen for C and sigma, I can find configurations that achieve about 95% accuracy in the spam/ham evaluation. The results in aggregate look the same as what I get in Octave.

            HOWEVER, Octave still seems to be about 5 or 6 times faster. My most grueling application so far is to sweep through these values of C and sigma, training a total of 441 different gaussian SVMs to see which values might be optimal:

            // range from 1.4^-10 to 1.4^10
            $cvals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
            $sigma_vals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];

            My PHP code chose C=2.744 and sigma=2.744, which achieved correct classification of 95.633% of test messages. It took 67202.334 seconds to run this PHP code. The Octave code trains the 441 SVMs on the same message data, varying the same C and sigma parameters in a much faster 12440.5 seconds. Octave chose the same optimal C and sigma, reporting the same success rate. The PHP code does a bit more to accumulate stats, etc., but this extra work adds much less than 5000 seconds in total.

            Using the stats from my PHP code, I was able to generate this surface plot. The z-axis indicates the accuracy of my SVM in predicting spam for various C and sigma values. Ironically, I had to export a CSV file from PHP, and import it into octave to make this plot. One will note that if I want to focus on the more promising regions of the graph, that I must adjust $cvals and $sigma_vals and run the whole 18-hour process again. Or I can run it in Octave in about 4 hours.

            sneakyimp I suspect (but have no real evidence) that the slowness of my pure PHP implementations of these matrix operations is due to the need to traverse some data structure to locate row $i and column $j.

            PHP userland does a TON of work for every reference for something like $c[$i][$j]. It goes something like this:

            1. Check to see if $c exists.
            2. Sees the array/string dereference (i.e. the brackets). Is $c an array or a string? If so, continue.
            3. Check to see if $i exists. Is it an integer or a string? If so, continue.
            4. Check to see if $i is in bounds of the array and/or the key exists. If so, continue.
            5. Sees the array/string dereference (i.e. the brackets). Is $c[$i] an array or string? If so, continue.
            6. Check to see if $j exists. Is it an integer or a string? If so, continue.
            7. Check to see if $j is in bounds of the array and/or the key exists. If so, continue.

            Now PHP has located the correct location in the arrays/hash tables to conduct the operation. Array lookups will almost certainly call several C functions. And the right side of the equation utilizes at least one temporary variable. Doing those steps one time? Not a big deal. Doing those steps a billion times? Very slow. PHP userland code can't really be optimized all that much by the Zend engine. Steps 3 and 4 could be skipped by using PHP references instead of constantly doing $c[$i] but that's about it.

            C is faster because it can precalculate all of those checks in advance to avoid making them for every cycle of the inner loop and also doesn't need to do type conversion/coercion and also won't call a bunch of functions in the innermost loop. The tradeoffs are the potential for instability and reduced system security. One wrong/misplaced pointer in C and an application will segfault (segfault == probable security vulnerability).

            I've now got mat_mult() written for both double and zend_long 2D arrays and also have max_execution_time timeout support. I still need to add scalar multiplies for completionist purposes and write the requisite test suite to properly validate the implementation. It won't run as fast as libblas does on your computer but it will be a native implementation.

            sneakyimp If anyone has thoughts on how I determine which libblas.dylib is being invoked, I'd appreciate it.

            You can probably figure that out via strace (or equivalent).

            cubic One wrong/misplaced pointer in C and an application will segfault (segfault == probable security vulnerability).

            I have never managed to get comfortable with pointers, and frequently run into trouble if I try to 'go native' and write some C with pointers.

            I was casually thinking that C could do matrix multiplication extremely quickly if it could simply calculate the memory offset by adding i * j to double *ptr but it later occurred to me that it might be unlikely that the memory allocated to a matrix would be contiguous. I was wondering how the C code might handle a matrix stored in non-contiguous memory and still be fast?

            cubic You can probably figure that out via strace (or equivalent).

            Oooh that seems very helpful! Will look into that.

            In the meantime, I thought it might be worth pointing out some of the challenges I've had translating octave code to PHP. I used ghostjat/np for performance reasons. It's the one library I've found with performance remotely comparable to octave. Sadly, it has a lot of problems so I'm currently working with my own modified version of it. Perhaps the biggest problem I've faced harks back to the lack of distinction between row vectors and column vectors. I expressed confusion about this earlier in this thread and discussed it with @Weedpacket. I want to re-emphasize this as a point of confusion, and illustrate with the challenge I faced translating this bit of octave code:

                % Vectorized RBF Kernel
                % This is equivalent to computing the kernel on every pair of examples
                X1 = sum(X.^2, 2);
                X2 = sum(model.X.^2, 2)';
                K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
                K = model.kernelFunction(1, 0) .^ K;
                K = bsxfun(@times, model.y', K);
                K = bsxfun(@times, model.alphas', K);

            In this case, X is something like my 1375x1966 training matrix and model.X is some different set with different number of rows (m) but same number of features (n). Calculating X1 and X2 are easy with ghostjat/np:

                    $x1 = $x->square()->sumRows();
                    $x2 = $model['x']->square()->sumRows();

            Calculating K is something of a nightmare. We can do - 2 * X * model.X' pretty easily, although the sequence of operations in our PHP starts to shift around. Remember that matrix multiplication is not always symmetric:

            // - 2 * X * model.X'
            $K = $x->dot($model['x']->transpose())->multiply(-2);

            Just to accomplish that inner bsxfun operation requires us to embark on looping structures:

            // do the inner bsxfun(plus...)
            // ghostjat has no means to add a ROW vector to a matrix soooo we fake it
            $kshape = $K->getShape();
            $km = $kshape->m;
            $kn = $kshape->n;
            $x2size = $x2->getSize();
            // i are columns, j are rows
            for($i=0; $i<$x2size; $i++) {
                $x2val = $x2->data[$i];
                for($j=0; $j<$km; $j++) {
                    // add the ith x2 value to the ith column of the jth row
                    $K->data[($j * $kn) + $i] += $x2val;
                }
            }

            I'd like to clarify that bsxfun(@plus, X2, - 2 * X * model.X') is not an especially elaborate operation. If we define a (a substitute for X2) and b (a substitute for the last parameter, a dot product) thusly:

            // a 1 x 2 row vector
            a = [10 20]
            // a 3 x 2 matrix
            b = [1 1; 2 2; 3 3;]
            inner_bsx = bsxfun(@plus, a, b)
                inner_bsx =
            
                   11   21
                   12   22
                   13   23

            We are basically just adding the row vector a to each of the 3 rows in b. I've been unable to identify any matrix sum or add method which can perform this very simple operation so I had to resort to the looping code above. This quickly gets more maddening. To implement the outer bsxfun operation in that one line of code, we have to run another nested loop pair to sum X1, a 3x1 column vector, to the result of the inner bsxfun operation.

            // a 3 x 1 column vector, same size as X1
            c = c = [100; 1000; 10000]
            // outer bsxfun
            outer_bsx = bsxfun(@plus, c, inner_bsx)
                outer_bsx =
            
                     111     121
                    1012    1022
                   10013   10023

            You'll note that octave knows that c is a column vector so it sums c to each column of inner_bsx. You'll also note that ghostjat/np doesn't distinguish between row and column vectors. Nor does it offer any bsxfun operation or methods like sumAsColumn or sumAsRow one might use to expressly specify how the addition is to be performed. The logic I've seen in some of these other matrix operation libs when multiplying a matrix and vector is to check dimensions and then make a guess about what was intended. I assert this guessing behavior is sorely inadequate, given that order of operations matters in matlab/octave, and given that ambiguity arises when multiplying square-but-non-symmetric matrices.

            The difficulties continue. Those six lines or so of octave/matlab yield this function, which was a chore to write (juggling the nested loops and making sure the offsets are calculated correctly, bounds not exceeded, etc) and doesn't seem to be very efficient at all.

            function get_gaussian_predict_k(Np\matrix $x, array $model) {
            
            	// orig octave code in svmPredict
            	// Vectorized RBF Kernel
            	// This is equivalent to computing the kernel on every pair of examples
            	//X1 = sum(X.^2, 2);
            	//X2 = sum(model.X.^2, 2)';
            	//K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
            	//K = model.kernelFunction(1, 0) .^ K;
            	//K = bsxfun(@times, model.y', K);
            	//K = bsxfun(@times, model.alphas', K);
            
            	$x1 = $x->square()->sumRows();
            	//echo "x1 ", $x1, "\n";
            
            	// we don't need to transpose this because ghostjat/np doesn't distinguish col vs row vectors
            	$x2 = $model['x']->square()->sumRows();
            	//echo "x2 ", $x2, "\n";
            
            	// need to build K.
            	$K = $x->dot($model['x']->transpose())->multiply(-2);
            
            	// do the inner bsxfun(plus...)
            	// ghostjat has no means to add a ROW vector to a matrix soooo we fake it
            	$kshape = $K->getShape();
            	$km = $kshape->m;
            	$kn = $kshape->n;
            	$x2size = $x2->getSize();
            	// $km should match the dimensions of $x2
            	// sanity check
            	if ($x2size !== $kn) {
            		throw new \Exception('x2 size ($x2size) does not match kn ($kn)');
            	}
            	// i are columns, j are rows
            	for($i=0; $i<$x2size; $i++) {
            		$x2val = $x2->data[$i];
            		for($j=0; $j<$km; $j++) {
            			// add the ith x2 value to the ith column of the jth row
            			$K->data[($j * $kn) + $i] += $x2val;
            		}
            	}
            
            	// do the outer bsxfun(plus...)
            	// ghostjat has no means to add a COLUMN vector soooo we fake it
            	$x1size = $x1->getSize();
            	// $km should match the dimensions of $x1
            	// sanity check
            	if ($x1size !== $km) {
            		throw new \Exception('x1 size ($x1size) does not match km ($km)');
            	}
            	// i are rows, j are columns
            	for($i=0; $i<$x1size; $i++) {
            		$x1val = $x1->data[$i];
            		for($j=0; $j<$kn; $j++) {
            			// add the ith x1 value to the jaith column of the ith row
            			//$offset = ($i * $kn) + $j;
            			$K->data[($i * $kn) + $j] += $x1val;
            		}
            	}
            
            	$kf = gaussian_kernel(Np\vector::ar([1]), Np\vector::ar([0]), $model['sigma']);
            	//echo "kf ", $kf, "\n";
            	$K = $K->map(fn($v) => (pow($kf, $v)));
            
            	$mysize = $model['y']->getSize();
            	// $km should match the dimensions of $model['y']
            	// sanity check
            	if ($mysize !== $kn) {
            		throw new \Exception('model.y size ($mysize) does not match kn ($kn)');
            	}
            	// i are columns, j are rows
            	for($i=0; $i<$mysize; $i++) {
            		$yval = $model['y']->data[$i];
            		for($j=0; $j<$km; $j++) {
            			// multiply the ith y value by the ith column of the jth row
            			$K->data[($j * $kn) + $i] *= $yval;
            		}
            	}
            
            
            	$alphasize = $model['alphas']->getSize();
            	// $km should match the dimensions of $model['alphas']
            	// sanity check
            	if ($alphasize !== $kn) {
            		throw new \Exception('model.alpha size ($alphasize) does not match kn ($kn)');
            	}
            	// i are columns, j are rows
            	for($i=0; $i<$alphasize; $i++) {
            		$aval = $model['alphas']->data[$i];
            		for($j=0; $j<$km; $j++) {
            			// multiply the ith y value by the ith column of the jth row
            			$K->data[($j * $kn) + $i] *= $aval;
            		}
            	}
            
            	return $K;
            }

            Given how tricky it is to get PHP::FFI working with BLAS or OpenBLAS, and how elaborate the PHP effort is here, and the fact that even with all that octave is five times faster, I'm starting to think I'd be better off using exec to delegate these matrix/svm calculations to octave.

            sneakyimp but it later occurred to me that it might be unlikely that the memory allocated to a matrix would be contiguous.

            If you allocate it as a single array the C standard requires it to be contiguously-addressable. Hence the tendency to use mat[i * stride + j] instead of mat[i][j]. The former code is defined as, once the compiler starts working on it, *((mat) + (i * stride + j)).

            Okay. I've got a working, validated mat_mult() function in a PHP extension with zero external library dependencies. But wow do large arrays chug a bunch of RAM for breakfast. That's not my fault. A 1375x1966 2D array in PHP consumes about 136MB RAM. Of course, doing the math, the most optimal memory usage is 1375x1966x8 = 21MB. So there are about 40 bytes of overhead per cell, which sounds about right for a zval.

            I just realized that your benchmark times are not including the time taken for converting between PHP zvals and the format required by the cblas_dgemm() function call AND the time taken to convert back to PHP zvals when done. I'd wager you excluded those conversions to test how fast specific function performance is, but the surrounding conversion times are important as well and make the actual function call time perhaps a little less impressive.

            At any rate, the new mat_mult() function is about 3x slower than the matrixmult benchmark tool on a very old piece of hardware - an ancient Intel Celeron running Linux where if something runs reasonably fast there, then it's probably very fast everywhere else. The slowdown probably comes from the overhead of dealing with PHP zvals AND allocating memory for 1.9 million zvals for the return array. The matrixmult benchmark tool multiplies 1375x1966x1375 in about 17.4 seconds (I said that the machine was slow!) while the mat_mult() PHP function takes 47.9 seconds on the same hardware. So I expect it to run the example matrix in about 7-9 seconds on modern hardware. Still WAY faster by a factor of around 20x than doing a matrix multiply in pure PHP userland. There might be some further optimizations that could be made, but I'm pretty satisfied with the results. Also, I haven't had any segfaults/crashes and only ran into one easily fixed bug.

            (In case you are wondering, the aforementioned ancient system is currently my only dedicated Linux system on my LAN that is set up for PHP extension development. I plan on replacing the system this year. I mostly run Windows but writing PHP extensions is most easily done first on Linux and then porting to other OSes after the initial development there is done.)

            The next step is to clean up the extension a little bit and get it ready to publish.

            Having basic mat_add(), mat_sub(), and mat_mult() implementations in PHP core will alleviate a lot of linear algebra performance problems that currently exist in PHP. I don't believe the effort is wasted even if you end up going the exec route because you desire better performance.

            cubic A 1375x1966 2D array in PHP consumes about 136MB RAM. Of course, doing the math, the most optimal memory usage is 1375x1966x8 = 21MB. So there are about 40 bytes of overhead per cell, which sounds about right for a zval.

            I understand you need the zval structs for a loosely typed language but OUCH that's some pretty serious overhead. I was curious about memory consumption and cooked up this script:

            echo "START\n";
            echo number_format(memory_get_usage()) . "\n";
            
            require __DIR__ . '/vendor/autoload.php';
            use Np\matrix;
            echo "AFTER AUTOLOAD\n";
            echo number_format(memory_get_usage()) . "\n";
            
            $x = [];
            for($i=0; $i<1375; $i++) {
            	$x[$i] = [];
            	for($j=0; $j<1966; $j++) $x[$i][$j] = lcg_value();
            }
            echo "AFTER X GENERATED\n";
            echo number_format(memory_get_usage()) . "\n";
            
            echo "x has " . sizeof($x) . " rows\n";
            echo "x has " . sizeof($x[0]) . " cols\n";
            
            $xmatrix = matrix::ar($x);
            echo "AFTER MATRIX DEFINED\n";
            echo number_format(memory_get_usage()) . "\n";
            
            
            unset($x);
            echo "AFTER X UNSET\n";
            echo number_format(memory_get_usage()) . "\n";
            
            var_dump($xmatrix->data);

            For some reason the 1375 x 1966 native PHP array of floats only consumes about 50MB on my workstation (ubuntu 20):

            START
            398,448
            AFTER AUTOLOAD
            452,696
            AFTER X GENERATED
            51,254,616
            x has 1375 rows
            x has 1966 cols
            AFTER MATRIX DEFINED
            73,111,352
            AFTER X UNSET
            22,309,432
            object(FFI\CData:double*)#4 (1) {
              [0]=>
              float(0.558039688348575)
            }

            Also note that the ghostjat/np class uses a FFI\CData:double* array, and its memory consumption appears to be the ideal of about 21MB.

            cubic I just realized that your benchmark times are not including the time taken for converting between PHP zvals and the format required by the cblas_dgemm() function call AND the time taken to convert back to PHP zvals when done.

            In practice I've been loading data from some external format (JSON, CSV, etc.), then defining the ghostjat/np/vector,matrix objects and using them for all my manipulations. But here's a simplified script for comparison. I've timed the steps after $x is defined to construct the matrix object, transpose it, and multiply:

            echo 'xtrain has ' . sizeof($xtrain) . " rows\n";
            echo 'xtrain has ' . sizeof($xtrain[0]) . " cols\n";
            
            $start = microtime(true);
            
            $ta = matrix::ar($xtrain);
            $tb = $ta->transpose(); // to generate random 2d matrix
            
            $tc = $ta->dot($tb);
            
            $arr = $tc->asArray();
            echo "total elapsed time: " . (microtime(true) - $start) . "seconds\n";
            
            echo "rows: " . sizeof($arr) . "\n";
            echo "cols: " . sizeof($arr[0]) . "\n";

            About 550ms:

            xtrain has 1375 rows
            xtrain has 1966 cols
            total elapsed time: 0.55272507667542seconds
            rows: 1375
            cols: 1375

            This particular configuration was using OpenBlas. The machine is a Intel Core i7-4820K CPU @ 3.70GHz, about 10 years old, but still fairly muscular, with 4 physical cores / 8 threads of simultaneous execution.

            cubic Having basic mat_add(), mat_sub(), and mat_mult() implementations in PHP core will alleviate a lot of linear algebra performance problems that currently exist in PHP.

            Your results seem pretty impressive to me, and I think it might be quite helpful to have such functions in PHP without needing sudo access to the machine and jumping thru various hoops to link up other modules/libs. I would point out that the really intensive part of my octave/matlab stuff is training the SVM on thousands of sample inputs. Your functions might be very helpful for running the SVM on a single input vector in a production environment. I.e., it takes a lot of heavy CPU time (1-10 minutes) to train an SVM so this would probably need to be offloaded (e.g. via exec) anyway, whereas running an already-trained SVM to classify an incoming message has ham or spam is much less intensive, but would certainly benefit from fast matrix/vector functions.

              9 days later

              With some effort, I finally got a working build of qolfuncs on my Windows PC against PHP 8.2.1, which gets me as close to PHP master as I plan on getting until I start opening pull requests.

              For 1375x1966 using mat_mult(): 1.82 seconds. Only a tad slower than the matrixmult tool. It's definitely in the zone though. The old and crusty hardware I originally wrote the function on doesn't even stack up to a Core i7.

              Running a homegrown userland implementation on the same hardware: 85.4 seconds. Even factoring in the variance between our hardware, my userland implementation appears to be notably faster than PHP-ML. This means mat_mult() is about 46 times faster than the best userland implementation. There was some initial pushback on the PHP internals list over matrix operations but I think the now-confirmed massive gain in performance will convince most folks that even basic matrix operations should be included in PHP core. Plenty of use-cases also exist.

              Write a Reply...