I tried just uploading this little folder from my Ubuntu 20 workstation running PHP 8.2 to another workstation running PHP 22.04 with php 8.1. I adjusted the paths accordingly for this machine but it does not work:

$ php bar.php
transpose elapsed time: 0.14942502975464seconds
php: symbol lookup error: /home/sneakyimp/cblas/cblas_LINUX.so: undefined symbol: dgemm_

I also tried uploading this little folder to an Ubuntu 20.04 machine running PHP 7.4 and it had the same problem:

$ php bar.php
transpose elapsed time: 0.12622904777527seconds
php: symbol lookup error: /home/sneakyimp/cblas/cblas_LINUX.so: undefined symbol: dgemm_

On this latter machine, I even tried to recompile BLAS and CBLAS on it and it still barfs.

    If anyone has any suggestions about how to get CBLAS-via-FFI working on these virtual machines, I'd certainly appreciate advice. I have no idea what this error means:

    php: symbol lookup error: /home/sneakyimp/cblas/cblas_LINUX.so: undefined symbol: dgemm_

    NOTE: I have posted a detailed question about this on stack overflow. I feel that if I can get CBLAS to work here, that might just be the best approach. OpenBLAS is giant and bloated. CBLAS, although not quite as fast, is still pretty darn fast, and much more compact. I've been googling for hours and can't seem to make a whit of progress on this.

      I'm please (but also bemused) to report that I got BLAS invoked from PHP via FFI--bemused because I'm unable to get it working with the self-compiled cblas_LINUX.so. I was able to get a speedy BLAS matrix multiplication working on all my various ubuntu machines by first installing libblas-dev:

      sudo apt install libblas-dev

      and by using the cblas.h file that I had previously reduced by removing openblas functions etc. from the ghostjat/np blas.h file. Here is the entire reduced file, which I have saved as cblas.h:

      #define FFI_SCOPE "blas"
      #define FFI_LIB "libblas.so"
      typedef  size_t CBLAS_INDEX_t;
      
      
      size_t cblas_idamax(const int N, const float *X, const int incX);
      
      double cblas_dsdot(const int N, const float *X, const int incX, const float *Y,
                         const int incY);
      double cblas_ddot(const int N, const double *X, const int incX,
                        const double *Y, const int incY);
      double cblas_dnrm2(const int N, const double *X, const int incX);
      double cblas_dasum(const int N, const double *X, const int incX);
      
      void cblas_dswap(const int N, double *X, const int incX, 
                       double *Y, const int incY);
      void cblas_dcopy(const int N, const double *X, const int incX, 
                       double *Y, const int incY);
      void cblas_daxpy(const int N, const double alpha, const double *X,
                       const int incX, double *Y, const int incY);
      void cblas_drotg(double *a, double *b, double *c, double *s);
      void cblas_drotmg(double *d1, double *d2, double *b1, const double b2, double *P);
      void cblas_drot(const int N, double *X, const int incX,
                      double *Y, const int incY, const double c, const double  s);
      void cblas_drotm(const int N, double *X, const int incX,
                      double *Y, const int incY, const double *P);
      void cblas_dscal(const int N, const double alpha, double *X, const int incX);
      void cblas_dgemv(const enum CBLAS_ORDER order,
                       const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                       const double alpha, const double *A, const int lda,
                       const double *X, const int incX, const double beta,
                       double *Y, const int incY);
      void cblas_dgbmv(const enum CBLAS_ORDER order,
                       const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                       const int KL, const int KU, const double alpha,
                       const double *A, const int lda, const double *X,
                       const int incX, const double beta, double *Y, const int incY);
      void cblas_dtrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                       const int N, const double *A, const int lda, 
                       double *X, const int incX);
      void cblas_dtbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                       const int N, const int K, const double *A, const int lda, 
                       double *X, const int incX);
      void cblas_dtpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                       const int N, const double *Ap, double *X, const int incX);
      void cblas_dtrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                       const int N, const double *A, const int lda, double *X,
                       const int incX);
      void cblas_dtbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                       const int N, const int K, const double *A, const int lda,
                       double *X, const int incX);
      void cblas_dtpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                       const int N, const double *Ap, double *X, const int incX);
      void cblas_dsymv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                       const int N, const double alpha, const double *A,
                       const int lda, const double *X, const int incX,
                       const double beta, double *Y, const int incY);
      void cblas_dsbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                       const int N, const int K, const double alpha, const double *A,
                       const int lda, const double *X, const int incX,
                       const double beta, double *Y, const int incY);
      void cblas_dspmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                       const int N, const double alpha, const double *Ap,
                       const double *X, const int incX,
                       const double beta, double *Y, const int incY);
      void cblas_dger(const enum CBLAS_ORDER order, const int M, const int N,
                      const double alpha, const double *X, const int incX,
                      const double *Y, const int incY, double *A, const int lda);
      void cblas_dsyr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                      const int N, const double alpha, const double *X,
                      const int incX, double *A, const int lda);
      void cblas_dspr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                      const int N, const double alpha, const double *X,
                      const int incX, double *Ap);
      void cblas_dsyr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                      const int N, const double alpha, const double *X,
                      const int incX, const double *Y, const int incY, double *A,
                      const int lda);
      void cblas_dspr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                      const int N, const double alpha, const double *X,
                      const int incX, const double *Y, const int incY, double *A);
      void cblas_dgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
                       const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
                       const int K, const double alpha, const double *A,
                       const int lda, const double *B, const int ldb,
                       const double beta, double *C, const int ldc);
      void cblas_dsymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                       const enum CBLAS_UPLO Uplo, const int M, const int N,
                       const double alpha, const double *A, const int lda,
                       const double *B, const int ldb, const double beta,
                       double *C, const int ldc);
      void cblas_dsyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                       const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                       const double alpha, const double *A, const int lda,
                       const double beta, double *C, const int ldc);
      void cblas_dsyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                        const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                        const double alpha, const double *A, const int lda,
                        const double *B, const int ldb, const double beta,
                        double *C, const int ldc);
      void cblas_dtrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                       const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                       const enum CBLAS_DIAG Diag, const int M, const int N,
                       const double alpha, const double *A, const int lda,
                       double *B, const int ldb);
      void cblas_dtrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                       const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                       const enum CBLAS_DIAG Diag, const int M, const int N,
                       const double alpha, const double *A, const int lda,
                       double *B, const int ldb);

      NOTE : FFI_LIB is simply 'libblas.so' without any path information. This should result in linux using the shared library installed with the libblas-ev package.

      Lastly, here's a simplified script that loads my JSON file and multiplies my training set by its transpose. The cblas_dgemm call executes in about 284ms on my workstation. Slower on the virtual machines for whatever reason.

      
      $data = json_decode(file_get_contents(__DIR__ . '/../machine-learning/training_data_sets.json'), TRUE);
      
      // more trial and error...cooked this up after reading the ghostjat/np source code
      $ffi_blas = FFI::load(__DIR__ . '/cblas.h');
      
      
      $m1 = $data['Xtrain'];
      
      
      $start = microtime(true);
      $m1t = array_map(null, ...$m1);
      echo "transpose elapsed time: " . (microtime(true) - $start) . "seconds\n";
      
      // lifted from ghostjat/np, converts matrix to obj with FFI\CData
      function get_cdata($m) {
      	// this check should be more specific, but works for now
      	if (!is_array($m) || !is_array($m[0])) {
      		throw new Exception('param must be array of arrays');
      	}
      
      	$rowcount = sizeof($m);
      	$colcount = sizeof($m[0]);
      	$flat_m = [];
      	$size = $rowcount * $colcount;
      	$cdata = \FFI::cast('double *', \FFI::new("double[$size]"));
      	$i = 0;
      	foreach($m as $row) {
      		foreach($row as $val) {
      			$flat_m[$i] = $val;
      			$cdata[$i] = $val;
      			$i++;
      		}
      	}
      
      	return [
      		'rows' => $rowcount,
      		'cols' => $colcount,
      		'flat_m' => $flat_m,
      		'cdata' => $cdata
      	];
      }
      
      $m1_c = get_cdata($m1);
      $m1t_c = get_cdata($m1t);
      
      define('CBLAS_ROW_MAJOR', 101);
      define('CBLAS_NO_TRANS', 111);
      
      $mr_size = $m1_c['rows'] * $m1t_c['cols'];
      $mr_rows = $m1_c['rows'];
      $mr_cols = $m1t_c['cols'];
      $mr_cdata = \FFI::cast('double *', \FFI::new("double[$mr_size]"));
      
      $start = microtime(true);
      // this fn works by pointer, returns some void object, I think?
      $some_val = $ffi_blas->cblas_dgemm(CBLAS_ROW_MAJOR, CBLAS_NO_TRANS, CBLAS_NO_TRANS, $m1_c['rows'], $m1t_c['cols'], $m1_c['cols'], 1.0, $m1_c['cdata'], $m1_c['cols'], $m1t_c['cdata'], $m1t_c['cols'], 0.0, $mr_cdata, $mr_cols);
      echo "cblas_dgemm elapsed time: " . (microtime(true) - $start) . "seconds\n";
      
      
      // convert cdata to php object
      $idx = 0;
      $mr = [];
      for($i=0; $i<$mr_rows; $i++) {
      	$mr[$i] = [];
      	for($j=0; $j<$mr_cols; $j++) {
      		$mr[$i][$j] = $mr_cdata[$idx];
      		$idx++;
      	}
      }
      
      
      echo "rows: " . sizeof($mr) . "\n";
      echo "cols: " . sizeof($mr[0]) . "\n";
      
      $json = json_encode($mr, JSON_PRETTY_PRINT);
      file_put_contents(__DIR__ . '/cblas-product.json', $json);
      echo "json md5 is " . md5($json) . "\n";

      The result, $mr is precisely what the other matrix multiply/dot functions returned. The output:

      transpose elapsed time: 0.057420015335083seconds
      cblas_dgemm elapsed time: 0.28933596611023seconds
      rows: 1375
      cols: 1375
      json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e
        10 days later

        I've got native PHP matrix addition (mat_add())) and subtraction (mat_sub())) routines working well in the extension now. Had to come up with a new way to cleanly access PHP hash tables because none of the existing C macros would work for independent iteration over two distinct arrays.

        Assuming all goes well, I hope to have a native PHP matrix multiply implemented sometime this week.

        Slower on the virtual machines for whatever reason.

        Virtual machines are...virtual. AWS and other cloud providers supply vCores, which are slices of CPU time. Your VM is sharing system resources with other VMs on the same hardware. Some providers let users reserve whole CPU cores for just their VM but those are much more expensive than the ultra-cheap VMs where CPU cores are time shared. I believe some low quality providers also overprovision RAM to cram more VMs onto one piece of hardware, which can result in very low usage VMs to dump to swap but can cause memory thrashing if they start being used and the system gets too busy. Your local machine is almost certainly more powerful and capable than any VM you reserve over in the cloud.

        cubic I've got native PHP matrix addition (mat_add())) and subtraction (mat_sub())) routines working well in the extension now. Had to come up with a new way to cleanly access PHP hash tables because none of the existing C macros would work for independent iteration over two distinct arrays.

        I suspect (but have no real evidence) that the slowness of my pure PHP implementations of these matrix operations is due to the need to traverse some data structure to locate row $i and column $j. I suspect (but have no evidence) that the implementations written in C benefit from some algorithm that quickly locates the data element i,j. I admire your courage for digging into the PHP source, and am somewhat curious what you are doing differently than the existing php codebase. Seems to me PHP itself might benefit from your approach.

        cubic Assuming all goes well, I hope to have a native PHP matrix multiply implemented sometime this week.

        That sounds pretty exciting!

        I have some possibly relevant news to report regarding real-world performance of BLAS. As previously posted, I managed to get BLAS-via-FFI working in PHP by installing libblas-dev. I even got it working on my MacOS laptop. I'm not sure which library is handling PHP::FFI's calls to cblas_x calls, but I'm loading a modified blas.h file from the ghostjat repo. In particular, the FFI_lib declaration is:

        #define FFI_LIB "libblas.dylib"

        and I seem to have removed the various openblas declarations that were causing complaints before. I don't really know which libblas.dylib is being summoned, but I see that I've installed openblas and lapack via brew. A search of the filesystem shows libblas.dylib files in numerous places. It appears to be part of both lapack and openblas, but the most official looking location is /usr/lib/libblas.dylib which is symbolic link to some file in /System/Library/Frameworks which did not show up in my file system 'find' command results. If anyone has thoughts on how I determine which libblas.dylib is being invoked, I'd appreciate it.

        My simple bar.php file, which multiplies my example matrix by its transpose using the ghostjat/np lib which, in turn, uses FFI to load libblas, multiplies my 1375x1966 sample matrix by its transpose in about 115ms on this MacOS laptop. This is very fast compared to the other methods we've been looking at here.

        Encouraged by this quick result, I laboriously converted the original svm-training matlab/octave code to PHP. This was laborious because expressions that are really tidy and efficient in octave are not tidy or efficient in PHP. I'll try to provide some specifics here shortly -- in particular, the fact that ghostjat/np doesn't distinguish between column vectors and row vectors required me to write some ungainly looping code to probe into the raw data of these objects. The resulting code is not intuitive and at times seems pretty kludgy, but it works in reasonable time frames. By 'works in reasonable time frames' I mean that I can train spam filter SVM using my matrices in anywhere from 1 to about 10 minutes. I've yet to thoroughly test everything, but I can train the spam filter on 1375x1966 training set, typically in 2-3 minutes, and then use the resulting SVM on another test set without about 450 messages and get the results back in 1-3 seconds. Depending on linear or gaussian kernel, and depending on parameters chosen for C and sigma, I can find configurations that achieve about 95% accuracy in the spam/ham evaluation. The results in aggregate look the same as what I get in Octave.

        HOWEVER, Octave still seems to be about 5 or 6 times faster. My most grueling application so far is to sweep through these values of C and sigma, training a total of 441 different gaussian SVMs to see which values might be optimal:

        // range from 1.4^-10 to 1.4^10
        $cvals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
        $sigma_vals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];

        My PHP code chose C=2.744 and sigma=2.744, which achieved correct classification of 95.633% of test messages. It took 67202.334 seconds to run this PHP code. The Octave code trains the 441 SVMs on the same message data, varying the same C and sigma parameters in a much faster 12440.5 seconds. Octave chose the same optimal C and sigma, reporting the same success rate. The PHP code does a bit more to accumulate stats, etc., but this extra work adds much less than 5000 seconds in total.

        Using the stats from my PHP code, I was able to generate this surface plot. The z-axis indicates the accuracy of my SVM in predicting spam for various C and sigma values. Ironically, I had to export a CSV file from PHP, and import it into octave to make this plot. One will note that if I want to focus on the more promising regions of the graph, that I must adjust $cvals and $sigma_vals and run the whole 18-hour process again. Or I can run it in Octave in about 4 hours.

        sneakyimp I suspect (but have no real evidence) that the slowness of my pure PHP implementations of these matrix operations is due to the need to traverse some data structure to locate row $i and column $j.

        PHP userland does a TON of work for every reference for something like $c[$i][$j]. It goes something like this:

        1. Check to see if $c exists.
        2. Sees the array/string dereference (i.e. the brackets). Is $c an array or a string? If so, continue.
        3. Check to see if $i exists. Is it an integer or a string? If so, continue.
        4. Check to see if $i is in bounds of the array and/or the key exists. If so, continue.
        5. Sees the array/string dereference (i.e. the brackets). Is $c[$i] an array or string? If so, continue.
        6. Check to see if $j exists. Is it an integer or a string? If so, continue.
        7. Check to see if $j is in bounds of the array and/or the key exists. If so, continue.

        Now PHP has located the correct location in the arrays/hash tables to conduct the operation. Array lookups will almost certainly call several C functions. And the right side of the equation utilizes at least one temporary variable. Doing those steps one time? Not a big deal. Doing those steps a billion times? Very slow. PHP userland code can't really be optimized all that much by the Zend engine. Steps 3 and 4 could be skipped by using PHP references instead of constantly doing $c[$i] but that's about it.

        C is faster because it can precalculate all of those checks in advance to avoid making them for every cycle of the inner loop and also doesn't need to do type conversion/coercion and also won't call a bunch of functions in the innermost loop. The tradeoffs are the potential for instability and reduced system security. One wrong/misplaced pointer in C and an application will segfault (segfault == probable security vulnerability).

        I've now got mat_mult() written for both double and zend_long 2D arrays and also have max_execution_time timeout support. I still need to add scalar multiplies for completionist purposes and write the requisite test suite to properly validate the implementation. It won't run as fast as libblas does on your computer but it will be a native implementation.

        sneakyimp If anyone has thoughts on how I determine which libblas.dylib is being invoked, I'd appreciate it.

        You can probably figure that out via strace (or equivalent).

        cubic One wrong/misplaced pointer in C and an application will segfault (segfault == probable security vulnerability).

        I have never managed to get comfortable with pointers, and frequently run into trouble if I try to 'go native' and write some C with pointers.

        I was casually thinking that C could do matrix multiplication extremely quickly if it could simply calculate the memory offset by adding i * j to double *ptr but it later occurred to me that it might be unlikely that the memory allocated to a matrix would be contiguous. I was wondering how the C code might handle a matrix stored in non-contiguous memory and still be fast?

        cubic You can probably figure that out via strace (or equivalent).

        Oooh that seems very helpful! Will look into that.

        In the meantime, I thought it might be worth pointing out some of the challenges I've had translating octave code to PHP. I used ghostjat/np for performance reasons. It's the one library I've found with performance remotely comparable to octave. Sadly, it has a lot of problems so I'm currently working with my own modified version of it. Perhaps the biggest problem I've faced harks back to the lack of distinction between row vectors and column vectors. I expressed confusion about this earlier in this thread and discussed it with @Weedpacket. I want to re-emphasize this as a point of confusion, and illustrate with the challenge I faced translating this bit of octave code:

            % Vectorized RBF Kernel
            % This is equivalent to computing the kernel on every pair of examples
            X1 = sum(X.^2, 2);
            X2 = sum(model.X.^2, 2)';
            K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
            K = model.kernelFunction(1, 0) .^ K;
            K = bsxfun(@times, model.y', K);
            K = bsxfun(@times, model.alphas', K);

        In this case, X is something like my 1375x1966 training matrix and model.X is some different set with different number of rows (m) but same number of features (n). Calculating X1 and X2 are easy with ghostjat/np:

                $x1 = $x->square()->sumRows();
                $x2 = $model['x']->square()->sumRows();

        Calculating K is something of a nightmare. We can do - 2 * X * model.X' pretty easily, although the sequence of operations in our PHP starts to shift around. Remember that matrix multiplication is not always symmetric:

        // - 2 * X * model.X'
        $K = $x->dot($model['x']->transpose())->multiply(-2);

        Just to accomplish that inner bsxfun operation requires us to embark on looping structures:

        // do the inner bsxfun(plus...)
        // ghostjat has no means to add a ROW vector to a matrix soooo we fake it
        $kshape = $K->getShape();
        $km = $kshape->m;
        $kn = $kshape->n;
        $x2size = $x2->getSize();
        // i are columns, j are rows
        for($i=0; $i<$x2size; $i++) {
            $x2val = $x2->data[$i];
            for($j=0; $j<$km; $j++) {
                // add the ith x2 value to the ith column of the jth row
                $K->data[($j * $kn) + $i] += $x2val;
            }
        }

        I'd like to clarify that bsxfun(@plus, X2, - 2 * X * model.X') is not an especially elaborate operation. If we define a (a substitute for X2) and b (a substitute for the last parameter, a dot product) thusly:

        // a 1 x 2 row vector
        a = [10 20]
        // a 3 x 2 matrix
        b = [1 1; 2 2; 3 3;]
        inner_bsx = bsxfun(@plus, a, b)
            inner_bsx =
        
               11   21
               12   22
               13   23

        We are basically just adding the row vector a to each of the 3 rows in b. I've been unable to identify any matrix sum or add method which can perform this very simple operation so I had to resort to the looping code above. This quickly gets more maddening. To implement the outer bsxfun operation in that one line of code, we have to run another nested loop pair to sum X1, a 3x1 column vector, to the result of the inner bsxfun operation.

        // a 3 x 1 column vector, same size as X1
        c = c = [100; 1000; 10000]
        // outer bsxfun
        outer_bsx = bsxfun(@plus, c, inner_bsx)
            outer_bsx =
        
                 111     121
                1012    1022
               10013   10023

        You'll note that octave knows that c is a column vector so it sums c to each column of inner_bsx. You'll also note that ghostjat/np doesn't distinguish between row and column vectors. Nor does it offer any bsxfun operation or methods like sumAsColumn or sumAsRow one might use to expressly specify how the addition is to be performed. The logic I've seen in some of these other matrix operation libs when multiplying a matrix and vector is to check dimensions and then make a guess about what was intended. I assert this guessing behavior is sorely inadequate, given that order of operations matters in matlab/octave, and given that ambiguity arises when multiplying square-but-non-symmetric matrices.

        The difficulties continue. Those six lines or so of octave/matlab yield this function, which was a chore to write (juggling the nested loops and making sure the offsets are calculated correctly, bounds not exceeded, etc) and doesn't seem to be very efficient at all.

        function get_gaussian_predict_k(Np\matrix $x, array $model) {
        
        	// orig octave code in svmPredict
        	// Vectorized RBF Kernel
        	// This is equivalent to computing the kernel on every pair of examples
        	//X1 = sum(X.^2, 2);
        	//X2 = sum(model.X.^2, 2)';
        	//K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
        	//K = model.kernelFunction(1, 0) .^ K;
        	//K = bsxfun(@times, model.y', K);
        	//K = bsxfun(@times, model.alphas', K);
        
        	$x1 = $x->square()->sumRows();
        	//echo "x1 ", $x1, "\n";
        
        	// we don't need to transpose this because ghostjat/np doesn't distinguish col vs row vectors
        	$x2 = $model['x']->square()->sumRows();
        	//echo "x2 ", $x2, "\n";
        
        	// need to build K.
        	$K = $x->dot($model['x']->transpose())->multiply(-2);
        
        	// do the inner bsxfun(plus...)
        	// ghostjat has no means to add a ROW vector to a matrix soooo we fake it
        	$kshape = $K->getShape();
        	$km = $kshape->m;
        	$kn = $kshape->n;
        	$x2size = $x2->getSize();
        	// $km should match the dimensions of $x2
        	// sanity check
        	if ($x2size !== $kn) {
        		throw new \Exception('x2 size ($x2size) does not match kn ($kn)');
        	}
        	// i are columns, j are rows
        	for($i=0; $i<$x2size; $i++) {
        		$x2val = $x2->data[$i];
        		for($j=0; $j<$km; $j++) {
        			// add the ith x2 value to the ith column of the jth row
        			$K->data[($j * $kn) + $i] += $x2val;
        		}
        	}
        
        	// do the outer bsxfun(plus...)
        	// ghostjat has no means to add a COLUMN vector soooo we fake it
        	$x1size = $x1->getSize();
        	// $km should match the dimensions of $x1
        	// sanity check
        	if ($x1size !== $km) {
        		throw new \Exception('x1 size ($x1size) does not match km ($km)');
        	}
        	// i are rows, j are columns
        	for($i=0; $i<$x1size; $i++) {
        		$x1val = $x1->data[$i];
        		for($j=0; $j<$kn; $j++) {
        			// add the ith x1 value to the jaith column of the ith row
        			//$offset = ($i * $kn) + $j;
        			$K->data[($i * $kn) + $j] += $x1val;
        		}
        	}
        
        	$kf = gaussian_kernel(Np\vector::ar([1]), Np\vector::ar([0]), $model['sigma']);
        	//echo "kf ", $kf, "\n";
        	$K = $K->map(fn($v) => (pow($kf, $v)));
        
        	$mysize = $model['y']->getSize();
        	// $km should match the dimensions of $model['y']
        	// sanity check
        	if ($mysize !== $kn) {
        		throw new \Exception('model.y size ($mysize) does not match kn ($kn)');
        	}
        	// i are columns, j are rows
        	for($i=0; $i<$mysize; $i++) {
        		$yval = $model['y']->data[$i];
        		for($j=0; $j<$km; $j++) {
        			// multiply the ith y value by the ith column of the jth row
        			$K->data[($j * $kn) + $i] *= $yval;
        		}
        	}
        
        
        	$alphasize = $model['alphas']->getSize();
        	// $km should match the dimensions of $model['alphas']
        	// sanity check
        	if ($alphasize !== $kn) {
        		throw new \Exception('model.alpha size ($alphasize) does not match kn ($kn)');
        	}
        	// i are columns, j are rows
        	for($i=0; $i<$alphasize; $i++) {
        		$aval = $model['alphas']->data[$i];
        		for($j=0; $j<$km; $j++) {
        			// multiply the ith y value by the ith column of the jth row
        			$K->data[($j * $kn) + $i] *= $aval;
        		}
        	}
        
        	return $K;
        }

        Given how tricky it is to get PHP::FFI working with BLAS or OpenBLAS, and how elaborate the PHP effort is here, and the fact that even with all that octave is five times faster, I'm starting to think I'd be better off using exec to delegate these matrix/svm calculations to octave.

        sneakyimp but it later occurred to me that it might be unlikely that the memory allocated to a matrix would be contiguous.

        If you allocate it as a single array the C standard requires it to be contiguously-addressable. Hence the tendency to use mat[i * stride + j] instead of mat[i][j]. The former code is defined as, once the compiler starts working on it, *((mat) + (i * stride + j)).

        Okay. I've got a working, validated mat_mult() function in a PHP extension with zero external library dependencies. But wow do large arrays chug a bunch of RAM for breakfast. That's not my fault. A 1375x1966 2D array in PHP consumes about 136MB RAM. Of course, doing the math, the most optimal memory usage is 1375x1966x8 = 21MB. So there are about 40 bytes of overhead per cell, which sounds about right for a zval.

        I just realized that your benchmark times are not including the time taken for converting between PHP zvals and the format required by the cblas_dgemm() function call AND the time taken to convert back to PHP zvals when done. I'd wager you excluded those conversions to test how fast specific function performance is, but the surrounding conversion times are important as well and make the actual function call time perhaps a little less impressive.

        At any rate, the new mat_mult() function is about 3x slower than the matrixmult benchmark tool on a very old piece of hardware - an ancient Intel Celeron running Linux where if something runs reasonably fast there, then it's probably very fast everywhere else. The slowdown probably comes from the overhead of dealing with PHP zvals AND allocating memory for 1.9 million zvals for the return array. The matrixmult benchmark tool multiplies 1375x1966x1375 in about 17.4 seconds (I said that the machine was slow!) while the mat_mult() PHP function takes 47.9 seconds on the same hardware. So I expect it to run the example matrix in about 7-9 seconds on modern hardware. Still WAY faster by a factor of around 20x than doing a matrix multiply in pure PHP userland. There might be some further optimizations that could be made, but I'm pretty satisfied with the results. Also, I haven't had any segfaults/crashes and only ran into one easily fixed bug.

        (In case you are wondering, the aforementioned ancient system is currently my only dedicated Linux system on my LAN that is set up for PHP extension development. I plan on replacing the system this year. I mostly run Windows but writing PHP extensions is most easily done first on Linux and then porting to other OSes after the initial development there is done.)

        The next step is to clean up the extension a little bit and get it ready to publish.

        Having basic mat_add(), mat_sub(), and mat_mult() implementations in PHP core will alleviate a lot of linear algebra performance problems that currently exist in PHP. I don't believe the effort is wasted even if you end up going the exec route because you desire better performance.

        cubic A 1375x1966 2D array in PHP consumes about 136MB RAM. Of course, doing the math, the most optimal memory usage is 1375x1966x8 = 21MB. So there are about 40 bytes of overhead per cell, which sounds about right for a zval.

        I understand you need the zval structs for a loosely typed language but OUCH that's some pretty serious overhead. I was curious about memory consumption and cooked up this script:

        echo "START\n";
        echo number_format(memory_get_usage()) . "\n";
        
        require __DIR__ . '/vendor/autoload.php';
        use Np\matrix;
        echo "AFTER AUTOLOAD\n";
        echo number_format(memory_get_usage()) . "\n";
        
        $x = [];
        for($i=0; $i<1375; $i++) {
        	$x[$i] = [];
        	for($j=0; $j<1966; $j++) $x[$i][$j] = lcg_value();
        }
        echo "AFTER X GENERATED\n";
        echo number_format(memory_get_usage()) . "\n";
        
        echo "x has " . sizeof($x) . " rows\n";
        echo "x has " . sizeof($x[0]) . " cols\n";
        
        $xmatrix = matrix::ar($x);
        echo "AFTER MATRIX DEFINED\n";
        echo number_format(memory_get_usage()) . "\n";
        
        
        unset($x);
        echo "AFTER X UNSET\n";
        echo number_format(memory_get_usage()) . "\n";
        
        var_dump($xmatrix->data);

        For some reason the 1375 x 1966 native PHP array of floats only consumes about 50MB on my workstation (ubuntu 20):

        START
        398,448
        AFTER AUTOLOAD
        452,696
        AFTER X GENERATED
        51,254,616
        x has 1375 rows
        x has 1966 cols
        AFTER MATRIX DEFINED
        73,111,352
        AFTER X UNSET
        22,309,432
        object(FFI\CData:double*)#4 (1) {
          [0]=>
          float(0.558039688348575)
        }

        Also note that the ghostjat/np class uses a FFI\CData:double* array, and its memory consumption appears to be the ideal of about 21MB.

        cubic I just realized that your benchmark times are not including the time taken for converting between PHP zvals and the format required by the cblas_dgemm() function call AND the time taken to convert back to PHP zvals when done.

        In practice I've been loading data from some external format (JSON, CSV, etc.), then defining the ghostjat/np/vector,matrix objects and using them for all my manipulations. But here's a simplified script for comparison. I've timed the steps after $x is defined to construct the matrix object, transpose it, and multiply:

        echo 'xtrain has ' . sizeof($xtrain) . " rows\n";
        echo 'xtrain has ' . sizeof($xtrain[0]) . " cols\n";
        
        $start = microtime(true);
        
        $ta = matrix::ar($xtrain);
        $tb = $ta->transpose(); // to generate random 2d matrix
        
        $tc = $ta->dot($tb);
        
        $arr = $tc->asArray();
        echo "total elapsed time: " . (microtime(true) - $start) . "seconds\n";
        
        echo "rows: " . sizeof($arr) . "\n";
        echo "cols: " . sizeof($arr[0]) . "\n";

        About 550ms:

        xtrain has 1375 rows
        xtrain has 1966 cols
        total elapsed time: 0.55272507667542seconds
        rows: 1375
        cols: 1375

        This particular configuration was using OpenBlas. The machine is a Intel Core i7-4820K CPU @ 3.70GHz, about 10 years old, but still fairly muscular, with 4 physical cores / 8 threads of simultaneous execution.

        cubic Having basic mat_add(), mat_sub(), and mat_mult() implementations in PHP core will alleviate a lot of linear algebra performance problems that currently exist in PHP.

        Your results seem pretty impressive to me, and I think it might be quite helpful to have such functions in PHP without needing sudo access to the machine and jumping thru various hoops to link up other modules/libs. I would point out that the really intensive part of my octave/matlab stuff is training the SVM on thousands of sample inputs. Your functions might be very helpful for running the SVM on a single input vector in a production environment. I.e., it takes a lot of heavy CPU time (1-10 minutes) to train an SVM so this would probably need to be offloaded (e.g. via exec) anyway, whereas running an already-trained SVM to classify an incoming message has ham or spam is much less intensive, but would certainly benefit from fast matrix/vector functions.

          9 days later

          With some effort, I finally got a working build of qolfuncs on my Windows PC against PHP 8.2.1, which gets me as close to PHP master as I plan on getting until I start opening pull requests.

          For 1375x1966 using mat_mult(): 1.82 seconds. Only a tad slower than the matrixmult tool. It's definitely in the zone though. The old and crusty hardware I originally wrote the function on doesn't even stack up to a Core i7.

          Running a homegrown userland implementation on the same hardware: 85.4 seconds. Even factoring in the variance between our hardware, my userland implementation appears to be notably faster than PHP-ML. This means mat_mult() is about 46 times faster than the best userland implementation. There was some initial pushback on the PHP internals list over matrix operations but I think the now-confirmed massive gain in performance will convince most folks that even basic matrix operations should be included in PHP core. Plenty of use-cases also exist.

          Write a Reply...