I'm not having much luck compiling CBLAS on either MacOS or my Ubuntu workstation. Part of the problem is that the compiler output is greek to me. If I'm not mistaken, the process appears to build an archive file, lib/cblas_LINUX.a, but the make operation fails. I posted stack overflow seeking help and I'm getting confusing/confused posts about either installing libopenblas instead or installing libblas. I thought compiling cblas would be creating a BLAS library?

I've been sniffing around the OpenBLAS library releases and the various zip/tgz archives I've downloaded don't appear to contain any .so files.

Lots of trial and error so far. I've been trying some of the .h files and .a files here but FFI doesn't seem to like any of them but its original blas.h file, modified to refer to the /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so shared lib which happens to be installed on my machine. It also seems like FFI has its own ideas about what a valid blas.h file is. It complains about semicolons and other #define statements in the cblas.h file from that netlib site, for instance.

for my own reference: further googling i found some suggested steps, which may be helpful.
EDIT: those instructions didn't work, either. I'm starting to wonder if perhaps the complaint is related to the fortran code? Calling make from the cblas folder fails and this is the final output:

gcc -I../include -O3 -DADD_ -c c_sblas1.c
gfortran -O3   -c c_sblat1.f
c_sblat1.f:214:48:

  214 |                CALL STEST1(SNRM2TEST(N,SX,INCX),STEMP,STEMP,SFAC)
      |                                                1
Error: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
c_sblat1.f:218:48:

  218 |                CALL STEST1(SASUMTEST(N,SX,INCX),STEMP,STEMP,SFAC)
      |                                                1
Error: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
make[1]: *** [c_sblat1.o] Error 1
make: *** [alltst] Error 2

I wonder if perhaps the BLAS fortran code is old somehow?

    Soooo I have been able to download and make/build the latest OpenBlas and build it both on Ubuntu 20 and MacOs (Catalina, I think? With intel chip) and get it to work with ghostjat/np on both platforms.

    The steps, should anyone be curious. These are the ubuntu steps but should also work with little modification on MacOs -- e.g., download the file with your browser instead of using wget

    #make a dir for openblas:
    mkdir openblas
    cd openblas
    #download latest OpenBlas, extract it, cd into the new dir and make the project.
    # make operation takes awile, and your machine will need a compiler and probably gfortrain, etc.:
    wget https://github.com/xianyi/OpenBLAS/releases/download/v0.3.21/OpenBLAS-0.3.21.tar.gz
    tar -xvzf OpenBLAS-0.3.21.tar.gz
    cd OpenBLAS-0.3.21
    make

    Take note of the library's location, which will probably have a name specific to your machine's CPU architecture. In my case, this is it: libopenblas_sandybridgep-r0.3.21.so

    Take note of the full path to that lib:
    /home/sneakyimp/biz/machine-learning/openblas/OpenBLAS-0.3.21/libopenblas_sandybridgep-r0.3.21.so

    We can now put that shared lib path in the blas.h file included with the ghostjat/np PHP code. I repeat here the steps to set up a brand new ghostjat/np file:

    cd ..
    mkdir np
    cd np
    # you may need to get composer, see: https://getcomposer.org/download/
    ./composer.phar require ghostjat/np:dev-main
    #edit the blas.h file that comes with it:
    nano vendor/ghostjat/np/src/core/blas.h

    In the blas.h file, change the FFI_LIB to point to the full path of the OpenBLAS lib you just built:
    #define FFI_LIB "/home/sneakyimp/biz/machine-learning/openblas/OpenBLAS-0.3.21/libopenblas_sandybridgep-r0.3.21.so"

    Now you should be able to run this PHP file, bar.php which performs a matrix multiplication:

    <?php
    require __DIR__ . '/vendor/autoload.php';
    use Np\matrix;
    
    // NOTE THIS IS MY DATA YOU'LL NEED TO COOK YOUR OWN
    // OR REFER BACK TO OUR PRIOR CORRESPONDENCE
    $data = json_decode(file_get_contents(__DIR__ . '/../training_data_sets.json'), TRUE);
    
    // more trial and error...cooked this up after reading the ghostjat/np source code
    Np\core\blas::$ffi_blas = FFI::load(__DIR__ . '/vendor/ghostjat/np/src/core/blas.h');
    
    $ta = matrix::ar($data['Xtrain']);
    
    $start = microtime(true);
    $tb = $ta->transpose(); // to generate random 2d matrix
    echo "transpose elapsed time: " . (microtime(true) - $start) . "seconds\n";
    
    $start = microtime(true);
    $tc = $ta->dot($tb);
    echo "dot elapsed time: " . (microtime(true) - $start) . "seconds\n";
    
    $arr = $tc->asArray();
    echo "rows: " . sizeof($arr) . "\n";
    echo "cols: " . sizeof($arr[0]) . "\n";
    
    $json = json_encode($arr, JSON_PRETTY_PRINT);
    file_put_contents(__DIR__ . '/np-product.json', $json);
    echo "json md5 is " . md5($json) . "\n";

    While I am delighted that bar.php runs and produces the correct output, I would point out that I'm just using the blas.h file that is distributed with ghostjat/np. I'm still quite concerned that blas.h might be totally mismatched against the latest OpenBlas. I tried using the cblas.h that is distributed with OpenBLAS, but FFI::load complains about it. Here's a typical error:

    PHP Fatal error:  Uncaught FFI\ParserException: ';' expected, got '<STRING>' at line 8 in /home/jaith/biz/machine-learning/np2/bar.php:9
    Stack trace:
    #0 /home/jaith/biz/machine-learning/np2/bar.php(9): FFI::load()
    #1 {main}

    It would appear that FFI has some stunted/limited ability to parse these h files, but its limitations are not very clear, nor am I sure how to convert the OpenBlas header file to something FFI can digest. The documentation offers a little help:

    C preprocessor directives are not supported, i.e. #include, #define and CPP macros do not work, except for special cases listed below.
    The header file should contain a #define statement for the FFI_SCOPE variable, e.g.: #define FFI_SCOPE "MYLIB". Refer to the class introduction for details.
    The header file may contain a #define statement for the FFI_LIB variable to specify the library it exposes. If it is a system library only the file name is required, e.g.: #define FFI_LIB "libc.so.6". If it is a custom library, a relative path is required, e.g.: #define FFI_LIB "./mylib.so".

    There's also a user comment that seems promising:

    Since #include's and #define's other than FFI_LIB and FFI_SCOPE are not supported in the header, you may want to use the C preprocessor to pre-process your headers in order to resolve all #include's and macros.

    I use -D"attribute(ARGS)=" to remove function attributes as well, which are not supported by FFI either.

    This is my script:

    echo '#define FFI_SCOPE "YOUR_SCOPE"' > header-ffi.h
    echo '#define FFI_LIB "/path/to/your_lib.so"' >> header-ffi.h
    cpp -P -C -D"attribute(ARGS)=" header_original >> header-ffi.h

    Unfortunately, I tried it and, although it produced a 15,000-line header-ffi.h when I ran it on cblas.h, this file also produced FFI errors. I've not been able to use any blas.h/cblas.h file from CBLAS or OpenBLAS and get it working. Any advice on how I might do this would be much appreciated.

    I'm also a bit concerned about using OpenBLAS for a couple of reasons:

    • this is a big project, probably overkill. It takes a while to make and the shared lib libopenblas_sandybridgep-r0.3.21.so is 14MB
    • I worry a bit about security. The contributors include quite a few far flung devs and the Chinese Academy of Sciences

    I have been able to build CBLAS on my ubuntu machine by first separately downloading and building BLAS and editing the CBLAS Makefile to refer to the static library it produces in this line:

    BLLIB = /home/sneakyimp/biz/machine-learning/blas/BLAS-3.11.0/blas_LINUX.a

    This won't build on MacOS, complaining about the previously mentioned fortran stuff, but on Ubuntu 20 it seems to build. Unforunately, it does not produce any .so shared libraries. It just produces a static library, lib/cblas_LINUX.a. I've tried that with ghostjat but it fails.

    Is there some simple flag I can provide to make so that CBLAS will build the shared libraries (i.e., the .so files)? Can anyone suggest an approach to creating a blas.h file for ghostjat/np that better matches either OpenBLAS or CBLAS?

      More trial and error...

      Looks like ghostjat/np requires OpenBLAS rather than CBLAS. Thanks to this little recipe I was finally able to get CBLAS compiled into a cblas_LINUX.so shared lib by making some changes to the CBLAS makefile. I then edited the blas.h file vendor/ghostjat/np/src/core/blas.h to point to this newly built library:

      #define FFI_LIB "/home/sneakyimp/biz/machine-learning/blas/CBLAS/lib/cblas_LINUX.so"

      and tried to run my bar.php file and ghostjat/np is evidently using openblas functions:

      $ php bar.php
      PHP Fatal error:  Uncaught FFI\Exception: Failed resolving C function 'openblas_set_num_threads' in /home/sneakyimp/biz/machine-learning/np/bar.php:9
      Stack trace:
      #0 /home/sneakyimp/biz/machine-learning/np/bar.php(9): FFI::load()
      #1 {main}
        thrown in /home/sneakyimp/biz/machine-learning/np/bar.php on line 9

      This is disappointing as CBLAS seemed much more compact. I.e., cblas_LINUX.so is only 130K, whereas libopenblas_sandybridgep-r0.3.21.so is a hulking 14MB. Kinda makes one wonder what is in that shared library?

      This now brings me back to my attempts to try and improve the blas.h file that's currently in ghostjat/np. It is quite dissimilar to the h files in either OpenBlas or CBLAS. I've tried manually removing preprocessor directives from the various blas.h and cblas.h files that exist in these libraries, but I can't seem to get FFI to parse the files that result. I tried that approach mentioned in the PHP-FFI documentation and it yielded a giant, 15000-line h file that also barfed.

      Is it safe to use the blas.h that comes with ghostjat/np if I am compiling OpenBlas or CBLAS from source?

      sneakyimp Is it safe to use the blas.h that comes with ghostjat/np if I am compiling OpenBlas or CBLAS from source?

      For the purposes of FFI, I reckon the header file only needs to declare the interface that you're actually using. So you'd be writing it from scratch (starting with the #define FFI_LIB to name the library).

      sneakyimp Kinda makes one wonder what is in that shared library?

      At a brief glance, for a start it looks like the whole of LAPACK is in there as well.
      https://www.openblas.net/

        I am not C/C++ pro but I cooked up some PHP to parse the ghostjat/np blas.h file and the OpenBlas cblas.h file and report some stuff. I can't guarantee this grabs all the fn definitions declared, but it seems pretty good:

        // ghostjat file
        $gj_h_file = __DIR__ . '/vendor/ghostjat/np/src/core/blas.h';
        echo "parsing $gj_h_file\n";
        $code = file_get_contents($gj_h_file);
        $pattern = '#^([a-z\*_]+)\s+([a-z0-9_]+)\s*\(([^)]+)\)#ismU';
        $matches = null;
        $mcount = preg_match_all($pattern, $code, $matches);
        echo "$mcount functions declared in ghostjat\n";
        $gj_fns = $matches[2];
        
        // there are several blas.h files in OpenBlas, i just parse the first
        //openblas/OpenBLAS-0.3.21/cblas.h
        //openblas/OpenBLAS-0.3.21/lapack-netlib/CBLAS/include/cblas.h
        //openblas/OpenBLAS-0.3.21/relapack/src/blas.h
        $ob_h_file = realpath(__DIR__ . '/../openblas/OpenBLAS-0.3.21/cblas.h');
        echo "parsing $ob_h_file\n";
        $code = file_get_contents($ob_h_file);
        $pattern = '#^([a-z\*_]+)\s+([a-z0-9_]+)\s*\(([^)]+)\)#ismU';
        $matches = null;
        $mcount = preg_match_all($pattern, $code, $matches);
        echo "$mcount functions declared in openblas\n";
        $ob_fns = $matches[2];
        
        $gj_not_ob = array_diff($gj_fns, $ob_fns);
        echo sizeof($gj_not_ob) . " fns in gj but not ob\n";
        //echo implode("\n", $gj_not_ob), "\n";
        
        $ob_not_gj = array_diff($ob_fns, $gj_fns);
        echo sizeof($ob_not_gj) . " fns in ob but not gj\n";
        //echo implode("\n", $ob_not_gj), "\n";

        OpenBlas appears to declare some 200 functions, and FFI doesn't like a lot of the stuff in there. I've not yet managed any version of this file that survives FFI::load.

        parsing /home/sneakyimp/biz/machine-learning/np/vendor/ghostjat/np/src/core/blas.h
        44 functions declared in ghostjat
        parsing /home/sneakyimp/biz/machine-learning/openblas/OpenBLAS-0.3.21/cblas.h
        201 functions declared in openblas
        0 fns in gj but not ob
        157 fns in ob but not gj

        I haven't bothered to compare return type or parameter declarations, but it looks like every function in the ghostjat/np blas.h file is accounted for in the OpenBlas.h file. I have a dim inkling this is good, and imagine it means that FFI will somehow locate all the functions declared in blas.h in the libopenblas_sandybridgep-r0.3.21.so shared library and, conversely, we aren't declaring functions in PHP that don't exist in the library. It seems like it'd be bad if our PHP code thought there was some function in the library that wasn't in there and we tried to jump to the wrong memory location? I really have no idea, though.

          Well, the header file is used to define the object returned by FFI::cdef; which methods and types it exposes. If they're not in the header they're not part of the object's interface and trying to call something else would be a plain old "Call to undefined method" error. The basic examples have the header declaration within the PHP code, instead of using the external file/preloader stuff used by FFI::load. The latter is more if you want to provide a full-featured Openblas interface to PHP users as part of your application. If you're only going to call dgemm then that and a couple of enums (CBLAS_ORDER, CBLAS_TRANSPOSE) are all you need to declare. And if dgemm is all you're using you might could compile a library from source code containing just that function after translating the Fortran to C yourself (or just write a matrix multiplication routine from scratch and just use the blas code as a guideline).

          8 days later

          While the OpenBlas performance is certainly lightning fast, I'm beginning to have serious doubts about how to distribute any code I might write which requires distribution of a libopenblas.so file. I downloaded the openblas source and built it on my Ubuntu 20 workstation:

          mkdir ~/openblas
          cd ~/openblas
          wget https://github.com/xianyi/OpenBLAS/releases/download/v0.3.21/OpenBLAS-0.3.21.tar.gz
          tar -xvzf OpenBLAS-0.3.21.tar.gz
          cd OpenBLAS-0.3.21
          make

          Because of the openblas configuration, this yields a file which apparently has a name specific to my intel CPU's architecture, libopenblas_sandybridgep-r0.3.21.so.

          I uploaded this libopenblas_sandybridgep-r0.3.21.so file and my bar.php along with its dependencies to a freshly set up Ubuntu 22.04 virtual machine (an Amazon EC2 instance with php 8.1 installed). I modified the bar.php script and blas.h file to refer to the correct file locations in the EC2 vm and tried to run it and it hates the so file:

          $ php bar.php
          PHP Fatal error:  Uncaught FFI\Exception: Failed loading '/home/sneakyimp/np/libopenblas_sandybridgep-r0.3.21.so' in /home/sneakyimp/np/bar.php:9
          Stack trace:
          #0 /home/sneakyimp/np/bar.php(9): FFI::load()
          #1 {main}
            thrown in /home/sneakyimp/np/bar.php on line 9

          It would appear that OpenBlas doesn't build shared libraries that are at all portable, at least not by default. I suspect this might be due to highly optimized code. E.g., assembler or something.

          I also note that this freshly established Ubuntu 22.04 vm lacks the tools to build openblas. Attempting to the download & make steps above on the vm yields a complaint:

          $ make
          Command 'make' not found, but can be installed with:
          sudo apt install make        # version 4.3-4.1build1, or
          sudo apt install make-guile  # version 4.3-4.1build1

          SUMMARY: Portability of OpenBLAS is a problem. It appears to build a lib specific to not just OS but also the CPU architecture. I can't even build a shared lib on my u20 workstation that'll run on another machine running u22. Forget about other linux distros or MacOS or windows. Also, the tools to make the OpenBLAS source code are not installed by default. To even make the source, you'd have to install the tools. It'd be completely impractical to try and precompile a lib SO file for every common architecture/distro/release to include as part of some composer package. I might add that the SO lib file is 14 Megabytes and it takes quite a while to compile -- a minute perhaps? Two minutes?

          I wonder just how narrowly compatible a particular open blas lib is? My workstation is Intel(R) Core(TM) i7-4820K CPU on Ubuntu 20 and the amazon EC2 Ubuntu 22.04 reports Intel(R) Xeon(R) CPU E5-2676 v3. Is there some way to compile for wider compatibility? I wonder how Numpy deals with this?

            I've been lurking on this thread for the past couple of weeks while I've been busy testing matrix multiplication algorithms written in pure ANSI C/C++:

            https://github.com/cubiclesoft/matrix-multiply

            If you've got a few minutes and don't mind running a stranger's code, I'd like to see what output you get for:

            matrixmult bench n 6 1 1375 1966 1375
            matrixmult bench n 8 1 1375 1966 1375
            matrixmult bench n 9 1 1375 1966 1375
            matrixmult bench c 6 1 1375 1966 1375
            matrixmult bench c 8 1 1375 1966 1375
            matrixmult bench c 9 1 1375 1966 1375

            I get this as output on my computer:

            matrixmult bench n 6 1 1375 1966 1375
            1.702430
            
            matrixmult bench n 8 1 1375 1966 1375
            1.718223
            
            matrixmult bench n 9 1 1375 1966 1375
            1.491469
            
            matrixmult bench c 6 1 1375 1966 1375
            1.972144
            
            matrixmult bench c 8 1 1375 1966 1375
            1.742858
            
            matrixmult bench c 9 1 1375 1966 1375
            1.781659

            So about 1.5 to 2 seconds for your test array size of 1375x1966 on my computer. Obviously not as fast as CBLAS but it compiles and runs and is roughly 100 times faster than the fastest pure PHP userland implementation you've attempted to use. And should hopefully be in the ballpark of NumPy's runtime.

            CBLAS might be the fastest BUT getting it to build, based on watching both your AND previously Christoph's (cmb69) attempts to build it:

            https://github.com/RubixML/Tensor/issues/22

            Reveals that Tensor/CBLAS/LAPACK is unlikely to build in certain environments and a pain in the neck to build in all other environments. The RubixML Tensor project also appears to be largely dead:

            https://github.com/RubixML/Tensor/issues/32

            You might have also noticed that FFI only works with the PHP CLI. The regular web server versions (web SAPIs) of PHP do not have FFI compiled in for security reasons.

            As to building in the cloud, you need to install the "build-essential" package to get gcc, make, etc. As to architectures for OpenBlas, the clue is in the name "Sandy bridge." Intel has different microcode architectures and it looks like BLAS/CBLAS/OpenBlas targets each one very specifically. As to why BLAS runs so fast, there's a second clue in the path where you find it: pthreads. BLAS starts multiple threads and splits up the workload across them. You either run longer (single threaded) OR draw more wattage from the wall (multithreaded). However, if multiple users are using the same machine, multithreading might actually be a hindrance rather than a help.

            As to the question about NumPy, the official NumPy website says their implementation is written in pure C. So they probably aren't using BLAS/CBLAS/LAPACK, which explains why it takes a couple of seconds of CPU to complete instead of 0.08 seconds.

            There is some good news though. I've been working on something for the past couple of months: A new PHP extension that is actually a hodgepodge of functions I'm calling the "Quality of Life Improvement Functions" extension. I figured I was basically done until you suddenly showed up on the PHP Internals list with your comments about matrix multiplication. The next step is to take one of the matrix multiplication algorithms and implement it into the extension. The extension itself already has a bunch of random, fully tested functions that look like someone barfed up 15 different ideas and threw them all into one extension...because that's pretty much what happened. So what's a couple more functions?

            By the way, transposing an array can already be accomplished with array_map():

            array_map(null, ...$arraytotranspose);   // The ... prefix expands the array as individual parameters.

            I'm not certain if array_map() is the fastest way to transpose an array in PHP but it is already built into the language. The extension I've made aims to be free of superfluous stuff that will likely be rejected by the core devs because I intend to push for adoption into PHP core. So a dedicated mat_transpose() function won't fit into that model since array_map() can already handle that operation natively.

            Multiplying two matrices together in PHP will also need some careful thought about how to avoid someone using all available CPU time on the web SAPIs without making a bunch of unnecessary calls to a timer routine to avoid exceeding the web SAPI timeout period.

            Hope this at least answers some of your questions. But it will probably be a few more weeks until I have the extension ready for any sort of use since the holidays are upon us.

            cubic If you've got a few minutes and don't mind running a stranger's code, I'd like to see what output you get for...

            Here are my results:

            matrixmult bench n 6 1 1375 1966 1375
            2.204161
            matrixmult bench n 8 1 1375 1966 1375
            2.312766
            matrixmult bench n 9 1 1375 1966 1375
            2.729322
            matrixmult bench c 6 1 1375 1966 1375
            2.254812
            matrixmult bench c 8 1 1375 1966 1375
            2.321597
            matrixmult bench c 9 1 1375 1966 1375
            2.810967

            My machine's CPU is a 4-core/8-thread Intel Core i7-4820K CPU @ 3.70GHz. I've got 28GB of RAM. I note that the ubuntu system monitor shows one thread/core max out while it's running. Assuming you can efficiently feed 1375 x 1966 ints/floats into it, those performance numbers are quite respectable. As it currently is, I have no idea what it's multiplying.

            Tensor doesn't seem to work for me. My workstation has PHP 8.2 and pecl stubbornly refused to work:

            $ sudo pecl install tensor
            [sudo] password for sneakyimp: 
            WARNING: channel "pecl.php.net" has updated its protocols, use "pecl channel-update pecl.php.net" to update
            pecl/tensor requires PHP (version >= 7.4.0, version <= 8.1.99), installed version is 8.2.0
            No valid packages found
            install failed

            Trying to compile it directly also fails:

            /home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c: In function ‘zephir_function_exists’:
            /home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c:285:101: warning: comparison between pointer and integer
              285 |  if (zend_hash_str_exists(CG(function_table), Z_STRVAL_P(function_name), Z_STRLEN_P(function_name)) != NULL) {
                  |                                                                                                     ^~
            /home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c: In function ‘zephir_function_exists_ex’:
            /home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c:301:76: warning: comparison between pointer and integer
              301 |  if (zend_hash_str_exists(CG(function_table), function_name, function_len) != NULL) {
                  |                                                                            ^~
            /home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c: In function ‘zephir_get_arg’:
            /home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c:571:9: error: too many arguments to function ‘zend_forbid_dynamic_call’
              571 |     if (zend_forbid_dynamic_call("func_get_arg()") == FAILURE) {
                  |         ^~~~~~~~~~~~~~~~~~~~~~~~
            In file included from /usr/include/php/20220829/main/php.h:35,
                             from /home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c:16:
            /usr/include/php/20220829/Zend/zend_API.h:782:39: note: declared here
              782 | static zend_always_inline zend_result zend_forbid_dynamic_call(void)
                  |                                       ^~~~~~~~~~~~~~~~~~~~~~~~
            make: *** [Makefile:213: kernel/main.lo] Error 1

            This guy offers an update that will compile and I made/installed it. You get the tensor extension listed in the installed extensions and you get various classes defined, but it doesn't seem to work when you try to multiply my data (click that issue to see my results). It seems to work sometimes, but the dimensions of the result are wrong. Other times it segfaults.

            cubic You might have also noticed that FFI only works with the PHP CLI. The regular web server versions (web SAPIs) of PHP do not have FFI compiled in for security reasons.

            I think it's possible to get FFI working in a web server environment. The docs have some detail about preloading FFI definitions and libs 'at PHP startup':

            FFI definition parsing and shared library loading may take significant time. It is not useful to do it on each HTTP request in a Web environment. However, it is possible to preload FFI definitions and libraries at PHP startup, and to instantiate FFI objects when necessary. Header files may be extended with special FFI_SCOPE defines (e.g. #define FFI_SCOPE "foo"”"; the default scope is "C") and then loaded by FFI::load() during preloading. This leads to the creation of a persistent binding, that will be available to all the following requests through FFI::scope(). Refer to the complete PHP/FFI/preloading example for details.

            cubic As to building in the cloud, you need to install the "build-essential" package to get gcc, make, etc. As

            Yep. And this is more confusing/complicated than just sudo apt install php-module-whatever and therefore might be a greater impediment.

            cubic As to architectures for OpenBlas, the clue is in the name "Sandy bridge." Intel has different microcode architectures and it looks like BLAS/CBLAS/OpenBlas targets each one very specifically

            Curiously, my CPU is Ivy Bridge, not Sandy Bridge. I might add that OpenBLAS in its current incarnation is MASSIVE (the compiled shared lib is 14MB!) and features contributions from Russian devs with yandex addresses as well as the Institute of Software Chinese Academy of Sciences.

            cubic As to why BLAS runs so fast, there's a second clue in the path where you find it: pthreads. BLAS starts multiple threads and splits up the workload across them. You either run longer (single threaded) OR draw more wattage from the wall (multithreaded). However, if multiple users are using the same machine, multithreading might actually be a hindrance rather than a help.

            Good observations here.

            cubic A new PHP extension that is actually a hodgepodge of functions I'm calling the "Quality of Life Improvement Functions" extension.

            I admire your ambition, but the 'hodgepodge' nature might prove an impediment to acceptance by the curmudgeonly community.

            cubic As to the question about NumPy, the official NumPy website says their implementation is written in pure C. So they probably aren't using BLAS/CBLAS/LAPACK, which explains why it takes a couple of seconds of CPU to complete instead of 0.08 seconds.

            I have a vague recollection of seeing some BLAS libs in various python paths on my Mac. Sorry I do not have additional information. I suspect NumPy is doing something ffi-like, but can't offer any authoritative detail.

            cubic transposing an array can already be accomplished with array_map():

            I saw someone else doing that and that's how my own sad matrix multiplier was defining transpose().

            cubic Multiplying two matrices together in PHP will also need some careful thought about how to avoid someone using all available CPU time on the web SAPIs without making a bunch of unnecessary calls to a timer routine to avoid exceeding the web SAPI timeout period.

            You might keep a running tally of how much time a calculation has taken and examine the system load average? Maybe let the calling scope specify some limits on how greedy it can be? I've done something like this in an image-munging asset pipeline.

            cubic Hope this at least answers some of your questions. But it will probably be a few more weeks until I have the extension ready for any sort of use since the holidays are upon us.

            It's nice to have input from someone who seems to understand and appreciate the issue, so thanks for your input here. I'd very much like to feed my matrix into your multiplier and get a matrix/array (a php array of arrays) out so I can check the results.

            In related news, I tried to take my JSON training data (the matrix I've been trying to multiply) and use FANN to train a neural network spam filter. The process involves exporting my JSON data to a format described in the documentation xor example and then creating a FANN object and then calling fann_train_on_file to train the neural net using this data file. I started this script running about 48 hours ago on an amazon EC2 instance and it's still running. "Fast" Artificial Neural Network? Harrumph!

              sneakyimp As it currently is, I have no idea what it's multiplying.

              It's just dummy data in the arrays. Basically, matrixmult bench n 6 1 1375 1966 1375 says multiply a 1375x1966 sized array with a 1966x1375 array filled with dummy data (1, 2, 3, 4, 5, 6, 7...) one time using the non-contiguous memory version of Implementation 6. The validate routine confirms that all 18 variants are functioning correctly (i.e. multiplying two matrices works properly).

              At any rate, it looks like Implementation 6 is the best overall winner for non-contiguous memory. It's what I was leaning toward anyway because it is simpler than the later Implementations.

              sneakyimp I admire your ambition, but the 'hodgepodge' nature might prove an impediment to acceptance by the curmudgeonly community.

              Yup. I'm rather aware of how difficult the PHP core developers are to work with at times. The extension isn't intended to be a standalone extension but rather a series of functions - each one being it's own unique pain point being dealt with and can be isolated/relegated to its own RFC as needed. I figure it is better to make one extension with 15 functions instead of 15 separate extensions that just implement one function each. As such, it's more a proof of concept extension with working code than something intended to be run in production environments. That is why adding a matrix multiplication function isn't a big deal.

              As a side note, if you are looking for a Bayesian style spam filter, Spambayes is written in Python. Should be relatively easy to port the training and classifier portions to PHP.

              cubic At any rate, it looks like Implementation 6 is the best overall winner for non-contiguous memory. It's what I was leaning toward anyway because it is simpler than the later Implementations.

              I'm curious what the plan is to feed PHP data structures into your matrixmult extension. If I'm not mistaken, these C programs need some kind of memory address (i.e., a pointer) fed to them so they know where to find the data structures to be multiplied.

              I would point out that pretty much every matrix multiplication scheme I've examined in this thread tends to have a class specifically designed and you feed it a PHP array of identically-sized PHP arrays (vectors) via constructor or some static method. I note that ghostjat/np delegates the heavy lifting of the constructor to its setData method which looks like simple matrix operations but the $data property of its matrix class appears to have 'flattened' a 2d PHP array representing a matrix into a single array of values while it separately keeps track of row and column counts. The constructor makes use of FFI:cast somehow. I'm still not sure how our PHP data gets handed off as c-style data objects to FFI for the OpenBlas operations. I have a dim inkling that $data is of type FFI\CData, which can evidently be directly manipulated in this case with PHP array operators like square brackets and such and that we can pass this $data object directly to our FFI-wrapped c functions. There's a lot of inheritance and functions delegating to other functions in ghostjat/np, but it looks like my matrix multiplication gets handled by this dotMatrix method, defined in a PHP Trait, which delegates the multiplication to a static method, blass.gemm, which is just a PHP wrapper around the FFI-loaded cblas_dgemm function declared in the blas.h file. This elaborate approach works with blinding speed with OpenBLAS, but CBLAS doesn't like the multithreading declarations.

              I wonder how CBLAS might compare, performance-wise, to your matrixmul performance? These instructions successfully compile CBLAS into a shared library on my ubuntu workstation (but not my MacOS laptop). This shared lib doesn't match up with the ghostjat/np blas.h file, but we might generate an FFI-friendly cblas.h file that works with this suggestion from the docs that I mentioned above:

              echo '#define FFI_SCOPE "YOUR_SCOPE"' > header-ffi.h
              echo '#define FFI_LIB "/path/to/your_lib.so"' >> header-ffi.h
              cpp -P -C -D"attribute(ARGS)=" header_original >> header-ffi.h

              Armed with cblas_LINUX.so compiled directly from BLAS and CBLAS, and a new cblas.h file distilled with this bit of cpp code, we might be able to cook up a test. I'm still trying to get my head around how this stuff all fits together.

              cubic As such, it's more a proof of concept extension with working code than something intended to be run in production environments

              I appreciate your desire to tinker with the PHP guts, and make your life better. At the same time, something like OpenBLAS seems massively overkill. CBLAS, on the other hand, is quite compact, and may actually improve performance for these matrix operations and others. An improvement from n3 to n2.37286 could be of great benefit*.

              *My FANN script is still running as I type this.

              cubic As a side note, if you are looking for a Bayesian style spam filter, Spambayes is written in Python. Should be relatively easy to port the training and classifier portions to PHP.

              That might be of interest, but I'm already well dug into this attempt to create efficient SVM/ANN spam filter. It's for learning as much as anything. I hope others might appreciate my sharing of the particulars I've uncovered. Personally, I'm appalled at the lack of fast matrix ops in PHP.

                Interesting development! I compiled CBLAS according to this recipe that I linked previously. I took the resulting cblas_LINUX.so file and copied it to a new folder and downloaded the blas.h file from the ghostjat/np library:

                mkdir ~/cblas
                cp cblas_LINUX.so ~/cblas/
                cd ~/cblas
                curl https://raw.githubusercontent.com/ghostjat/Np/main/src/core/blas.h > blas.h

                I edited this line in blas.h:

                #define FFI_LIB "libblas.so"

                to point to the location of my cblas_LINUX.so

                #define FFI_LIB "/home/sneakyimp/cblas/cblas_LINUX.so"

                I started with a simple bar.php file, which loads that blas.h file via FFI:

                <?php
                $ffi_blas = FFI::load(__DIR__ . '/blas.h');

                This script complains because blas.h declares functions that do not exist in cblas_LINUX.so like so:

                PHP Fatal error:  Uncaught FFI\Exception: Failed resolving C function 'openblas_get_config' in /home/sneakyimp/biz/cblas/bar.php

                So I methodically ran it over and over again, each time deleting the function declaration in blas.h that caused a complaint, yielding this blas.h file:

                #define FFI_SCOPE "blas"
                #define FFI_LIB "/home/sneakyimp/cblas/cblas_LINUX.so"
                typedef  size_t CBLAS_INDEX_t;
                
                size_t cblas_idamax(const int N, const float *X, const int incX);
                double cblas_dsdot(const int N, const float *X, const int incX, const float *Y,
                                   const int incY);
                double cblas_ddot(const int N, const double *X, const int incX,
                                  const double *Y, const int incY);
                double cblas_dnrm2(const int N, const double *X, const int incX);
                double cblas_dasum(const int N, const double *X, const int incX);
                
                void cblas_dswap(const int N, double *X, const int incX, 
                                 double *Y, const int incY);
                void cblas_dcopy(const int N, const double *X, const int incX, 
                                 double *Y, const int incY);
                void cblas_daxpy(const int N, const double alpha, const double *X,
                                 const int incX, double *Y, const int incY);
                void cblas_drotg(double *a, double *b, double *c, double *s);
                void cblas_drotmg(double *d1, double *d2, double *b1, const double b2, double *P);
                void cblas_drot(const int N, double *X, const int incX,
                                double *Y, const int incY, const double c, const double  s);
                void cblas_drotm(const int N, double *X, const int incX,
                                double *Y, const int incY, const double *P);
                void cblas_dscal(const int N, const double alpha, double *X, const int incX);
                void cblas_dgemv(const enum CBLAS_ORDER order,
                                 const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                                 const double alpha, const double *A, const int lda,
                                 const double *X, const int incX, const double beta,
                                 double *Y, const int incY);
                void cblas_dgbmv(const enum CBLAS_ORDER order,
                                 const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                                 const int KL, const int KU, const double alpha,
                                 const double *A, const int lda, const double *X,
                                 const int incX, const double beta, double *Y, const int incY);
                void cblas_dtrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                 const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                 const int N, const double *A, const int lda, 
                                 double *X, const int incX);
                void cblas_dtbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                 const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                 const int N, const int K, const double *A, const int lda, 
                                 double *X, const int incX);
                void cblas_dtpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                 const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                 const int N, const double *Ap, double *X, const int incX);
                void cblas_dtrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                 const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                 const int N, const double *A, const int lda, double *X,
                                 const int incX);
                void cblas_dtbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                 const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                 const int N, const int K, const double *A, const int lda,
                                 double *X, const int incX);
                void cblas_dtpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                 const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                 const int N, const double *Ap, double *X, const int incX);
                void cblas_dsymv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                 const int N, const double alpha, const double *A,
                                 const int lda, const double *X, const int incX,
                                 const double beta, double *Y, const int incY);
                void cblas_dsbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                 const int N, const int K, const double alpha, const double *A,
                                 const int lda, const double *X, const int incX,
                                 const double beta, double *Y, const int incY);
                void cblas_dspmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                 const int N, const double alpha, const double *Ap,
                                 const double *X, const int incX,
                                 const double beta, double *Y, const int incY);
                void cblas_dger(const enum CBLAS_ORDER order, const int M, const int N,
                                const double alpha, const double *X, const int incX,
                                const double *Y, const int incY, double *A, const int lda);
                void cblas_dsyr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                const int N, const double alpha, const double *X,
                                const int incX, double *A, const int lda);
                void cblas_dspr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                const int N, const double alpha, const double *X,
                                const int incX, double *Ap);
                void cblas_dsyr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                const int N, const double alpha, const double *X,
                                const int incX, const double *Y, const int incY, double *A,
                                const int lda);
                void cblas_dspr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                const int N, const double alpha, const double *X,
                                const int incX, const double *Y, const int incY, double *A);
                void cblas_dgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
                                 const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
                                 const int K, const double alpha, const double *A,
                                 const int lda, const double *B, const int ldb,
                                 const double beta, double *C, const int ldc);
                void cblas_dsymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                                 const enum CBLAS_UPLO Uplo, const int M, const int N,
                                 const double alpha, const double *A, const int lda,
                                 const double *B, const int ldb, const double beta,
                                 double *C, const int ldc);
                void cblas_dsyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                                 const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                                 const double alpha, const double *A, const int lda,
                                 const double beta, double *C, const int ldc);
                void cblas_dsyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                                  const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                                  const double alpha, const double *A, const int lda,
                                  const double *B, const int ldb, const double beta,
                                  double *C, const int ldc);
                void cblas_dtrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                                 const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                                 const enum CBLAS_DIAG Diag, const int M, const int N,
                                 const double alpha, const double *A, const int lda,
                                 double *B, const int ldb);
                void cblas_dtrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                                 const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                                 const enum CBLAS_DIAG Diag, const int M, const int N,
                                 const double alpha, const double *A, const int lda,
                                 double *B, const int ldb);

                I then emulated the code in ghostjat/np to create the necessary FFI\CData objects that I could feed to the cblas_gdemm function to perform matrix multiplication using CBLAS:

                <?php
                // load CBLAS lib
                $ffi_blas = FFI::load(__DIR__ . '/blas.h');
                
                // load my data
                $data = json_decode(file_get_contents(__DIR__ . '/../machine-learning/training_data_sets.json'), TRUE);
                
                // this is the 1375 x 1966 matrix i want to multiply times its transpose
                $m1 = $data['Xtrain'];
                
                $start = microtime(true);
                // transpose it
                $m1t = array_map(null, ...$m1);
                echo "transpose elapsed time: " . (microtime(true) - $start) . "seconds\n";
                
                // wrote this by examining code in ghostjat/np, converts PHP array/array obj to FFI\CData
                function get_cdata($m) {
                	// this check should be more specific, but works for now
                	if (!is_array($m) || !is_array($m[0])) {
                		throw new Exception('param must be array of arrays');
                	}
                
                	$rowcount = sizeof($m);
                	$colcount = sizeof($m[0]);
                	$flat_m = [];
                	$size = $rowcount * $colcount;
                	$cdata = \FFI::cast('double *', \FFI::new("double[$size]"));
                	$i = 0;
                	foreach($m as $row) {
                		foreach($row as $val) {
                			$flat_m[$i] = $val;
                			$cdata[$i] = $val;
                			$i++;
                		}
                	}
                
                	return [
                		'rows' => $rowcount,
                		'cols' => $colcount,
                		'flat_m' => $flat_m,
                		'cdata' => $cdata
                	];
                }
                
                // FFI function needs CData objects as params
                $m1_c = get_cdata($m1);
                $m1t_c = get_cdata($m1t);
                
                // these constants mean something to CBLAS. copied from ghostjat/np
                define('CBLAS_ROW_MAJOR', 101);
                define('CBLAS_NO_TRANS', 111);
                
                // some dimensions of our expected output matrix
                $mr_size = $m1_c['rows'] * $m1t_c['cols'];
                $mr_rows = $m1_c['rows'];
                $mr_cols = $m1t_c['cols'];
                $mr_cdata = \FFI::cast('double *', \FFI::new("double[$mr_size]"));
                
                $start = microtime(true);
                // INVOKE CBLAS VIA FFI. This fn works by pointer, returns some void object, I think
                $some_val = $ffi_blas->cblas_dgemm(CBLAS_ROW_MAJOR, CBLAS_NO_TRANS, CBLAS_NO_TRANS, $m1_c['rows'], $m1t_c['cols'], $m1_c['cols'], 1.0, $m1_c['cdata'], $m1_c['cols'], $m1t_c['cdata'], $m1t_c['cols'], 0.0, $mr_cdata, $mr_cols);
                echo "cblas_dgemm elapsed time: " . (microtime(true) - $start) . "seconds\n";
                
                // convert resulting cdata to php object
                $idx = 0;
                $mr = [];
                for($i=0; $i<$mr_rows; $i++) {
                	$mr[$i] = [];
                	for($j=0; $j<$mr_cols; $j++) {
                		$mr[$i][$j] = $mr_cdata[$idx];
                		$idx++;
                	}
                }
                
                
                echo "rows: " . sizeof($mr) . "\n";
                echo "cols: " . sizeof($mr[0]) . "\n";
                
                $json = json_encode($mr, JSON_PRETTY_PRINT);
                file_put_contents(__DIR__ . '/cblas-product.json', $json);
                echo "json md5 is " . md5($json) . "\n";

                It runs in 250ms, and appears to yield the same (correct?) output as all the other functions I've tested:

                transpose elapsed time: 0.056370973587036seconds
                cblas_dgemm elapsed time: 0.28544211387634seconds
                rows: 1375
                cols: 1375
                json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e

                  I tried just uploading this little folder from my Ubuntu 20 workstation running PHP 8.2 to another workstation running PHP 22.04 with php 8.1. I adjusted the paths accordingly for this machine but it does not work:

                  $ php bar.php
                  transpose elapsed time: 0.14942502975464seconds
                  php: symbol lookup error: /home/sneakyimp/cblas/cblas_LINUX.so: undefined symbol: dgemm_

                  I also tried uploading this little folder to an Ubuntu 20.04 machine running PHP 7.4 and it had the same problem:

                  $ php bar.php
                  transpose elapsed time: 0.12622904777527seconds
                  php: symbol lookup error: /home/sneakyimp/cblas/cblas_LINUX.so: undefined symbol: dgemm_

                  On this latter machine, I even tried to recompile BLAS and CBLAS on it and it still barfs.

                    If anyone has any suggestions about how to get CBLAS-via-FFI working on these virtual machines, I'd certainly appreciate advice. I have no idea what this error means:

                    php: symbol lookup error: /home/sneakyimp/cblas/cblas_LINUX.so: undefined symbol: dgemm_

                    NOTE: I have posted a detailed question about this on stack overflow. I feel that if I can get CBLAS to work here, that might just be the best approach. OpenBLAS is giant and bloated. CBLAS, although not quite as fast, is still pretty darn fast, and much more compact. I've been googling for hours and can't seem to make a whit of progress on this.

                      I'm please (but also bemused) to report that I got BLAS invoked from PHP via FFI--bemused because I'm unable to get it working with the self-compiled cblas_LINUX.so. I was able to get a speedy BLAS matrix multiplication working on all my various ubuntu machines by first installing libblas-dev:

                      sudo apt install libblas-dev

                      and by using the cblas.h file that I had previously reduced by removing openblas functions etc. from the ghostjat/np blas.h file. Here is the entire reduced file, which I have saved as cblas.h:

                      #define FFI_SCOPE "blas"
                      #define FFI_LIB "libblas.so"
                      typedef  size_t CBLAS_INDEX_t;
                      
                      
                      size_t cblas_idamax(const int N, const float *X, const int incX);
                      
                      double cblas_dsdot(const int N, const float *X, const int incX, const float *Y,
                                         const int incY);
                      double cblas_ddot(const int N, const double *X, const int incX,
                                        const double *Y, const int incY);
                      double cblas_dnrm2(const int N, const double *X, const int incX);
                      double cblas_dasum(const int N, const double *X, const int incX);
                      
                      void cblas_dswap(const int N, double *X, const int incX, 
                                       double *Y, const int incY);
                      void cblas_dcopy(const int N, const double *X, const int incX, 
                                       double *Y, const int incY);
                      void cblas_daxpy(const int N, const double alpha, const double *X,
                                       const int incX, double *Y, const int incY);
                      void cblas_drotg(double *a, double *b, double *c, double *s);
                      void cblas_drotmg(double *d1, double *d2, double *b1, const double b2, double *P);
                      void cblas_drot(const int N, double *X, const int incX,
                                      double *Y, const int incY, const double c, const double  s);
                      void cblas_drotm(const int N, double *X, const int incX,
                                      double *Y, const int incY, const double *P);
                      void cblas_dscal(const int N, const double alpha, double *X, const int incX);
                      void cblas_dgemv(const enum CBLAS_ORDER order,
                                       const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                                       const double alpha, const double *A, const int lda,
                                       const double *X, const int incX, const double beta,
                                       double *Y, const int incY);
                      void cblas_dgbmv(const enum CBLAS_ORDER order,
                                       const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                                       const int KL, const int KU, const double alpha,
                                       const double *A, const int lda, const double *X,
                                       const int incX, const double beta, double *Y, const int incY);
                      void cblas_dtrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                       const int N, const double *A, const int lda, 
                                       double *X, const int incX);
                      void cblas_dtbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                       const int N, const int K, const double *A, const int lda, 
                                       double *X, const int incX);
                      void cblas_dtpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                       const int N, const double *Ap, double *X, const int incX);
                      void cblas_dtrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                       const int N, const double *A, const int lda, double *X,
                                       const int incX);
                      void cblas_dtbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                       const int N, const int K, const double *A, const int lda,
                                       double *X, const int incX);
                      void cblas_dtpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                       const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                                       const int N, const double *Ap, double *X, const int incX);
                      void cblas_dsymv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                       const int N, const double alpha, const double *A,
                                       const int lda, const double *X, const int incX,
                                       const double beta, double *Y, const int incY);
                      void cblas_dsbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                       const int N, const int K, const double alpha, const double *A,
                                       const int lda, const double *X, const int incX,
                                       const double beta, double *Y, const int incY);
                      void cblas_dspmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                       const int N, const double alpha, const double *Ap,
                                       const double *X, const int incX,
                                       const double beta, double *Y, const int incY);
                      void cblas_dger(const enum CBLAS_ORDER order, const int M, const int N,
                                      const double alpha, const double *X, const int incX,
                                      const double *Y, const int incY, double *A, const int lda);
                      void cblas_dsyr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                      const int N, const double alpha, const double *X,
                                      const int incX, double *A, const int lda);
                      void cblas_dspr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                      const int N, const double alpha, const double *X,
                                      const int incX, double *Ap);
                      void cblas_dsyr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                      const int N, const double alpha, const double *X,
                                      const int incX, const double *Y, const int incY, double *A,
                                      const int lda);
                      void cblas_dspr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                                      const int N, const double alpha, const double *X,
                                      const int incX, const double *Y, const int incY, double *A);
                      void cblas_dgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
                                       const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
                                       const int K, const double alpha, const double *A,
                                       const int lda, const double *B, const int ldb,
                                       const double beta, double *C, const int ldc);
                      void cblas_dsymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                                       const enum CBLAS_UPLO Uplo, const int M, const int N,
                                       const double alpha, const double *A, const int lda,
                                       const double *B, const int ldb, const double beta,
                                       double *C, const int ldc);
                      void cblas_dsyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                                       const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                                       const double alpha, const double *A, const int lda,
                                       const double beta, double *C, const int ldc);
                      void cblas_dsyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
                                        const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
                                        const double alpha, const double *A, const int lda,
                                        const double *B, const int ldb, const double beta,
                                        double *C, const int ldc);
                      void cblas_dtrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                                       const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                                       const enum CBLAS_DIAG Diag, const int M, const int N,
                                       const double alpha, const double *A, const int lda,
                                       double *B, const int ldb);
                      void cblas_dtrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side,
                                       const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA,
                                       const enum CBLAS_DIAG Diag, const int M, const int N,
                                       const double alpha, const double *A, const int lda,
                                       double *B, const int ldb);

                      NOTE : FFI_LIB is simply 'libblas.so' without any path information. This should result in linux using the shared library installed with the libblas-ev package.

                      Lastly, here's a simplified script that loads my JSON file and multiplies my training set by its transpose. The cblas_dgemm call executes in about 284ms on my workstation. Slower on the virtual machines for whatever reason.

                      
                      $data = json_decode(file_get_contents(__DIR__ . '/../machine-learning/training_data_sets.json'), TRUE);
                      
                      // more trial and error...cooked this up after reading the ghostjat/np source code
                      $ffi_blas = FFI::load(__DIR__ . '/cblas.h');
                      
                      
                      $m1 = $data['Xtrain'];
                      
                      
                      $start = microtime(true);
                      $m1t = array_map(null, ...$m1);
                      echo "transpose elapsed time: " . (microtime(true) - $start) . "seconds\n";
                      
                      // lifted from ghostjat/np, converts matrix to obj with FFI\CData
                      function get_cdata($m) {
                      	// this check should be more specific, but works for now
                      	if (!is_array($m) || !is_array($m[0])) {
                      		throw new Exception('param must be array of arrays');
                      	}
                      
                      	$rowcount = sizeof($m);
                      	$colcount = sizeof($m[0]);
                      	$flat_m = [];
                      	$size = $rowcount * $colcount;
                      	$cdata = \FFI::cast('double *', \FFI::new("double[$size]"));
                      	$i = 0;
                      	foreach($m as $row) {
                      		foreach($row as $val) {
                      			$flat_m[$i] = $val;
                      			$cdata[$i] = $val;
                      			$i++;
                      		}
                      	}
                      
                      	return [
                      		'rows' => $rowcount,
                      		'cols' => $colcount,
                      		'flat_m' => $flat_m,
                      		'cdata' => $cdata
                      	];
                      }
                      
                      $m1_c = get_cdata($m1);
                      $m1t_c = get_cdata($m1t);
                      
                      define('CBLAS_ROW_MAJOR', 101);
                      define('CBLAS_NO_TRANS', 111);
                      
                      $mr_size = $m1_c['rows'] * $m1t_c['cols'];
                      $mr_rows = $m1_c['rows'];
                      $mr_cols = $m1t_c['cols'];
                      $mr_cdata = \FFI::cast('double *', \FFI::new("double[$mr_size]"));
                      
                      $start = microtime(true);
                      // this fn works by pointer, returns some void object, I think?
                      $some_val = $ffi_blas->cblas_dgemm(CBLAS_ROW_MAJOR, CBLAS_NO_TRANS, CBLAS_NO_TRANS, $m1_c['rows'], $m1t_c['cols'], $m1_c['cols'], 1.0, $m1_c['cdata'], $m1_c['cols'], $m1t_c['cdata'], $m1t_c['cols'], 0.0, $mr_cdata, $mr_cols);
                      echo "cblas_dgemm elapsed time: " . (microtime(true) - $start) . "seconds\n";
                      
                      
                      // convert cdata to php object
                      $idx = 0;
                      $mr = [];
                      for($i=0; $i<$mr_rows; $i++) {
                      	$mr[$i] = [];
                      	for($j=0; $j<$mr_cols; $j++) {
                      		$mr[$i][$j] = $mr_cdata[$idx];
                      		$idx++;
                      	}
                      }
                      
                      
                      echo "rows: " . sizeof($mr) . "\n";
                      echo "cols: " . sizeof($mr[0]) . "\n";
                      
                      $json = json_encode($mr, JSON_PRETTY_PRINT);
                      file_put_contents(__DIR__ . '/cblas-product.json', $json);
                      echo "json md5 is " . md5($json) . "\n";

                      The result, $mr is precisely what the other matrix multiply/dot functions returned. The output:

                      transpose elapsed time: 0.057420015335083seconds
                      cblas_dgemm elapsed time: 0.28933596611023seconds
                      rows: 1375
                      cols: 1375
                      json md5 is 4a33a230c1cf8bb6ab75dbd0d3d7bf0e
                        10 days later

                        I've got native PHP matrix addition (mat_add())) and subtraction (mat_sub())) routines working well in the extension now. Had to come up with a new way to cleanly access PHP hash tables because none of the existing C macros would work for independent iteration over two distinct arrays.

                        Assuming all goes well, I hope to have a native PHP matrix multiply implemented sometime this week.

                        Slower on the virtual machines for whatever reason.

                        Virtual machines are...virtual. AWS and other cloud providers supply vCores, which are slices of CPU time. Your VM is sharing system resources with other VMs on the same hardware. Some providers let users reserve whole CPU cores for just their VM but those are much more expensive than the ultra-cheap VMs where CPU cores are time shared. I believe some low quality providers also overprovision RAM to cram more VMs onto one piece of hardware, which can result in very low usage VMs to dump to swap but can cause memory thrashing if they start being used and the system gets too busy. Your local machine is almost certainly more powerful and capable than any VM you reserve over in the cloud.

                        cubic I've got native PHP matrix addition (mat_add())) and subtraction (mat_sub())) routines working well in the extension now. Had to come up with a new way to cleanly access PHP hash tables because none of the existing C macros would work for independent iteration over two distinct arrays.

                        I suspect (but have no real evidence) that the slowness of my pure PHP implementations of these matrix operations is due to the need to traverse some data structure to locate row $i and column $j. I suspect (but have no evidence) that the implementations written in C benefit from some algorithm that quickly locates the data element i,j. I admire your courage for digging into the PHP source, and am somewhat curious what you are doing differently than the existing php codebase. Seems to me PHP itself might benefit from your approach.

                        cubic Assuming all goes well, I hope to have a native PHP matrix multiply implemented sometime this week.

                        That sounds pretty exciting!

                        I have some possibly relevant news to report regarding real-world performance of BLAS. As previously posted, I managed to get BLAS-via-FFI working in PHP by installing libblas-dev. I even got it working on my MacOS laptop. I'm not sure which library is handling PHP::FFI's calls to cblas_x calls, but I'm loading a modified blas.h file from the ghostjat repo. In particular, the FFI_lib declaration is:

                        #define FFI_LIB "libblas.dylib"

                        and I seem to have removed the various openblas declarations that were causing complaints before. I don't really know which libblas.dylib is being summoned, but I see that I've installed openblas and lapack via brew. A search of the filesystem shows libblas.dylib files in numerous places. It appears to be part of both lapack and openblas, but the most official looking location is /usr/lib/libblas.dylib which is symbolic link to some file in /System/Library/Frameworks which did not show up in my file system 'find' command results. If anyone has thoughts on how I determine which libblas.dylib is being invoked, I'd appreciate it.

                        My simple bar.php file, which multiplies my example matrix by its transpose using the ghostjat/np lib which, in turn, uses FFI to load libblas, multiplies my 1375x1966 sample matrix by its transpose in about 115ms on this MacOS laptop. This is very fast compared to the other methods we've been looking at here.

                        Encouraged by this quick result, I laboriously converted the original svm-training matlab/octave code to PHP. This was laborious because expressions that are really tidy and efficient in octave are not tidy or efficient in PHP. I'll try to provide some specifics here shortly -- in particular, the fact that ghostjat/np doesn't distinguish between column vectors and row vectors required me to write some ungainly looping code to probe into the raw data of these objects. The resulting code is not intuitive and at times seems pretty kludgy, but it works in reasonable time frames. By 'works in reasonable time frames' I mean that I can train spam filter SVM using my matrices in anywhere from 1 to about 10 minutes. I've yet to thoroughly test everything, but I can train the spam filter on 1375x1966 training set, typically in 2-3 minutes, and then use the resulting SVM on another test set without about 450 messages and get the results back in 1-3 seconds. Depending on linear or gaussian kernel, and depending on parameters chosen for C and sigma, I can find configurations that achieve about 95% accuracy in the spam/ham evaluation. The results in aggregate look the same as what I get in Octave.

                        HOWEVER, Octave still seems to be about 5 or 6 times faster. My most grueling application so far is to sweep through these values of C and sigma, training a total of 441 different gaussian SVMs to see which values might be optimal:

                        // range from 1.4^-10 to 1.4^10
                        $cvals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];
                        $sigma_vals = [0.034571613033608,0.048400258247051,0.067760361545871,0.09486450616422,0.13281030862991,0.18593443208187,0.26030820491462,0.36443148688047,0.51020408163265,0.71428571428571,1,1.4,1.96,2.744,3.8416,5.37824,7.529536,10.5413504,14.75789056,20.661046784,28.9254654976];

                        My PHP code chose C=2.744 and sigma=2.744, which achieved correct classification of 95.633% of test messages. It took 67202.334 seconds to run this PHP code. The Octave code trains the 441 SVMs on the same message data, varying the same C and sigma parameters in a much faster 12440.5 seconds. Octave chose the same optimal C and sigma, reporting the same success rate. The PHP code does a bit more to accumulate stats, etc., but this extra work adds much less than 5000 seconds in total.

                        Using the stats from my PHP code, I was able to generate this surface plot. The z-axis indicates the accuracy of my SVM in predicting spam for various C and sigma values. Ironically, I had to export a CSV file from PHP, and import it into octave to make this plot. One will note that if I want to focus on the more promising regions of the graph, that I must adjust $cvals and $sigma_vals and run the whole 18-hour process again. Or I can run it in Octave in about 4 hours.

                        sneakyimp I suspect (but have no real evidence) that the slowness of my pure PHP implementations of these matrix operations is due to the need to traverse some data structure to locate row $i and column $j.

                        PHP userland does a TON of work for every reference for something like $c[$i][$j]. It goes something like this:

                        1. Check to see if $c exists.
                        2. Sees the array/string dereference (i.e. the brackets). Is $c an array or a string? If so, continue.
                        3. Check to see if $i exists. Is it an integer or a string? If so, continue.
                        4. Check to see if $i is in bounds of the array and/or the key exists. If so, continue.
                        5. Sees the array/string dereference (i.e. the brackets). Is $c[$i] an array or string? If so, continue.
                        6. Check to see if $j exists. Is it an integer or a string? If so, continue.
                        7. Check to see if $j is in bounds of the array and/or the key exists. If so, continue.

                        Now PHP has located the correct location in the arrays/hash tables to conduct the operation. Array lookups will almost certainly call several C functions. And the right side of the equation utilizes at least one temporary variable. Doing those steps one time? Not a big deal. Doing those steps a billion times? Very slow. PHP userland code can't really be optimized all that much by the Zend engine. Steps 3 and 4 could be skipped by using PHP references instead of constantly doing $c[$i] but that's about it.

                        C is faster because it can precalculate all of those checks in advance to avoid making them for every cycle of the inner loop and also doesn't need to do type conversion/coercion and also won't call a bunch of functions in the innermost loop. The tradeoffs are the potential for instability and reduced system security. One wrong/misplaced pointer in C and an application will segfault (segfault == probable security vulnerability).

                        I've now got mat_mult() written for both double and zend_long 2D arrays and also have max_execution_time timeout support. I still need to add scalar multiplies for completionist purposes and write the requisite test suite to properly validate the implementation. It won't run as fast as libblas does on your computer but it will be a native implementation.

                        sneakyimp If anyone has thoughts on how I determine which libblas.dylib is being invoked, I'd appreciate it.

                        You can probably figure that out via strace (or equivalent).

                        cubic One wrong/misplaced pointer in C and an application will segfault (segfault == probable security vulnerability).

                        I have never managed to get comfortable with pointers, and frequently run into trouble if I try to 'go native' and write some C with pointers.

                        I was casually thinking that C could do matrix multiplication extremely quickly if it could simply calculate the memory offset by adding i * j to double *ptr but it later occurred to me that it might be unlikely that the memory allocated to a matrix would be contiguous. I was wondering how the C code might handle a matrix stored in non-contiguous memory and still be fast?

                        cubic You can probably figure that out via strace (or equivalent).

                        Oooh that seems very helpful! Will look into that.

                        In the meantime, I thought it might be worth pointing out some of the challenges I've had translating octave code to PHP. I used ghostjat/np for performance reasons. It's the one library I've found with performance remotely comparable to octave. Sadly, it has a lot of problems so I'm currently working with my own modified version of it. Perhaps the biggest problem I've faced harks back to the lack of distinction between row vectors and column vectors. I expressed confusion about this earlier in this thread and discussed it with @Weedpacket. I want to re-emphasize this as a point of confusion, and illustrate with the challenge I faced translating this bit of octave code:

                            % Vectorized RBF Kernel
                            % This is equivalent to computing the kernel on every pair of examples
                            X1 = sum(X.^2, 2);
                            X2 = sum(model.X.^2, 2)';
                            K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
                            K = model.kernelFunction(1, 0) .^ K;
                            K = bsxfun(@times, model.y', K);
                            K = bsxfun(@times, model.alphas', K);

                        In this case, X is something like my 1375x1966 training matrix and model.X is some different set with different number of rows (m) but same number of features (n). Calculating X1 and X2 are easy with ghostjat/np:

                                $x1 = $x->square()->sumRows();
                                $x2 = $model['x']->square()->sumRows();

                        Calculating K is something of a nightmare. We can do - 2 * X * model.X' pretty easily, although the sequence of operations in our PHP starts to shift around. Remember that matrix multiplication is not always symmetric:

                        // - 2 * X * model.X'
                        $K = $x->dot($model['x']->transpose())->multiply(-2);

                        Just to accomplish that inner bsxfun operation requires us to embark on looping structures:

                        // do the inner bsxfun(plus...)
                        // ghostjat has no means to add a ROW vector to a matrix soooo we fake it
                        $kshape = $K->getShape();
                        $km = $kshape->m;
                        $kn = $kshape->n;
                        $x2size = $x2->getSize();
                        // i are columns, j are rows
                        for($i=0; $i<$x2size; $i++) {
                            $x2val = $x2->data[$i];
                            for($j=0; $j<$km; $j++) {
                                // add the ith x2 value to the ith column of the jth row
                                $K->data[($j * $kn) + $i] += $x2val;
                            }
                        }

                        I'd like to clarify that bsxfun(@plus, X2, - 2 * X * model.X') is not an especially elaborate operation. If we define a (a substitute for X2) and b (a substitute for the last parameter, a dot product) thusly:

                        // a 1 x 2 row vector
                        a = [10 20]
                        // a 3 x 2 matrix
                        b = [1 1; 2 2; 3 3;]
                        inner_bsx = bsxfun(@plus, a, b)
                            inner_bsx =
                        
                               11   21
                               12   22
                               13   23

                        We are basically just adding the row vector a to each of the 3 rows in b. I've been unable to identify any matrix sum or add method which can perform this very simple operation so I had to resort to the looping code above. This quickly gets more maddening. To implement the outer bsxfun operation in that one line of code, we have to run another nested loop pair to sum X1, a 3x1 column vector, to the result of the inner bsxfun operation.

                        // a 3 x 1 column vector, same size as X1
                        c = c = [100; 1000; 10000]
                        // outer bsxfun
                        outer_bsx = bsxfun(@plus, c, inner_bsx)
                            outer_bsx =
                        
                                 111     121
                                1012    1022
                               10013   10023

                        You'll note that octave knows that c is a column vector so it sums c to each column of inner_bsx. You'll also note that ghostjat/np doesn't distinguish between row and column vectors. Nor does it offer any bsxfun operation or methods like sumAsColumn or sumAsRow one might use to expressly specify how the addition is to be performed. The logic I've seen in some of these other matrix operation libs when multiplying a matrix and vector is to check dimensions and then make a guess about what was intended. I assert this guessing behavior is sorely inadequate, given that order of operations matters in matlab/octave, and given that ambiguity arises when multiplying square-but-non-symmetric matrices.

                        The difficulties continue. Those six lines or so of octave/matlab yield this function, which was a chore to write (juggling the nested loops and making sure the offsets are calculated correctly, bounds not exceeded, etc) and doesn't seem to be very efficient at all.

                        function get_gaussian_predict_k(Np\matrix $x, array $model) {
                        
                        	// orig octave code in svmPredict
                        	// Vectorized RBF Kernel
                        	// This is equivalent to computing the kernel on every pair of examples
                        	//X1 = sum(X.^2, 2);
                        	//X2 = sum(model.X.^2, 2)';
                        	//K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
                        	//K = model.kernelFunction(1, 0) .^ K;
                        	//K = bsxfun(@times, model.y', K);
                        	//K = bsxfun(@times, model.alphas', K);
                        
                        	$x1 = $x->square()->sumRows();
                        	//echo "x1 ", $x1, "\n";
                        
                        	// we don't need to transpose this because ghostjat/np doesn't distinguish col vs row vectors
                        	$x2 = $model['x']->square()->sumRows();
                        	//echo "x2 ", $x2, "\n";
                        
                        	// need to build K.
                        	$K = $x->dot($model['x']->transpose())->multiply(-2);
                        
                        	// do the inner bsxfun(plus...)
                        	// ghostjat has no means to add a ROW vector to a matrix soooo we fake it
                        	$kshape = $K->getShape();
                        	$km = $kshape->m;
                        	$kn = $kshape->n;
                        	$x2size = $x2->getSize();
                        	// $km should match the dimensions of $x2
                        	// sanity check
                        	if ($x2size !== $kn) {
                        		throw new \Exception('x2 size ($x2size) does not match kn ($kn)');
                        	}
                        	// i are columns, j are rows
                        	for($i=0; $i<$x2size; $i++) {
                        		$x2val = $x2->data[$i];
                        		for($j=0; $j<$km; $j++) {
                        			// add the ith x2 value to the ith column of the jth row
                        			$K->data[($j * $kn) + $i] += $x2val;
                        		}
                        	}
                        
                        	// do the outer bsxfun(plus...)
                        	// ghostjat has no means to add a COLUMN vector soooo we fake it
                        	$x1size = $x1->getSize();
                        	// $km should match the dimensions of $x1
                        	// sanity check
                        	if ($x1size !== $km) {
                        		throw new \Exception('x1 size ($x1size) does not match km ($km)');
                        	}
                        	// i are rows, j are columns
                        	for($i=0; $i<$x1size; $i++) {
                        		$x1val = $x1->data[$i];
                        		for($j=0; $j<$kn; $j++) {
                        			// add the ith x1 value to the jaith column of the ith row
                        			//$offset = ($i * $kn) + $j;
                        			$K->data[($i * $kn) + $j] += $x1val;
                        		}
                        	}
                        
                        	$kf = gaussian_kernel(Np\vector::ar([1]), Np\vector::ar([0]), $model['sigma']);
                        	//echo "kf ", $kf, "\n";
                        	$K = $K->map(fn($v) => (pow($kf, $v)));
                        
                        	$mysize = $model['y']->getSize();
                        	// $km should match the dimensions of $model['y']
                        	// sanity check
                        	if ($mysize !== $kn) {
                        		throw new \Exception('model.y size ($mysize) does not match kn ($kn)');
                        	}
                        	// i are columns, j are rows
                        	for($i=0; $i<$mysize; $i++) {
                        		$yval = $model['y']->data[$i];
                        		for($j=0; $j<$km; $j++) {
                        			// multiply the ith y value by the ith column of the jth row
                        			$K->data[($j * $kn) + $i] *= $yval;
                        		}
                        	}
                        
                        
                        	$alphasize = $model['alphas']->getSize();
                        	// $km should match the dimensions of $model['alphas']
                        	// sanity check
                        	if ($alphasize !== $kn) {
                        		throw new \Exception('model.alpha size ($alphasize) does not match kn ($kn)');
                        	}
                        	// i are columns, j are rows
                        	for($i=0; $i<$alphasize; $i++) {
                        		$aval = $model['alphas']->data[$i];
                        		for($j=0; $j<$km; $j++) {
                        			// multiply the ith y value by the ith column of the jth row
                        			$K->data[($j * $kn) + $i] *= $aval;
                        		}
                        	}
                        
                        	return $K;
                        }

                        Given how tricky it is to get PHP::FFI working with BLAS or OpenBLAS, and how elaborate the PHP effort is here, and the fact that even with all that octave is five times faster, I'm starting to think I'd be better off using exec to delegate these matrix/svm calculations to octave.