cubic If you've got a few minutes and don't mind running a stranger's code, I'd like to see what output you get for...
Here are my results:
matrixmult bench n 6 1 1375 1966 1375
2.204161
matrixmult bench n 8 1 1375 1966 1375
2.312766
matrixmult bench n 9 1 1375 1966 1375
2.729322
matrixmult bench c 6 1 1375 1966 1375
2.254812
matrixmult bench c 8 1 1375 1966 1375
2.321597
matrixmult bench c 9 1 1375 1966 1375
2.810967
My machine's CPU is a 4-core/8-thread Intel Core i7-4820K CPU @ 3.70GHz. I've got 28GB of RAM. I note that the ubuntu system monitor shows one thread/core max out while it's running. Assuming you can efficiently feed 1375 x 1966 ints/floats into it, those performance numbers are quite respectable. As it currently is, I have no idea what it's multiplying.
Tensor doesn't seem to work for me. My workstation has PHP 8.2 and pecl stubbornly refused to work:
$ sudo pecl install tensor
[sudo] password for sneakyimp:
WARNING: channel "pecl.php.net" has updated its protocols, use "pecl channel-update pecl.php.net" to update
pecl/tensor requires PHP (version >= 7.4.0, version <= 8.1.99), installed version is 8.2.0
No valid packages found
install failed
Trying to compile it directly also fails:
/home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c: In function ‘zephir_function_exists’:
/home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c:285:101: warning: comparison between pointer and integer
285 | if (zend_hash_str_exists(CG(function_table), Z_STRVAL_P(function_name), Z_STRLEN_P(function_name)) != NULL) {
| ^~
/home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c: In function ‘zephir_function_exists_ex’:
/home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c:301:76: warning: comparison between pointer and integer
301 | if (zend_hash_str_exists(CG(function_table), function_name, function_len) != NULL) {
| ^~
/home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c: In function ‘zephir_get_arg’:
/home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c:571:9: error: too many arguments to function ‘zend_forbid_dynamic_call’
571 | if (zend_forbid_dynamic_call("func_get_arg()") == FAILURE) {
| ^~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/php/20220829/main/php.h:35,
from /home/sneakyimp/biz/machine-learning/tensor/Tensor/ext/kernel/main.c:16:
/usr/include/php/20220829/Zend/zend_API.h:782:39: note: declared here
782 | static zend_always_inline zend_result zend_forbid_dynamic_call(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~
make: *** [Makefile:213: kernel/main.lo] Error 1
This guy offers an update that will compile and I made/installed it. You get the tensor
extension listed in the installed extensions and you get various classes defined, but it doesn't seem to work when you try to multiply my data (click that issue to see my results). It seems to work sometimes, but the dimensions of the result are wrong. Other times it segfaults.
cubic You might have also noticed that FFI only works with the PHP CLI. The regular web server versions (web SAPIs) of PHP do not have FFI compiled in for security reasons.
I think it's possible to get FFI working in a web server environment. The docs have some detail about preloading FFI definitions and libs 'at PHP startup':
FFI definition parsing and shared library loading may take significant time. It is not useful to do it on each HTTP request in a Web environment. However, it is possible to preload FFI definitions and libraries at PHP startup, and to instantiate FFI objects when necessary. Header files may be extended with special FFI_SCOPE defines (e.g. #define FFI_SCOPE "foo"”"; the default scope is "C") and then loaded by FFI::load() during preloading. This leads to the creation of a persistent binding, that will be available to all the following requests through FFI::scope(). Refer to the complete PHP/FFI/preloading example for details.
cubic As to building in the cloud, you need to install the "build-essential" package to get gcc, make, etc. As
Yep. And this is more confusing/complicated than just sudo apt install php-module-whatever
and therefore might be a greater impediment.
cubic As to architectures for OpenBlas, the clue is in the name "Sandy bridge." Intel has different microcode architectures and it looks like BLAS/CBLAS/OpenBlas targets each one very specifically
Curiously, my CPU is Ivy Bridge, not Sandy Bridge. I might add that OpenBLAS in its current incarnation is MASSIVE (the compiled shared lib is 14MB!) and features contributions from Russian devs with yandex addresses as well as the Institute of Software Chinese Academy of Sciences.
cubic As to why BLAS runs so fast, there's a second clue in the path where you find it: pthreads. BLAS starts multiple threads and splits up the workload across them. You either run longer (single threaded) OR draw more wattage from the wall (multithreaded). However, if multiple users are using the same machine, multithreading might actually be a hindrance rather than a help.
Good observations here.
cubic A new PHP extension that is actually a hodgepodge of functions I'm calling the "Quality of Life Improvement Functions" extension.
I admire your ambition, but the 'hodgepodge' nature might prove an impediment to acceptance by the curmudgeonly community.
cubic As to the question about NumPy, the official NumPy website says their implementation is written in pure C. So they probably aren't using BLAS/CBLAS/LAPACK, which explains why it takes a couple of seconds of CPU to complete instead of 0.08 seconds.
I have a vague recollection of seeing some BLAS libs in various python paths on my Mac. Sorry I do not have additional information. I suspect NumPy is doing something ffi-like, but can't offer any authoritative detail.
cubic transposing an array can already be accomplished with array_map():
I saw someone else doing that and that's how my own sad matrix multiplier was defining transpose().
cubic Multiplying two matrices together in PHP will also need some careful thought about how to avoid someone using all available CPU time on the web SAPIs without making a bunch of unnecessary calls to a timer routine to avoid exceeding the web SAPI timeout period.
You might keep a running tally of how much time a calculation has taken and examine the system load average? Maybe let the calling scope specify some limits on how greedy it can be? I've done something like this in an image-munging asset pipeline.
cubic Hope this at least answers some of your questions. But it will probably be a few more weeks until I have the extension ready for any sort of use since the holidays are upon us.
It's nice to have input from someone who seems to understand and appreciate the issue, so thanks for your input here. I'd very much like to feed my matrix into your multiplier and get a matrix/array (a php array of arrays) out so I can check the results.