I want to code an application which uses "Latent semantic analysis" to extract the most important keywords out of a text.
Explanation of the method on en.wikipedia.org:
http://en.wikipedia.org/wiki/Latent_semantic_analysis
My approach is as follows:
1) Extract all text information out of the text and split it into single words which are then stemmed => Array containing the word stems [split()]
2) delete stop words ("and", "but", ...) [foreach-loop: check if in stop words array]
3) Calculate weights for the terms:
After this code ...
$term_document_matrix = array();
foreach ($words as $word) {
// $freq is the number of occurrences of $word in this document;
// $max is the number of occurrences of the most frequent word in this document;
// $doc1 is the number of documents in database;
// $doc2 is the number of documents in database containing $word;
$weight = $freq/$max * log($doc1/$doc2);
$term_document_matrix[$word] = $weight;
}
... I should have an array with all words as the keys and the weights as the values.
But there's still a lot of work to do then, isn't it?
Wikipedia says that I should use singular-value decomposition to split the term document matrix into three components:
( A = U S V )
The orthogonal matrices U and V contain eigenvectors of AtA and AAt, S is a diagonal matrix with the roots of the eigenvalues of AtA, also called singular values.
Using the eigenvalues in the created matrix S, you can control the linear feature extraction by successively leaving out the smallest eigenvalue until reaching the indefinite limit k.
But my problem is: I don't know how to code what is said in the article of Wikipedia. Could you please help me to find an approach for coding that? That would be great! Thanks in advance!