I am working on a corpus (text files - single space between each word) where I need to get word frequencies by single words, by words grouped by three, by four, and by five. I know I can wade through and eventually get what I want, but as only a semi-pro, before I delve into it, I wanted to get some feedback from those of you who have some experience with text manipulation so I am not reinventing the wheel.
Two main problems, one -- my text files are huge, from 50,000 to several million words per file (about 20 files). I've never worked with such huge files. What would be your suggestion on opening and working with such large files (to ease the processing/memory burden)? If I open a file and work with only parts of it at at time, I have to be sure that the file is split at the end of a sentence (signalled by a period)--I can't just randomly split the file into thirds or such. I could split them up to smaller files manually but it would take some serious work and would seem to waste a good resource -- the computer.
Second, there are probably countless ways I could group the words by twoS threeS and fourS, using arrays or regex friendly functions and then get their frequencies, but I was hoping someone with experience might have some general knowledge that could point me in the right direction so that by the time I'm finished, my work is at least quasi-efficient (I'm an English professor, not a full-time programmer, though I consider myself a semi-pro).
I'm not looking for free code or anything, just a general algorithmic approach and maybe some of the best functions for working with such large text files.
although my php is pretty good, should I take up perl for this, as I here it is good at text manipulation?
sincerely,
clarkepeters.