text manipulation expertise (php or perl?)

clarkepeters

I am working on a corpus (text files - single space between each word) where I need to get word frequencies by single words, by words grouped by three, by four, and by five. I know I can wade through and eventually get what I want, but as only a semi-pro, before I delve into it, I wanted to get some feedback from those of you who have some experience with text manipulation so I am not reinventing the wheel.

Two main problems, one -- my text files are huge, from 50,000 to several million words per file (about 20 files). I've never worked with such huge files. What would be your suggestion on opening and working with such large files (to ease the processing/memory burden)? If I open a file and work with only parts of it at at time, I have to be sure that the file is split at the end of a sentence (signalled by a period)--I can't just randomly split the file into thirds or such. I could split them up to smaller files manually but it would take some serious work and would seem to waste a good resource -- the computer.

Second, there are probably countless ways I could group the words by twoS threeS and fourS, using arrays or regex friendly functions and then get their frequencies, but I was hoping someone with experience might have some general knowledge that could point me in the right direction so that by the time I'm finished, my work is at least quasi-efficient (I'm an English professor, not a full-time programmer, though I consider myself a semi-pro).

I'm not looking for free code or anything, just a general algorithmic approach and maybe some of the best functions for working with such large text files.

although my php is pretty good, should I take up perl for this, as I here it is good at text manipulation?

sincerely,
clarkepeters.

laserlight

A little off-topic here, but as you mentioned Perl despite not knowing it: if you happen to be doing natural language processing, then you might want to consider the Python-based Natural Language Toolkit.

clarkepeters

not exactly off topic. I'll certainly give it a try. I wasn't able to find any other satisfactory language software for linux after I did several searches, this one didnt' come up, but it looks promising. I do have a specific type of algorithm I want to run that I can't be sure that the other software programs consider. But the natural language toolkit looks promising and could save me loads of time on this and other projects. (although, I like programming just for fun, so I'll probably go ahead and write my frequency program)

Thanks a heap, laserlight!