One method is to count stretches of contiguous whitespace (preg_match_all('/\s+/', $text)) - that plus 1 is a reasonable estimate of the number of words in the text. Don't forget to trim() the text first.
If it's fewer than 500 words, then you don't need to do any more, so it's reasonable to check first and save yourself unnecessary work 🙂 Only if there are more than 500 words does it sense to work at limiting it. So with that check out of the way, we can happily go ahead and assume that what we're looking at does have more than 500 words.
You can preg_split() on the whitespace (same expression as above), array_slice() out the first 500 words from the array that results, and then join() them again with spaces. The downside of course is that you lose any niceties of whitespace formatting in the string (much as HTML pages don't respect such things either unless you use <pre> tags). If there are paragraph marks they will be lost.
I haven't tried it, but I suspect anything involving preg_match('/((\S+\s+){500})/', $text, $matches) would be a bad idea. On the other hand, you could have
$short_text = '';
for($i=0;$i<500;++$i)
{ preg_match('/^(\S+\s+)(.*)', $text, $matches);
$short_text .= $matches[1];
$text = $matches[2];
}
$short_text = trim($short_text);
This would preserve the whitespace.
Something else I haven't tried at all:
$text_bits = preg_split('/\b/', $text);
and split the text on word boundaries. The result would be an array of alternating words and chunks of whitespace (at least, that would in my opinion be the Right Thing for it to do). Join the first 999 or 1000 (I'm not sure if \b would match the very start of the string or not) with "" and whitespace would be preserved, including paragraph marks.
That's just a couple of ideas that sprang to mind when I read the question. Usually my very first ideas aren't the best 🙂