Howto get the string if the position is known

starbbs · Jan 20, 2005

Well laserlight.... i am really greatfull for your help. I did not realise that this request isn´t all that simple.

But did you take a look at my example ±
print array_pop(explode(' ',substr($text,0,$positie)));

laserlight · Jan 20, 2005

print array_pop(explode(' ',substr($text,0,$positie)));

At first glance, this might be ok, though you'll have problems with multiple delimiters.

The code example I gave you was originally coded by defining words as alphanumeric strings, so ctype_alnum() was used to test instead of a direct comparison with the delimiter.
This allows you to work with more than one delimiter with just a small modification of the function.

Weedpacket · Jan 20, 2005

Obviously, the real code is the content of the second loop. The loops are only there as test harnesses to make sure that the right words are returned for every possible value of $pos.

$sentence = "Dit is een simpele text die nergens op slaat. I'd write this sentence in Dutch if I knew any Dutch."; 

// Okay, let's test this.
for($pos = 0; $pos<strlen($sentence); ++$pos)
	echo $pos,' ',array_pop(explode(' ',substr($sentence,0,$pos))),"\n";

// Not what is desired, I take on.

// Regexps? Okay, let's have a crack with them.

for($pos=0; $pos<strlen($sentence); ++$pos)
{
	// Break the sentence into to pieces at $pos. The word we want will be the 
	// last one on the left. If $pos was inside the word at the time, a fragment
	// of the word will end up on the right.
	$left = substr($sentence, 0, $pos);
	$right = substr($sentence, $pos);
	// Isolate the last word to the left of $pos and the first word to the right of $pos.
	// There may be some whitespace after the last word, and there may be some before the first.
	preg_match('/(\\S*)(\\s*)$/', $left, $last_word_and_space);
	preg_match('/^(\\s*)(\\S*)/', $right, $space_and_first_word);
	// We don't need $junk - it's the entire string matched - word and space and all.
	list($junk, $last_word, $space_on_left) = $last_word_and_space;
	list($junk, $space_on_right, $first_word) = $space_and_first_word;
	if($space_on_left=='' && $space_on_right=='')
	{
		// $pos lay inside a word, which got bro
		// ken into two pieces.
		$word = $last_word.$first_word;
	}
	elseif($space_on_right!='')
	{
		// $pos was in whitespace
		//  at the end of a word
		$word = $last_word;
	}
	elseif($space_on_left!='') // Don't really need to test this
	{
		// $pos was positioned 
		// at the start of the word
		$word = $first_word;
	}
	echo $pos,' ',$word,"\n";
}

(Edit: just noticed vBulletin had turned my \s's and \S's into s's and S's. That's not right.)

starbbs · Jan 21, 2005

Thanks for this one... but i really have to make this last code work cause it does not returns the found word left at the position i know/want ?

SO what does this one ?

starbbs · Jan 21, 2005

function parseForWord($str, $pos, $delim = ' ')

I also saw that this one also return words like:

name,

You see the , at the end ? i thoight it tests this to filter this out ?

BlackenedSky · Jan 22, 2005

what about doing a search for the previous delimiter, then using teh mid function to return the word using the two positions?

laserlight · Jan 22, 2005

You see the , at the end ? i thoight it tests this to filter this out ?

It doesnt, because you define the delimiter as a space.

It does for my own version, because I defined words as alphanumeric strings, rather than as non-delimiters.

starbbs · Jan 22, 2005

SO you wrote a simlira unction for your self ? cn you post this ?

laserlight · Jan 22, 2005

function parseForWord($str, $pos) {
    $len = strlen($str);
    //perform bounds checking
    if ($pos >= 0 && $pos < $len) {
        //is character at $pos is a delimiter?
        if (ctype_alnum($str{$pos})) {
            //$pos marks a character within a word
            $word_len = 1;
            //scan left for start of word
            for ($start = $pos - 1; $start >= 0; $start--) {
                if (ctype_alnum($str{$start})) {
                    $word_len++;
                } else {
                    break;
                }
            }
            //scan right to end of word
            for ($i = $pos + 1; $i < $len; $i++) {
                if (ctype_alnum($str{$i})) {
                    $word_len++;
                } else {
                    break;
                }
            }
            //$start + 1 here because of $start-- or an out-of-bounds $start
            return substr($str, $start + 1, $word_len);
        } else {
            //$pos marks position of a delimiter
            $word_len = 0;
            //scan left only
            for ($i = $pos - 1; $i >= 0; $i--) {
                if (ctype_alnum($str{$i})) {
                    $word_len = 1;
                    break;
                }
            }
            //$word_len = 1 > 0 if (last character of) word found
            if ($word_len > 0) {
                //scan left for start of word
                for ($start = $i - 1; $start >= 0; $start--) {
                    if (ctype_alnum($str{$start})) {
                        $word_len++;
                    } else {
                        break;
                    }
                }
                return substr($str, $start + 1, $word_len);
            } else {
                return '';
            }
        }
    } else {
        return false;
    }
}

mtmosier · Jan 22, 2005

Ooh! Ooh! Can I play too?

function getWord($text, $pos) {
    if ($pos > strlen($text)) {
        return '';
    }
    if ($pos < 1)  $pos = 1;

/*  Get a copy of the string up until $pos  */
$tmp = substr($text, 0, $pos);
/*  Match the last word of the temporary string keeping any trailing non-word characters  */
$tmp = preg_replace('/^.*?([\w\\'\-]+[^\w\\'\-]*)$/s', '\1', $tmp);
/*  Append the second part of the original string back onto our temporary string  */
$tmp .= substr($text, $pos);
/*  Match the word at the beginning of the string and return it  */
return preg_replace('/^([\w\\'\-]*).*$/s', '\1', $tmp);
}

This one counts the first character in the string as position 1. It uses regex's "word" character to find word breaks, which should vary by locale.

Edited to include apostrophes and hyphens in words. Also changed PHP tags to CODE since its nearly impossible to get backslashes right in PHP blocks. grrr

ripat · Jan 23, 2005

Just my two (euro) cent:

function getWord2($text, $pos) {
  if ($pos > strlen($text)) { 
    return 'Tekst is veel te kort!'; 
  } 
  preg_match_all('#\b\w+#', substr($text, 0, $pos), $out);
  return array_pop($out[0]);
}

Using the assertion \b (which is no character consuming) does the magic trick.

Edit This BBcode doesn't like the regex pattern. Hence the CODE tag !

laserlight · Jan 23, 2005

This BBcode doesn't like the regex pattern. Hence the CODE tag !

Yeah, I think its a bug with the php bbcode tag.

There's a bug with your solution though, in that if the position is in the middle of the word, not the whole word is returned (due to the substr())

Weedpacket · Jan 24, 2005

Originally posted by mtmosier
This one counts the first character in the string as position 1. It uses regex's "word" character to find word breaks, which should vary by locale.

It also breaks on "I'd", a word I deliberately used in my test string because of this. It's also why I didn't do anything about punctuation, since nothing was specified about them in the original problem (even though I asked).

mtmosier · Jan 24, 2005

It also breaks on "I'd", a word I deliberately used in my test string because of this.

Quite true. Easily fixed, but then the question becomes what else is a valid part of a word? A hyphen I suppose, but there must be more. Alternatively could simply define what constitues a character on which to break.

I think I need more detailed specs.

ripat · Jan 24, 2005

There's a bug with your solution though, in that if the position is in the middle of the word, not the whole word is returned (due to the substr())

True. If that's what he needs, it is easy to correct it by adding just one line:

function getWord2($text, $pos) {
  if ($pos > strlen($text)) { 
    return 'Tekst is veel te kort!'; 
  }
  while ($text{$pos} != ' ') $pos++;        //  <--------   added line
  preg_match_all('#\b\w+#', substr($text, 0, $pos), $out);
  return array_pop($out[0]);
}

As for the "I'd" problem, pcre syntax considers it as two words. Which is gramaticly correct I guess (forgive me if I'am wrong, english is not my mother language!).

If not, change the regex pattern with #\b[\w']+#.

Et voilà. Meer moet dat niet zijn!

Weedpacket · Jan 24, 2005

Originally posted by ripat
If not, change the regex pattern with #\b[\w']+#.

Which would then include the closing (but not opening) quotes in 'single quoted strings'.

I could change mine to - after matching on space/whitespace - then trim off non-wordlike characters from either end. I'd use trim if it had negated character classes, but it doesn't. So preg_replace('/^\W|\W$', '', $word) it is then.

But there will still be failures on some strings' apostrophes.

But maybe Dutch has different conventions.

ripat · Jan 24, 2005

Yes, there will always be 'special cases'.

Just for the sake of efficiency I tried to optimize my code above to make it run even faster.

function getWord4($text, $pos) {
  $pos = strpos( $text, ' ', $pos);
  if ($pos === FALSE) $pos = strlen($text);
  $out = preg_split("#[^\w-']#", substr($text, 0, $pos), -1, PREG_SPLIT_NO_EMPTY);
  return array_pop($out);
}

preg_split does the job faster and also, strpos() is more efficient than my while() loop - (which, btw, had a bug when $pos was in the last word! - the while loop did not stop)

We all had fun at this post but what does the original poster thinks of all this?

laserlight · Jan 24, 2005

Havent tested your new version yet, but wouldnt end($out) be better than array_pop($out) in this case?

ripat · Jan 24, 2005

Originally posted by laserlight
Havent tested your new version yet, but wouldnt end($out) be better than array_pop($out) in this case?

It is indeed beter as there is no need to shorten the $out array.

I guess we all have our (bad) habits.

Thanks.

ripat · Jan 26, 2005

I just read a post above that asked if strpos could work backwards. Well I don't think so but by using strrev() one could emulate it.

function strpos_backwards($text, $pos) {
  // finds the position of the next space starting at the given $pos
  $pos = strpos( $text, ' ', $pos);
  // if $pos in the middle of last word, this will position $pos at the end of string
  if ($pos === FALSE) $pos = strlen($text);
  // substr chops the string behind new $pos and reverse the chopped string
  $reversed_text = strrev(substr($text, 0, $pos));
  // it's here that strpos work backwards i.e. forwards but on a reversed string
  // substring it until first occurrence of a space
  $out = substr($reversed_text, 0, strpos($reversed_text, ' '));
  // reverse back the found word and return it
  return strrev($out);
}

I benchmarked it against the above solutions and it's the fastest (2 times faster than the my preg_split thing and Laserlight's string function that are both equally fast)

Howto get the string if the position is known

Sstarbbs

laserlight

Weedpacket

Sstarbbs

Sstarbbs

BBlackenedSky

laserlight

Sstarbbs

laserlight

Mmtmosier

Rripat

laserlight

Weedpacket

Mmtmosier

Rripat

Weedpacket

Rripat

laserlight

Rripat

Rripat