Howto get the string if the position is known

laserlight · Jan 22, 2005

You see the , at the end ? i thoight it tests this to filter this out ?

It doesnt, because you define the delimiter as a space.

It does for my own version, because I defined words as alphanumeric strings, rather than as non-delimiters.

starbbs · Jan 22, 2005

SO you wrote a simlira unction for your self ? cn you post this ?

laserlight · Jan 22, 2005

function parseForWord($str, $pos) {
    $len = strlen($str);
    //perform bounds checking
    if ($pos >= 0 && $pos < $len) {
        //is character at $pos is a delimiter?
        if (ctype_alnum($str{$pos})) {
            //$pos marks a character within a word
            $word_len = 1;
            //scan left for start of word
            for ($start = $pos - 1; $start >= 0; $start--) {
                if (ctype_alnum($str{$start})) {
                    $word_len++;
                } else {
                    break;
                }
            }
            //scan right to end of word
            for ($i = $pos + 1; $i < $len; $i++) {
                if (ctype_alnum($str{$i})) {
                    $word_len++;
                } else {
                    break;
                }
            }
            //$start + 1 here because of $start-- or an out-of-bounds $start
            return substr($str, $start + 1, $word_len);
        } else {
            //$pos marks position of a delimiter
            $word_len = 0;
            //scan left only
            for ($i = $pos - 1; $i >= 0; $i--) {
                if (ctype_alnum($str{$i})) {
                    $word_len = 1;
                    break;
                }
            }
            //$word_len = 1 > 0 if (last character of) word found
            if ($word_len > 0) {
                //scan left for start of word
                for ($start = $i - 1; $start >= 0; $start--) {
                    if (ctype_alnum($str{$start})) {
                        $word_len++;
                    } else {
                        break;
                    }
                }
                return substr($str, $start + 1, $word_len);
            } else {
                return '';
            }
        }
    } else {
        return false;
    }
}

mtmosier · Jan 22, 2005

Ooh! Ooh! Can I play too?

function getWord($text, $pos) {
    if ($pos > strlen($text)) {
        return '';
    }
    if ($pos < 1)  $pos = 1;

/*  Get a copy of the string up until $pos  */
$tmp = substr($text, 0, $pos);
/*  Match the last word of the temporary string keeping any trailing non-word characters  */
$tmp = preg_replace('/^.*?([\w\\'\-]+[^\w\\'\-]*)$/s', '\1', $tmp);
/*  Append the second part of the original string back onto our temporary string  */
$tmp .= substr($text, $pos);
/*  Match the word at the beginning of the string and return it  */
return preg_replace('/^([\w\\'\-]*).*$/s', '\1', $tmp);
}

This one counts the first character in the string as position 1. It uses regex's "word" character to find word breaks, which should vary by locale.

Edited to include apostrophes and hyphens in words. Also changed PHP tags to CODE since its nearly impossible to get backslashes right in PHP blocks. grrr

ripat · Jan 23, 2005

Just my two (euro) cent:

function getWord2($text, $pos) {
  if ($pos > strlen($text)) { 
    return 'Tekst is veel te kort!'; 
  } 
  preg_match_all('#\b\w+#', substr($text, 0, $pos), $out);
  return array_pop($out[0]);
}

Using the assertion \b (which is no character consuming) does the magic trick.

Edit This BBcode doesn't like the regex pattern. Hence the CODE tag !

laserlight · Jan 23, 2005

This BBcode doesn't like the regex pattern. Hence the CODE tag !

Yeah, I think its a bug with the php bbcode tag.

There's a bug with your solution though, in that if the position is in the middle of the word, not the whole word is returned (due to the substr())

Weedpacket · Jan 24, 2005

Originally posted by mtmosier
This one counts the first character in the string as position 1. It uses regex's "word" character to find word breaks, which should vary by locale.

It also breaks on "I'd", a word I deliberately used in my test string because of this. It's also why I didn't do anything about punctuation, since nothing was specified about them in the original problem (even though I asked).

mtmosier · Jan 24, 2005

It also breaks on "I'd", a word I deliberately used in my test string because of this.

Quite true. Easily fixed, but then the question becomes what else is a valid part of a word? A hyphen I suppose, but there must be more. Alternatively could simply define what constitues a character on which to break.

I think I need more detailed specs.

ripat · Jan 24, 2005

There's a bug with your solution though, in that if the position is in the middle of the word, not the whole word is returned (due to the substr())

True. If that's what he needs, it is easy to correct it by adding just one line:

function getWord2($text, $pos) {
  if ($pos > strlen($text)) { 
    return 'Tekst is veel te kort!'; 
  }
  while ($text{$pos} != ' ') $pos++;        //  <--------   added line
  preg_match_all('#\b\w+#', substr($text, 0, $pos), $out);
  return array_pop($out[0]);
}

As for the "I'd" problem, pcre syntax considers it as two words. Which is gramaticly correct I guess (forgive me if I'am wrong, english is not my mother language!).

If not, change the regex pattern with #\b[\w']+#.

Et voilà. Meer moet dat niet zijn!

Weedpacket · Jan 24, 2005

Originally posted by ripat
If not, change the regex pattern with #\b[\w']+#.

Which would then include the closing (but not opening) quotes in 'single quoted strings'.

I could change mine to - after matching on space/whitespace - then trim off non-wordlike characters from either end. I'd use trim if it had negated character classes, but it doesn't. So preg_replace('/^\W|\W$', '', $word) it is then.

But there will still be failures on some strings' apostrophes.

But maybe Dutch has different conventions.

ripat · Jan 24, 2005

Yes, there will always be 'special cases'.

Just for the sake of efficiency I tried to optimize my code above to make it run even faster.

function getWord4($text, $pos) {
  $pos = strpos( $text, ' ', $pos);
  if ($pos === FALSE) $pos = strlen($text);
  $out = preg_split("#[^\w-']#", substr($text, 0, $pos), -1, PREG_SPLIT_NO_EMPTY);
  return array_pop($out);
}

preg_split does the job faster and also, strpos() is more efficient than my while() loop - (which, btw, had a bug when $pos was in the last word! - the while loop did not stop)

We all had fun at this post but what does the original poster thinks of all this?

laserlight · Jan 24, 2005

Havent tested your new version yet, but wouldnt end($out) be better than array_pop($out) in this case?

ripat · Jan 24, 2005

Originally posted by laserlight
Havent tested your new version yet, but wouldnt end($out) be better than array_pop($out) in this case?

It is indeed beter as there is no need to shorten the $out array.

I guess we all have our (bad) habits.

Thanks.

ripat · Jan 26, 2005

I just read a post above that asked if strpos could work backwards. Well I don't think so but by using strrev() one could emulate it.

function strpos_backwards($text, $pos) {
  // finds the position of the next space starting at the given $pos
  $pos = strpos( $text, ' ', $pos);
  // if $pos in the middle of last word, this will position $pos at the end of string
  if ($pos === FALSE) $pos = strlen($text);
  // substr chops the string behind new $pos and reverse the chopped string
  $reversed_text = strrev(substr($text, 0, $pos));
  // it's here that strpos work backwards i.e. forwards but on a reversed string
  // substring it until first occurrence of a space
  $out = substr($reversed_text, 0, strpos($reversed_text, ' '));
  // reverse back the found word and return it
  return strrev($out);
}

I benchmarked it against the above solutions and it's the fastest (2 times faster than the my preg_split thing and Laserlight's string function that are both equally fast)

Howto get the string if the position is known

laserlight

Sstarbbs

laserlight

Mmtmosier

Rripat

laserlight

Weedpacket

Mmtmosier

Rripat

Weedpacket

Rripat

laserlight

Rripat

Rripat