[RESOLVED] Trying to understand international character handling

jkurrle

I'm working on a small personal project, in order to understand international characters. I have mbstring enabled on my WAMP setup, and am trying to do the following:

1) HTML form accepts international characters (I'm using Russian), then POSTs data to PHP script
2) PHP script reads international character string
3) PHP takes every other letter and puts it into a $str1 variable
4) Remaining letters are appended to a $str2 variable

Normally, I'd treat the string as a pseudo array and use a for() statement combined with strlen() to process each letter at a time. However, with international characters, this doesn't seem to work. This is my sample HTML:

<html>
<!-- enable UTF-8 encoding on this page -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

<head><title>test page</title></head>

<body>
<form method="post" action="test3.php">
Input string below: <br />
<textarea name="str" id="str"></textarea>
<input type="submit">
</form>
</body>

</html>

The following is my processing script:

<?php
header('Content-type: text/html; charset=utf-8');

$str = $_POST['str'];
$str1 = "";
$str2 = "";
for($i=0;$i<=strlen($str)-1;$i++)
  {
  if($i % 2 == 1)
    {
    $str2 .= $str[$i];
    }
  else
    {
    $str1 .= $str[$i];
    }
  }

echo "String is $str <br />";
echo "String 1 is $str1 <br />";
echo "String 2 is $str2 <br />";
?>

The international characters do pass correctly, as my first echo statement at the end will show ($str). However, doing the split of the characters is not being handled. I get character boxes, instead of the characters themselves, in $str1 and $str2.

Can someone give me an idea of the proper way to process international strings on a character by character basis?

traq

[man]mb_strlen/man

However, the [font=monospace]$string[i][/font] syntax also works with bytes (not [multibyte] characters).

$utf8str = '&#5815;&#5825;&#5792;&#5867;&#5819;&#5846;&#5867;&#5817;&#5825;&#5850;&#5846;&#5867;&#5792;&#5801;&#5809;&#5867;&#5854;&#5809;&#5825;&#5819;&#5839;&#5822;&#5846;&#5867;&#5854;&#5801;&#5847;&#5846;&#5835;&#5867;&#5819;&#5850;&#5831;&#5839;&#5802;&#5822;&#5868;';

// split string at non-start, non-end character boundaries
$utf8chars = preg_split( '/(?<!^)(?!$)/u',$utf8str );

// empty strings
$evens = $odds = '';

// loop through characters
foreach( $utf8chars as $i=>$char ){
    if( $i % 2 === 1 ){
        $evens .= $char;
    }
    else{
        $odds .= $char;
    }
}

print "oddly positioned characters: $odds\nevenly positioned characters: $evens";

/* prints

oddly positioned characters: &#5815;&#5792;&#5819;&#5867;&#5825;&#5846;&#5792;&#5809;&#5854;&#5825;&#5839;&#5846;&#5854;&#5847;&#5835;&#5819;&#5831;&#5802;&#5868;
evenly positioned characters: &#5825;&#5867;&#5846;&#5817;&#5850;&#5867;&#5801;&#5867;&#5809;&#5819;&#5822;&#5867;&#5801;&#5846;&#5867;&#5850;&#5839;&#5822;

*/

jkurrle

This is very useful information. Thank you!

traq

welcome : )