[RESOLVED] a problem with unicode Canonical equivalence and substr()

steamPunk

I'm buidling a site in several languages and using the UTF-8 encoding to display all the characters properly.

This works very nicely except when i use substr() and it cuts off just after an accented letter

example :

I have this text in the DB :

"Bienvenue à l’ESMA Aviation Academy !

Vous avez choisi d’entrer dans le monde aéronautique, monde de passion (...)"

when i do

<?php echo substr($str,0,100); ?>

the text cuts off just after the "é" in "aéronautique" but the "é" itself is displayed as a rectangle. I couldn't work out why this was happening as all the other accented characters were displaying properly

however i found that by lengthening the substr() by 1 character

<?php echo substr($str,0,101); ?>

the "é" character displays properly !!

so i did some research and discovered that Unicode is a lot more complex than i thought and has letters that are assembled from two or more characters even though they only display as a single character, and this would explain the substr() anomaly.

but I don't know what i can do to ensure that the substr() wouldn't cut a composite character in the middle .....

the only thing i can think of is to somehow make a substr()-like function that would not cut words in the middle but would continue to the end of the word and cut in the next space

but i haven't got a clue how to go about doing that

could someone suggest a method that i could use ?

thanks

laserlight

Perhaps you could use the multibyte string functions, e.g., [man]mb_substr/man.

steamPunk

yes that did the trick, thanks :

	$encoding = "UTF-8";	
	echo mb_substr($str,0,100,$encoding);

in fact to kill two birds with one stone I've extended it so that it won't cut in the middle of a word :

	
$offset = 100;
$encoding = "UTF-8";
$search = " ";
$n = mb_strpos($str,$search,$offset,$encoding);	
echo mb_substr($str,0,$n,$encoding);

is there a shorter way of writing it ?

thanks again