I've inherited a big nasty oscommerce project with a lot of dodgy modules installed. One of the many things I'm trying to get my head around is the way in which a product name (or category name, etc.) gets converted into an seo-friendly url.
The problem I'm having is that the product names in the database appear to use iso-8859-1 names which contain some unusual chars such as an ellipsis (ord=133). This not a particular problem for most display purposes, as the charset declared in the html is iso-8859-1. The problem appears to be this function which is supposed to convert a string of text into a string of words-and-dashes that can serve as an seo-friendly url. This function seems to be the culprit:
function strip($string){
if ( is_array($this->attributes['SEO_CHAR_CONVERT_SET']) ) $string = strtr($string, $this->attributes['SEO_CHAR_CONVERT_SET']);
$pattern = $this->attributes['SEO_REMOVE_ALL_SPEC_CHARS'] == 'true'
? "([^[:alnum:]])+"
: "([[:punct:]])+";
$anchor = ereg_replace($pattern, '-', mb_convert_case($string, MB_CASE_LOWER, "utf-8"));
$pattern = "([[:space:]]|[[:blank:]])+";
$anchor = ereg_replace($pattern, '-', $anchor);
return $this->short_name($anchor); // return the short filtered name
} # end function
I'm not sure I see any wisdom in converting to a multibyte string while using single-byte ereg statements (and isn't ereg deprecated?). The net effect of this function is that the phrase "Something…and something else" becomes "somethingand-something-else". The ellipsis just disappears.
Removing the mb_convert_case function call and using strtolower helps (interestingly [[:punct]] fails to match the ellipsis), but I'm hoping to make this work even if folks are using kanji or arabic or whatever. I'm guessing I'll need to change the character encoding declaration of my pages at the very least.
Does anyone have some good general advice about making sure a website is utf-8 friendly? Some things occur to me
1) Data is captured via browser in form inputs. I'm guessing these pages where I capture user/admin input should declare a meta tag that specifies utf-8:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
2) any PHP code that processes this input should use the multibyte string functions. I was sad to notice that there are no preg functions in this collection, just ereg.
3) My databases and database tables should all be declared with utf-8 as the charset. I have no idea how this relates to (and possibly conflicts?) with the collation declared for a given column.
4) Any PHP code that processes database output should also use the multibyte string functions.
5) When the html (or xml or whatever) is finally blasted out to the browser, it should have the meta tag declaring said output as utf-8 (see item 1 above). Does it also need to send a header?
Any guidance or wisdom here would be much appreciated.