[RESOLVED] Webpage title as UTF string

NRice

Hi,

I'm stuck with a silly problem here. I'm trying to fetch a webpage and parse its HTML title. It works fine for Western charset pages, but if I try to fetch a non-English (say Japanese) webpage then the returned page title is not well encoded (although the source webpage is UTF8 encoded). Here's the basic code I have (using Google News Japan as an example to print its webpage title):

$pagetext = file_get_contents("http://news.google.com/news?ned=jp");
preg_match("/<title>(.*?)<\/title>/is", $pagetext, $pagefound);
echo $pagefound[1];

Further, I store the the webpage title in the database (in a UTF8 table with a UTF8 charset field). But the printed or stored value is not well encoded and has invalid chars like ??? in it.

Any clues as to how I can solve this problem will be helpful. Thanks!

dzysyak

Actually I do not have much tdeas about that problem. Here is what I can suggest. Select database connection encoding after you set up connection to the databse. something like that:

mysql_query("SET NAMES 'utf8'");

Hope this will help you.

MarkR

You should not attempt to parse a HTML document using regular expressions. It won't work in the general case.

I recommend that you feed the document into the DOMDocument::loadHTML function, which will parse it correctly provided it has a HTML meta content-type tag.

If the page has no meta content-type tag, you should read its encoding from the HTTP header - and somehow get loadHTML to treat that as the correct encoding. However, it is not straightforward to do this, as it seems it always uses latin1 as the default.

What I ended up doing when I did this was having some code manually insert a meta http-equiv content-type tag if one didn't exist in the document already, copying the charset from the http header.

This is a hack however.

All DOM methods return their string values in utf-8, regardless of the original encoding, so from that point forward there is no problem.

Mark

MarkR

Storing stuff in a database is a separate issue. PHP is not very encoding-aware, therefore you have to be. If you're storing arbitrary strings (including non-latin charse), you really have to use utf-8.

If you're using MySQL, yes you must do "SET NAMES utf8", but also you must ensure that the tables are all in utf8 encoding AND your application uses utf8 throughout.

Being inconsistent is the killer, as you will get a lot of problems if there is a mismatch at any stage (HTML, email, database client, database server, database tables, files on disc etc).

Mark

NRice

Thanks Jason and Mark.

@: I tried DOMDocument::loadHTML but it doesn't work either ...

$pagetext = file_get_contents("http://news.google.com/news?ned=jp");
$doc = new DOMDocument();
@$doc->loadHTML($pagetext);
$pagetitle = $doc->getElementsByTagName("title")->item(0)->firstChild->nodeValue;
echo $pagetitle;

The page title is parsed correctly, and the browser shows that the current page encoding is UTF-8, but the text is still not displayed correctly. The text comes up with invalid chars instead of valid Japanese text as in the title.

Saeven

(sorry - wrong thread, please delete)

dzysyak

Have found an issue?

By the way have you set correct encoding of the page and on the server (in .htaccess file). May be it is just an issue of the output to the browser?

scrupul0us

do u have multi-lingual character support installed on your OS?

MarkR

NRice - the code you showed should work. But it doesn't.

I ran that code, and it appears that Google sends the output in a different encoding to that script than my browser. This is bad.

Google is sending Shift_Jis output to PHP, but utf8 to my browser. I don't know why. My guess is that it doesn't see the apporpriate accept-encoding header.

I don't think DOMDocument supports Shift_Jis

Try setting an accept-encoding header. I will post the code to do this shortly.

Mark

MarkR

Ok, I've got:

<?php

ini_set('default_charset', 'utf8');
$opts = array('http' =>
        array(
                'header' => "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n" .
                        "User-Agent: Mozilla/5.0 (Compatible; php page getter)"
        )
);
$context = stream_context_create($opts);
$pagetext = file_get_contents("http://news.google.com/news?ned=jp", false, $context);
file_put_contents("jap.html", $pagetext);
$doc = new DOMDocument();
$doc->loadHTML($pagetext);
$pagetitle = $doc->getElementsByTagName("title")->item(0)->firstChild->nodeValue
;
echo $pagetitle;
echo "\n";
echo "Encoding is apparently " . $doc->encoding . "\n";
echo "actualEncoding = " . $doc->actualEncoding . "\n";

?>

Which sadly, still doesn't work. It manages to load the page in utf-8 now, but somewhere in the parsing or output, it's erroneously interpreted as latin1, which leads to corruption.

However it is different. And at least I've managed to persude Google to send me the data in utf-8.

I'll keep trying.

Mark

dzysyak

Have you tried to change character coding?

In .htacces:

AddDefaultCharset UTF-8

or in php before output:

header("Content-Type: text/html; charset=UTF-8'");

NRice

Guys, nothing seems to have helped still. I've tried everything discussed here.

NRice

Ok, I got it working!

function GetPageTitle($url) {
    $fd = @fopen($url, 'r');
    if ($fd) {
        $html = fread($fd, 5120);
        fclose($fd);

    // Get title from title tag
    preg_match_all('/<title>(.*)<\/title>/si', $html, $matches);
    $title = $matches[1][0];

    // Get encoding from charset attribute
    preg_match_all('/<meta.*charset=([^;"]*)">/i', $html, $matches);
    $encoding = strtoupper($matches[1][0]);

    // Convert to UTF-8 from the original encoding
    if (function_exists('mb_convert_encoding') ){
        $title = @mb_convert_encoding($title, 'UTF-8', $encoding);
    }

    if (utf8_strlen($title) > 0) {
        return $title;
    } else {
        // No title, so return filename
        $uriparts = explode('/', $url);
        $filename = end($uriparts);
        unset($uriparts);

        return $filename;
    }
} else {
    return false;
}
}

This function is from an external open source app, but it resolves the issue.

bradgrafelman

Don't forget to mark this thread resolved.