Problem with replacing character "–"

gvanto

I am reading a text file (attached) which contains the character '–' (i think ascii code 150, 0x96). The problem is, I can't seem to match/replace this character using preg_match / str_replace. When displaying the character, it comes up as a "?" (the character with ascii code 141 on here: http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml
)

And the weird thing is, to try and see what the ascii code is (using count_chars) that php sees it as, returns 3 DIFFERENT values:

foreach (count_chars("–", 1) as $i => $val) {
			$this->debug("", "i=$i There were $val instance(s) of ". chr($i) . " in the string. ord = " . ord(chr($i)) );

		}

Outputs:

'i=128 There were 1 instance(s) of � in the string. ord = 128'

'i=147 There were 1 instance(s) of � in the string. ord = 147'

'i=226 There were 1 instance(s) of � in the string. ord = 226'

I think the problem might be in the way I'm reading the file contents (UTF-8 encoding by default for get_file_contents() ) ... Perhaps I need to read it as pure text so that the "–" character can stay preserved ? Only, not quite sure how to do a TEXT read using file / get_file_contents, the doc doesn't have example on this.

Help would be much appreciated,
Gerry

Weedpacket

The en-dash (which is what that character is), does indeed live at code point 150 in the Windows-1251 character set.

It does not appear in the ASCII character set, nor does it appear in ISO-8859-1 (aka Latin-1).

Its Unicode code point is 2013, which in a UTF-8 document is encoded as three bytes 128, 147, 226.

It would appear that your source code is being saved as UTF-8, given the bytes reported as being in the string literal "–". If the text file is saved using the same encoding, then str_replace("–", $whatever, $text) would do the job: the correct bytes are being searched for and replaced. If not, then either some character set conversion would be needed ([man]iconv[/man]) to convert the text file into the same encoding that the PHP source code with the string literals in it was saved as; or specify bytes explicitly as "\x80\x93\xe2" (A UTF-8 encoded en-dash) or "\x96" (an en-dash in Windows-1251).

(UTF-8 encoding by default for get_file_contents() )

Unless you're using PHP 6 already, file_get_contents() is binary-safe: the bytes it puts in the string are the same bytes that were in the original file (actually, binary-safe file reading is the default in PHP 6 as well).

gvanto

Hi Weedpacket
[like your signature btw!]

Thanks for your helpful response. I think I see how Windows' wonderful ingenious has come to shoot me in the foot again :-)

I've tried the following, but without success, the en-dash (or whatever it is read as by file_get_contents) remains ....

$l = preg_replace("/([\x80\x93\xe2])|([\x96])/", "-", $l);

lol why does windows have these characters?!

NogDog

This might help: http://www.charles-reace.com/blog/2008/10/15/filtering-ms-word-text/

PS: As to "why", because the character set Windows uses was designed for Windows, not for the internet.

gvanto

Found the problem: pdftotext function by default converts files using Latin1 format, so just had to manually specify -enc option to utf8 ... fark, that wasted a couple of hours!!!!