Hello,

Totally stumped on this one...any ideas?

Alright, I am reading a remote file into my script using fread() and then grabbing only portions of that document and modifying it using eregi() and str_replace()

Now, sometimes the remote file has special characters in it, for example:

’ - &#146 ;
“ - &#147 ;
” - &#148 ;
— - &#151 ;

And for some reason my script is interpreting this as gibberish. For example, the remote file will say:

John Doe’s site

And my script will interpret this as:

John Doe’s site

I have tried using str_replace() to fix this, for example:

$review[1] = str_replace("’", "'", $review[1]);

However, it doesn't seem to work. Any ideas?

Thanks!

    Are you or have you tried reading the file in binary?

    fopen($file, "rb");

      No, I was just using "r" Tried changing it and nothing happened different. Here is my complete code with some non-critical lines removed for readability:

      $pi_source = "http://www.somesite.com/";
      
      $pi_open = @fopen($pi_source, "rb");
      
      if (!empty($pi_open)) {
      	do {
      	    $pi_read = fread($pi_open, 8192);
      	    if (strlen($pi_read) == 0) {
      	        break;
      	    }
      	    $pi_contents .= $pi_read;
      	} while(true);
      
      $pi_review = eregi("<!--IN THEATERS-->(.*)<!--ALSO IN THEATERS-->", $pi_contents, $review);
      
      // REMOVED: Find and replace HTML contents and other formatting
      
      $review[1] = str_replace("&#146;", "'", $review[1]);
      $review[1] = str_replace("&#147;", "\"", $review[1]);
      $review[1] = str_replace("&#148;", "\"", $review[1]);
      $review[1] = str_replace("&#151;", "--", $review[1]);
      
      fclose($pi_open);
      }
      
      echo $review[1];

      Any thoughts/ideas?

        Anyone have any ideas? :queasy: 🙁

          Hi,

          in the manual sections about utf8_encode and utf8_decode are user contributed notes about cp1252 data converted to utf-8 (which leads to some kind of invalid utf-8 data).

          One user (dobersch at gmx dot net) provided a function to correctly translate that utf-8 data to iso-8859-1 data.

          Your example string displays correctly with that code:

          <?PHP
          ini_set('error_reporting', E_ALL);
          // map taken from [url]http://de3.php.net/manual/de/function.utf8-encode.php#45226[/url]
          $cp1252_map = array(
             "\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
             "\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
             "\xc2\x83" => "\xc6\x92",    /* LATIN SMALL LETTER F WITH HOOK */
             "\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
             "\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
             "\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
             "\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
             "\xc2\x88" => "\xcb\x86",    /* MODIFIER LETTER CIRCUMFLEX ACCENT */
             "\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
             "\xc2\x8a" => "\xc5\xa0",    /* LATIN CAPITAL LETTER S WITH CARON */
             "\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
             "\xc2\x8c" => "\xc5\x92",    /* LATIN CAPITAL LIGATURE OE */
             "\xc2\x8e" => "\xc5\xbd",    /* LATIN CAPITAL LETTER Z WITH CARON */
             "\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
             "\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
             "\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
             "\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
             "\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
             "\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
             "\xc2\x97" => "\xe2\x80\x94", /* EM DASH */
          
             "\xc2\x98" => "\xcb\x9c",    /* SMALL TILDE */
             "\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
             "\xc2\x9a" => "\xc5\xa1",    /* LATIN SMALL LETTER S WITH CARON */
             "\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
             "\xc2\x9c" => "\xc5\x93",    /* LATIN SMALL LIGATURE OE */
             "\xc2\x9e" => "\xc5\xbe",    /* LATIN SMALL LETTER Z WITH CARON */
             "\xc2\x9f" => "\xc5\xb8"      /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
          );
          
          // I find this name a little misleading because the result won't be valid UTF8 data
          function cp1252_to_utf8($str) {
             global $cp1252_map;
             return  strtr(utf8_encode($str), $cp1252_map);
          }
          
          function cp1252_utf8_to_iso($str) { // the other way around...
            global $cp1252_map;
            return  utf8_decode( strtr($str, array_flip($cp1252_map)) );
          }
          
          $str = cp1252_utf8_to_iso("John Doe’s site");
          
          echo $str;
          ?>
          

          So basically remove the str_replace lines and then use cp1252_utf8_to_iso on $review[1].

          Thomas

            THANK YOU!!!!! 😃

            I can't tell you how relieved I am to have figured this thing out! 😉

              Write a Reply...