Hi, this is monster regex and it's doing my head in. I'm trying to make a site XHTML compatible and the main thing it's falling down on on the w3 XMHTML validator is ampersands in the source. I need a way of sifting through php files and replacing ampersands with & where necesarry. Here is what it needs to do.

  1. find an ampersand
  2. if the ampersand is not within <?(|php|PHP) and ?> then skip 3.
  3. if the ampersand is not within "" or '' (taking account of backslashes) then ignore it.
  4. replace all ampersands that do not match criteria for possibly being html entities ( /\&[a-zA-Z0-9]{3,7};/ ) with &amp;

I'm having real difficulty with even where to begin. Could someone please help with this? Is it possible with one preg_replace? If so how?

What I've managed to get together fot part 4 so far are
'/(\&)(?=[;]*(\s|"))/'
'/(\&)(?=[;]{2,7}(\s|")/'
but neither of them work quite right even just for that one section.
Thanks
Bubble
PS.
Even if you solve just part of the problem please post your 2c.

    Okay, here's a wacky idea ottomh. Since we're talking about working over PHP files, use PHP's own [man]tokenizer[/man] to figure out which ampersands are which - which ones are being used as operators (and hence shouldn't be touched) and which bits are in strings. Then you only need to deal with the ampersands that appear inside string literals (since those are probably the only ones that will be appearing in the generated HTML).

      Thanks for the tip (I like the link to the code high-lighter at the bottom), however the server is running 4.2.2 without tokenizer. Any other ideas?
      Cheers
      Bubble

        Okay; it's more hassle, but it is easier than trying to accommodate every single possibility in one regexp.

        Basically, the idea is to take out bits you don't want altered (in this case, anything that might contain an & you don't want escaped), alter what's left, and then put those bits back in. You can use several regexps to remove different categories of thing until the text you're working on only contains & that you want escaped.

        I wrote some off-the-cuff code for this once; it's fairly usable, and I dropped it in this thread.

        Dunno how well that will work; in this case there might have to be two passes. First, take out PHP blocks and escape what's left (the HTML), put the PHP back in, take the HTML out and escape the & characters that appear in string literals. (Alternatively, work on the array of PHP blocks to escape & characters in string literals before putting the them back into the HTML.)

          Cool, I've just done a similar thing (but a hell of a lot sloppier) by spliting the php and the html into two arrays. While doing this every time I reach a new html chunk I split the php that has just finished into code and strings. Doesn't handle \\" but I'm pretty sure that never happens in the site (I hope :p ).
          OK, no laughing now. Here it is.
          I'd like to use my method (purely out of pride...and because I've bloody worked on it for three hours already!!😉) so, do you have any comments?

          <?php
          //split php and html
          function strip_code($string)
          {
              $len=strlen($string);
              $intag=false;
              $html;
              $php;
              for($i=0,$a=-1,$b=-1;$i<$len;$i++) {
                  if(!$intag && $string[$i]=='<' && $string[$i+1]=='?') {
                      $html[$a].=$string[$i++].$string[$i];
                      if(preg_match('/php/i',$string[$i+1].$string[$i+2].$string[$i+3]))
                          $html[$a].=$string[++$i].$string[++$i].$string[++$i];
                      $intag=true;
                      $b++;
                  } elseif($intag && $string[$i]=='?' && $string[$i+1]=='>') {
                      $php[$b].=$string[$i++];
                      $intag=false;
                      $php[$b]=just_strings($php[$b]);
                      $a++;
                  } elseif(!$intag) {
                      $html[$a].=$string[$i];
                  } else {
                      $php[$b].=$string[$i];
                  }
              }
              return array('html'=>$html,'php'=>$php);
          }
          
          //for php; split code and strings
          function just_strings($string)
          {
              $quotes=array('"',"'");
              $inquotes=array(false,false);
              $qcount=count($quotes);
              $len=strlen($string);
              $code;
              $quote;
              for($i=0, $c=0, $q=0;$i<$len;$i++) {
                  if(!in_array(true,$inquotes) && !in_array($string[$i],$quotes)) {
                      $code[$c].=$string[$i];
                  } elseif(!in_array(true,$inquotes) && in_array($string[$i],$quotes)) {
                      $code[$c++].=$string[$i];
                      for($j=0;$j<$qcount;$j++) {
                          if($string[$i]==$quotes[$j]) {
                              $inquotes[$j]=true;
                          }
                      }
                  } elseif(in_array(true,$inquotes)) {
                      for($j=0;$j<$qcount;$j++) {
                          if($inquotes[$j]==true && $string[$i]==$quotes[$j] && ($string[$i-1]!="\\\\" || $string[$i-2]=="\\\\")) {
                              $code[$c].=$string[$i];
                              $q++;
                              $inquotes[$j]=false;
                          } elseif ($inquotes[$j]==true) {
                              $quote[$q].=$string[$i];
                          }
                      }
                  }
              }
              return array('code'=>$code, 'quote'=>$quote);
          }
          ?>

          Cheers
          Bubble

          PS
          Just out of interest, how would one create a regex to eliminate the contents of php tags? At first it seemed easy but then you are allowed to use the ?> in a string and it is quite possible your would have to. Does this come back to lookaheads and lookebehinds?

            Using this for the ampersands
            '/&(?!amp|uml|lt|gt|quot)[;]/'
            The problem is, I have to explicitly say every possible tag between & and ; because apparently you cannot have quantifiers in lookaheads. Is there a simple workaroud for this kind of thing?
            Thanks
            Bubble

              Two things:
              Apostrophes in comments might cause problems.

              For the entities, you might have easier workings if you store them as arrays somehow and pass the arrays to the preg function (which can take arrays of regexps and process them all in a single call).

              Or expand all of the ampersands, and collapse back down those that shouldn't have been - those would be ones of the form &amp;[a-z]+; and &amp;#[0-9]+;

                Write a Reply...