Pattern Matching Odyssey

sneakyimp

hi all

i am about to embark on a pattern matching journey. I'm hoping that someone can recommend which strains of pattern matching function are the best weapon to deal with my file issues. Perl-compatible or POSIX extended?? what's the difference?

my fleet is 4 or 5 directories full of handsome and seaworthy HTML files. I need to bravely steer them to new directories every bit as good as their old ones and named indentically.

During this journey from an old folder named /src/foo, a typical html file will move to another folder named /output/foo and will undergo the following changes which will strip out headers and footers while retaining critical javascript and css references.

1) CSS file references in the header must be extracted and put into an array.

2) any javascript in the header must be extracted and saved in a variable. this variable must be cleansed of certain javascript functions which i will be including on every page as a js file reference in the html.

3) a fairly headers and footers must be removed. there are 5 varieties of this header/footer pairs with different image files but consistent formatting--i'm nearly certain headers all end with <td background="images/1bg.gif">

footers all begin with:

	</td>
  </tr>
  <tr>
    <td><table width="100%" border="0" cellspacing="0" cellpadding="0">
        <tr>
          <td><table width="100%" border="0" cellspacing="0" cellpadding="0">
              <tr>
                <td height="14" background="images/one_of_5_images.gif">

4) all links internal to the site will be changed from .html to .php. external site references will be unaffected.

There are more things i will need to tweak but i figured this is plenty to start with. if done right, this script will save me weeks of painful and messy hand coding.

any advice would be greatly appreciated! the pattern matching functions confuse me enough and i'd like to start off on the right foot.

sneakyimp

alrighty then.

i have managed to extract the header, the title from the header, any css references and scripts.

can anyone recommend a good preg_match_all for javascript functions? ideally it would match exactly one javascript function. can you nest functions in javascript? if not, i think i can do it...if so, i am screwed. i'm no good at back references.

sneakyimp

given a string containing javascript, this function ALMOST works....it doesn't grab all of the function, though...report_html2 is a function which formats HTML for display in a web page....

function report_javascript_functions($javascript) {
  $arr_script_functions = Array();
  $pattern = "|\n\s*function\s*([^ \(]+)[^\{]+{.+?(?!function)}|is";
  $num_matches = preg_match_all($pattern, $javascript, $arr_script_functions);
  if ($num_matches == 0) {
   jta_error("no functions in script");
  } else {
    echo "$num_matches functions found in script<br>";
    foreach($arr_script_functions[0] as $key=>$value) {
      echo "====== FUNCTION " . ($key+1) . " =====<br>";
      report_html2($value);
    } // for each function found
  } // if functions found
} // report_javascript_functions()

sneakyimp

ok...i'm starting to think this pattern matching of javascript functions in a general sense is literally impossible to do for the general sense.

this can find nested brackets:

$pattern = "#{((?>[^{}]+)|(?R))*}#is";

this can almost do it....problem is that it doesn't grab the entire function:

$pattern = "#(function)(\s*)([^ \(]+)([^{]+){.+}#Uis";/PHP]

if you try to make that last + greedy, then the first match grabs every single function in your script (and all the stuff in between).

this is currently working for me, but requires that the no functions are indented...the 'function' keyword and closing bracket must both be right after a line return.
[code=php]$pattern = "#(function)(\s*)([^ \(]+)([^{]+)({.*\n}.*\n)#Uis";

I suspect there will be problems with that one too....