All right, I need to take the code of an entire HTML page as input, and regular expressions used to extract information from it and put it into arrays. The issue is that a large portion of the HTML in the page repeats; i.e. the same chunk of code is repeated many times, with different information which will be extracted. The script needs to loop through the input, and for each instance of this chunk of code, make an array with the information that's extracted. I figure I'll use a set of regular expressions within a while loop to get the information, but I need help in actually being able to break down the HTML such that the PHP script can loop through it. Does this make sense?

I would also need to simply delete a large portion of text from the beginning of the HTML code up to a certain string.

    I think you are after [man]preg_match_all[/man] which will return all matching strings in an input string. Use [man]file_get_contents[/man] to get the url in a variable.

      All right, I tried making a regular expression to encompass the entire chunk of code that repeats, but now preg_match_all is returning no matches.

      $pattern = "/<tr bgcolor='#[0-9A-Fa-f]{6}'><td>
         <div style=\"width: 58;  overflow-x: hidden\">  
      [0-9]+\) <a title='Display Nation' href=\"nation_drill_display\.asp\?Nation_ID=([0-9]+)\">(.*?)<\/a>
      <\/div>.*?<center>([0-9]*,?[0-9]+\.[0-9]{3})<\/td><\/center><td><center>([0-9]*,?[0-9]+\.[0-9]{2})<\/td><\/center><td><center>([0-9]*,?[0-9]+\.[0-9]{2})<\/td><\/center><td><center>[1-5]<\/td><\/center><td><center> <img src='assets\/nuke\.gif' title=\"(This nation owns [0-9]{1,2} nuclear weapons\.|Nation supports nuclear weapons but does not own any\.|Nation does not support nuclear weapons\.)\"> <\/td><\/center><td><img src='images\/war\.gif' border=0 title='(War is an option|Peaceful Nation)'> <\/td><\/tr>/"; $string = "<tr bgcolor='#E6E6E6'><td> <div style=\"width: 58; overflow-x: hidden\">
      1) <a title='Display Nation' href=\"nation_drill_display.asp?Nation_ID=116286\">Ian Isles</a>
      </div> </td><td>5/6/2008 12:42:53 AM</td><td><center>16<br>Days</td></center><td><center> <img border=\"0\" src=\"images/teams/team_Aqua.gif\" width=\"14\" height=\"13\" title=\"Team: Aqua\"> </td></center><td><center>42,574.764</td></center><td><center>5,689.99</td></center><td><center>2,675.34 </td></center><td><center>5</td></center><td><center> <img src='assets/nuke.gif' title=\"This nation owns 20 nuclear weapons.\"> </td></center><td><img src='images/war.gif' border=0 title='War is an option'> </td></tr>";

      The above input should match the regular expression. What's the issue here, do I need to cut out the white space and line returns beforehand, or is it something more?

        4 days later
        Write a Reply...