I'm currently working on some code that parses HTML content found on a page and is put into a downloadable .csv file. I've used this guide as a reference and the tool found here. Here is my code that parses the HTML:

<?php
include "simple_html_dom.php";
$html = file_get_html('http://siteurl.com');
header('Content-type: application/ms-excel');
header('Content-Disposition: attachment; filename=sample.csv');
$fp = fopen("php://output", "w");
foreach($html->find('tr') as $element) {
$td = array();
foreach( $element->find('th') as $row)
{
$td [] = $row->plaintext;
}
fputcsv($fp, $td);
$td = array();
foreach( $element->find('td') as $row)
{
$td [] = $row->plaintext;
}
fputcsv($fp, $td);
}
fclose($fp);
?>

Everything is working perfectly except for one thing: It adds two blank rows below the rows with table header cells and an additional blank row after each row with normal table cells. Here is a screenshot of the outputted excel file:

[ATTACH]4921[/ATTACH]

How can I prevent it from doing this?

Oh and one additional question in regards to formatting: How can I make the date inside the csv file not be in military time? I want it to be this way as soon as the file is downloaded. In other words, the source HTML file has the date and time like this:

July 31, 2013 5:00 pm

but its output is this:

07/31/2013 17:00

I want it to show up with AM or PM (I'm okay with the '/'s).

Thanks for your help!

excel.jpg

    Scraping an HTML page is not a very robust way of getting information into a program; There are no tables at http://siteurl.com so it's anyone's guess how well they're written or whether they contain blank rows or how the tool you're using handles them.

      @ - thank you for your reply. I was using http://siteurl.com generically because I'm currently working locally. Here is the HTML code of the page I'm scraping. A very simple HTML table:

      <table>
      	<tr>
      		<th>title</th>
      		<th>contact</th>
      		<th>from_date</th>
      		<th>to_date</th>
      		<th>category</th>
      		<th>where</th>
      		<th>body</th>
      	</tr>
      	<tr>
      		<td>MS/US Parent Breakfast</td>
      		<td> ()</td>
      		<td>July 31, 2013 8:15 am</td>
      		<td>July 31, 2013 9:00 am</td>
      		<td>All School, Test, 		</td>
      		<td>Building A</td>
      		<td><p>But really, cool.</p></td>
      	</tr>
      	<tr>
      		<td>Club Applications Due</td>
      		<td>Brandon (email@gmail.com)</td>
      		<td>July 31, 2013 10:00 am</td>
      		<td>August 9, 2013 7:00 pm</td>
      		<td>All School		</td>
      		<td>Mr. Bert's Classroom</td>
      		<td></td></tr>
      </table>
      

      I'm attaching the PHP file I referenced in my original post here. It's converted to a .txt file so it could be uploaded here.
      Thanks again!

      simple_html_dom.txt
        brandonpence wrote:

        @ - thank you for your reply. I was using http://siteurl.com generically because I'm currently working locally

        Okay. FYI there is a standard domain name for use in examples (see http://www.example.com).

        So what your code does is that it goes through every row in the table. For every such row it generates a line with all the <th> elements in the row and then a second line with all the <td> elements in the row. So two lines are output for every row; if the row doesn't have any <th> elements in it, then the first line will be empty, and if the row doesn't have any <td> elements in it, the second line will be empty.

        So maybe check to make sure that the row is not [man]empty[/man] before putting it in the file. See one of the stackoverflow answers immediately before the one you linked to.

          Weedpacket;11030725 wrote:

          Okay. FYI there is a standard domain name for use in examples (see http://www.example.com).

          Good to know for example URLs - thanks for the heads up!

          So maybe check to make sure that the row is not empty before putting it in the file. See one of the stackoverflow answers immediately before the one you linked to.

          I can't believe I missed that in Stackoverflow! I usually review all answers - so thanks for guiding me in that direction. That answer fixed my problem!

          Any guidance on how to get the date to render in non-military format? I also noticed that all special characters, something like the apostrophe for example becomes its HTML code: '

          Thanks!

            brandonpence wrote:

            Any guidance on how to get the date to render in non-military format?

            The content of the CSV file would be the same as the content of the HTML table. Excel on the other hand might (and usually does) have its own ideas on how anything that resembles a date should be formatted (keep in mind that a CSV file is not an Excel spreadsheet). Poke around its menus and stuff for something to do with formatting.

              Weedpacket - thank you so much for your help on all of this!

                Write a Reply...