get_file_contents issues

itfidds · Apr 18, 2011

Hello all,

I am trying to pull in some content from an external website and managed this using; -

<?php
$url = file_get_contents($this->wrapper->url);
echo $url;
?>

However, my first problem started when it transpired that the site I am pulling from has a DIV called 'footer' and my site has a DIV of the same name.

I managed to get around this using the str_replace: -

<?php
$url = file_get_contents($this->wrapper->url);
$output = str_replace ( 'footer', 'credits', $url );
echo $output;
?>

So far , so good. However, the next hurdle is that the data that I am pulling in contains links to more information. If I click on the link then I am taken away from my site and the results are shown directly from the originating site. I have used the str_replace function as I needed to append the correct http path to the links: -

<?php
$url = file_get_contents($this->wrapper->url);

$searchReplaceArray = array(
   'footer' => 'credits',
   'PlayerHistory.php?CWID=15025' => 'http://masterscoreboard.co.uk/results/PlayerHistory.php?CWID=15025' );
$output = str_replace(
   array_keys($searchReplaceArray),
   array_values($searchReplaceArray),
   $url ); 

echo $output;
?>

So my question is this - is it possible to use something similar to a nested get_file_contents on the links so that I am not redirected away from my site?

If it is of any use my site is being created in Joomla 1.6 and I have amended the wrapper_com so that it doesn't use iFrames.

Sorry if my terminology is not good and hopefully you will see what I am trying to achieve.

Any 'eureka' moments will be greatly appreciated.

Thanks in advance...

jgetner · Apr 18, 2011

yes but this will be a toughy to explain properly so be patient it will make sense. First you need to search the site for ulrs with a preg_match() grabing all the urls and storing them in a array. after that you will need to then loop through the array when and get the content as they are called.

also look into curl i have not dove into it yet but from what im hearing it is a better option than file_get_contents().

johanafm · Apr 18, 2011

You would also need to change the href attribute on any links so that they point to a correct location on your own server.

bradgrafelman · Apr 18, 2011

Never use regular expressions to try and manipulate HTML - that should be your last resort.

Instead, a more robust solution would be to use something like [man]DOM[/man] to find all <a> tags and modify the 'href' attribute as desired.

itfidds · Apr 18, 2011

jgetner;10978305 wrote:
yes but this will be a toughy to explain properly so be patient it will make sense. First you need to search the site for ulrs with a preg_match() grabing all the urls and storing them in a array. after that you will need to then loop through the array when and get the content as they are called.

also look into curl i have not dove into it yet but from what im hearing it is a better option than file_get_contents().

OK this sounds painful! In real terms I can use an iFrame but the result is so ugly and unfriendly - I really would like to be able to format the output and not have scroll bars.

I'm no coder but perhaps I will have a look at preg_match() and see what the requirements are.

Thanks for your reply...

itfidds · Apr 18, 2011

johanafm;10978307 wrote:
You would also need to change the href attribute on any links so that they point to a correct location on your own server.

Hi johanafm,

This also sounds complicated! Changing the href links to point to my server? That would suggest that I need to have the data pulled down to my server in the first place.

I was hoping that after I have executed the file_get_contents (which populates the table with the player names and handicaps) that I could manipulate the data that I have pulled down into $url by using multiple or nested file_get_contents or fopen functions to create another string such as $member which then goes off and pulls in the player information.

As usual I think I have over-simplified what I believe should be possible.

I guess that, without learning PHP very quickly, I may have to fall back to iFrames...

Thanks for replying, it's much appreciated.

itfidds · Apr 18, 2011

bradgrafelman;10978308 wrote:
Never use regular expressions to try and manipulate HTML - that should be your last resort.

Instead, a more robust solution would be to use something like [man]DOM[/man] to find all <a> tags and modify the 'href' attribute as desired.

First off bradgrafelman, many thanks for taking the time to reply.

As I think I mentioned before I am not a coder by any means and whilst the first couple of replies where possibly slightly above my competence level, your reply has scared me senseless!! I had a very quick look at DOM and the options and permutations seemed almost limitless. Very powerful BUT way above my head...

Perhaps iFrames are, for the moment, the only real way forward.

Thanks again.

bradgrafelman · Apr 19, 2011

itfidds wrote:
Perhaps iFrames are, for the moment, the only real way forward.

Actually, I'd call iFrames and everything mentioned above a workaround, not a real solution.

The real solution, IMO, would be to see if the external website has some sort of mechanism of providing the raw content you need in a usable format (XML, etc.). Screen scraping the HTML output as if you were a web browser and then attempting to separate the bits and pieces of HTML/data you actually want is terribly inefficient, cumbersome, error-prone, etc.

Res · Apr 19, 2011

If you're willing to try the coding there are a couple of good tutorials out there. These might be easy enough to adapt without too much understanding of the complicated bits.

I recommend you give this a shot first, it's a tutorial of what bradgrafelman mentioned. Firefox has a great plugin that will give you the xpath query by selecting it in the browser, which makes this easier then it first appears.
http://www.earthinfo.org/xpaths-with-php-by-example/

I don't disagree that you should avoid using regex for html parsing. As you said, you're going back to iframes, we can consider it last resort.
http://www.bradino.com/php/screen-scraping/

johanafm · Apr 19, 2011

itfidds;10978319 wrote:
I was hoping that after I have executed the file_get_contents (which populates the table with the player names and handicaps) that I could manipulate the data that I have pulled down into $url by using multiple or nested file_get_contents or fopen functions to create another string such as $member which then goes off and pulls in the player information.

Well, you could do that, assuming that you change the links to point to scripts on your server that uses file_get_contents to display whatever is on the original site.
But, if all you are providing your users with, is a way to browse another site through your web server, then you've created nothing but a roundabout proxy and might in my opinion just as well set up a proxy.
Also, for what you're doing, I'd suggest checking with the site providing the content if they are ok with you doing what you do, since you provide users with access to their content by web scraping, without them actually getting the users to their site, and thereby missing out on the occasional ad clicks that generate revenue. And if they are ok with you using their content, as brad pointed out, they most likely have some other means of getting the data without the markup so that you can manipulate it more easily and create whatever output you desire.

itfidds · Apr 19, 2011

Res and johanafm, again many thanks for the responses. I get the feeling that I am being almost pushed into using a hammer to crack a nut.

If you follow this link you will see content that a particular golf club have granted permission to be viewed by the public on the masterscoreboard.co.uk site: -
http://masterscoreboard.co.uk/results/EGA_HandicapList.php?CWID=15025

The golf club that I am working on use the same system but do not allow their information to be viewed by the public via the masterscoreboard.co.uk site: -
http://masterscoreboard.co.uk/results/EGA_HandicapList.php?CWID=15023

Instead this club have chosen to present the data via their own site. This is perfectly OK under the rules of the masterscoreboard.co.uk site and use of their software.

So to reiterate my dilemna, I can present the data using an iFrame and can access the links in the content which is then still rendered on our site. The major stumbling block, other than not being able to control the formatting is the age old 100% height issue of the iFrame - we seriously do not want the extra scroll bar.

A similar issue occurs using the object element to embed the content.

So that's how I ended up looking a file_get_contents etc. in an effort to present the data using PHP. But my problem with is that clicking on the link redirects me away from our website!

I have spent hours upon hours on this, trying many methods and reading article after article (some from 7 years ago - I would have thought this would have been sorted by now!) and still have limited success.

I guess that 100% height iFrames or objects are a holy grail akin to tableless design from years back??!?!?

And there we are, still open to suggestions...

get_file_contents issues

Iitfidds

Jjgetner

Jjohanafm

Bbradgrafelman

Iitfidds

Iitfidds

Iitfidds

Bbradgrafelman

RRes

Jjohanafm

Iitfidds