Scheduling a screen scrape. . .

stymie · Oct 13, 2011

Greetings all, and thanks in advance to anyone taking the time to read this. . .double thanks if you post a reply.

That being said, I'm having to screen scrape a website that is linked to my uncle's business' inventory database. On his normal website, it checks in w/ the server at his business every night and updates product prices on a nightly basis. The people that setup his business software also setup his website. So I don't want to mess w/ anything there. Problem is, his website sucks.

So I've made an alternate site he hopes to use that looks great, and I'm using PHP w/ CURL to scrape his website. My site basically works like this:
1) Visitor clicks link to a catalog section
2) If they're the first visitor that day to visit that page, then the screen scape is performed to get that pages info from the other site, and is then cached and served to all other visitors that day. This happens every day and is how I'm able to keep my site updated and in sync w/ his other website and business.

The problem is, it takes forever if you're the first person to click that link that day. 20-30 seconds sometimes. After it's cached, it's fast.

So, I'm wondering how I would go about automating the first click on every catalog link to occur at 7am in the morning. That way, the site is current, updated, and cached before the first visitor even comes that day, drastically increasing my load time. The main reason I need to do this is b/c I think Google is penalizing the site b/c it takes so long to load.

I've thought a cron job might do it, but I don't know anything about them really, so I have no idea how to set it up.

Any help would be greatly appreciated!
Thanks in advance!

bradgrafelman · Oct 13, 2011

stymie wrote:
So, I'm wondering how I would go about automating the first click on every catalog link to occur at 7am in the morning.

cron

stymie · Oct 13, 2011

Thanks Brad,
I thought cron would be the way to go, but as I said earlier, I'm very unfamiliar with it and it's syntax.
So would I need a php file that poses as a browser and and clicks all the links? Then target that w/ cron? Any further instruction would be a great help!
Thanks again!

bradgrafelman · Oct 13, 2011

Well you might need to re-think the way that you fetch the remote content. Can you not just loop through all of the "categories" and fetch them that way?

Also.. are you sure all of this is even necessary? There's no way you can get access to the DB itself and just skip this nasty screen scraping business?

stymie · Oct 13, 2011

I've thought long and hard about how to go about it, and came to the conclusion that this is the best solution. I am not allowed access to the main site's db. That would solve it. But alas, seems like I have to go east to get west in this situation.

Everything I've got so far is working great and I would like to stick w/ this approach. I just need a way to fire those links off every morning to cache the content.

bradgrafelman · Oct 13, 2011

stymie wrote:
I just need a way to fire those links off every morning to cache the content.

Why do you need the links themselves? Presumably, they just point to a PHP script on your site that checks to see if the requested category has already been downloaded and, if not, grabs it from the remote site.

Can't you skip the entire "link" part and simply loop through all categories and perform the download for each?

stymie · Oct 13, 2011

I guess I don't need the links themselves. I'm barely above a nube when it comes to PHP, but I've got this far w/ some pretty complicated stuff and feel good about how it's working. And give that I couldn't come up w/ any other solutions, it'll have to do for now.

The only thing I need the links for is to trip the code that runs curl to get everything for that category from the main site. Maybe a webbot that will visit the site every day at 7 and trip them for me would do the trick. I'm just trying to build on what I have and not have to rebuild what I've got. If I can get it working everyday, I'll be done. Good luck to the poor soul who may have to work w/ it one day. I do have it written in OOP formats though, just a screwy way of going about it.

Your suggestion may be the ticket. The links do point to a file that does all the heavy lifting. So if I can just feed that script the categories, it should be able to do it. But I'm not confident that I fully understand what you're saying. I do think a webbot could do what I need done too though, posing as a person and clicking those links.

Thoughts, suggestions?

bradgrafelman · Oct 13, 2011

stymie;10989272 wrote:
The only thing I need the links for is to trip the code that runs curl to get everything for that category from the main site.

So why not move the code that does that when a link is clicked into the cron script itself?

stymie;10989272 wrote:
I do think a webbot could do what I need done too though, posing as a person and clicking those links.

Again, why not just move the code that does all the downloading into the cron script itself?

If you're at the stove in the kitchen and need something from the fridge, do you go outside, walk around your house once, ring the doorbell, ask someone to hand you something from the fridge, walk back around the house, re-enter the kitchen, and then return to the stove?

Assuming no, why use a "webbot" to simulate requests for links on your site when you can just have the cron script directly execute the code that does the downloading?

stymie · Oct 13, 2011

Quit simply b/c I don't know how to do cron. That's why I'm posting to this forum. I'm willing to do the research and learn, but I'm looking for answers from someone that is more familiar w/ cron than me. I have no clue how to write the simplest of cron jobs, though I can learn. I really don't know how to execute php in a cron job. As far as going outside to get milk from the fridge beside me, no, I would not do that. But my solution is so backwards anyway that I figured it made sense here too. It made sense to Columbus also, and he discovered a whole new world in the process, so maybe I will to.
I'm really looking for answers here, not more questions. Thanks for you help, but I feel like I'm hashing over stuff I've already dwelt on for a while. It is what it is, and I have to deal with it. 95% is done. I just need to automate and schedule the scraping. If you can help me with that, then it will solve this issue.
Thanks again!

Scheduling a screen scrape. . .

Sstymie

Bbradgrafelman

Sstymie

Bbradgrafelman

Sstymie

Bbradgrafelman

Sstymie

Bbradgrafelman

Sstymie