Need a good search engine for our site

dmacman1962 · May 7, 2013

Weedpacket, the actual events page is always the same, ie, article.pjp?ID=1234. So that should give me the same results as your example. The thing that changes is the archive.php?page=2 page. The archive page has 15 items on it. So, every 15 days, the article you are searching for, goes to the next page. Does this give you more information for a possible suggestion?

Weedpacket · May 7, 2013

dmacman1962 wrote:
Weedpacket, the actual events page is always the same, ie, article.pjp?ID=1234. So that should give me the same results as your example

But that's only for individual events. I'm talking about the archive listings that keep changing.

dmacman1962 wrote:
The thing that changes is the archive.php?page=2 page. The archive page has 15 items on it. So, every 15 days, the article you are searching for, goes to the next page. Does this give you more information for a possible suggestion?

Yes; don't keep moving articles from one page to another . It's why I suggested having an archive with URLs based on the date of articles (which doesn't change) rather than an arbitrary n most recent articles (which changes every time an article is added).
The first archive page (with yesterday's items) would be accessible via two URLs, say www.example.com/events/archive/latest and www.example.com/events/archive/2013-May-07 so that the archive itself is accessible via a constant URL (the former) so that all your links to the archive don't have to keep changing, and all of the items from 7 May 2013 are accessible via a constant URL (the latter) so that anyone in 2041 wanting to link to a particular archive page will have a constant URL they can use (including yourself: 2013-May-07 would link to 2013-May-06). Tomorrow, events/archive/2013-May-07 will still be around and will still have the same content, and events/archive/latest will be listing the events of 8 May and have a link to the page for 7 May (and the page for 7 May will have links to 6 May and 8 May).

The use of the word 'latest' being for the benefit of anyone who's search has turned up both URLs and is wondering if one is better than the other - assuming you don't put a noindex instruction into the document header when that URL is used so the search engine doesn't even bother indexing it. The search engine will find the page under its permanent URL easily enough.

If there are too many items in a day for a single page, they can be paginated individually; archive/2013-August-15?p=2. Once the day is over there won't be any additional items being added, so nothing will be pushing items from one page to the next.

Think of a physical archive. Newspaper publishers don't have all their back issues filed in cabinets labelled "three months ago", "four months ago", "five months ago", and taking the time every month to shift everything from one cabinet to another (or even just moving cabinet labels around) so that what was in the "four months ago" cabinet ends up in the "five months ago" cabinet and putting the oldest records in the newest cabinet. They have cabinets labelled "June 1886", "July 1886", "August 1886", and write a new label for a new cabinet.

There's another suggestion implicit in the above paragraph. Paginate to a strict 15-items-per-page format, but number them so that the oldest page is page 1. Again, the most recent page would have two URLs (one to reflect the fact that it is a page in the archive, and the other to reflect that it is the most recent page). It would mean that if the number of items is not a multiple of 15, then it's the most recent page that will have fewer (but your current problem basically boils down to the fact that your most recent archive page is already full and so you keep moving existing items to other pages to make room for new ones).

dmacman1962 · May 8, 2013

Wow, ok, thats a lot of information. I will have to address this one item at a time (since my brain can't process all that at once, lol).

The first archive page (with yesterday's items) would be accessible via two URLs, say www.example.com/events/archive/latest and www.example.com/events/archive/2013-May-07 so that the archive itself is accessible via a constant URL (the former) so that all your links to the archive don't have to keep changing, and all of the items from 7 May 2013 are accessible via a constant URL (the latter) so that anyone in 2041 wanting to link to a particular archive page will have a constant URL they can use (including yourself: 2013-May-07 would link to 2013-May-06).

I am trying to figure out how to code the constant URL archive page. I don't understand how to use the archive/2013-May-07 URL. I am guessing that I would use archive/article.php?date=2013-May-07. And then on that page I use the GET var, and convert it so I can look up the article by date instead of the ID like I am doing now.

For the archive/latest I could use archive/latest/archive.php and if there is not GET var, then I use yesterdays date by default.

Is that the correct approach?

bradgrafelman · May 8, 2013

dmacman1962;11027877 wrote:
Is that the correct approach?

It sounds like an approach that should work, so who's to say it's "incorrect" ? That's like asking someone what the "correct" answer is to the question "What is the best color?"

Since both URLs perform similar functions, however, I personally would map both of them to a single PHP script. Otherwise, you could easily end up with separate PHP files that contain very similar code (causing dual maintenance issues).

dmacman1962 · May 8, 2013

That is the approach I was trying to explain. I will test for the GET['date'], and if there was no GET var, then I would use the next days date (both using the same script like you said). I wasn't sure if there was a different technique for the page with no GET var. I was guessing that they may have been a different directory or file for each date (which would be a huge pain, and create a ton of files. I have seen SEO suggestions that you should use a distinct directory or file name for items you want to get a higher ranking for (ex: my_house_is_red_in_color).

I was just clarifying. Thanks for the information.
Don

dmacman1962 · May 8, 2013

This will address the individual article pages, but not the archive list. Currently, we have archive.php, which is the list. That lists the articles, adding a new one each day (M-F). With this method we would not be able to have a list of articles.

Each day there is only one article, so this would not apply:

If there are too many items in a day for a single page, they can be paginated individually; archive/2013-August-15?p=2. Once the day is over there won't be any additional items being added, so nothing will be pushing items from one page to the next.

But, how could I use this logic to create the archive listings, which will grow each day like it does now? I can do like you suggested and have the last article at the top of page 1, and have the older articles on the next page(s) ?p=2 etc.

Weedpacket · May 8, 2013

It's the responsibility of the web server to take a requested URL and determine which resource to serve in response. In this case Apache's URL-rewriting engine can be used to interpret the URL suitably.

RewriteRule: ^/events/archives/(latest|[0-9][0-9]?+-[a-z]+-[0-9][0-9][0-9][0-9])$  /events/archives.php?date=$1

Or something resembling that. Then a request for

http://www.example.com/events/archives/12-Octember-2011

would be handled as though it read

http://www.example.com/events/archives.php?date=12-Octember-2011

You could just use

http://www.example.com/events/archives.php?date=12-Octember-2011

and be done with it, but it's uglier and exposes bits of the inner workings of your system that might change and consequently break all those existing links to your pages.

dmacman1962 · May 9, 2013

To be sure I understand the rewrite, events/archive/latest?date=2013-05-08

latest|[0-9][0-9] == latest
?+-[a-z]+-[0-9][0-9][0-9][0-9] == ?2013-05-08
$ /events/archives.php?date=$1 == /events/archives.php?date=2013-05-08

Correct?

Weedpacket · May 9, 2013

I might have borked the regexp a bit (I was writing it off-hand - and I see a stray [font=monospace]+[/font] in there now that I look at it and its handling of upper/lower case is off), but the idea I had when writing it was:

/events/archives/latest  => /events/archives.php?date=latest
/events/archives/4-June-1684 => /events/archives.php?date=4-June-1684

So, "either the word 'latest'; or, one or two digits, a hyphen, then a word, a hyphen, then four digits" - the word supposedly being the name of a month.

The date format was of course just an example - I figured referring to the months by name was more human-friendly than "1684-06-04" would have been ("04-06-1684" wouldn't be a good idea). Needless to say, there's no way to prevent people entering dates that are out of range (1684) or outright bogus (Octember), but like I say, needless to say.... If they enter something that doesn't match the pattern, the rule won't be applied and the server would proceed as before. If they enter something that does match the pattern - you'll need to check user input anyway.

At the expense of a more complex rewrite rule, the day, month, and year components could be separated into distinct querystring fields, saving you a parsing step in the script.

Other ways to format the URL are of course possible. [font=monospace]archives/2011/06/04[/font], say. Then there'd be an obvious choice of URL for annual and monthly listings/summaries. (I seem to recall someone advocating the principle that if /foo/bar/baz is a legitimate path in a URL, then /foo/bar should be as well.)

dmacman1962 · Jun 13, 2013

Where was the stray + and what uppercase was incorrect? I want to add this to my htaccess for that directory so I can use the user-friendly names for the date like your example " events/archive/2013-May-07".

RewriteRule: ^{/events/archives/(latest|[0-9][0-9]?+-[a-z]+-[0-9][0-9][0-9][0-9])$} /events/archives.php?date=$1

Thanks

dmacman1962 · Jun 14, 2013

I changed my code to take a full date var and explode it them look up the archive article with that date. My GET is now archive.php?date=2013-06-01. I take that and explode it then look up the article. I couldn't figure out how to write the rewrite code from your example. I asked a consultant to look into writing that for us.

Would this work well with the Google searches that work like your example?

Thanks,
Don

sneakyimp · Jun 14, 2013

What weedpacket is proposing is that your site be structured such that a list of links, once created, will always have the same url and will always show the same content. This sounds like a pretty good strategy in that the google search results should much more accurately reflect the contents of a particular page on your site. The archive on your site, rather than constantly changing every time a new article gets added, would be organized differently such that instead of just getting the N most recent articles, your site would say "here are articles from 2013-06-14" or whatever. While the contents of articles from a particular day might change during the course of that day, these contents would not change once it becomes June 15. So, once googlebot scans your page for 6/14 the following day, the contents of that page (and all its links) would not ever change. The result is consistency between your site's content and the contents of the google index for most pages.

For some pages (such as your home page that might be continually changing in respons to new articles or RSS feeds or twitter feeds or whatever) you may not have the luxury of this kind of restructuring. For example, the forum page of phpbuilder.com changes continuously and it's not possible to adopt some other kind of linking scheme to get more consistency with the google index. Even so, this forum (and also linuxquestions.org) are highly optimized for search engines and a posting I make here often appears in a google search very quickly. You can improve the quality of your site indexing by tweaking your sitemap. Note that you can specify a changefreq value for the urls you specify in your sitemap which will tell he bot how often a particular pages changes. You can specify any one of these values:
always
hourly
daily
weekly
monthly
yearly
* never

keep in mind that if you put 'always' on a lot of pages, you might have the googlebot overhwelm your site with page requests.

dmacman1962 · Jun 14, 2013

I want to use his method, but I tried to write the rewrite htaccess code for the archive directory, but I can't get it to work. I am having a hard time understanding regex commands. Especially complicated ones this this one. Can you tell me where the 2 issues he mentioned in his regex code? I want to put this in a htaccess file.

RewriteRule: ^{/events/archives/(latest|[0-9][0-9]?+-[a-z]+-[0-9][0-9][0-9][0-9])$} /events/archives.php?date=$1

He said:

I might have borked the regexp a bit (I was writing it off-hand - and I see a stray + in there now that I look at it and it's handling of upper/lower case is off), but the idea I had when writing it was:
Code:
/events/archives/latest => /events/archives.php?date=latest
/events/archives/4-June-1684 => /events/archives.php?date=4-June-1684
So, "either the word 'latest'; or, one or two digits, a hyphen, then a word, a hyphen, then four digits" - the word supposedly being the name of a month.

in this code
RewriteRule: ^/events/archives/(latest|[0-9][0-9]?+-[a-z]+-[0-9][0-9][0-9][0-9])$  /events/archives.php?date=$1

Thanks,
Don

sneakyimp · Jun 14, 2013

Understanding RewriteRule directives can be tricky and facility will only come with practice. The documentation on mod_rewrite is pretty helpful, if a bit intimidating. If you were to look at it, you might see that there's no colon after the word RewriteRule in your apache configuration or htaccess file, for instance.

Before you can make use of any rewriting, you have to make sure that mod_rewrite is installed and switched on for your domain. Additionally, you need to make sure that apache is even paying any attention to your .htaccess file. When you say "it doesn't work" what do you mean?

There are a few important parts of the rule you are working on:
The format of a RewriteRule is

RewriteRule      REQUEST_TO_MATCH      WHAT_TO_CHANGE_IT_TO

* the ^ char indicates the start of the requested url. your rule only works if the requested url starts with a forward slash. You can get an idea of what the actual requests are by looking at your apache log or, in a PHP script looking at the contents of $_SERVER["REQUEST_URI"].
parentheses are used for grouping and the $1 bit corresponds to the whatever was matched by your first group of parentheses.
the pipe character | means "or"...so that parenthesized group could match the string "latest" or it might match some pattern described by all the brackets, dashes, numbers and letters, in the other part of that section.
[0-9] matches a single character that is a digit -- unless you follow it with a wildcard (see below)
[a-z] matches a single character that is a letter between a and z -- unless you follow it with a wildcard (see below)
a plus (+) or an asterisk () is a wildcard. a plus means "there has to be at least one of the preceding matched items." An asterisk means "there may be zero or more of the preceding items." If the wildcard follows an expression like [a-z]+, it means "at least one letter between a and z".

dmacman1962 · Jun 14, 2013

Thank you for taking the time to explain it, and give us links to learn it more. To answer your server setting questions, we do use .htaccess files on our server, mostly staff directories. And we have the rewrite module loaded:

Loaded Modules: mod_rewrite
and (if this is required as well)
url_rewriter.tags a=href,area=href,frame=src,input=src,form=fakeentry

I will read up on this and try to get it to work so can I learn it myself.

Thanks,
Don

Need a good search engine for our site

Ddmacman1962

Weedpacket

Ddmacman1962

Bbradgrafelman

Ddmacman1962

Ddmacman1962

Weedpacket

Ddmacman1962

Weedpacket

Ddmacman1962

Ddmacman1962

Ssneakyimp

Ddmacman1962

Ssneakyimp

Ddmacman1962