least processor intensive?

pnoeric · Sep 18, 2003

hey all, I'm working on a site that is going to get some major media exposure and TONS of traffic spikes (like when TV ads hit and everyone rushes to their computers)... as a result I am trying to code it to be super non- processsor intensive... it does some database work so I want to really minimize the work that the PHP script is doing. (I have tons of hard drive space, fwiw)

the task I'm thinking about is how to count the number of times someone hits a specific page... here's the trick I came up with: append a single period to a file.

later, when I want to count the number of 'hits' to that page (something an admin user would do only once a day) I can just do a filesize check.

is there a better way? the obvious solutions (keep track in the db, or even just a counter in a static text file) seem like they would use more resources.

thanks,
Eric

okuto1973 · Sep 18, 2003

If you use a file track it take no resource but is obviously less statistic research and difficult to read and have response.
If you use a DB (MYSQL....) you could write in it all you want......i suggest this last one....

pnoeric · Sep 18, 2003

I don't need much tracking-- just a quick count, but I hear ya... a db is the "right" way to do this (well, just using the server logs are really the right way, but I digress) but way too heavy for every time someone touches a page, at least in this app.

drawmack · Sep 18, 2003

If all you want is which user visited which page you could do this the way you said and just use both the file name and something to identify the user (ip or username) in the name of the file.

superwormy · Sep 18, 2003

I have a feeling that a database will be the MUCH LESS cpu intensive process compared to opening a file and appending periods to it...

PHP has to open that file EVERYTIME a user hits it, which causes hits to the filesystem ( read SLOW ) AND more than likely SERIOUS file locking issues if a lot of people try to access the file simultanious ( which you seem to think they will do )

MySQL will more than likely manage these queries in RAM MUCH more efficiently, and updating SET hits = hits + 1 requires almost no CPU what-so-ever.

Not to mention appending all those periods is a horrible waste of filespace, even if you do have 18 gigs.

Sxooter · Sep 18, 2003

The real issue here is not how fast you can run the update a thousand or a million times in a row, but how fast can you update it with 200+ people all doing it at once. And avoiding race conditions.

I'd do it with a sequence and nextval() in postgresql.

But, you likely don't have that option. So, here's a few ideas.

1: Try your way (appending a '.' character)
2: fopen a file, read a number from it, increment it, and store it.
3: Update a field in a MySQL table.

For 1 and 2 you'll need to program in flock or you'll likely have a race condition.

Next, run apache bench (or some other benchmark that can run parallel accesses) like so:

ab -c 20 -n 1000 http://yourserver/yoururl.html

and see if your counting method counts all 1000 updates. If it comes up short, you've got a race condition to worry about.

IF you have postgresql, your problem is already solved. Use persistant connects (I'd use them for MySQL too here) and config the server for it (i.e. max postgresql connects must be > than max apache children, I usually drop apache to 50 and crank up pgsql to 100 to be safe) then you:

create sequence counter;

And each time a page loads, just:

select nextval('counter');

from that database.

Postgresql sequences are guaranteed transactionally safe, and they're very fast.

pnoeric · Sep 18, 2003

Sxooter, much appreciate the thoughtful analysis. Why not MySQL for the same thing? Is it not guaranteed transactionally safe for simple increments like that?

Love the apache bench testing idea... I'll def try it.

best
Eric

Sxooter · Sep 18, 2003

The reason is that MySQL doesn't have sequences, it has autoincrementing fields, which don't work quite the same. It still might be made to work, but you'd be inserting rows in a table, with it growing. Maybe you could delete the old row to keep the size down, but deletes cost a fair bit of CPU time.

MySQL's autoincrement fields are incremented by:

lock table;
select max(autoinc)+1;
insert new row with autoinc value;
unlock table;

So they don't have very good concurrency. It would be more effective if you locked the table yourself and just incremented a single row over and over, since it's quicker to lock a single row table than keep adding new rows and watch the table grow larger and larger.

This exact same method would be horrifically slow for Postgresql since it uses an MVCC locking system where every update is actually a delete/insert, so in postgresql, locking a table with one row, updating it, and unlocking it would result in x dead rows in the table space for x updates. Very bad.

I.e. either database can probably handle this, but they need to handle it differently because one is row locking (MySQL) i.e. update in place, while the other is MVCC (Postgresql) i.e. update=delete/insert.

Since Postgresql sequences don't live in MVCC tables, they don't have the problem with update = delete / insert that a table update would.

Then, to see the count, you just would just run 'select last_value from counter;' to get your count.

okuto1973 · Sep 18, 2003

You could always use MYSQL MAX that have InnoDb....with transaction.....i 'm in wrong?

Sxooter · Sep 18, 2003

The problem for MySQL here isn't lack of transactions on MyISAM tables, it's a lack of good concurrent accessability on it's autoincrements. With the method I was mentioning about using an autoincs, MySQL uses the same "lock table;select max(id)+! from table; insert row with value from previous statement;unlock table"

It can't escape that by having the row locking inherent in innodb or dbd tables.

Keep in mind, that what the original post wants to do is not, in itself, a simple thing. It may well be that the best solution is to just create a table with no indexes or anything else, and insert a single row with a single character in it (doesn't matter what character) each time a page is hit, then do a select count(*) from that table.

MyISAM mightbad choice for that, but innodb tables or postgresql should work well. however, often large tables start to behave poorly over time (i.e. if you've got 100M rows, adding one is usually gonna be slower than adding a row to an empty table, so the performance envelope over time of adding rows to a table may drop off to a point that it becomes slower than the method where you upadte a single row (MySQL) or a sequence (Postgresql).

Even though the updates in these two methods (single row MySQL, sequence Postgresql) are by their nature a bit of a bottle neck, they are bottlenecks that should stay the same over time.

Often the biggest headaches are things that run fast at first, then slow down over time. Postgresql is known for doing this due to it's issues with vacuuming. Mysql is known for doing this with autoincrement fields in large tables. Note that in Postgresql the logical size of the table doesn't affect performance so much as the physical size on disk, which grows due to its MVCC locking mechanism. I.e. you can have a table with 1000 visible rows but it takes up the same disk space of 100,000,000 rows due to unreclaimed space if you've failed to vacuum regularly.

So, yes, InnoDB is a good choice if you want to do something like just add a single small row each time someone hits the page, which is another method I hadn't really thought of til now.

Sounds like an intersting case to study and benchmark.

pnoeric · Sep 19, 2003

after digesting all this, and checking to see what my database options are (I only have mysql right now), I think I'm going to do this:

insert a row every time someone touches the page.
add a cron job and once a day, count the rows, store that count, and empty the table so it's blank again (and probably optimize the table, tho that may take too long to make it worthwhile)

not bad?

E

Moonglobe · Sep 19, 2003

yup that should work, if you only want one day's count.... but i guess it will cumulate and not be over-writtten every day? that makes sense then.

ahundiak · Sep 19, 2003

Why not just use your web logs?

Sxooter · Sep 19, 2003

Originally posted by ahundiak
Why not just use your web logs?

I had the same thought while driving home last night...

pnoeric · Sep 20, 2003

lol I know, duh, web logs would be great... but I can't in this project. sucks, huh, since it's all tracked at that level!

pnoeric · Sep 20, 2003

hey!! I got them to give me access to the apache logs... so now the question is, do y'all know offhand of a class library or code snippet that will search an apache log and crunch it down? barring that, I'll just do it by brute force but as you can imagine, these things get big fast... I'm caching the stats but not for any longer than 30 minutes or so, and I can't imagine re-crunching the log file on demand like that. I need to go look at the log files on the disk (I haven't seen them yet)... maybe if they're stored by day, I could crunch them once a day and just store the results. hmm.

drawmack · Sep 20, 2003

that depends on how the appache server is set up I believe that by default there is a new access log (which is the one you're interested in) generated every day.

goa103 · Sep 20, 2003

Your app sounds like a pro one so the only suitable solution is to use a pro tracking analysis tool/service like Xiti.

If you can't affort it my advice is "not to trick", use a DB to track your users, visitors. Tracking registered users is better because you don't need to know their IP, it can change afterall. So the only thing you have to do, it's to update a simple integer field in a "visits" table, visits = id_user, num. However I don't really understand what you can do with such a basic tracker, IMHO it's useless as you don't know the date and time. A number of visits mean nothing if you can't build statistics from them, per day, per hour... Moreover you have to know the page url they visit... See it's not that simple, you can't just count the hits.

So the question is : what information do you want to track with your tracking system ? Only a "virtual" number of visitors ? What's the point ?

JM. Molina

pnoeric · Sep 20, 2003

lol. let me clarify. I am building a high-level "admin" dashboard for our app-- a pretty, clean, no-hassle interface for the execs at this company to get some numbers from their site. they want to see very clear figures like "how many people went to this key page" and "how many people clicked this important button." they don't care about visits, hits, sessions, or any of that important stuff.

I do have a fantastic stats package running on the domain (urchin) and can easily get numbers there... BUT... the spec says these numbers will appear on the nice pretty php-driven dashboard page that I am coding, not nine clicks down in urchin.

yes, I would like to just tell the execs "hey guys, you can just go into urchin and get a TON of data, all sorts of interesting things, and chop and slice it nine ways from sunday and see it however you want, and there's even pretty graphs and cool stuff," but that is not an option. I need to have this clean, simple page just report the figures on demand.

ideally, Urchin would crunch the logs for me and then just save the results someplace in a nice format that I could import in PHP and display. wouldn't that be lovely? too bad it doesn't work like that. ;-) so I'm back to playing with the logs myself.

(incidentally, I will be displaying these figures in four date ranges: today, this week, this month and total-to-date. basic log crunching.)

if apache rolls the logs daily, I could figure out how to crunch them once a day and just store my results, then I can add them up on the fly. it's a lot of absurd coding, tho, for something that should be easier. which is why I like the idea of manually dropping counters on the pages that are important: it's not nearly as efficient as going back to the apache logs, but it will be a whole lot less work to display those stats. which is all that matters in the end.

so yeah, if anyone has any better ideas, I'm all ears. it's a sort of interesting little coding job.

best
Eric

drawmack · Sep 20, 2003

If all that is important is the number of times given pages were visited in the last x where x can be

hour, day, month and todate I would do it with the period appended to the file like you suggest and a cron job that runs every hour.

Cron would do this:
1) If it is midnight x-01-xxxx then empty the monthly file.
2) If it is midnight empty the daily file.
3) Figure hits in the last hour = bytes in file.
4) Empty the periods file
5) Add hits in last hour to the day, month and to date files.

Recording the hits:
I would just create a class that creates a file named the same name as the current page but in a logs folder and appends a period to the end of it.

I could then create different methods for reading in the files and displaying the data.

This solution would probably take less then a day to code and if they don't like your dashboard everything else stays modular.

Heck you could even make up a text file that contains a list of the pages that are to be logged and then use htaccess to prepend the logging function to every page. Then you just change this text file to change which pages are logged.

least processor intensive?

Ppnoeric

Ookuto1973

Ppnoeric

Ddrawmack

Ssuperwormy

SSxooter

Ppnoeric

SSxooter

Ookuto1973

SSxooter

Ppnoeric

MMoonglobe

Aahundiak

SSxooter

Ppnoeric

Ppnoeric

Ddrawmack

Ggoa103

Ppnoeric

Ddrawmack