I'm working on a real-time hit counter project. I would like to know the number of times that each page has been displayed, and then record the number into MySQL database. Another script is going to extract the stats from the database and then show the result (counting hits) for each page to the users.

Currently I'm using a SSI CGI written in C working with flat-file database, and keeping track of the hits for each page in separate files (page1.html with a data file page1.stats storing number of hit access).

The problem with the above method is that it's resource intensive, and the stats sometimes got reset to 0 when too many users are accessing at the same time (for some reasons, the file lock seems not working properly).

After getting some suggestion from people in PHPBuilder, I've eliminated my choices down to the following three:

  1. using a separate hits tracker stats.php?page1.html to update the hits in the database and the send visitor to page1.html

  2. calling stats.php?page1.html within page1.html via <script> or <img> tags (to avoid server parsing as my current SSI).

  3. analyze the server access log in a 5 minute interval via cron job to simulate real-time tracking (which i'm not quite sure how to do so at this moment)

There is about 500 pages that I'm going to track, and a total of at least 1.5 million page access per day to these pages.

The server is currently receiving over 5 million hits per day, and the size of the log file is considerably large (with log rotation 4 times a day).

My major concern right now is to minimize the usage of server resource. The server is equipped with dual 750 CPU and 2gb of RAM.

Can anyone give me a better suggestion of implementation?

    My site doesn't have quite as many pages (it's 95% php/mysql driven, so ALL content is created dynamically), but I've setup a small counter.

    However, my implementation is a bit more basic and thus might not necessarily apply here. I'm not sure. I think it's an extremely fast way of doing it. Though I suppose you can't get much slower than working with a flat-file?

    Basically, I have fields setup in my table for each page I want to rack:

    CREATE TABLE stats (stats_id int not null primary key auto_increment, page1 int, page2 int....);

    Perhaps you might want to use BIGINT? Never used it, not sure about the diff (I'm still quite new to php/mysql), but with those kinds of views....

    Then, the rest is quite simple:

    On each page that you want to track, you write the corresponding page name and include("stats_counter.php"); as such:

    In Page1.php:

    <?
    $page = "page1";
    include("stats_counter.php");
    ?>
    ...rest of html

    stats_counter.php:
    <?
    $stats_result = mysql_query("SELECT $page FROM stats");
    $stats = mysql_fetch_row($stats_result);

    $counter = ++$stats[0];

    mysql_query("UPDATE stats SET $page='$counter'");
    ?>

    That should be it.

    A bit too simple perhaps? It perhaps won't give you as much detailed info as you want, and with the # of pages you're dealing with, perhaps you'll want to incorporate some kind of auto-field insertion. Furthermore, it's not really a hit-tracker, so much as a page-views counter. Not sure which you were looking for.

    Personally, I'd reckon that the simpler the better, but perhaps you need more detailed info.

    Pardon if this was too much in the wrong direction.

      Thank you for your comments.

      I'm fairly new to php+mysql as well, this is my second project.

      All I need is a simple counter, I don't need any extra info but the access of hits to each page.

      The database structure I have in mind is a table with two fields, page_name (or page id) and hits.

      Whenever the script is called, I will update the hits by increasing its value by 1 where page_name is matched.

      I'll use mysql_affected_rows() to tell if the above is being done or not.

      If mysql_affected_rows() == 0, I know the entry is not in my database, I'll do an insert of new entry.

      But what I'm interested about is to maximize the efficiency of accomplishing this task by reducing my server work load.

        Mysql doesn't have very good write performance when lots of clients are writing. You might be overwhelmed by the performance aspect.

        As far as your current method:

        The problem with the above method is that it's resource intensive, and
        the stats sometimes got reset to 0 when too many users are accessing at
        the same time (for some reasons, the file lock seems not working
        properly).

        If your log files are plain text files, with each record being a single line (and thus modest length) do you even need a lockfile? You don't mention the OS (or I missed it) but doesn't just about everybody follow the good old unix semantics that low-level writes() are atomic? Hmm, I guess I'm just assuming you're using apache on Linux/xxxBSD/Solaris, but maybe not.

        As far as my somewhat-tongue-in-cheek proposal elsewhere, hey: if you really are rotating logs every 15 minutes, maybe going to every 5 isn't that big a stretch.

          I'm no pro, but the method I described, sounding exactly like what you just described shouldn't be too deficient.

          That's a guess, but you're only making about 2-3 mysql_queries.

          If you have sections, perhaps that might optimize things?

          Ie, a table for section1 (containing pages 0-125) a table for section2 (containing pages 126-250) etc... But with only 500 rows, it's probably not worth the bother. Anyone else have comments?

          Why not just plop the code on and track your server load?

            I'm not going to do a lot of writing into the database. The 500 web pages that I'm tracking are all static.

            The database structure I have in mind is:

            page_name hits

            000000001 1123
            000000002 2332
            000000003 4225
            :
            000000500 3123

            I will do something like "UPDATE $table SET hits++ WHERE page_name=$page_name".

            Then I'll do a mysql_affected_rows() to see if any page get updated.

            If not, I'll just INSERT a new entry starting with hits value 1.

            There won't be a tremendous amount of writing in process, but rather millions of updating processes.

            By the way, I'm using Red Hat Linux 6.1 with Apache.

            I haven't really had chance to work with my server log file before, but I believe it's in the standard CLF.

            The SSI CGI script which I'm currently using is written in C by someone else. It's only 50 lines of coding, rather simple, using flock to prevent simultaneous writing access. It's being called via /counter.cgi?0001 and generating a 0001.stats containing only one number of total hits. Theoretically it should work, but sometimes still got reset to zero.

            Since SSI is going to increase the server work load by parsing the html files, in my case, 1.5 million a day. I would try to look for some more efficient method.

            regardin to analyzing the log file, how am i suppose to extract certain lines containing particular string of urls? via grep?

              My server is already under heavy work load, it would be hard for me to evaluate the exact performance of my scheme.

              I don't think breaking up the table up would do any help either, the table itself is only going to have a maximum of 500 rows.

              I think the strategy is good enough, as far as I can think of for now.

              My second option is to try analyzing the server log file.

              I would like to figure out the best way before I start implement it.

                I'm not going to do a lot of writing into the database
                ...
                There won't be a tremendous amount of writing in process, but rather
                millions of updating processes

                Uhh, not to upset your cart, but updates are writes. If you're right about your hit count ("at least 1.5 million page access per day") I think you are headed for some trouble. This translates to an average of 17.3 writes (updates/inserts*) per second, which mysql might manage, except that surely the load is not spread out perfectly evenly. Thus I would expect to see significant throughput problems at peak periods.

                Using the log files, and rotating them every 5 minutes, has a number of advantages. First of all, you have only a single process handling the calculation, so there's no contention problems. Furthermore, since it is a separate process, you can give it a nice value such that it doesn't have much impact on the webserver throughput. Now, if the load is such that you can't process 5 minutes' worth of log files in 5 minutes, you're in trouble--but you would be anyway in that case. And yes, you can GREP the file for interesting rows, sort it, etc.

                  • Inserts can be slower than updates if you have lots of indexes defined, but just plain updates are certainly slow enough, and cause the file-locking contention for which mysql is so famous.

                    I think I'll go with the log analysis method for my solution.

                    What will happen if I can't manage to process the log files in 5 minutes?

                    I'm now having another 3 proposal for the implementation of log analysis:

                    1. GREP the lines I'm interested in, store it in array of lines, parse it by making array of variables, storing number of hits. after the parsing is done, update the data in the MySQL database.

                    2. GREP the lines I'm interested in, store it in array of lines, went through it line by line and then parse it, make necessary insert and update to the MySQL database directly instead of storing data in the array of variables.

                    3. Read the whole file in, parse it line by line, and then make insert and update as #2 (which I believe it the least efficient way).

                    I'm not quite sure how GREP works in the respect of efficiency, does it have to do pattern matching for each line to generate my desired result? In that case, will #3 be more efficient?

                    Another question in mind, how are those FREE Traffic Analysis services work? They claim to have real-time stats, tracking billions of hits per day. I place their link code on my page, displaying a small icon for their ads, and they do the traffic tracking for me. Do they store everything into database or into log files and then make analysis in certain time interval?

                      Write a Reply...