Hi!
I'm working on a disscusion forum program (PHP + MySQL). When designing the table schema, I tried to follow the normalization principles and now I have 10 tables : Channel, Forum, Topic, Article, ChangeLog, BanList, MailingList, Moderator, Signature and User.

The problem is that I found I have to do lots of joins to get the information I need. For example, this is one of the queries I made to get the information of a Topic :

SELECT Topic.,
Channel.c_id,
Channel.title,
Channel.visit_priv,
Forum.title,
Forum.visit_priv,
COUNT(
) AS total_article,
MAX(Article.create_time) AS last_post,
User.account,
User.nickname
FROM Channel,Forum,Topic,User
LEFT JOIN Article
ON Topic.t_id = Article.t_id
WHERE Channel.c_id = Forum.c_id AND
Forum.f_id = Topic.f_id AND
Topic.u_id = User.u_id AND
Topic.t_id = $this->t_id
GROUP BY Topic.t_id

Although I've added indexes to my tables and the EXPLAIN result of this query is not too bad, I still wonder that, if I add some level of redundancy to my tables, I can improve the SELECT performance. For example, if I add a "last_post" field to my Topic table, then I won't have to join Article table to find the time of lastest Article in a Topic (EXPLAIN result shown that this is the bottleneck of this query).

I know that if I do this, I'll need to do an extra check and update Topic table every time an Article is posted or deleted. But times of posting / deleting an Article is far less than simply viewing the pages (times of INSERT/DELETE is far less than SELECT), I think this extra effort of keeping the "last_post" field updated should be worth it.

Please share your opinions. Thank you.

    I'm baffled by your statement that you "follow the normalization principles and now I have 10 tables" ... that doesn't look normalized to me.

    What is the difference between Channel, Forum, and Topic? Aren't they all containers?

    Why isn't Signature a property of a User record?

    Are you banning on a discussion-level basis, or globally? If the latter, what is the purpose of a separate Banned table?

    And what's MailingList for? If it's just a list of users who are to receive email, why isn't that a property of the User record?

      Forum probably holds a list of the forums (IE: discussion, feedback, etc.) as well as some style settings for individual forums. Topic probably contains all the actual messages, with a forum_id, user_id, etc.

      You would keep banlist seperated from the user table because you have to ban by IP, which can always change - so it's probably more a list of IPs than user names.

      One would probably keep signature seperate because it's pointless to set aside places for all that data when only a few users take advantage of it.

      Let's take a step back, however. Why attack his nitty gritty rather than answer the poor guy's question? He's asking whether we feel that a certain amount of redundancy is justified if you can significantly reduce the number or complexity of the database queries.

      In my opinion, yes, it's justified.

        Ack 'Marinne' may be a she, if this is the case, forgive me :-)

          But what's happening is an INCREASE in the complexity of the queries -- a lot of joins -- not a decrease.

            you have to ban by IP

            Only if you want to harm a lot of innocent bystanders! Oh, and let lots of perps with dialup accounts slip through the cracks un-banned. :-( Everybody repeat after me:

            There is not now, nor has there ever been, a one-to-one correspondence between users and IP addresses!

            (There, I feel better now!)

              Continuing on this topic ... here is a link to a PHPBuilder article on data model normalization:

              http://www.phpbuilder.com/columns/barry20000731.php3

              The key point is "eliminating redundancy and inconsistent dependency."

              I learned this stuff on the job, by working for a month or so on a project with a really, really good data model engineer. At some point in the process I decided I wanted more than anything to throw him out the nearest window if he asked another question of the form "what is X, exactly?" or "what is the relationship between X and Y?" It can be maddening. But it's important.

              I'll throw out this challenge: Model a message board that supports any number of users, any number of messages, organized in any number of containers at any arbitrary depth of nesting. Keep track of the last message read in each container so that you can compute the number of new messages.

              This requires no more than three tables, although I would probably treat individual messages as distinct from containers and therefore wind up with four tables (not counting index tables created when you designate a column as a key, etc.)

              When you're done, you'll find that your queries are VERY simple.

                Right, that's how it is now. If you put a "last post date" in the forum table, you'd only have to query that one, for the main page, make sense? By adding some extra REDUNDANT fields to some of the tables, you'd have to do less queries and less complex queries across less tables. At the cost of a timestamp, don't you think it'd be better? All you have to do then is make sure the forum table is updated each time a message is posted/deleted/edited/whatever.

                  Look, I think this is a well-known fact. But the point is, sometimes this is the only thing you can do. Sometimes, the rule-breaker is on a static IP. Sometimes, you MUST ban an entire IP range to get rid of the trouble makers. Have you ever run a high-trafficked forum? We have roughly 5000 registered users, and when one or two choose to act out for whatever reason, they flood the ENTIRE message board with posts for hours - the only way to get them to stop is to ban their IPs. Sure, this may catch some users from their same ISP, but I'd rather ban 10 people, even if 9 of them weren't breaking the rules, than let 1 person completely ruin the entire forum. Because if I let one person ruin the entire forum, BAM I may as well have just shut the whole thing down, because nobody wants to visit a forum like that anyway.

                    If you've got some magic formula, why not share it with the rest of us?

                      Nothing magic about it; it's just modeling.

                      In the following model, there are three primary entities: rooms (containers for either messages or other rooms), messages, and users. If you don't like the concept of "room," call it a "folder." Same thing.

                      The relationships are:

                      rooms-<messages (one to many). One room can contain many messages, but a message can exist in only one room.

                      users-<messages (one to many). Each message has one and only one author. An author may write many messages.

                      rooms-<rooms (one to many, recursive). This is how rooms can contain rooms (or folders can contain folders). This is why forums and topics are the same thing; they are just containers with a parent-child relationship. If you want to allow nesting to any arbitrary depth, the model allows for that.

                      The tricky one:

                      users>-<rooms (many to many). A user visits many rooms. A room is visited by many users. In every case we need to track a "high-water mark" -- in other words, a bookmark pointing to the last-read message. This lets us not only compute the number of unread messages, it lets us hide the old messages, a concept invented in the 1970s but sadly missing from many "modern" message boards.

                      To turn this into a physical model, start with one table for each entity. Rooms, users, messages. Three tables.

                      Then consider relationships.

                      One-to-many relationships are simple: a key in one table points to a record in another table. No additional tables are necessary.

                      Many-to-many relationships require another (relational) table.

                      So now we have four tables. See here:

                      http://stonemountain.yi.org/prattle/prattle.sql.txt

                      This is a model of a "linear" message board. If you like threading, it's fairly simple: add a one-to-many (parent-child) relation, message to messages. That requires an additional column in the message table, not an additional table.

                      If you need to "ban" IP's, that's a separate entity to track, so it deserves a separate table. I do agree with Kirk that it's not a good way of handling the problem.

                      I used to run a system with about 40,000 registered users. Once or twice we blocked a nuisance IP at the router or httpd level. Registration and email verification takes care of nearly all of the village idiots.

                        I'm not following how you can store 'forum' information in a messages table. In a typical forum system, you'd have say 3 seperate forums. General Discussion, Feedback, and Technical Help. Each of these forums can have different logo graphics, different colors, etc. They would also have to have descriptions, and rules regarding who is allowed to post, etc.

                        Your model sounds fine, but it sounds a bit too simple for a forum system in this day and age. People want features, and it's practically impossible to provide all the features people want with only 3 or 4 tables. I'm not saying I have a good answer, but I don't think your solutions are answers to the questions asked here.

                        Just to make the point of the signature table - in the MySQL manual, it says a text field requires 2 bytes of storage, even if the field is empty. This may not sound like much, but take your 40,000 users and do the math. It also depends on how one intends to implement the signature field - is it an option to put it on the end of each message? Most message boards put a seperating few characters "---" above the sig, which don't need to be there each time if there is no sig for the person. So in that case, you'd have to test for a null value (I think? or empty string? I'm not sure what mysql returns - but I do know that PHP creates the textfield variable, even if it's empty) returned from the select query.

                        But seriously, we can debate a type of table design for a forum system all day - there are many ways to do it, just look at the designs for all the popular php forums, phpBB, phorum, vBulletin, etc. There's room for different opinions.

                        But Steve, it doesn't seem that you have stated your opinion clearly -- do you think it's NOT okay to have redundancy in order to speed up or simplify the queries? For instance, you could reduce the load on the messages table if you had a "topics" table that simply stored the text field of the topic, the number of posts, user ID & name of the person that posted it, and a timestamp of the last reply. You'd only have to hit one table each time someone loaded the main forum page, rather than a messages table and a users table - consider the fact that you said you had 40,000 users, and that's a lot less data that mysql has to sift through.

                        As for the banning problem - My site caters to young people, it's based on computer games. Many of these people are still in grade school or middle school or high school. I'm sure it would be much different if I ran an IT site that catered to professionals, but not everyone has the luxury (or desire...) to make a site like that. Obviously this is a priority to some people (including myself and the person asking the question). I know from experience that email registration does not work - I've spent nights up removing hundreds of crap posts because some kid got mad at the staff for not posting his file fast enough. Also consider the tens of thousands (if not more) of free email providers... Trust me, these kids take advantage of them.

                          We have roughly 5000 registered users

                          "Registered" in what way? The brute-force ban-the-whole-subnet approach only seems necessary if your users are anonymous.

                          The way we typically handle this is to require a real email address, and send out a one-time access key. Until they respond to it, they only have read-only access. Sure, an obnoxious user can simply shift to another address, but there's a bit of effort and delay involved, so it does slow down the amount of damage a person can create. I am quite aware that not every kind of site will want this degree of confirmation, but if you have much less than this, what sense does it make to call those who post "users" much less "registered users"?

                          Such a casual-entry system seems to me to really only have "posts", some of which happen to allege to come from a certain individual or other. And I completely agree, in such a case the IP address is all you have to go on. That's the price of the near-anonymity, and for some systems that's a very valid tradeoff.

                            Look, I think this is a well-known fact.

                            Would that this were so, but apparently not based on the number of inquiries I see. One recent example:

                            www.phpbuilder.com/forum/read.php3?num=5&id=11309&loc=0&thread=11309

                              It's a UBB, and I require email confirmation upon registration - people just get a hundred email addresses and a bunch of different user names... Nothing I can do. In one of my above posts I explained that my audience is generally much younger than 'normal' I guess. So I'm sure that has a lot to do with it.

                                shrug I stand corrected. However, it seems that the original poster knew this, unless his banlist is somehow different than I imagine it.

                                  Write a Reply...