So I'm rebuilding a site that was constructed 15 years ago. The old, much-doctored code is still running in production. We have about 100 db tables and neglected to keep a data dictionary. While most of the data comes from government sources, some data is of unknown origin. Some years after launch, I started keeping a data dictionary in a spreadsheet, but it is proving to be far too vague.

Do you guys maintain data dictionaries? These seem to be a good idea for the old "what if our only developers gets hit by a bus scenario" but perhaps even more for understanding the source of your original data. What format do you use? What data points about each table and/or column?

Any anecdotal wisdom would be much appreciated.

    I don't know that we do, per se. After an accident or two during a recent deployment where a table got dropped without a backup*, I did create a "schemas.php" which really isn't PHP at all:

    <?php
    
    exit;
    
    ?>
    DB schemas for FOO.com; save for disaster recoveries.
    Note that these are only tables created for FOO 2.0 or
    other additional features beyond the original Gonkulate
    software installation; those tables should be backed up
    in several locations.
    
    # TOC: attribute_count, cart, cart_detail, cart_item, 
    items, listing_dimensions, logins_e, orders, pricing_plan, 
    search_feature_product, payment_methods, terms, table_salt,
    social_media
    
    mynock@localhost [foo_copy]> describe attribute_count;
    +--------+-------------+------+-----+---------+-------+
    | Field  | Type        | Null | Key | Default | Extra |
    +--------+-------------+------+-----+---------+-------+
    | cat    | int(11)     | YES  |     | NULL    |       |
    | attrib | varchar(48) | YES  |     | NULL    |       |
    | value  | varchar(48) | YES  |     | NULL    |       |
    | count  | int(11)     | YES  |     | NULL    |       |
    +--------+-------------+------+-----+---------+-------+
    mysql> describe cart
    etc ...

    As far as sourcing the data itself, that would be an exercise in analyzing what the data is and extrapolating based on that. User data is assumed to come from users. Product data generally comes from the Vendor but via our special tools unless "is_import" is false in the product table; in that case it could've come from the vendor directly via their control panel or from using via the admin panel ... and the only reference we have for that is a field called last_login_IP and we'd assume it was from them unless it had our IP address(es) in there. Everything else we probably generated ourselves or was placed there by the Original Programmers, and much of that is no longer viable (since we're on v2.5 or so of continuous development since 2010).

    *These tables were "new" tables that, somewhat fortunately, only had development data in them. The structure, alas, wasn't documented yet and had to be recalled/rebuilt from .mysql_history or code analysis ... (Argh!) Production server tables are backed up regularly. 😉

      I don't recall using a distinct data dictionary as such — I suspect I've tried on occasion but found such documentation went stale faster than code comments, probably to do with a separate data dictionary being separate — it means having to sync information in two places when "making it work" only involves changes in one.

      So at present the closest thing I do to having a data dictionary is putting comments on the tables/columns/types themselves so that they become part of the database schema.

      COMMENT ON COLUMN gallery_pages.index_page IS 'The index page this gallery was listed on.';
      

      Weedpacket COMMENT ON COLUMN

      Thanks for that ... now, where's that "What did you learn today" thread .... 😃

        Mind you, I'll point out that there are at least two "data dictionary" concepts. The one here is the mapping between database objects and the real-world entities the objects are supposed to be representing.

        Meanwhile the DBMS itself has its own data dictionary: a database in which it stores its internal representation of the users' databases. This "system catalogue" may or may not be available to users, but either way there is the standard INFORMATION_SCHEMA, which is typically implemented as a set of views into the system catalogue. But of course that only tells what the database objects are and (unless comments are included) doesn't tell you what they mean.

          Write a Reply...