Maybe I spoke too soon ;P. Everything was working fine yesterday, but today PG was failing with:

Warning: pg_query() [function.pg-query]: Query failed: ERROR: catalog is missing 1 attribute(s) for relid 456086 . in /var/www/html/lib.php on line 200

VACUUM gave the same error, but VACUUM FULL appears to have fixed whatever the problem was. Only one of the PG databases was giving this error. Any idea what the problem was/is? I'm using PG 7.3.2 and run the vacuumdb script nightly.

    It was a SELECT * from a view. The view is something like

    SELECT *
    FROM A
    LEFT JOIN B
    ON A.B_ID = B.ID

      Hmm, I don't have physical access to the server. Looks like a partition can be used instead of a boot disk for memtest? Is that possible/easy/advisable?

      Got a response from Tom Lane:

      CSN writes:

      ERROR: catalog is missing 1 attribute(s) for relid
      456086 . in /var/www/html/lib.php on line 200

      Ugh. This indicates corruption in the system catalogs --- specifically,
      a missing pg_attribute row. These rows will normally be picked up with an
      indexscan, so it's possible that pg_attribute itself is fine and the
      trouble is corruption in pg_attribute_relid_attnum_index.

      VACUUM gave the same error, but VACUUM FULL appears to
      have fixed whatever the problem was.

      Sounds like blind luck to me. I'd have expected that you might escape
      alive by reindexing pg_attribute, otherwise you have a big problem.

      The only way that an error like that would go away by itself is if it
      wasn't really there at all --- that is, the apparent corruption only
      existed in in-memory copies of pg_attribute pages, and not on disk at
      all. I'd strongly recommend running some hardware diagnostics
      (memtest86, badblocks, etc). You might not be so lucky with the next
      dropped bit.

      		regards, tom lane

      Maybe I'll try badblocks. At any rate, is this something that's bound to happen again? And even fubar everything? I've never encountered memory-related errors before (that I know of).

        Yep, it will happen again if you have bad memory. No database can make up for bad memory, and postgresql chews up memory like mad to buffer everything it can, and it trusts the machine to have all good memory.

        You should be able to get your hosting provider to test it for you. It's in their best interest as well as their customer's best interest to make sure their machines are solid and reliable.

        If you don't have direct access, a good way to test for bad memory is to compile postgresql with a -j 4 switch or larger. Run top in one window, and run the make with a larger and larger number to -j until you're almost swapping out memory. If you have enough memory you may not be able to use it all directly, but there will be plenty of disc buffer to use it and get corrupted. Either way, look for sig 11s when compiling. If you get sig-11s from gcc, in random places while compiling, then you have a broken machine underneath you.

          2 years later

          Hey Sxooter-

          What does the "-j" switch do? I couldn't find it in gcc's (version 3.2.2) help or man page.

            -j tells the compiler to run multiple threads, which uses more memory, which, in turn, makes machines with bad ram fail more often.

            Usually... 🙂

              Write a Reply...