Automatically Deleting Records

hadoob024

My web hosting company offers phpMyAdmin to administer MySQL dbs. My question is, is there anyway to have records that are older than X amount of time be deleted automatically? Or do I need to add some PHP code to my site to handle this? thanks.

mourisj

Hello,

If your provider also provides cron-jobs, you might want to try to execute a cron-job once a day with appropriate mysql-commands in it. Or, if ths is not allowed, the cron-job might run a php-script executing the required sql queries.

If cron-jobs are not available, you could call the above php-script from your browser; but you have to do this manually.

HTH

JJ Mouris

Sxooter

You might want to set up a backup to put the rows into for a while before actually deleting them. That way, if your cron job goes wild, or you have some kind of date problem, you'll still have the user's data.

hadoob024

hmmm... that's interesting. but if i write some code, and thoroughly test it, what are the chances of something like what you said happening?

like my plan is to add a new field to my db and store the timestamp in it (call the field "dateadded"). and then in my script where i perform searches on the db, just before performing a search, i call the following:

$postdate = time() - 2592000;
$timestampquery = mysql_query("DELETE FROM tablename WHERE dateadded < '$postdate'") or trigger_error('Error deleting records older than 30 days\r\n'.mysql_error(), E_USER_ERROR);

how does that look?

Sxooter

All it takes is a few minutes of someone running the server accidentally setting the time ahead one year and poof, all your records are gone.

Is the storage space an issue? Sometimes it is. most of the time it's not.

If the storage space isn't the real issue, you can always just put a where timestamp > now()- interval '30' days or something in the query.

I'm an ops guy. I don't play on chances if I can help it. I play on as many certainties as I can have. And murphy is a bastard. 🙂

hadoob024

Hmm. That is true! Yeah, I guess I should play it on the safe side. Quick question, with the suggestion you made, "where timestamp > now()- interval '30' days". Is this any safer than what I was planning on originally? I guess instead of just deleting the records, maybe I should just dump them into another table? How hard would this be to setup?

TF_TheMoon

Make sure you've checked which version of mysql you are using, as earlier versions will update any timestamp field when any record in that row is altered

Sxooter

TF@TheMoon, Close but not exactly correct.

1: It's still all versions of MySQL in release (i.e. 5.0.19 does it)
2: It's ONLY the first timestamp field, not ALL timestamp fields.
3: It's ONLY if you update it without setting the first timestamp field.

for example:

create table abc (i int, dt timestamp, dt2 timestamp);
insert into abc (i) values (1);
select * from abc;
+------+---------------------+---------------------+
| i | dt | dt2 |
+------+---------------------+---------------------+
| 1 | 2006-04-04 05:55:05 | 0000-00-00 00:00:00 |
+------+---------------------+---------------------+

insert into abc (i,dt) values (1,'2005-01-01');
select * from abc;
+------+---------------------+---------------------+
| i | dt | dt2 |
+------+---------------------+---------------------+
| 1 | 2006-04-04 05:55:05 | 0000-00-00 00:00:00 |
| 1 | 2005-01-01 00:00:00 | 0000-00-00 00:00:00 |

That said, it is still something to be paid close attention to (unless you use ANY other database in the world. then you'll have other, lesser things to pay attention to)

Roger_Ramjet

hadoob024 wrote:
Hmm. That is true! Yeah, I guess I should play it on the safe side. Quick question, with the suggestion you made, "where timestamp > now()- interval '30' days". Is this any safer than what I was planning on originally? I guess instead of just deleting the records, maybe I should just dump them into another table? How hard would this be to setup?

Like Sxooter, I believe that all data must be recoverable, ALWAYS.

I archive out data in this way.

Have an archive table that is identical in structure as the master table, EXCEPT that any identity/autoinc columns are retyped to int or whatever.
Copy all data you want to archive into the archive table using a query like this

$insert_query = "INSERT INTO archive_table 
     SELECT * FROM main_table LEFT JOIN archive_table USING(main_id)
     WHERE archive_table.main_id IS NULL AND DATEDIFF(main_table.timestamp, now())>30";

The left join with is null means that only those records not already in the archive table will be added. You could just let the unique indexes bounce of any duplicates, but that means you will be processing many more records every time.

You then use an inner join to delete records in the main table that have been copied to the archive table.

$delete_query = "DELETE main_table.* FROM main_table INNER JOIN archive_table USING(main_id)";

That will delete any records in the main table that are in the archive table.

For full-on data integrity checking you can match every column in the main table to it's corresponding column in the archive table and only delete records that are identical. You would then have to check for any records that were only partial matches. That would be the boiler-plate method as it will trap any errors or corruption that occured when the data was copied to the archive table.

hadoob024

Hmm. OK. I see what you're doing. A couple of follow-ups. In this part of your code:

USING(main_id)

This is the main_id from the main table, right, and not the archive table? What else, oh yeah, in this part:

DATEDIFF(main_table.timestamp, now())

How would you recommend storing this timestamp? Is this a PHP timestamp or a MySQL one? Would I be ok doing the following:

$timestamp = time();

And then inserting this value for the timestamp field? And finally, unlike in the main table, archive_table.main_id should be allowed to store NULL and should have a default value of NULL, correct? Thanks to everyone for all the help!

Roger_Ramjet

The archive table should have exactly the same table definition as the main table, EXCEPT that any autoincrement or identity(other dbs) columns should have that propertu removed. We do not want to generate new ids but use those generated in the master table. So all column names, types, etc will be the same. It needs to be like that so that it is a true archive of the master table that can be substituted for, or joined seemlessly to the master table as and when required.

The USING(main_id) is just a good way of joining 2 tables on an identically named column. A convenient side-effect of the above.

DATEDIFF() just returns the difference in days between 2 dates.
Store the timestamp as a mysql date/time type. Up to you if you use the timestmp, date, or datetime type. Depending on your version of mysql, as Sxooter explained, a timestamp type may be just the job. If it is just for the date entered so that you can archive it then use the date type with a default of DATE().

The main factor in choosing which date/time type to use is down to your application design and what you are going to be using the data for.

If there is going to be a lot of summary reporting from the data then date with no time element is favourite. It will save you having to keep on loosing the time element when you aggregate the data.

If the timing and order of individual records will matter then timestamp or datetime will be required.

hadoob024

Thanks. No, the only reason I'm including a timestamp field is to help with maintaining the db because I need to keep track of records that are older than 30 days. Other than that, I have no use for it. So in my case what's the best way to get a value into the field, and what's the best data type (and default value) to define for the field? Thanks.

Sxooter

For what you're doing that's probably the best choice Note that the field will autoupdate on both updates and inserts, not just inserts.

hadoob024

Uh, my bad, I might be missing something here, but which choice are you saying is the best choice for what I'm doing? What's the best way to get a value into the field, and what's the best data type (and default value) to define for the field? Thanks again!

Sxooter

You should probably use the timestamp type. If you'll be updating the rows, but want the date to stay the same as it was when it was first inserted, you can keep the original date by adding a line to your update statement like:

... ,timestampfield=timestampfield, ...

and the date will stay the same.

Roger_Ramjet

I disagree Sxooter, I would say the best choice would be a Date column type with the default set to Date() so that it stores the current date only. You can either omit the coulmn in your Insert query as you would for the autoinc column, or explicitly set the value to Date() in the query.

// set to Date() as default then omit the column and any value
$sql = "INSERT INTO table (col2, col4, col5) VALUES ($val2, $val4, $val5)";

// or explicitly set it in the query
$sql = "INSERT INTO table (col2, dateadd, col4, col5) VALUES ($val2, Date(), $val4, $val5)";

That way you elliminate any need to drop time values or use a date format function to match just on the date.

And you also elliminate the need t worry about the column being updated when the record is edited.

Sxooter

Good point about using a default. I agree on that. Doesn't matter much whether it's a date or a datetime/timestamp, as you'd just use something like:

insert into backuptable select * from maintable where datefield < now() - interval '30' days;
delete from maintable where datefield < now() - '30' days;

hadoob024

Cool. And using either:

DATEDIFF(main_table.timestamp, now())>30

or:

where datefield < now() - interval '30' days

with the Date column type and using Date() for the value will work properly, right? I won't have to do any conversions before running this check? Thanks!

Roger_Ramjet

None whatsoever.

After a while you will develop habits when designing relational databases. One of mine is to have a DateIn column in most tables with the default set to Date(). This works pretty much across all db engines and makes the design highly portable. The decision I make at design time is whether a table does NOT need such a column, by default it gets one. Additionally, I consider reporting and management information requirements as to whether the time is worth recording as well. Timestamp columns, where the type exists, are reserved for last-modified data as and when it is required.

Not storing the time means that you can index the dateadd column in a meaningfull way and avoid a table scan. Likewise using a native date type means that you can use inbuilt date functions without a table scan cause by type conversion.

Roger_Ramjet

Sxooter wrote:
Good point about using a default. I agree on that. Doesn't matter much whether it's a date or a datetime/timestamp, as you'd just use something like:

insert into backuptable select * from maintable where datefield < now() - interval '30' days;
delete from maintable where datefield < now() - '30' days;

Sxooter, I'm surprised at you, and you a dba an all. Deleting from the main table without checking if it exists in the backup table ? Tut tut.