Hi All,
I inherited a crude system from a client who receives leads including name, address, and email and phone #. Data is coming in by API pretty heavy, from about 15 clients, 24 hrs/day. Each client/project has a separate table. Each table has the clients project/name like jimbob, jaguar, mentor, etc.. Data is inserted in to the tables based on the clientid field passed.
Almost no error checking is done on the data.
Reason for multiple tables was he was getting locking up trying to do it in one large table.
The data is highly duplicate prone, and there are more than 3 million records in some tables and growing by the second.
My strategy right now is to NOT touch these tables, leave the system, and from them pull the data into a master table, de-duplicate them, and indicate instead that this record occured in this table at this id# through a join table.
You can imagine this table will be HUGE. It will have only very basic fields
fname
mname
lname
email
phone1
phone2
address1
address2
city
state
zip
zipplus4
country
and no more
MY QUESTION IS:
Of course I will normalize the city, state, and country. However, should I also normalize the firstname and middlename? (almost no data has middlename) you can imagine some of the queries that would be made and I just want to make sure that having an id for the first, middle, and maybe even last name is a good idea
And BTW, if any one knows any links to "working with reeeeally big tables for beginners", let me know 🙂
Samuel