8 Lessons Learned Importing Legacy Data into a New System
Apparently, it’s not uncommon for business to create some tables in an Access database, pass it around to anyone who needs it, and basically treat it like a big Excel spreadsheet. Not uncommon, I think, because in the last year I’ve been tasked with importing 4 different Access databases from 2 different clients into a new schema for use by web applications. Shockingly, they found that passing around a copy of an Access DB that anyone could put any old data into didn’t lead to the kind of data integrity they were hoping for.
I figure if this sort of work has come up 4 times in a year, I’m probably not going to be the only person who runs into it. So while this is still fresh in my head, I’ll going to dump out what lessons I’ve learned from my experiences. I’m far from having mastered this though, so caveat emptor*.*
1. Data Integrity? What’s that?
So, you and your team have created this fancy new DB schema for your web app. You’ve normalized tables, set up foreign keys, created unique and not-null constraints, and your web app itself performs all sorts of validation before allowing you to write to the database. You’re feeling pretty good about yourself! Good, hold on to that feeling. You’ll need it.
You take a quick glance at the Access DB, and it doesn’t look so bad. Okay, sure, they don’t have any non-null constraints set, but for all the columns you care about, they look like they’re always filled in. Usually. Okay, maybe one or two rows are missing a value. And see, tblFoo has a BarId column, which clearly references tblBar. I mean, there’s no actual constraint forcing the ID to map to anything in tblBar, but surely they wouldn’t just put random ids in there, would they?
Yes, yes they would. And they’ll have a reason for it. Like, “Oh, yeah, usually that is the id for tblBar, but if it starts with “100”, it actually means it references a Bar that we don’t have in our database. It’s just on paper in a file cabinet in my office. We’re not allowed to input that data for security reasons. But you need to import that id, because we need to know! Oh, and did I mention if it ends in 999, it references tblFoobared? But only for records entered before 1998.”
Even if we’re not talking about blatantly bad data like broken foreign keys, you’ll run into the other minor problems. They’ve got a column which contains two letter state abbreviations, except 4 of the records contain full state names. Another 12 have 3 letter abbreviations. One appears to actually be a province of Canada. 2 are the name of a city. Another 17 just appear to contain typos. One even says “N/A”. And then there’s the field you defined as a bool in your new schema, “IsFooBarred?”. Except once you look in their Access DB, you see that it actually contains 3 values: “Y”, “N”, and “FileNotFound”.
So how do you solve this problem?
Well, first off, if you know you’re going to have to import legacy data into the fancy new schema for the app you’re building them, try to get your hands on their database as soon as possible. Nail down right away which tables and columns they need imported. Make sure you have a plan of where those columns are going to map to in your new schema. Take a quick glance at the data and see if any weird values stand out, such as the “FileNotFound” in the column you assumed was a bool.
Now is also a good time to find out if you’ve got all the data you need for columns that you’re setting not-null constraints for in your new database. In an example from my last project, we created a table which stored information about users, customer contacts, and such. There were plenty of fields that could be filled in, but at a minimum we required First Name, Last Name, and email address. The idea was that a person would be entered in the system once, and various other things would just point at that person’s record. It’d be a great way to avoid having to type in information for someone already entered in the system, and it would create a single point of change for updates.
We knew the data we were importing had user references in it, but we never actually opened up the database to see how those people were referenced. This created one of the biggest challenges of the import… often the user data consisted only of a name column. Maybe a first name. Maybe a last name. Maybe both. Maybe a COMPANY name. Maybe just a phone number or email address. All mixed in the same column. It was basically a text field where they typed in whatever they wanted. One referenced person showed up 7 times with slight variations because he had a difficult to spell name. Had we looked at the data ahead of time, maybe we could have come up with a better plan about how to incorporate this data into our system. Sadly, we had to settle for taking a best guess at what fields we actually had when parsing the data, and filling in junk data when we were missing it. We also had to relax a few of the not-null constraints we’d placed on the table in our new system.
2. Create an error-logging framework
Looking at the data ahead of time might help you catch some problems, but others are going to slip by. This is where the single biggest piece of advice I can give you comes in. Take some time right at the beginning, before you start writing scripts to import the data, and create yourself a nice error-logging framework. Chances are you’re going to want it to log the errors within a row and keep on moving with the import. I found it was best to just insert null values (when possible) when a particular value couldn’t be parsed, rather than skip the entire row as soon as an error was found. It helped me find as many errors as possible all at once. Otherwise, if I just stopped looking at a row after one error, I’d just hit the next one in line when I got the first one fixed. You’re going to want to find as many problems as you can as soon as you can. More on this below.
Also, do your best to categorize and group errors. If you just get a log 3000 lines log describing each of the 3000 errors you encountered, that’s more useful than no log at all, to be sure. But better is along with it, you get a breakdown of unique errors. Yes, you probably want to know which 15 records had incorrect state abbreviations in case the client wants the ids for quick reference when you’re talking about the problem. But along with that, if you can get a report that those really come down to two.. “WIS” (which should have been WI”) and “MH” (which is… a typo for NH? Who knows?!), it helps you better communicate the problems back to the client. They don’t want to look through your 3000 line error log. And neither do you, really. You want to be able to clearly and concisely summarize the types of errors that need to be fixed when talking to your client.
3. Find the problems as early as possible
I mean, this is self-evident, right? Not just in software, but just about everything in life, if there’s a problem you want to find it as early as possible before it has a chance to breed. By the end of my last project, I’d started referring to import tasks as a game of “Let’s find the hidden requirements!” We had what we’d thought was a full list of requirements that our app had to meet. We planned our new DB schema, UI, and validation rules with that in mind. And then we’d get to the import tasks and find that a column value really meant something totally different in certain circumstances, or that a single field in the Access DB was actually used to store a comma-separated list, or that when we were told their Foo process only had 3 phases, their legacy DB actually recorded 4 phases. We then either had to push back and say that we couldn’t import certain bits of data because they didn’t fit into the new system, or more commonly we had to expand our scope to cover these new “hidden” requirements.
There’s another reason you want to find them early, here. Unless your client has a dedicated resource at your beck and call, who has total knowledge of the data and perfect recall, you’re going to have quite a bit of turnaround time on getting answers to your questions, or getting their input on plans on how to deal with data issues. Sure, you can probably just have your script change “WIS” to “WI” without the client’s input (though you should still notify them of the plan), but what about the state with abbreviation of “MH”? You’re going to need a response from them before you’re confident that your import is working correctly.
So, Monday afternoon you’ll send Tim the client the email detailing a handful of the problems you’re currently dealing with. He’ll see it come in but decide it’s too much to get into before the end of the day. And since helping out with this newfangled system being built is just a side responsibility and Tim still has his real job to do, maybe he’ll remember to get back to you tomorrow. After he’s done with 6 hours of meetings. And then he’ll say, “MH… hrm. Well, I certainly didn’t put that in there. Let’s see… looks like Doug wrote those records. I’ll email him to see what it means”. And 2 days later, maybe Doug will get back to Tim. And maybe the day after that, Tim will forward that response to you. And if you’re really, really, lucky it might even answer your question.
I’m not getting down on the clients here… I’ve never actually run into a “Bottleneck Bill”. They’re busy carrying out their day-to-day jobs and probably aren’t allotted time for dealing with all your problems. The point is, it was not uncommon for me to go a week or more before getting a response to my question. And sometimes, a week later, the response was a request for more information. This back and forth can play hell with your schedule, so you really need to find as many problems as possible early on.
4. Try to make the client fix the data themselves
For data that needs to be corrected, try to get the client to fix the data themselves. For the first two of the four databases I worked on, I made the mistake of fixing it myself after getting them to sign off on the fix. If it’s a simple fix like changing “WIS” to “WI”, maybe it’s not such a big deal. You don’t even have to modify the data… you can (and probably should) simply add a line to your script that fixes the problem. But in one case, I was manually correcting foreign key ids for about 3000 records. Worse yet, the client had actually sent me the correction in an Excel sheet. Why it didn’t occur to me until after getting the fix in there that they could just as easily corrected the data themselves as sent me a spreadsheet of corrections still bothers me. That was the biggest single correction I made, but there were a lot of smaller ones. The time added up, and none of it had been included in the estimate of work to be done. I was there to import their data, not fix their existing database. I was basically reproducing their work and doing manual data entry. That’s a waste of time and money.
There’s a secondary benefit to having the client make the fix themselves. When you put them in the position of correcting it themselves, they are more likely to find a problem with your proposed solution, or spot a better solution. They’re the domain experts, after all.
5. Plan to import the same DB multiple times
In all four cases, I was dealing with a live database. I had a copy of whatever it looked like when I started and it’s what I used to test against, but while I was working on the import, the legacy database was still growing, and would continue to grow until our app and the new DB went live. This means that new data integrity errors were potentially being introduced that I wouldn’t see until the “final” import happened. This is another area where your error logging will really help you out… you’ll be well positioned to quickly deal with any new import issues that come up. By writing the import script such that it looks for already-existing records and skips them, you don’t have to worry about trying to turn the import into an all-or-nothing affair. If they shut down their legacy DB so they can start using their web app, they’re probably going to want as much of their legacy data in their as soon as possible. If 3 records of 30000 have problems, they are going to want to see the other 29997 while you’re figuring out what’s wrong with those other 3. And if it takes you a few days to fix it, they don’t want to have to worry that any changes to those 29997 records on the new system got blown away when you re-ran the import.
6. You may not be the one running the final import
On the first of the four imports, I was in control of the production database, and I was the one who would run the import script. It meant I was present, ready, and able to deal with any issues that came up.
The other three imports I wrote didn’t have that luxury. The client manages the production database themselves, and we do not have access to it. For that matter, it doesn’t actually exist yet… the imports I’ve written in the last few months won’t be run on the production servers until they’re finished setting up the server, a task that will probably be finished long after the point that our team has delivered them our final product and all moved on to other projects. This is an area I wish we had given more consideration to before starting. Those fancy error logs I talked about above? Well, I had less fancy versions than I would have liked. Hopefully they’re good enough that if their DB admin runs into problems. They’ll get most of their data in there and get a report of the problems that should help them get the rest in after some corrections. I feel I could have done better here, though, and keep fearing there will be some unexpected error condition that brings the whole import to a halt. Hopefully anyone reading this who is tasked with a data import takes heed and plans on how to handle this case.
7. Your ORM probably doesn’t bulk import well
One of the four imports was done via SQL scripts. No problems there. The other three were done by pushing the Access data into our entities, which then were saved to our database via nHibernate. This is when I discovered that nHibernate isn’t all that good at bulk import. Painfully bad, in fact, particularly when you’ve got references to references to references. For the first of the three imports, we had a hierarchy where one imported record referenced up to 100 entries from another table, which could reference up to about 100 records themselves. And I was trying to do the entire database in one transaction. No joy! Even trying to do one top-level record per transaction was looking like it would takes days for the import to run. I finally found the answer… use a stateless session. This meant a little more work to make sure everything hooked up correctly, and meant we didn’t get to automatically create our audit records with Envers, but it dropped the import time from days to a little under an hour. This was especially important given how many times I had to re-run the import as data fixes came in so that I could validate they worked as expected.
The other two databases didn’t have as much nesting, so I was able to use a regular session (which saved me from having to manually worry about audit records). However, I was noticing these were still much slower than I had expected. The solution, I found, was to wrap each top-level item in a transaction, as I had for the previous database, but also to clear the session at the end of every transaction. This cut imports that had been taking hours to run down to about 20 minutes.
The point is, though, that ORMs aren’t really designed to import this much data at once. Think carefully about whether or not using the ORM is your best option.
8. Build a user-friendly tool for the client that checks the integrity of their database and reports back errors
I haven’t done this, because I didn’t think of it until writing this blog post, but what if the first thing you did was build a tool that codified your understanding of how the legacy DB is supposed to behave and gave it a snappy enough interface so that the client could run it and understand the errors that were being reported?
This solves so many of the above problems for you. Finding data-integrity errors early? Check! Error-logging framework? This is pretty much just a modularized version of that with a fancy UI! Reducing turnaround time on client communication? Yes, they can try fixing the data and re-run the tool to see if the error goes away! Finding errors as early as possible? Yes, because you’re going to build the integrity-checking tool before you ever write a line of code to actually populate your new schema! The legacy DB might be constantly updated while you’re working on the import? No problem… with this tool, the client can check whenever they want if their updated DB is showing errors.
Of course, this requires client participation. If they aren’t going to run the thing, you still get some benefits just from having it yourself, but nowhere near as many.
And as an added bonus, the UI and logging modules seem like a “write-once” sort of thing that you could reuse for any future import projects. All that would be left is defining the rules you expect the DB to follow… work you’re going to have to do anyway during the import.
In a perfect world…
If the data were perfect, imports would be pretty easy. Aside from a one-time cost of figuring out some ORM issues, the code for mapping one table’s values to another tends to be simple. The complexity lies in identifying the integrity issues, making changes to your new schema and UI to support them, and working with the client to get a plan for how things should be fixed. If you’re expecting a lot of work of this nature, spending some upfront time creating a simple-but-friendly tool for checking data integrity could pay off many times over.
I wonder how one can accurately estimate how long a data import will take, given that the hardest part of the work is dealing with all of the hidden work that bad data creates. I have no answer. It seems that we should definitely plan our estimates around that uncertainty, though.