What’s Step 1 in a Cloud Migration? Data Cleansing

Kevin Kline

Published On: October 14, 2019

Categories: Cloud, Data Lineage, Documentation 1

When planning a successful cloud migration, several different things need to happen in a specific order. The first step—and possibly the most critical—is to make sure you have clean data to migrate. It’s the perfect time to take a good, hard look at your data and make some assessments. Why assess and cleanse your data? Because storage of data in the cloud has a variable cost. The more there is to store, the more it costs. But if you don’t need or can’t use portions of your data, then performing an assessment literally means you will save money.

Three Options for Your Data

When assessing your data to determine what you should move to the cloud, you have three options. First, you can purge your data. Though there is a common trend these days to retain everything, a reason that you would want to purge is that you’ll have less data in the cloud, thereby lowering your cloud hosting costs.

As an example, let’s say in your organization’s early years there were significant data quality issues—so the data from that time is literally junk and not worth retaining. On the other hand, let's say you have a financial application, and this application is required by law to have data available for the owners of that data—your customers—for seven years (a Sarbanes-Oxley rule). But now that the data has “aged out” and you've passed that legal responsibility, you might not want to carry that data forward. So, in that case, you could get rid of that data and save yourself the expense of moving and storing it in the cloud.


Three options for determining what to move to the cloud

Conversely, even though you might have old data, it might not be low quality and has likely increased over the years due to customer growth. But very old data seldom—if ever—changes. Your second option is to use that data in a read-only manner for trending and reporting. Trending means you can archive that data and keep it available for read-only purposes. You can also use this option to perform business intelligence (BI), apply AI or machine learning, or do predictive analytics. When you're using inexpensive read-only data storage, you can employ Azure features like a Stretch Database, which uses a very inexpensive tier of storage to put that data on ice. You can query the archival data when you need it to look at trends by year or trends by months across many years, and it’s a low-cost way to keep that data available.

Thirdly, after doing this assessment phase, you’ll know exactly which data you intend to migrate to the cloud. Keep in mind that you're going to have to employ a data verification and validation period or “check phase” as you migrate data from on-premises to the cloud. Why not go straight into production with that data since it’s coming from a production system? We've seen people migrate without a check phase, and then they have customers calling back and saying, "Hey, my reports aren't adding up the way they used to. What is going on?" Basically, the data did not move cleanly for some reason. So, you’ll need to add this validation phase as well after you move the essential data. (And if you’re looking for a convenient solution for validating data, check out SentryOne Test.)

Data Assessment and Database Schema Quality

A database migration is the perfect time to perform a few additional quality checks, such as reviewing the database schema design for any of your internally designed databases. Since the database and application are taking a natural pause, it’s a good time to refactor any of significant findings that can impede solid performance.

When I'm doing a code review of a database, I like to confirm that:

  1. There are consistent naming standards for objects like tables, views, triggers, stored procedures, and User-Defined Functions (UDFs).
  2. Oversized columns aren’t in use. This situation occurs when you use a defined-length data type, like CHAR or NCHAR, of a much greater length than anything stored in that column, say CHAR(500) when the largest value in that column is only 32 characters long.
  3. There are no GUIDs as clustered indexes, except if the table is very small and is not growing. Global Unique Identifiers (GUIDs) as clustered primary keys are a terrible thing. They cause all kinds of performance problems. Now GUIDs as a column or as an alternative key are fine, but you don't want those to be your primary key or your clustered index.
  4. There aren’t any data types defined at MAX size, such as NVARCHAR(MAX). Any column defined with a MAX size requires a lot of special handling by the query optimizer and frequently does not perform as well as the same data type set at a very large size, like NVARCHAR(8000). Consequently, this is a good time to make sure you really need that data type size and weren’t just taking a shortcut. If you need it, then feel free to keep it. But if you don’t, it will save you some performance issues down the road to convert to the same data type with a smaller size.
  5. There aren’t any implicit conversions that can indicate severe code issues between the database and your code. Those should be fixed during your cloud migration.

To help simplify this for you, I do have a set of scripts that detect these issues. If you're interested in those scripts, contact me (see my contact information at the end of this blog) and I’ll send it to you. You can also go to www.SentryOne.com/KKline, where I have a script called Vendor Red Flags that includes all of these scripts that look at the quality of the code.

Another thing you need to look for is deprecated data types and other deprecated usages. There’s a PerfMon counter you can use to see how many and what types of deprecated usages are occurring on your on-premises SQL Server. One of which to watch out for is TEXT columns and IMAGE columns. In the next version of SQL Server, these data types are going to be gone and should be replaced with VARCHAR(MAX)/NVARCHAR(MAX) and VARBINARY(MAX), respectively.

The overall takeaway is simply to spend time looking at your data. I would suggest talking to the subject matter experts (SMEs) who determined business use case of the database and ask them, "What identifies junk data as opposed to useful data?" Once you figure those things out, it's going to make a big difference as to how much you move into the cloud. And since you're responsible for how much storage you consume and how much data movement happens, it will feel good knowing that you’re going to save money.

Reviewing and Refactoring Code

So now let's talk about the code: Reviewing and—if you can—refactoring the application code and the database code. We have two different high levels at which our code is written: application and database. They each require different skill sets. Your app developers are probably going to be C# coders (or something along those lines) and your database professionals are going to be T-SQL people. That said, there are some things that you need to check to make sure they aren't going to muck anything up when you move to the cloud.

3Essential_cloud_graphic_APP_Database blog

Reviewing and refactoring the application code and the database code

If you and your team are heavy users of Object Relational Mapping (ORM) tools, like NHibernate or Entity Framework, keep in mind that these tools will speed up coding in certain situations. However, they can also introduce problems. For example, if you use any of the ORMs, you're probably going to have a lot more implicit conversion problems than if you coded the stuff yourself. Also, if you coded for on-premises and you have certain kinds of query timeout times, then you might have even more in the cloud. You’ll need to update your code to be more resilient to such timeouts. So, check for those kinds of assumptions to ensure they don't shoot you in the foot once you’ve deployed your code to the cloud.

I’ve also seen lot of situations where people became enamored with a particular kind of object inside a database and used it all the time. The “Oooh shiny!” phenomena, as I’ve heard it called. For example, people regularly and excessively create temporary tables once they learn about that feature, or have an unbalanced use of views, Common Table Expressions (CTEs), cursors, and triggers. Some of those features are useful to improve code logic, but very few of them are explicitly helpful for performance. In the cloud, they might yield very different kinds of behaviors than what you're used to on-premises. As before, be ready to test those assumptions.

Finally, don't forget about the other parts of the system that are not directly application code. For example, I see a lot of people who have SSRS reports, or SSAS cubes built on top of their relational database, and they forget about SQL Agent jobs. They spend a lot of time and energy planning to move to Azure SQL Database, not even realizing that Azure SQL Database does not have SQL Agent. You must use Azure SQL Managed Instance if you're going to move to the cloud to get SQL Agent.

Data Documentation is the Answer

What does all this mean? It means you've got to go through and do some serious documentation of all your data sources, metadata, and code. I know that most of us are not crazy about writing documentation, but a data dictionary is essential at this stage of your process. Imagine that you have a data feed coming in using SSIS to load a daily batch of .CSV files, and it's pushing that data into a staging database, and then finally into production. What happens if you need to change one of the data types or modify them at the very first stage? What will that impact downstream? To answer those questions, you need to conduct data lineage analysis, and to do that by hand is very, very difficult and time consuming.

Data lineage analysis takes a lot of work if you try to do it on your own. You must document each and every table (possibly in multiple databases), document all their columns, and then look at all the stored procedures, UDFs, triggers, and views. What if you have views built on top of views and other views built on top of those? (I’ve seen views nested almost a dozen times, sometimes recursively!) Well, you've got to figure out all those cascading relationships. It's slow, difficult, and painstaking.

There are some free tools that can help. One is the Microsoft Assessment and Planning (MAP) toolkit. It will take an inventory of your SQL Server databases and give you some information on those instances of SQL Server and the databases. It’s a good start, but you’ll have to do more work after using MAP.

Then there's another one that I like even better: SQL Power Doc. It's a PowerShell library written by Kendal Van Dyke (@SQLDBA), formerly a Microsoft MVP who is now a Microsoft Premier Field Engineer (PFE). SQL Power Doc library collects information about your SQL Servers and builds a helpful data dictionary for you. However, you’ll still have to build all the linkages between tables and the objects that depend upon them (e.g. views, UDFs, and stored procedures) after you collect that information.

Make the Process a Painless One with SentryOne Document

After reading all the way down to this point, you can see that this first and very important step of documenting your data ahead of a cloud migration can be a very cumbersome and tiresome process. But it doesn’t have to be. SentryOne can help simplify—and automate it—with SentryOne Document, a solution that helps you efficiently and accurately document your entire Microsoft Data Platform environment.

SentryOne Document 10-7 Social-1

Visually analyze data lineage with SentryOne Document

SentryOne Document is a powerful accelerator to a successful cloud migration scenario. You need to know everything that's touching your databases, not just the applications that are using them. SentryOne Document features a set-up process that is very simple. You’ll add the different databases you want to migrate, provide connection information for those various resources, and then scan those resources. As a result of scanning, you can create documentation. You can even make your document a living document by scheduling regular scans so that you can see and update changes over time.

With SentryOne Document, you also receive documentation of your SSIS packages, SSAS cubes, and all the various pieces they touch, such as Power BI data visualizations. But the real power of SentryOne Document comes from exploring data lineages and the linkage between database objects. Understanding lineage makes figuring out how the database environment will be impacted by a migration much, much easier than having to go through the database schema by hand, mainly because the lineage analysis includes more than just the SQL references themselves.

If you’re seriously considering a cloud migration, I encourage you to take this solution for a spin. Download a free trial today and see for yourself how SentryOne Document can help you turn a cumbersome data documentation battle into a seamless process.

Also, for a full, comprehensive guide on migrating to the cloud, be sure to check out the SentryOne Cloud Migration Guides for AWS and Azure.

Kevin (@kekline) serves as Principal Program Manager at SentryOne. He is a founder and former president of PASS and the author of popular IT books like SQL in a Nutshell. Kevin is a renowned database expert, software industry veteran, Microsoft SQL Server MVP, and long-time blogger at SentryOne. As a noted leader in the SQL Server community, Kevin blogs about Microsoft Data Platform features and best practices, SQL Server trends, and professional development for data professionals.