How to Manage an ETL Data Migration Into the Cloud

As organizations move their workloads into the public cloud, their environments gradually evolve. Initially, an organization tends to aim for a 1:1 migration of resources, typically in the form of virtual machines (VMs). Although this approach allows for an easier data migration path, it’s more costly and offers less flexibility than taking advantage of some of the serverless and Platform-as-a-Service (PaaS) offerings in the public cloud.

Whether they’re database as a service (DBaaS), serverless functions, or extract, load, and transform (ETL) processes, the flexibility and scale of these services make them attractive as cloud maturity increases.

Moving ETL to a service such as Azure Data Factory or AWS Glue can be challenging, especially to developers coming from an on-premises background. These projects often face several challenges, including newer data sources, differences in cloud security, and networking details.

A Wider Variety of Data Sources

Traditionally, most organizations have had online transaction processing (OLTP) systems like point of sale (PoS) or manufacturing execution systems (MES) for executing actual operational processes. Those systems held the information about the activities, and these operations were extracted and loaded into another database called a data warehouse. Typically, this data would be loaded nightly, or a slightly more frequent interval, like every four hours. The data warehouse is a database with a structure optimized for reporting and has transformed data for consistent reporting.

As systems have evolved, there’s been demand from both business and technology organizations to have closer to real-time reporting. This is a basic idea but is complex to implement in practice. You could just report on the operational system, but you risk blocking key operational processes, and the data structure of those systems isn’t typically designed for reporting. On the other hand, loading data into a data warehouse in real time has many technical challenges. Relational databases can have issues ingesting streaming data. Modern data warehouse systems can use a combination of file or object storage and a relational engine to traverse the files. Examples of these solutions include Microsoft Azure Synapse, Amazon Redshift, and Snowflake.

One of the benefits of the cloud is the ease of using a wider variety of data stores for a given application. You may store some data in a relational database, data in a data lake, and other data in a normal cloud object store, such as Amazon S3 or Azure Blob Storage.

Each of these data sources has its own connectivity and security configurations, different APIs, and unique customization options for optimum performance. Although having a variety of data sources isn’t new in the ETL space, their ease of implementation and sheer variety increase in the cloud.

As the variety of data increases, you may also run into more data governance challenges. Especially when dealing with less controlled data sinks, such as data lakes, you may need to implement more data quality checks and do so earlier in your ELT process. In some cases, you might move to an ELT model for data stored in object stores, as it can be more efficient.

Security Is Different in the Cloud

Anyone who has ever written a SQL Server Integration Services package is likely familiar with the Kerberos double hop problem. This occurs when a package attempts to move from one server to another and Active Directory is improperly configured, causing login errors. Although public cloud security has moved from Kerberos to newer protocols designed with the web in mind, such as OAuth and SAML, security in the cloud is still a challenge.

Talking specifically about Azure Data Factory, which uses Azure Active Directory for identity and authorization, you need to create a managed identity for your Data Factory. This is somewhat similar to the concept of service accounts in Windows, which means your Data Factory will have an app registration in your Azure Active Directory tenant. You will then assign privileges to that app registration to access data sources and perform any tasks it needs to perform. Azure Active Directory is different from traditional Active Directory—if you’re using Azure Data Factory as an ETL tool, this isn’t an issue, but if you’re using an older ETL tool and are connecting to Azure SQL Database or Synapse Analytics, you’ll want to ensure you can connect using Azure AD or fall back to SQL authentication.

Beyond those basics, many organizations have multiple subscriptions and, in some cases, multiple tenants, as shown in Figure 1. Subscriptions act as a security boundary, but still share the same identity provider, so your Data Factory must cross those boundaries. In the case of multiple tenants, your options are few, and you may need to rearchitect your design.

Figure 1: Relationships between tenants, subscriptions, and resources with Azure Active Directory

Figure 1: Relationships between tenants, subscriptions, and resources with Azure Active Directory

Networking Is Hard

One of our consultants who works mostly in business intelligence jokes that since the public cloud became a thing, she’s become a network engineer. While networking within cloud VMs is straightforward, as you move into serverless options such as Glue and ADF, or you move data back and forth from on-premises to the cloud, you might face networking challenges. Many of these services rely on public endpoints, which can raise security concerns amongst auditors and security teams.

You need to understand your security requirements and how each of your data sources is stored on a public or private network. While many cloud services exist by default on private networks, business rules might force you to enable private networking, which can require firewall rules to be modified to allow your traffic to flow. Both AWS and Azure support the ability to isolate many of their PaaS resources, but different services support different approaches to networking. Azure has moved towards standardizing their network solutions using a service called Private Link, like AWS and Virtual Private Cloud.

Picking the Right ETL Tools and ETL Processes for Your Business

It’s important to understand the automation technology for your cloud ETL process—some solutions like Azure Data Factory have their own scheduling in the form of time-based triggers. Other processes may be event driven, such as when a row lands in a database system, or a new file is created in cloud storage. Each cloud provider has a different solution for how they automate processes. Beyond, automation, you need to understand the mechanism for alerting in the event of job failure and how you configure your retry logic.

Being able to consume a wide variety of source formats is a tenant of any modern ETL process. Your tools need to be able to consume more data sources than just relational databases, whether it be consuming data from a Salesforce API or cloud storage, or new data stores like MongoDB or Snowflake. Your ETL tool needs to be able to ingest a wide variety of sources. It’s also important to understand what data sources your company needs to integrate with, and then thinking beyond that—you may have to consume data from business partners who have other systems. Having an ETL solution that works with a broad variety of data sources, new and old, provides the most flexibility for your BI solutions.

Likewise, your solution needs to allow for a wide variety of target systems, or destinations. While Hadoop has faded as a pure technology in recent years, as mentioned above, many cloud-based data warehouse systems rely on object-based storage like Azure Blob Storage or Amazon S3. The most common pattern is to write parquet files, which is a columnar data store, which is an efficient way of querying data at scale. However, you still need to be able to communicate with relational databases and understand those workflows.

If you’re new to cloud ETL, you may face challenges getting started. Whether you’re migrating from legacy SQL Server Integration Services (SSIS) or building a new environment, using a tool such as SolarWinds Task Factory can help you accelerate development.

Get started with a free trial of Task Factory and see for yourself what it can do.

Thwack - Symbolize TM, R, and C