Configuring a PolyBase Cluster

There have been a few blog posts lately on how to configure PolyBase in SQL Server 2016. I'm not going to go into a lot of details with the install, but I do want to clear up some points so that your configuration will work consistently if you decide to utilize the scale-out group functionality. Honestly, why wouldn't you use the scale-out functionality?

Initial setup requires that you select the PolyBase components.

PolyBase Cluster Config 1

When you get to the PolyBase configuration screen, you have the option to run this as a standalone instance, or to use the SQL Server as part of a scale-out group. Personally, I don't think that anyone should ever choose the standalone instance option. You can always run a SQL Server configured for a scale-out group as a standalone instance, BUT you can't change (at least not today) a SQL Server configured as a standalone PolyBase instance to run as part of a PolyBase cluster once you have completed the install.

PolyBase Cluster Config 2

It is important to note the IP port range that is assigned by default. If this is going to be a production installation, it is recommended that you validate that these ports are acceptable with the network engineering team.

From here the only other piece of PolyBase configuration required during the install is to configure the services. It is recommended that you use a Windows service account, but that isn't necessary IF you are running this as a standalone instance. If you are running this as part of a cluster, then the Windows service account is required.

Once you have everything installed, there are still a couple of things that you need to do in order to get PolyBase fully configured.

Before going too far, I would recommend that you run the following to verify that PolyBase is installed;

-- Check to verify that PolyBase is installed
SELECT SERVERPROPERTY(N'IsPolybaseInstalled') AS IsPolybaseInstalled;
 

Since you will be connecting to an external data source, you will also need to run a few more statements and restart the services:

-- Configure external tables on Hortonworks 2.1/2.2/2.3 on Linux,

-- and Azure blob storage

EXEC sys.sp_configure

     @configname = N'hadoop connectivity',

     @configvalue = 7; -- Azure Blob Storage

GO

RECONFIGURE; -- This requires a restart of SQL and PolyBase services
 

Microsoft provides a full list of "hadoop connectivity" values that are available today. In this example, I utilized the value for Azure Blob Storage, as that was the easiest external data source to configure, and provided all the functionality that I needed for my initial testing.

I'm not going to cover the details of PolyBase exports in this post, but if you desire to write to the external data source, then you will need to run this step:

-- Configure PolyBase to allow for export of data to your external data source

EXEC sys.sp_configure @configname = N'allow polybase export', @configvalue = 1;

GO

RECONFIGURE;
 

Configuring PolyBase Cluster

Before we walk through the steps to configure the PolyBase cluster, it is important to understand some basic concepts.

  1. In a PolyBase cluster there is one head node, that can be associated with 0 to many compute nodes. When you have no compute nodes specified, the head node will also act as a compute node.
  2. At present a head node needs to be Enterprise version of SQL Server. The compute nodes can be Standard Edition.
  3. The head node and the compute nodes can be multipurpose, meaning that they do not need to be solely allocated for PolyBase. So, consider using those underutilized development SQL Servers, or your standby Always-On Availability Groups instances as part of the PolyBase cluster. Also, since Microsoft has made it really easy to move nodes in and out of the cluster, there is the possibility of moving a server into the cluster during periods of low activity, and then moving it out when you know it is going to be busy again.
  4. It is important to note that the PolyBase services on the head node, and the PolyBase data movement service on the compute node, will consume memory and CPU. So, you need to ensure that they do not interfere with your SQL Server performance.
  5. The steps to configure the external data source connectivity will need to be run on all nodes prior to joining them to the cluster.

There are a couple of ways to configure the PolyBase cluster. You can use SSMS by right-clicking on the PolyBase folder and selecting "Configure PolyBase Cluster." After you are there it is pretty self-explanatory on how to add a compute node. This operation would be performed on the head node.

PolyBase Scale-Out Configuration

If you want to stay old-school, then you can use the stored procedures provided with PolyBase to connect a compute node to the head node. These statements need to be run on the compute node.

-- Enter head node details:

-- head node machine name, head node dms control channel port, head node sql server name

EXEC sys.sp_polybase_join_group

    @head_node_address = N'YourHeadNode',

    @dms_control_channel_port = 16450,

    @head_node_sql_server_instance_name = N'MSSQLSERVER';
 

When you run this statement, the compute node will now be associated with the head node, and can be used for distributed queries to the external data source that you configured previously. Additionally, you will notice that the "SQL Server PolyBase Engine" service will be changed to disabled and stopped on the compute node(s). This allows for the head node that you defined in the statement to be the engine for the cluster. The Data Movement service will continue to run on all nodes.

At this point, you should have a fully functional PolyBase cluster. You can use the following query on either the head or compute node to see the different nodes associated with the cluster.

SELECT compute_node_id, type, name, address

  FROM sys.dm_exec_compute_nodes;
 

If you need to remove a node from the cluster, you can do that with the "Configure PolyBase Cluster" utility, or run;

EXEC sys.sp_polybase_leave_group;
 

Note that you cannot remove the head node from the cluster, so this command is only available on the compute nodes.

PolyBase is a great addition to SQL Server in environments that use external data sources like Hadoop or Azure Blob storage. Having a cluster configuration provides scale-out performance that will allow for optimal performance of queries that are hitting those external data sources. I'll share some details on how to see query performance details in another post, so look for that soon.

Call To Action

PolyBase allows for access to external data through standard T-SQL functionality that both developers and DBAs are comfortable with. Start to familiarize yourself with the power that this brings to SQL Server by configuring a PolyBase cluster in your environment, and connect to some sample external data sources. Then start to consider how you can use this functionality to improve your data platform architecture, and potentially offload work from SQL Server to other data platforms.

Additional Reading


Comments