Data Movement in Australia using Azure Data Factory

UPDATE: This slipped a little under the radar but the Data Movement Service was announced as being available in Australia on March 8th, so just go ahead and use it as it is intended!

One of the key activities for enabling data analytics of large scale datasets is the movement of data from one location to another to allow for further processing.

Azure Data Factory has a Copy activity that allows you specify a source and sink for data to be moved. The (nearly) globally available Data Movement Service performs the move based on the location of the data sink.

So if you have data in East US and need to move to North Europe, the Data Movement Service in North Europe will perform the move no matter where your Data Factory template is located.

The one exception to this is Australia. Currently there is no Data Movement Service in Australia so if you try to move data from Australia East to Australia East for instance, the Copy activity will fail.

Since data sovereignty is a real issue for Australian businesses a solution is required.

Any time one of the locations of data is on-premises (including an Azure VM) the Data Management Gateway is used to move the data irrespective of where the data sink is.

As a test, I wondered if the issue around there be no Data Movement Service in Australia could take advantage of this.

Environment

In order to test the solution the following is required:

  1. Source Storage Account containing a simple CSV file (make sure this file exists before deploying the Pipeline)
  2. Sink Storage Account (can be the same account with a different container)
  3. Azure VM to act as intermediate storage
  4. Simple Azure Data Factory to perform the data movement

Setup

To copy data via an Azure VM it needs to be running the Data Management Gateway. Once installed the rest of the set up is straightforward.

Storage Account

The simplest way of testing the movement is to create a single storage account in one of the Australia datacentres, in our case Australia Southeast. NOTE: In order to use the Australia region you need an Azure subscription registered to an Australian credit card.

The storage account created has 2 containers, one for input and one for output.

ADF-Move-StorSetup

The idea here is to move the file between the two containers. Doing this directly leads to an error, so a staging folder on an Azure VM is required.

Once the VM is running and has the Data Management Gateway installed, the Data Factory can be created and tested.

Linked Services

We need a linked service to represent the storage account:

{
    "name": "StorageLinkedService",
    "properties": {
        "description": "",
        "hubName": "adfmovetest_hub",
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=[ACCOUNT_NAME];AccountKey=[ACCOUNT_KEY]"
        }
    }
}

And for the Azure VM connected storage:

{
    "name": "OnPremisesFileServerLinkedService",
    "properties": {
        "description": "",
        "hubName": "adfmovetest_hub",
        "type": "OnPremisesFileServer",
        "typeProperties": {
            "host": "localhost",
            "gatewayName": "testgateway",
            "userId": "",
            "password": "",
            "encryptedCredential": "[REMOVED]"
        }
    }
}

Datasets

The input and output blob containers for the files are identical accept for the container folderPath (input shown below):

{
    "name": "InputBlob",
    "properties": {
        "structure": [
            {
                "name": "firstname",
                "type": "String"
            },
            {
                "name": "lastname",
                "type": "String"
            }
        ],
        "published": false,
        "type": "AzureBlob",
        "linkedServiceName": "StorageLinkedService",
        "typeProperties": {
            "fileName": "people.csv",
            "folderPath": "input",
            "format": {
                "type": "TextFormat"
            }
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": true,
        "policy": {}
    }
}

Likewise the on premises (or Azure VM) storage differs only in the linkedServiceName and folderPath:

{
    "name": "Staging",
    "properties": {
        "structure": [
            {
                "name": "firstname",
                "type": "String"
            },
            {
                "name": "lastname",
                "type": "String"
            }
        ],
        "published": false,
        "type": "FileShare",
        "linkedServiceName": "OnPremisesFileServerLinkedService",
        "typeProperties": {
            "fileName": "people.csv",
            "folderPath": "c:\\staging"
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": false,
        "policy": {}
    }
}

Pipeline

To move the data, we need to create a simple pipeline that contains 2 Copy activities:

  1. Copy dataset from “input” container to staging folder on premises
  2. Copy dataset from staging folder on premises to “output” container
{
    "name": "TestAustraliaMove",
    "properties": {
        "description": "Test if you can use a staging folder to do a file move in Australia",
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "FileSystemSink",
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "InputBlob"
                    }
                ],
                "outputs": [
                    {
                        "name": "Staging"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst",
                    "style": "StartOfInterval",
                    "retry": 3
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "BlobToFile",
                "description": ""
            },
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "FileSystemSource"
                    },
                    "sink": {
                        "type": "BlobSink",
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "Staging"
                    }
                ],
                "outputs": [
                    {
                        "name": "OutputBlob"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst",
                    "style": "StartOfInterval",
                    "retry": 3
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "FileToBlob",
                "description": ""
            }
        ],
        "start": "2016-01-18T23:59:00Z",
        "end": "2016-01-19T23:59:59Z",
        "isPaused": false,
        "hubName": "adfmovetest_hub",
        "pipelineMode": "Scheduled"
    }
}

You can see there are 2 activities:

  1. BlobToFile has a BlobSource and FileSystemSink
  2. FileToBlob has a FileSystemSource and BlobSink

Once the activity is deployed it will execute based on the start and end dates specified.

Result

When all the activities have run we would expect to see all datasets in the Data Factory diagram to show as green indicating success.

ADF-Move-Pipeline

As we can see, all the lights are green.

If we look at the staging folder on the Azure VM we can see the file:

ADF-Move-LocalDiskFile

Likewise when we look at the output container in our storage account we can see the file has been moved:

ADF-Move-BlobOutput

Conclusion

The Data Movement Service is currently unavailable in the Australia region which limits the ability to move data between platform services within a region where data sovereignty is a real issue.

In order to achieve data movement and still stay within the Australia region, an Azure VM can be used to provide a staging location making use of the Data Management Gateway to perform the actual data copy activity.

Azure Data Factory and “on-premises” Azure VMs

A question came up recently on the MSDN Forum for Azure Data Factory around whether or not using an Azure VM would count as a cloud or on-premises resource when it comes to billing.
Checking the pricing for Azure Data Factory you can see that price for Data Movement is different depending on the source location of the data, so where the data is has quite an impact on cost.
So is an Azure VM considered as a cloud location or an on-premises location.
I thought I’d do a quick test to confirm.

Environment

In order to understand how the Data Movement Service sees an Azure VM some setup is required.

 

  1. Create an Azure VM; I already had a Windows Server 2016 CTP4 one so I reused that
  2. Create an Azure Storage account that will act as the Sink for the data
  3. Create a simple Azure Data Factory that contains a Copy activity to move data from the VM to the Storage account

Setup

To allow data to be moved from an on-premises File System sink in Azure Data Factory you need to use the Data Management Gateway on your server.
When the server is a virtual machine in Azure the process is the same, so for the first part of the environment it is pretty straightforward, you need to download, install and run the gateway. Once up and running you’ll expect to see something like the following.

 

ADF-OnPremTest-DMG
For the Azure Data Factory, a number of artefacts need to be created:

 

  1. Data Management Gateway that will provide access to the VM
  2. Linked Service to a File system
  3. Linked Service to an Azure Blob
  4. Dataset representing source data
  5. Dataset representing sink data
  6. Pipeline containing a Copy activity
You also need some very basic data to be move, which can be a simple CSV file containing a couple of items.

Data Management Gateway

The Data Management Gateway is very straightforward and just follows the usual pattern for an on-premises service as it just creates an endpoint on the server.

{
    "name": "OnPremisesFileServerLinkedService",
    "properties": {
        "description": "",
        "hubName": "adfiaastest_hub",
        "type": "OnPremisesFileServer",
        "typeProperties": {
            "host": "localhost",
            "gatewayName": "IAASTEST",
            "userId": "",
            "password": "",
            "encryptedCredential": "[REMOVED]"
        }
    }
}
When you set up an on-premises Linked Service you can either store the credentials for the server directly in the configuration (NOTE: the password is always replaced by asterisks when display), or use an encrypted credential.

Azure Storage Linked Service

Once you’ve created an Azure storage account, you need to create a container. This can be done directly in the Azure portal or through a number of other tools such a Azure Management Studio, Cloud Portam or indeed Visual Studio.

{
    "name": "StorageLinkedService",
    "properties": {
        "description": "",
        "hubName": "adfiaastest_hub",
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=[STORAGEACCT];AccountKey=[STORAGEKEY]"
        }
    }
}

Datasets

As this is a test the dataset used for the test data is extremely simple.
 

For the File System file:

{
    "name": "OnPremisesFile",
    "properties": {
        "published": false,
        "type": "FileShare",
        "linkedServiceName": "OnPremisesFileServerLinkedService",
        "typeProperties": {
            "fileName": "people.csv",
            "folderPath": "c:\\adfiaastest"
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": true,
        "policy": {}
    }
}

And for Blob Storage:

{
    "name": "AzureBlobDatasetTemplate",
    "properties": {
        "structure": [
            {
                "name": "firstname",
                "type": "String"
            },
            {
                "name": "lastname",
                "type": "String"
            }
        ],
        "published": false,
        "type": "AzureBlob",
        "linkedServiceName": "StorageLinkedService",
        "typeProperties": {
            "fileName": "people.csv",
            "folderPath": "output",
            "format": {
                "type": "TextFormat"
            }
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        }
    }
}

Copy Activity Pipeline

Since we are only moving data our pipeline only contains a single Copy activity.

{
    "name": "PipelineTemplate",
    "properties": {
        "description": "Testing IaaS VM",
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "FileSystemSource"
                    },
                    "sink": {
                        "type": "BlobSink",
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "OnpremisesFile"
                    }
                ],
                "outputs": [
                    {
                        "name": "AzureBlobDatasetTemplate"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "OnpremisesFileSystemtoBlob",
                "description": "copy activity"
            }
        ],
        "start": "2015-12-26T00:00:00Z",
        "end": "2015-12-28T00:00:00Z",
        "isPaused": false,
        "hubName": "adfiaastest_hub",
        "pipelineMode": "Scheduled"
    }
}

Once completed, having a look at the result in the Diagram blade for the data factory should show similar to the following:

ADF-OnPremTest-Pipeline

Result

Once the Data Management Gateway and Azure Storage has been linked and a file uploaded to the blob container, the factory should execute and move the data as expected.
This is confirmed by quickly checking the storage container.

ADF-OnPremTest-Storage

After checking the process has been successful, I examined my subscription to see what Data Factory charges had been incurred. NOTE: It takes a few hours for new charges to show.

ADF-OnPremTest-Bill

Conclusion

Looking at the charges incurred during execution of the data movement activity, it can be seen that whilst we are essentially running a cloud service in the form of an Azure Virtual Machine, the data movement activity is showing as an On Premises move.

It should be noted that the Azure Virtual Machine in this case was one created in the new portal.

Microsoft Integration Roadmap – My Op-ed

Microsoft has recently published their Integration Roadmap to provide insight in to the direction of their key integration technologies:
  • BizTalk Server
  • Microsoft Azure BizTalk Services (MABS)
  • Microsoft Azure Logic Apps and Azure App Service
There are some great summaries are out there by Kent Weare, Saravana Kumar and Daniel Toomey and having read through the document, I thought I’d capture some of my thoughts of what it contains.

BizTalk Server

Having used BizTalk for over 10 years I was keen to see what direction the platform was going to take. There have been naysayers for many years saying the platform is dead, so it is good that Microsoft have announced a new version is coming later in 2016.

Some disappointing news for me on this though is that for the most part this is really yet another platform alignment release.
That said, the addition of more robustness around high availability and better support for this on Azure IaaS is welcome and should see a batch of new customers being able to leverage this.
There is certainly some indication of future releases, although not as strong a commitment as outlined at the BizTalk Summit 2015 in London.

BizTalk Services

BizTalk Services has been the elephant in the room since Logic Apps went in to preview. There are a number of API Apps that encapsulate a lot the functionality provided by BizTalk Services so it is quite telling that the roadmap says that any new development should target using Logic Apps and these API Apps rather than BizTalk Services.

It doesn’t take much reading between the lines to see that at some point MABS will be sunsetted. I hope it is fair to assume that a migration path from MABS to Logic Apps will be provided either directly by Microsoft or via a partner.

Logic Apps, App Service and Azure Stack

Since the preview release, Logic Apps have undergone a number of revisions around functionality and tooling and the roadmap lays out the path ahead for when they come out of preview.

Along with this we are going to see new connectors and general availability of Azure Stack.
Azure Stack provides App Services on-premises, and will provide organisations with at least some of the agility, resilience and scalability that the core Azure iPaaS platform provides.

New Features != Evolution

Taking the point about Azure Stack, Microsoft has announced a convergence between cloud and on-premises solutions for integration.

I’d take this a step further I think. For a long time the BizTalk community has had to field questions on when and if the BizTalk Server platform will move forward technically. Whatever form this discussion has taken it has long been assumed that over time the platform would evolve.
By converging cloud and on-premises with Azure Stack, thereby providing potential self-service to integration solutions in your own hosting environment, and with the release of PowerApps recently, a means to creating just in time data driven applications is pretty much at hand.
Since there has been little actual evolution of the core BizTalk platform I wonder if over the next couple of years workloads will move to Azure Stack instead. After all, this would provide the swiftest way to then leverage the core iPaaS platform in future and reduce dependency on hard to find BizTalk skills.
One telling comment in the roadmap discussion for me was:
Alongside our Azure Stack investments, we are actively working on adding more BizTalk Server capabilities to Logic Apps.
 Is this scene setting, after all the team that is responsible for BizTalk Server is the same team that is responsible for Logic Apps, they have finite resources?

Conclusion

It’s great to see Microsoft continuing to invest in the future of integration, hardly surprising given Hybrid Integration is seen as a key approach to moving workloads to the cloud.

Have they gone far enough?
I think for now the combination of a core robust platform in BizTalk Server that is moving forward, albeit slowly, and a solution that offers parity between cloud and on-premises provides a great springboard for the near future of integration.
And for me that is one key takeaway, the roadmap provides a vision for the near future but I would like to have seen a longer term (i.e., beyond 2016) vision, that is what would allow us as IntegrationProfessionals™ to ensure we don’t take a customer down a dark alley.