Data Movement in Australia using Azure Data Factory

UPDATE: This slipped a little under the radar but the Data Movement Service was announced as being available in Australia on March 8th, so just go ahead and use it as it is intended!

One of the key activities for enabling data analytics of large scale datasets is the movement of data from one location to another to allow for further processing.

Azure Data Factory has a Copy activity that allows you specify a source and sink for data to be moved. The (nearly) globally available Data Movement Service performs the move based on the location of the data sink.

So if you have data in East US and need to move to North Europe, the Data Movement Service in North Europe will perform the move no matter where your Data Factory template is located.

The one exception to this is Australia. Currently there is no Data Movement Service in Australia so if you try to move data from Australia East to Australia East for instance, the Copy activity will fail.

Since data sovereignty is a real issue for Australian businesses a solution is required.

Any time one of the locations of data is on-premises (including an Azure VM) the Data Management Gateway is used to move the data irrespective of where the data sink is.

As a test, I wondered if the issue around there be no Data Movement Service in Australia could take advantage of this.

Environment

In order to test the solution the following is required:

  1. Source Storage Account containing a simple CSV file (make sure this file exists before deploying the Pipeline)
  2. Sink Storage Account (can be the same account with a different container)
  3. Azure VM to act as intermediate storage
  4. Simple Azure Data Factory to perform the data movement

Setup

To copy data via an Azure VM it needs to be running the Data Management Gateway. Once installed the rest of the set up is straightforward.

Storage Account

The simplest way of testing the movement is to create a single storage account in one of the Australia datacentres, in our case Australia Southeast. NOTE: In order to use the Australia region you need an Azure subscription registered to an Australian credit card.

The storage account created has 2 containers, one for input and one for output.

ADF-Move-StorSetup

The idea here is to move the file between the two containers. Doing this directly leads to an error, so a staging folder on an Azure VM is required.

Once the VM is running and has the Data Management Gateway installed, the Data Factory can be created and tested.

Linked Services

We need a linked service to represent the storage account:

{
    "name": "StorageLinkedService",
    "properties": {
        "description": "",
        "hubName": "adfmovetest_hub",
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=[ACCOUNT_NAME];AccountKey=[ACCOUNT_KEY]"
        }
    }
}

And for the Azure VM connected storage:

{
    "name": "OnPremisesFileServerLinkedService",
    "properties": {
        "description": "",
        "hubName": "adfmovetest_hub",
        "type": "OnPremisesFileServer",
        "typeProperties": {
            "host": "localhost",
            "gatewayName": "testgateway",
            "userId": "",
            "password": "",
            "encryptedCredential": "[REMOVED]"
        }
    }
}

Datasets

The input and output blob containers for the files are identical accept for the container folderPath (input shown below):

{
    "name": "InputBlob",
    "properties": {
        "structure": [
            {
                "name": "firstname",
                "type": "String"
            },
            {
                "name": "lastname",
                "type": "String"
            }
        ],
        "published": false,
        "type": "AzureBlob",
        "linkedServiceName": "StorageLinkedService",
        "typeProperties": {
            "fileName": "people.csv",
            "folderPath": "input",
            "format": {
                "type": "TextFormat"
            }
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": true,
        "policy": {}
    }
}

Likewise the on premises (or Azure VM) storage differs only in the linkedServiceName and folderPath:

{
    "name": "Staging",
    "properties": {
        "structure": [
            {
                "name": "firstname",
                "type": "String"
            },
            {
                "name": "lastname",
                "type": "String"
            }
        ],
        "published": false,
        "type": "FileShare",
        "linkedServiceName": "OnPremisesFileServerLinkedService",
        "typeProperties": {
            "fileName": "people.csv",
            "folderPath": "c:\\staging"
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": false,
        "policy": {}
    }
}

Pipeline

To move the data, we need to create a simple pipeline that contains 2 Copy activities:

  1. Copy dataset from “input” container to staging folder on premises
  2. Copy dataset from staging folder on premises to “output” container
{
    "name": "TestAustraliaMove",
    "properties": {
        "description": "Test if you can use a staging folder to do a file move in Australia",
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "FileSystemSink",
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "InputBlob"
                    }
                ],
                "outputs": [
                    {
                        "name": "Staging"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst",
                    "style": "StartOfInterval",
                    "retry": 3
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "BlobToFile",
                "description": ""
            },
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "FileSystemSource"
                    },
                    "sink": {
                        "type": "BlobSink",
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "Staging"
                    }
                ],
                "outputs": [
                    {
                        "name": "OutputBlob"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst",
                    "style": "StartOfInterval",
                    "retry": 3
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "FileToBlob",
                "description": ""
            }
        ],
        "start": "2016-01-18T23:59:00Z",
        "end": "2016-01-19T23:59:59Z",
        "isPaused": false,
        "hubName": "adfmovetest_hub",
        "pipelineMode": "Scheduled"
    }
}

You can see there are 2 activities:

  1. BlobToFile has a BlobSource and FileSystemSink
  2. FileToBlob has a FileSystemSource and BlobSink

Once the activity is deployed it will execute based on the start and end dates specified.

Result

When all the activities have run we would expect to see all datasets in the Data Factory diagram to show as green indicating success.

ADF-Move-Pipeline

As we can see, all the lights are green.

If we look at the staging folder on the Azure VM we can see the file:

ADF-Move-LocalDiskFile

Likewise when we look at the output container in our storage account we can see the file has been moved:

ADF-Move-BlobOutput

Conclusion

The Data Movement Service is currently unavailable in the Australia region which limits the ability to move data between platform services within a region where data sovereignty is a real issue.

In order to achieve data movement and still stay within the Australia region, an Azure VM can be used to provide a staging location making use of the Data Management Gateway to perform the actual data copy activity.

Leave a Reply

Your email address will not be published. Required fields are marked *