Recently I did a post regarding data locality and the ability to understand and get data growth under control by having a deeper understanding of the demographics of the data being stored. Thankfully, not all data is created equal, and therefore the key is to intelligently understand the constructs of the data and determine where it best belongs. With lower cost, archive cloud storage services like Amazon S3 & Glacier it is becoming ever more economical to place archives and dormant data (files that haven't been updated, modified, or read in a significant period of time) into repositories optimized for infrequently accessed data.
In this post we will highlight the steps that you can take using DataGravity's Dormant Data functionality, to identify data that can be moved using a data synchronize/migration tool like SyncBack Pro, to an cloud-based storage platform. In this series of steps we will be moving the files to Amazon S3, but we could just as easily be moving the data to Google Drive, Dropbox, Microsoft Azure or any other preferred cloud-based storage platform. During this process we will also leverage the use of DataGravity's data-aware search functionality to validate that the data that we are moving doesn't contain sensitive information, including personally identifiable information (PII), or data that is owned by certain individuals that we would not like to store in a public cloud.
7 Steps to Taking Care of Dormant Data
Step 1: Identify Dormant Data with DataGravity File Analytics
After logging into the DataGravity, browse to the File Analytics screen by selecting the Discover Tile >> File Analytics. Select the File Share or Virtual Machine in which you would like to move data from - in our example we will be looking at all data on the Marketing share that hasn't been utilized, accessed or viewed in over 1 year.
In the lower left quadrant of the File Analytics screen is section entitled Dormant Data. This section, just like every part of the File Analytics UI, allows me to grab a quick glance at the demographics of my data and if desired, drill in for more detail. By doing so I am able to quickly produce a full list of the data at a granular, file-level which can then be searched, exported, or drilled into further.
Drilling into the Dormant Data that hasn't been accessed in over 1 year, allows me to not only see the data itself, but also provides a search experience to find items like it's owners, file sizes, file types, readers and writers. I also have the ability to export all of this information into a comma separated (CSV) list.
Step 2: Export List of Files to be Moved & Excluding Sensitive Tagged Data
After exporting the list of dormant files to a CSV file, we can open the file with a text editor or within Excel. I tend to like to use Excel for CSV types, as it gives me the ability to look at things in a nice columnar view and the ability to perform some filtering on those columns. As you can see there is a great amount of information that has been exported, and some key information that helps us better understand our data, namely: Owner, Tags, & File Path.
These columns are important because we can now utilize the intelligent tags provided by the DataGravity export to filter out any files that might contain Personal Identifiable Information (PII) or other sensitive information. In this case we have some files that were tagged with Credit Card information being present in the file. We know this by the "CC" tag that is associated with the file, that was automatically identified and tagged by DataGravity. We will not be uploading those files to Amazon, so we can simply filter those out of our list by deselecting them. (In fact I may wish to go look at those files a little closer and speak to the file owner to understand why there is Credit Card data in clear text being saved on the Marketing share.) Other tags we can filter on include SSN (Social Security Number), Email, IPv4 (IP Addresses), and URLs if they are present within our data.
Along with tag filtering for sensitive information, we can even filter and exclude files that are owned by certain individuals. In this case we don't want to move any files owned by our boss (John Q. Boss) to the public cloud, and DataGravity makes it very easy to highlight exactly what files those are. We can simply filter on the Owner column to exclude all dormant files owned by the boss.....Mr. John Q. Boss, that is.
Now we have all the dormant data identified for movement to the cloud available for us in a nice columnar format in File Path, which excludes data with sensitive information or owned by certain individuals.
Step 3: Configure SyncBack Pro to Move Data to Amazon S3
Now we are have identified which files are ready for us to move off our primary storage and up into the cloud, let's look at the tool we will be using to make the move. There are many tools that can be used to do this, but the one that I have been using recently for a number of data moves and migration exercises is called SyncBack Pro. This was a tool that I was introduced to by Michael White, who has done some nice write ups on many of its features.
The feature of this tool that we will be using is it's Cloud Support, which makes it easy to copy, move and/or synchronize file data to cloud-storage services including Google, Amazon, Microsoft & Dropbox. The interface offers an incredible amount of options, but we will use the cloud configuration for Amazon Web Services. Like most tools that are utilized for moving data, we simply need to specify a source and destination as well as few other options. In our case the source will be the Marketing file share on DataGravity and the destination will be Amazon S3.
There are a plethora of posts that explain how to setup a AWS S3 account, create a bucket, configure the appropriate permissions/access with Identity and Access Management (IAM), and securing an Access Key ID and Secret Access Key. Chris Wahl has a nice writeup on these steps for his setup with Amazon Glacier and the steps are very similar for setting up Amazon S3, which allows you to create a bucket with which you can upload/move data.
Once you have a bucket of storage up on Amazon, we are ready to start moving our dormant data to it.
Step 4: Using Filters to Specify Data to Move Exclude Sensitive Data from being Moved
In the Expert view of SyncBack Pro there are a number of items that you can modify that allow you extreme granularity on what to do with the movement of your data. We will keep everything to the defaults, except for a few items that we are specifically calling out.
Move - rather then simply copying the data up to Amazon, I am specifying that we move the data off of primary storage. Of course if I wanted to use Amazon for more of a backup location, we certainly could simply copy it and not move it.
Permissions - I typically like to include any permissions or ACLs along with the copies or moves that I do, so I will maintain the owner, primary group and DACLs for my file moves. Since the destination is not an NTFS file system, this probably is not necessary but something I am fallen into the habit of doing.
Filter - SyncBack Pro allows us to filter which files and folders to include in the move. Since we have a nice list of those files that we gathered in Step #2, we can use this filtering feature to be sure we are only moving dormant data that hasn't been accessed in over a year, as well as excluding any files with sensitive information contained within them, or owned by John Q. Boss in our example.
Filters within SyncBack Pro are accessed on the main screen of the profile, and by default includes all files within the source directory to be moved. We want to modify this filter to only include the files we have identified to be moved and SyncBack Pro also allows for you to enter multiple file path filters by specifying a forward slash (/) to separate them.
Following this nomenclature we can quickly consolidated our list of files and their file paths into our filter by placing a forward slash (/) between them. This will save us a significant amount of time from having to copy and paste each folder path individually. One of the tools that I like to use for performing any type of text consolidation is Vim, but any good text editor will do. I especially like the regular expressions that Vim offers, and it allows me to consolidate text with the best of them.
In fact here is what that export of file paths looks like in Vim and here are a couple of quick regular expressions that allow me to put these file paths into a single line that saves me a lot of time:
Changing all forward slashes to back slashes in a simple regular expression - (Find Replace also works, but I am showing off a little now).
Consolidating all lines, but replacing the new line character (/n) with a forward slash (/) which is the preferred delimiter for SyncBack Pro for specifying multiple files.
This results in the following which is easy to copy and paste into the SyncBack Pro Filter dialog.
This will bring us back to the home screen of SyncBack Pro that provides a nice summary of the move job that we are about to execute. It highlights all of the essentials including: Move Type, Source, Destination, Conflict Resolution, and files to be included and ignored.
Step 5: Perform a Simulated Test
Call me paranoid, but I always like to test things before actually move forward with any type of big move, so I highly appreciate the 'Simulated Test' feature that SyncBack Pro provides. You can always skip this step if you would like but I would encourage you to run the simulation of the impending bulk move.
Step 6: Move Dormant Files to Amazon
Once your simulated run works as you would expect, then it is time to move the data, which can be done by selecting the Run option.
A nice summary of each file, it's size, extension, last modified date as well as the Action to be performed (Move to Amazon S3) is provided and if all looks good we can select 'Continue Run'.
Now we can see the job running, and data starting to be moved up to AWS.
Step 7: Validate the Move and Save Report of All Moved Data
The last step is to validate that the move worked as we would expect. There are a couple of ways to validate, and my favorite is to look at two items in particular: i) The job log provided by SyncBack Pro and ii)File Analytics on the Marketing Share in DataGravity showing my reduction of dormant data now that the files have been moved.
SyncBack Pro Log:
DataGravity File Analytics / Dormant Data View
Looks like both check out as the all the files were moved up to Amazon and deleted from the primary share. Of course if we moved something that we shouldn't have we can move it back from AWS, or leverage the DataGravity data protection recovery mechanism to restore any files that were moved off of the array.
There you have it. In 7 Simple Steps, we maximized the intelligence provided by DataGravity's Data-Aware storage to understand, export, & move dormant data. Using the tags and ownership properties provided to us through the dormant data export, we could be sure to exclude any data that we were not comfortable moving to the cloud. Once identified, SyncBack Pro allowed us a simple and easy interface to perform the move to Amazon S3. Of course we could have used these same steps to move this data to any storage platform of our choosing, highlighting the benefits of using data-aware storage: intelligently understand our data, make informed decisions, and facilitate actionable results.