Finding Duplicate Files with PowerShell

Let's explore a script that leverages DataGravity's file fingerprints to identify the top 10 duplicate files on a given department share or virtual machine.

The Workflow

  1. Export fingerprints and file names to File List (CSV format)
  2. Run the FindDuplicateFiles.ps1 powershell script
  3. List the Top 10 duplicate files and space they are consuming

Files and Fingerprints

DataGravity makes it easy to identify files and their unique SHA-1 fingerprints on a share or virtual machine (VMware or Hyper-V).  In this example we are going to gather the file names and fingerprints in the Sales department share.

The Script:

FindDuplicateFiles.ps1 -csvFilePath "c:\temp\sales.csv" -top 10

Script parameters:

-csvFilePath is the path to the CSV file we downloaded in the first step which contains a list of the files and file fingerprints.  This is an export from DataGravity's Search.

-top optional parameter that if specified will show the top number of duplicate files

Listing and Validating Duplicates

Let's run the script to return the top 10 duplicate files, and their file size.

These can of course be validated as the example below returns duplicate files consuming the most space.

The full powershell script is listed below, and available on my Powershell repo on GitHub.

Finding Duplicate Files using DataGravity FingerPrints

I love it when community feedback brings an idea to life.  I have had the benefit of seeing this first hand many times since joining the DataGravity family first as an Alpha customer, and for the last two years as a Solutions Architect.  The most recent example centers on the topic of duplicate files and stems from a conversation at Tech Field Day Extra - VMWorld 2014.  Several of the delegates were discussing the reality of just how many duplicate files exist within a given file system and how valuable it would be to be able to identify those to provide space and performance savings.  In the words of Hans De Leenheer - 'That is 101, finding what is duplicate'.

Imagine if you will for a minute how many duplicate copies of the exact same file live on a department share, virtual machine or home directory.  Copies of office templates, time reporting spreadsheets, company wide memos, or department powerpoints.  All the exact same files saved to different locations, by different people on the storage system. Howard Marks proposed a use case to find just how many copies of the same marketing powerpoint have been saved.

File Fingerprinting

DataGravity now creates a file fingerprint for every supported file.  A SHA-1 cryptographic hash value of the file provides the file's "fingerprint" as a 40 character hexadecimal value.  Each file has a unique SHA-1 value associated with file contents allowing inspection with far more accuracy then being only able to look at simple file meta-data such as file name and size.

The file fingerprint is unique to the contents of a file to allow the following:

  • Locate a file on any mount point / share / VM based on its unique content.
  • Find all files with identical content, even if the files have different names or reside in different locations.
  • Ensure that a file has not changed over time, by viewing the file fingerprint from different DiscoveryPoints.
  • Ensure that a file containing specific content, as identified by the file SHA-1 value, does not reside on the DataGravity Discovery system.

Finding Duplicates

Finding duplicate files all with the same unique fingerprint is extended to DataGravity's search and discovery. Let's search for all duplicates of the recent marketing presentation using the file's fingerprint.

It is easy to see that indeed there are duplicates of the presentation being saved by multiple people, to multiple locations.  In fact some of these files appear to be copied by the same user into different directories on their home share, but are the EXACT SAME file.

Using the preview function from the search confirms our duplicates.


There is a growing number of examples of how file fingerprinting is useful, many of which I will continue to share here on the blog.  Identifying duplicate files is one of my favorite uses of the feature, mostly because of how useful it is, but also because it demonstrates how DataGravity listens and incorporates feedback to enhance the product.

Deleting Dormant Data with Powershell

One of my favorite forms of managing data is to DELETE it.  One of my favorite ways to delete things is with SPEED and CONFIDENCE.

I have been quoted as saying that "DELETE is the best form of de-duplication" - in fact it is 100% dedupe. Some of the best data to DELETE is the stuff that no one is using: dormant data.  So putting my automation hat on, let's explore a script that helps DELETE things quickly but still provides us with the ability to UNDO using DataGravity File Analytics for Dormant Data.

The workflow:

  1. Export Dormant Data to CSV File List
  2. Run the ArchiveDormantData.ps1 powershell script
  3. Optionally create an archive txt file to notify it has been deleted
  4. Validate space savings and recover individual files if required.  

Identify Dormant Data:

DataGravity makes it easy to identify and download a list of all the dormant data.  In this example we are going to grab anything that hasn't been updated, read or touched within a year or more on the Marketing share.

The script:

ArchiveDormantData.ps1 -ShareFilePath "\\CorporateDrive\Marketing" -csvFilePath "c:\temp\Marketing.csv" -logFile "C:\Temp\DormantDataDelete.log" -ArchiveStub

Script parameters:

-ShareFilePath is the path to the data where the dormant data lives to be deleted.  In our example it is the Marketing share.

-csvFilePath is the path to the CSV file we downloaded in the first step which contains a list of the files to be deleted.  This is an export from the DataGravity's Dormant Data.

-logFile is an optional location for where we want to log the activity of what has been removed.

-ArchiveStub optional parameter that if specified will create a TXT stub in the place of the deleted file

Validate and Recover if necessary (The undo button)

If you get anything wrong or delete the wrong thing, it is always handy to have an UNDO button.  There are several ways to do that using backup/recovery tools, and in this case since we are already using DataGravity, we can crate a manual discovery point to changes and restore any files if required.

The full powershell script is listed below, and available on my Powershell repo on GitHub.  Big thanks to Will Urban for the heavy lifting on this one.  Happy DELETING.

Discovering Private Keys & Certificates in Unsecured File Shares and VMS

Attending a InfraGard event recently, I was made aware of problem that I never gave much thought to before but probably should have - securing the private keys for SSL and SSH certificates.  Much like usernames and passwords, public and private keys for certificates to encrypt and authenticate you to various internet services are critically important to manage and protect.  I also learned that many people don't rotate their certificates nearly as frequently as recommended (how many of us have GoDaddy certificates that are set to expire for 3 years) and often private private keys are saved in simple file and network shares.  Very similar to saving usernames and passwords in an Excel spreadsheet or text file.

Just like a password, to ensure the security of your private key it is best practice to limit access to members of your organization who absolutely need to have control over it. It is also best practice to change your private key (and re-key any associated certificates) if a member of your team who had access to the private key leaves your organization.  The challenge is finding these keys and identifying who is possibly using them.

Identifying Private & Public Keys

There are many samples of private keys and certificates that you can download to see the makeup of a particular crt or key file for user and machine authentication.  Opening these in a common editor will show the how they are crafted:


Building Intelligence to Indentify Keys & Certificates in File Shares

One of the foundational tenets of DataGravity is to utilize intelligence on unstructured file shares and VMs to determine where sensitive data is being saved and accessed.  In this case we will identify and discover private keys/certificates in file shares and VMs with DataGravity's automated detection and intelligence.  This can be done using the Intelligence Management interface, allowing us to create a custom tag, attach it to a Discovery Policy, and then find and be alerted on this information.

We can simply give our new tag a name 'Certificates Key' and then a color indicator for importance (Red is the universal sign for 'Very Important') and a Description.

The pattern we will be looking inside the files to identify if they are a certificates or keys is the 'Begin' and 'End' lines for private keys and certificates.  The Regex expression that I found useful for this is listed below.

(-----(\bBEGIN\b|\bEND\b) ((\bRSA PRIVATE KEY\b)|(\bCERTIFICATE\b))-----)

As seen when testing for the beginning and ending of these certificates my Match Pattern is working as expected.

Now that I have the tools in place to identify Private Keys and Certificates, I simply need to update my Intelligence Profile to automatically discovery when new certificates and keys are found.

Idenfity, Discover, & Notify

With the new intelligence tag for Certificate Key created and applied to my Intelligence Profile, I can very easily search and discover with DataGravity all instances of those files.

  1. Search for instances of the newly created tag - Certificates Key
  2. Indentify the number of Results
  3. Preview any of the Files to confirm that it is a key or certificate
  4. You will notice DataGravity also can identify beyond the file extension to find this information.  In this case the private key was saves as a Text file, but we can still see by previewing the file that it contains the private key information.
  5. Export or Subscribe to the Search to be notified when Private Keys or Certificates are saved.

Parity throughout the System

The newly created tags now are accessible through search as well as in all of the key visuals provided by DataGravity including: File Analytics, File Details, Activity Reporting and Trending - across file shares and VMs.  Extremely powerful for understanding where this sensitive data lives, who is accessing it, and then notifying other systems for full discovery such as a PKI Key Management system.