HE · EN

Cleaning Duplicates from Mailboxes

In certain circumstances, duplicates are created in mailboxes. For example, after using Enterprise Vault and importing an archive from there to a mailbox, or after a PST import that failed and restarted.

· 5 min read · Updated June 21, 2024
Cleaning Duplicates from Mailboxes

The following article is a continuation of the previous article I published, about importing problematic PST files into user mailboxes.

I implemented the solution using MAPI, which allows working with Office applications through PowerShell. This enables breaking down and handling properties of objects such as documents, emails, data tables, and so on - all with all the tools that scripts have to manage and deal with various things.

The Duplicate Problem:

One of the problems encountered by anyone who has tried (and failed) several times to import PST files into a mailbox is the duplicate problem. Every time an import attempt is made, the system starts copying mail from the file into the mailbox itself, until it encounters an error that causes the process to stop. Repeated attempts copy more and more emails, sometimes the same emails multiple times, until another error is encountered.

Another case that can trigger the problem is when the mailbox was linked to an archive vault - such as Enterprise Vault. In this state the mailbox can contain a lot of shortcuts to items in the vault. And after canceling the archive and retrieving the actual items back to the mailbox, sometimes the shortcuts remain alongside the actual emails, and this causes duplicate results in searches and a poor user experience.

Solution:

If the PST import problem was resolved - and even if it wasn’t (to resolve the problem you can use my previous article), you can use the script I will present below, and through it clean the mailbox of duplicates.

The script should be run using PowerShell on a computer where Outlook is installed and the mailbox to be cleaned is fully loaded. A fully loaded mailbox (Full Cache Mode) is required in order to clean all duplicates in the mailbox.

Before running, update the user line at the beginning of the script according to the email address of the desired mailbox.

Now I will present the script, and after it explain its operation.

The Script:

$user = 'user@domain.suffix

<# This function does logging if the procces. printing datails and timestamp on screen and into log file.
Parameters: 
$a, $b: index of the string to print, inside the messages array.
$res: the value described by the messege printed above. #>
function Print-Output {
    param ($a, $b,[Parameter(Mandatory = $false)]$res = $null)
    
    $global:output[$a,$b,$a]
    $res
    Get-Date
    
    $global:output[$a,$b,$a] | Add-Content -Path C:\Users\$env:USERNAME\Desktop\log.txt -Encoding UTF8
    $res| Add-Content -Path C:\Users\$env:USERNAME\Desktop\log.txt -Encoding UTF8
    Get-Date | Add-Content -Path C:\Users\$env:USERNAME\Desktop\log.txt -Encoding UTF8
}


<# This function going through all folders, cleaning duplicated items in the folder.
At first it's making index of duplicates, than call another function for the actual cleaning.
Parameters:
$folder: represent the folder to work inside. #>
function Clean-Folder {
    param ($Folder)


    if ($Folder.name -in $global:exlude) {
        break 
    } elseif ($Folder.name -eq 'Deleted Items') {
        $global:final = $Folder
    }


    if ($Folder.Folders.Count -gt 0) {        
        foreach ($x in $Folder.Folders) {
            Clean-Folder -Folder $x
        }
    }
    
    Print-Output -a 0 -b 1 -res $Folder.folderpath
    $list = $Folder.items | select senton, subject
    $duplicates = $list | Group-Object -Property senton, subject | ?{$_.count -gt 1}
    $allduplicates = @() 
    $tempgroup = @()
    $tmpcount = 0
    $counter = 0
    foreach ($group in $duplicates) {
        if (($counter -gt 1000) -or ($group -eq $duplicates[-1])) {
            $tempgroup += $group.Group
            $tempgroup | Add-Member -MemberType NoteProperty -Name 'count' -Value ($tmpcount += 1) -Force
            $allduplicates += $tempgroup
            $tempgroup = @()
            $counter = 0
        }
        $tempgroup += $group.Group
        $counter += $group.count 
    }
    
    $allduplicates = $allduplicates | Group-Object -Property count
    foreach ($dup in $allduplicates) {
        $messages = $folder.Items | ?{$_.senton -in $dup.group.senton} | ?{$_.subject -in $dup.group.subject}
        $duplicates = $messages | Group-Object -Property senton, subject


        foreach ($group in $duplicates) {
        $count = 0
        While ($count -lt ($group.count -1)) {
          
            if (($group.Group | ?{$_.MessageClass -EQ "IPM.Note.EnterpriseVault.Shortcut"}) -ne $null) {
                if (($group.Group | ?{$_.MessageClass -EQ "IPM.Note.EnterpriseVault.Shortcut"}).count -gt 1) {
                    ($group.Group | ?{$_.MessageClass -EQ "IPM.Note.EnterpriseVault.Shortcut"})[$count].delete()
                } else {
                    ($group.Group | ?{$_.MessageClass -EQ "IPM.Note.EnterpriseVault.Shortcut"}).delete()
                }    
            } else {
                $group.Group[$count].delete()
            }
            $count += 1            
        }
        }    
    }   
}

# Configuring Outlook MailBox and PST as an available object to work with.
$outlook = New-Object -com Outlook.Application
$namespace = $outlook.GetNamespace("MAPI")
$mailbox = $namespace.Stores | ? {$_.displayname -like $user}
# Veriable points to the root folder of the MailBox.
$global:mailboxRoot = $mailbox.GetRootFolder()
# Folders that irrelevant.
$global:exlude = 'RSS Subscriptions', 'Quick Step Settings', 'Sync Issues', 'Conversation Action Settings', 'Yammer Root', 'Recipient Cache'


# Contains output messages To print on screen and log in file, for monitoring purpes.
$global:output = "===================================", "         Cleaning folder"


Clean-Folder -Folder $global:mailboxRoot
Clean-Folder -Folder $global:final'

Script Operation by Steps:

Definition and Functions:
  1. Determine which mailbox the process will run on. For this you need to change the first line at the top of the script, and update the mailbox address there.

  2. First function - manages the output. The function displays output on screen showing the current step running in the process, and also outputs the data to a log file on the desktop.

  3. Second function - a recursive function that goes through every folder in the folder tree - essentially in the mailbox. The function checks in each folder whether there are subfolders. If there are, the function calls itself on each subfolder. After that the function runs the cleanup process on the items in that folder.

  4. The function creates an index of all items in the folder. An index that includes only the send time of the item, and the title/subject.

Sorting and Segmenting:
  1. Sorting the index to find duplicates. Division into groups is performed, where each group contains items that have the same title and same timestamp. Finally a list is obtained containing only groups that contain more than one item - and these are the duplicates.

  2. After that, several variables are created for the benefit of the process. One of them is a counter, intended to ensure that collecting objects from the folders does not accumulate too large a quantity that would clog the memory and stall the process. Once the counter exceeds one thousand items, the counter resets and moves to a new group.

  3. The goal is on one hand to perform as few collections of items from the folder as possible, because each item collection takes time. Especially in huge folders containing tens of thousands of items. On the other hand, groups of duplicates typically contain 2-5 items. Wasting such a process on each duplicate group would greatly lengthen the cleanup process.

    Therefore I used groups of groups. Index components of groups are collected into a large group, until the counter exceeds one thousand components. At that moment the group is tagged according to a serial number that turns all the items into one group, the large group joins the array of large groups, and everything resets to collect another group of about one thousand components.

  4. After all the duplicates are sorted and divided into groups of about one thousand components each (or one group if the number of items in the folder is less than one thousand), processing moves from index processing to processing the actual items.

The intention is to reach handling the actual items, after there is certainty that more than one thousand items won’t need to be handled at any given time.

Working in Segments:
  1. The process goes through each such thousand-group.

  2. From the folder all the actual items are collected, corresponding to the index components in the thousand-group.

  3. The actual items are sorted by timestamp and title/subject, which divides them into duplicate groups.

  4. The process runs in a loop through each duplicate group, and on each one performs a cleanup process, as follows.

The Cleanup Process:
  1. Counter reset.

  2. A loop that runs as long as the counter is two numbers less than the number of items in the group. For example, if there is only one duplicate - meaning the group contains two items, the loop will run only once - as long as the counter stands at 0. If the group contains 5 items, the loop will run until the counter stands at number 4. When the counter stands at number 4 the loop will not run again. Because we have an array of the duplicate items, and the index of the last item in an array of 5 items is 4. We need to delete all items except the last one.

So… again - the counter runs the loop as long as its value is two numbers less than the quantity of items in the group.

  1. If the group contains vault shortcuts, a check is performed whether there are also items in the group that are not shortcuts.

  2. If there are items that are not shortcuts, shortcuts are deleted first.

  3. It is checked whether there is more than one shortcut; if so, shortcuts are deleted according to an index taken from the counter’s value.

  4. If there is only one shortcut, it is simply deleted.

  5. If there are no shortcuts, simply delete the item in the array whose index is the counter’s value.

  6. The counter value increases by one, and the loop continues until the condition stops being met.

Final Handling:
  1. After going through all the small groups within all the large groups and deleting all duplicates in all folders, the process returns to the deleted items folder, where all previously deleted items go and need to be permanently deleted.

  2. Throughout the process a folder is created on the desktop that stores the indexes of duplicates from all mailboxes, for documentation and control purposes. Anyone who wants to check first can comment out the deletion lines themselves and let the process run without deletion, then check the data.

  • Exchange
  • Outlook
  • PowerShell
  • System