Removing duplicate emails from an mbsync maildir


I use mbsync to synchronise my emails between my mail provider and my various computers. After messing around migrating emails from one box to another, I found myself accidentally creating ~40,000 duplicate emails in my maildir. Unfortunately, these duplicates were not all textually identical - since they were uniquely downloaded from the mail server, they all have unique X-TUID headers in the file1. As a result, I needed to deduplicate mails that are identical in every way except for this header.

This is a reasonably well-explored problem2, but the method that worked for me was mostly following this guide, with a couple of tweaks. This post illustrates the process more fully, with some tips to stop the issue reocurring.

The general process is:

But before we do anything, make sure that your maildir is backed up:

$ cp -r /your/maildir /your/maildir_backup

My maildir was made ~11.5GB larger because of the duplicates, so I decided to compress it:

$ tar -czf /your/maildir.tar.gz /your/maildir

Identifying the problem

Before we begin, it would be wise to make sure that you're experincing the same problem - namely, that duplicate emails exist that only differ by the X-TUID header. To do this, find a duplicate email in your inbox that has some distinctive text in the subject/header.

Then grep through your mailbox to find at least two copies of the email:

$ cd /your/maildir
$ grep -r "Assistance Request with Error" *
/your/maildir/1665657241.2812_494.henleybeach,U=20442:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
/your/maildir/1648507477.henleybeach.843910446fe3022ed33fb971c75a5,U=608:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
/your/maildir/1665657340.2812_2461.henleybeach,U=22409:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error

Now compare the two files:

$ diff /your/maildir/1665657241.2812_493.henleybeach,U=20441:2,S /your/maildir/1648507477.henleybeach.843910445baf79c0a7d7a39f56623,U=607:2,S
--- /your/maildir/1665657241.2812_493.henleybeach,U=20441:2,S
+++ /your/maildir/1648507477.henleybeach.843910445baf79c0a7d7a39f56623,U=607:2,S
@@ -154,7 +154,7 @@
 Content-Transfer-Encoding: 7bit
 sent-on: 2021-03-18 19:55:46 +0000
 X-MESSAGEID: z3x/2B0tA3mvKAUXMxEH61eIUaI8ECiXYGyZYNcdvq5NYzwxZHXCLJf/fBn6jmBlW9mpPmAQqNgN9q/bFMFFhGxDdwEXe598YVmmA8NfRUU1LmMdizmDSr81WW6UWakdgJY7A8HFtTP+IB/wp4sArRmUwTFP7xJX9x8yT6mEYrw=
+X-TUID: vdYQtbR960ob


Seems about right!

Preventing the problem from happening again

Purportedly, running mbsync with CopyArrivalDate yes prevents this from occurring.

Fixing the maildir

Firstly make a working copy of the mailbox under a unique name, and navigate there. The unique name will allow us to safely modify the rmlint-generated script later on.

$ cp -r /your/maildir /your/maildir_copyUNIQUENAME

Strip the X-TUID headers out of the working copy with sed:

$ cd /your/maildir_copyUNIQUENAME
$ find ./ -type f -exec sed -i -e '/X-TUID/d' {} \;

Run rmlint to check for duplicates. If you don't have rmlint, now would be a good time to install it with your system package manager. I'm packaging it for Alpine3, and it seems fairly available across various distros. Failing this, you can simply compile it yourself - it's not too hard!

We supply the -g flag to get a nice progress bar. The --types flag is going to select the default types of files, except duplicate and empty directories. If we removed empty directories, then integral parts of the mailbox structure (such as empty cur, new, and tmp directories) would be removed and cause errors later.

$ cd /your/maildir_copyUNIQUENAME
$ ~/dev/rmlint/rmlint -g --types="defaults -ed -dd"
▕░░░░░░░░░░░░░░░░░░░▏                  Traversing (42480 usable files / 8 + 0 ignored files / folders)
▕░░░░░░░░░░░░░░░░░░░▏                      Preprocessing (reduces files to 42450 / found 0 other lint)
▕░░░░░░░░░░░░░░░░░░░▏       Matching (39692 dupes of 2688 originals; 0 B to scan in 0 files, ETA:  1s)

==> In total 42480 files, whereof 39692 are duplicates in 2688 groups.
==> This equals 11.49 GB of duplicates which could be removed.
==> Scanning took in total 13.600s.

Wrote a sh file to: /your/maildir_copyUNIQUENAME/
Wrote a json file to: /your/maildir_copyUNIQUENAME/rmlint.json

That's a lot of duplicates!

Open up the generated to check everything's looking sensible, and in particular to check that the paths are absolute. If they're relative, move to ./your/maildir. Otherwise, point the files at the working copy:

$ sed -i -e 's/_copyUNIQUENAME//g' /your/maildir_copyUNIQUENAME/
$ mv /your/maildir_copyUNIQUENAME/ /your/maildir/ # this may not be strictly necessary

Now it's time to deduplicate! Firstly, do a quick double check, running the script without making any changes. -x makes sure that isn't deleted after running. -n configures this as a dry run only.

$ cd /your/maildir
$ ./ -x -n
# ////////////////////////////////////////////////////////////
# ///  This is only a dry run; nothing will be modified! ///
# ////////////////////////////////////////////////////////////
[  0%] Keeping:  /your/maildir/Sent/cur/1665585366.32080_3724.henleybeach,U=2933:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657859.2812_13825.henleybeach,U=8765:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665585044.32080_1537.henleybeach,U=746:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1642635782.5313_1397.henleybeach,U=16:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657827.2812_13096.henleybeach,U=8036:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657888.2812_14554.henleybeach,U=9494:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657788.2812_12367.henleybeach,U=7307:2,S

If it seems to be deleting from the right place, run the script for real (you'll need to provide some input to get it to start):

$ cd /your/maildir
$ ./ -x
usage: ./ OPTIONS


  -h   Show this message.
  -d   Do not ask before running.
  -x   Keep; do not autodelete it.
  -p   Recheck that files are still identical before removing duplicates.
  -r   Allow deduplication of files on read-only btrfs snapshots. (requires sudo)
  -n   Do not perform any modifications, just print what would be done. (implies -d and -x)
  -c   Clean up empty directories while deleting duplicates.
  -q   Do not show progress.
  -k   Keep the timestamp of directories when removing duplicates.
  -i   Ask before deleting each file

This script will delete certain files rmlint found.
It is highly advisable to view the script first!

Rmlint was executed in the following way:

   $ rmlint -g --types=defaults -ed -dd

Execute this script with -d to disable this informational message.
Type any string to continue; CTRL-C, Enter or CTRL-D to abort immediately
[  0%] Keeping:  /your/maildir/Sent/cur/1665585366.32080_3724.henleybeach,U=2933:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657859.2812_13825.henleybeach,U=8765:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665585044.32080_1537.henleybeach,U=746:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665657567.2812_7020.henleybeach,U=26968:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665586645.5084_5180.henleybeach,U=15035:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665586298.5084_1245.henleybeach,U=11100:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665657372.2812_3083.henleybeach,U=23031:2,S
[100%] Done!

Now would be a good idea to check your mailbox to ensure it's all okay now!

Finally, sync with mbsync. --expunge-far should be enabled (or its equivalent in the mbsync config file) to ensure that your locally deleted emails get deleted on the far side as well4.

$ mbsync -a --expunge-far
C: 0/3  B: 9/11  F: +0/0 *0/0 #0/0  N: +0/0 *0/0 #0/0

And breathe!

  1. Speaking of which, if anyone knows what this header really does, I'd love to know - there's very little information about it online. ↩︎

  2. This code snippet claims to help: and there's a discussion on this mailing list: ↩︎

  3. Edit: rmlint is now available in Alpine Edge↩︎

  4. You might notice that you get some errors about the near side box not existing. If that's the case, you need to recreate those mail boxes in your maildir, and resynchronise the mail into it. I plan to write a future post about this topic. ↩︎

Articles from blogs I read

Porting Helios to aarch64 for my FOSDEM talk, part one

Helios is a microkernel written in the Hare programming language, and the subject of a talk I did at FOSDEM earlier this month. You can watch the talk here if you like: A while ago I promised someone that I would not do any talks on Helios until I could prese…

via Drew DeVault's blog February 20, 2023

Personal Knowledge Management: my current approach

My history with personal knowledge management, an overview of my current system, and my thoughts on potential additional use cases.

via Josh Smailes February 12, 2023

AIs galore

2022 was an amazing year for AIs.

via Not Just Serendipity December 27, 2022

Generated by openring