Removing duplicate emails from an mbsync maildir

2022-11-09

I use mbsync to synchronise my emails between my mail provider and my various computers. After messing around migrating emails from one box to another, I found myself accidentally creating ~40,000 duplicate emails in my maildir. Unfortunately, these duplicates were not all textually identical - since they were uniquely downloaded from the mail server, they all have unique X-TUID headers in the file1. As a result, I needed to deduplicate mails that are identical in every way except for this header.

This is a reasonably well-explored problem2, but the method that worked for me was mostly following this guide, with a couple of tweaks. This post illustrates the process more fully, with some tips to stop the issue reocurring.

The general process is:

But before we do anything, make sure that your maildir is backed up:

$ cp -r /your/maildir /your/maildir_backup

My maildir was made ~11.5GB larger because of the duplicates, so I decided to compress it:

$ tar -czf /your/maildir.tar.gz /your/maildir

Identifying the problem

Before we begin, it would be wise to make sure that you're experincing the same problem - namely, that duplicate emails exist that only differ by the X-TUID header. To do this, find a duplicate email in your inbox that has some distinctive text in the subject/header.

Then grep through your mailbox to find at least two copies of the email:

$ cd /your/maildir
$ grep -r "Assistance Request with Error" *
/your/maildir/1665657241.2812_494.henleybeach,U=20442:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
/your/maildir/1648507477.henleybeach.843910446fe3022ed33fb971c75a5,U=608:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
/your/maildir/1665657340.2812_2461.henleybeach,U=22409:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
...

Now compare the two files:

$ diff /your/maildir/1665657241.2812_493.henleybeach,U=20441:2,S /your/maildir/1648507477.henleybeach.843910445baf79c0a7d7a39f56623,U=607:2,S
--- /your/maildir/1665657241.2812_493.henleybeach,U=20441:2,S
+++ /your/maildir/1648507477.henleybeach.843910445baf79c0a7d7a39f56623,U=607:2,S
@@ -154,7 +154,7 @@
 Content-Transfer-Encoding: 7bit
 sent-on: 2021-03-18 19:55:46 +0000
 X-MESSAGEID: z3x/2B0tA3mvKAUXMxEH61eIUaI8ECiXYGyZYNcdvq5NYzwxZHXCLJf/fBn6jmBlW9mpPmAQqNgN9q/bFMFFhGxDdwEXe598YVmmA8NfRUU1LmMdizmDSr81WW6UWakdgJY7A8HFtTP+IB/wp4sArRmUwTFP7xJX9x8yT6mEYrw=
-X-TUID: AiQwG2dVS4Mo
+X-TUID: vdYQtbR960ob


 ----==_mimepart_6053b04260bcd_2442ad0be6055909039010

Seems about right!

Preventing the problem from happening again

Purportedly, running mbsync with CopyArrivalDate yes prevents this from occurring.

Fixing the maildir

Firstly make a working copy of the mailbox under a unique name, and navigate there. The unique name will allow us to safely modify the rmlint-generated script later on.

$ cp -r /your/maildir /your/maildir_copyUNIQUENAME

Strip the X-TUID headers out of the working copy with sed:

$ cd /your/maildir_copyUNIQUENAME
$ find ./ -type f -exec sed -i -e '/X-TUID/d' {} \;

Run rmlint to check for duplicates. If you don't have rmlint, now would be a good time to install it with your system package manager. I'm packaging it for Alpine, and it seems fairly available across various distros. Failing this, you can simply compile it yourself - it's not too hard!

We supply the -g flag to get a nice progress bar. The --types flag is going to select the default types of files, except duplicate and empty directories. If we removed empty directories, then integral parts of the mailbox structure (such as empty cur, new, and tmp directories) would be removed and cause errors later.

$ cd /your/maildir_copyUNIQUENAME
$ ~/dev/rmlint/rmlint -g --types="defaults -ed -dd"
▕░░░░░░░░░░░░░░░░░░░▏                  Traversing (42480 usable files / 8 + 0 ignored files / folders)
▕░░░░░░░░░░░░░░░░░░░▏                      Preprocessing (reduces files to 42450 / found 0 other lint)
▕░░░░░░░░░░░░░░░░░░░▏       Matching (39692 dupes of 2688 originals; 0 B to scan in 0 files, ETA:  1s)

==> In total 42480 files, whereof 39692 are duplicates in 2688 groups.
==> This equals 11.49 GB of duplicates which could be removed.
==> Scanning took in total 13.600s.

Wrote a sh file to: /your/maildir_copyUNIQUENAME/rmlint.sh
Wrote a json file to: /your/maildir_copyUNIQUENAME/rmlint.json

That's a lot of duplicates!

Open up the generated rmlint.sh to check everything's looking sensible, and in particular to check that the paths are absolute. If they're relative, move rmlint.sh to ./your/maildir. Otherwise, point the files at the working copy:

$ sed -i -e 's/_copyUNIQUENAME//g' /your/maildir_copyUNIQUENAME/rmlint.sh
$ mv /your/maildir_copyUNIQUENAME/rmlint.sh /your/maildir/rmlint.sh # this may not be strictly necessary

Now it's time to deduplicate! Firstly, do a quick double check, running the script without making any changes. -x makes sure that rmlint.sh isn't deleted after running. -n configures this as a dry run only.

$ cd /your/maildir
$ ./rmlint.sh -x -n
# ////////////////////////////////////////////////////////////
# ///  This is only a dry run; nothing will be modified! ///
# ////////////////////////////////////////////////////////////
[  0%] Keeping:  /your/maildir/Sent/cur/1665585366.32080_3724.henleybeach,U=2933:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657859.2812_13825.henleybeach,U=8765:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665585044.32080_1537.henleybeach,U=746:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1642635782.5313_1397.henleybeach,U=16:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657827.2812_13096.henleybeach,U=8036:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657888.2812_14554.henleybeach,U=9494:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657788.2812_12367.henleybeach,U=7307:2,S
...

If it seems to be deleting from the right place, run the script for real (you'll need to provide some input to get it to start):

$ cd /your/maildir
$ ./rmlint.sh -x
usage: ./rmlint.sh OPTIONS

OPTIONS:

  -h   Show this message.
  -d   Do not ask before running.
  -x   Keep rmlint.sh; do not autodelete it.
  -p   Recheck that files are still identical before removing duplicates.
  -r   Allow deduplication of files on read-only btrfs snapshots. (requires sudo)
  -n   Do not perform any modifications, just print what would be done. (implies -d and -x)
  -c   Clean up empty directories while deleting duplicates.
  -q   Do not show progress.
  -k   Keep the timestamp of directories when removing duplicates.
  -i   Ask before deleting each file

This script will delete certain files rmlint found.
It is highly advisable to view the script first!

Rmlint was executed in the following way:

   $ rmlint -g --types=defaults -ed -dd

Execute this script with -d to disable this informational message.
Type any string to continue; CTRL-C, Enter or CTRL-D to abort immediately
go!
[  0%] Keeping:  /your/maildir/Sent/cur/1665585366.32080_3724.henleybeach,U=2933:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657859.2812_13825.henleybeach,U=8765:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665585044.32080_1537.henleybeach,U=746:2,S
...
[ 99%] Deleting: /your/maildir/Archive/cur/1665657567.2812_7020.henleybeach,U=26968:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665586645.5084_5180.henleybeach,U=15035:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665586298.5084_1245.henleybeach,U=11100:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665657372.2812_3083.henleybeach,U=23031:2,S
[100%] Done!

Now would be a good idea to check your mailbox to ensure it's all okay now!

Finally, sync with mbsync. --expunge-far should be enabled (or its equivalent in the mbsync config file) to ensure that your locally deleted emails get deleted on the far side as well3.

$ mbsync -a --expunge-far
C: 0/3  B: 9/11  F: +0/0 *0/0 #0/0  N: +0/0 *0/0 #0/0

And breathe!


  1. Speaking of which, if anyone knows what this header really does, I'd love to know - there's very little information about it online. ↩︎

  2. This code snippet claims to help: https://gist.github.com/lewisthompson/bb0e0399254c90cf36dba03956bd2ff0 and there's a discussion on this mailing list: https://groups.google.com/g/mu-discuss/c/SqMOZSouh0Y ↩︎

  3. You might notice that you get some errors about the near side box not existing. If that's the case, you need to recreate those mail boxes in your maildir, and resynchronise the mail into it. I plan to write a future post about this topic. ↩︎

Articles from blogs I read

In praise of Plan 9

Plan 9 is an operating system designed by Bell Labs. It’s the OS they wrote after Unix, with the benefit of hindsight. It is the most interesting operating system that you’ve never heard of, and, in my opinion, the best operating system design to date. Even …

via Drew DeVault's blog November 12, 2022

Do I still remember how to blog?

I haven’t written a blog post for a couple of months now, which is a good indicator I should probably document my workflow before I forget how to do it…

via Not Just Serendipity October 9, 2022

Github Copilot: The future of programming?

My thoughts on GitHub Copilot and the future of AI assistants in writing code.

via Josh Smailes April 15, 2022

Generated by openring