Removing duplicate emails from an mbsync maildir

2022-11-09

I use mbsync to synchronise my emails between my mail provider and my various computers. After messing around migrating emails from one box to another, I found myself accidentally creating ~40,000 duplicate emails in my maildir. Unfortunately, these duplicates were not all textually identical - since they were uniquely downloaded from the mail server, they all have unique X-TUID headers in the file1. As a result, I needed to deduplicate mails that are identical in every way except for this header.

This is a reasonably well-explored problem2, but the method that worked for me was mostly following this guide, with a couple of tweaks. This post illustrates the process more fully, with some tips to stop the issue reocurring.

The general process is:

But before we do anything, make sure that your maildir is backed up:

$ cp -r /your/maildir /your/maildir_backup

My maildir was made ~11.5GB larger because of the duplicates, so I decided to compress it:

$ tar -czf /your/maildir.tar.gz /your/maildir

Identifying the problem

Before we begin, it would be wise to make sure that you're experincing the same problem - namely, that duplicate emails exist that only differ by the X-TUID header. To do this, find a duplicate email in your inbox that has some distinctive text in the subject/header.

Then grep through your mailbox to find at least two copies of the email:

$ cd /your/maildir
$ grep -r "Assistance Request with Error" *
/your/maildir/1665657241.2812_494.henleybeach,U=20442:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
/your/maildir/1648507477.henleybeach.843910446fe3022ed33fb971c75a5,U=608:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
/your/maildir/1665657340.2812_2461.henleybeach,U=22409:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
...

Now compare the two files:

$ diff /your/maildir/1665657241.2812_493.henleybeach,U=20441:2,S /your/maildir/1648507477.henleybeach.843910445baf79c0a7d7a39f56623,U=607:2,S
--- /your/maildir/1665657241.2812_493.henleybeach,U=20441:2,S
+++ /your/maildir/1648507477.henleybeach.843910445baf79c0a7d7a39f56623,U=607:2,S
@@ -154,7 +154,7 @@
 Content-Transfer-Encoding: 7bit
 sent-on: 2021-03-18 19:55:46 +0000
 X-MESSAGEID: z3x/2B0tA3mvKAUXMxEH61eIUaI8ECiXYGyZYNcdvq5NYzwxZHXCLJf/fBn6jmBlW9mpPmAQqNgN9q/bFMFFhGxDdwEXe598YVmmA8NfRUU1LmMdizmDSr81WW6UWakdgJY7A8HFtTP+IB/wp4sArRmUwTFP7xJX9x8yT6mEYrw=
-X-TUID: AiQwG2dVS4Mo
+X-TUID: vdYQtbR960ob


 ----==_mimepart_6053b04260bcd_2442ad0be6055909039010

Seems about right!

Preventing the problem from happening again

Purportedly, running mbsync with CopyArrivalDate yes prevents this from occurring.

Fixing the maildir

Firstly make a working copy of the mailbox under a unique name, and navigate there. The unique name will allow us to safely modify the rmlint-generated script later on.

$ cp -r /your/maildir /your/maildir_copyUNIQUENAME

Strip the X-TUID headers out of the working copy with sed:

$ cd /your/maildir_copyUNIQUENAME
$ find ./ -type f -exec sed -i -e '/X-TUID/d' {} \;

Run rmlint to check for duplicates. If you don't have rmlint, now would be a good time to install it with your system package manager. I'm packaging it for Alpine3, and it seems fairly available across various distros. Failing this, you can simply compile it yourself - it's not too hard!

We supply the -g flag to get a nice progress bar. The --types flag is going to select the default types of files, except duplicate and empty directories. If we removed empty directories, then integral parts of the mailbox structure (such as empty cur, new, and tmp directories) would be removed and cause errors later.

$ cd /your/maildir_copyUNIQUENAME
$ ~/dev/rmlint/rmlint -g --types="defaults -ed -dd"
▕░░░░░░░░░░░░░░░░░░░▏                  Traversing (42480 usable files / 8 + 0 ignored files / folders)
▕░░░░░░░░░░░░░░░░░░░▏                      Preprocessing (reduces files to 42450 / found 0 other lint)
▕░░░░░░░░░░░░░░░░░░░▏       Matching (39692 dupes of 2688 originals; 0 B to scan in 0 files, ETA:  1s)

==> In total 42480 files, whereof 39692 are duplicates in 2688 groups.
==> This equals 11.49 GB of duplicates which could be removed.
==> Scanning took in total 13.600s.

Wrote a sh file to: /your/maildir_copyUNIQUENAME/rmlint.sh
Wrote a json file to: /your/maildir_copyUNIQUENAME/rmlint.json

That's a lot of duplicates!

Open up the generated rmlint.sh to check everything's looking sensible, and in particular to check that the paths are absolute. If they're relative, move rmlint.sh to ./your/maildir. Otherwise, point the files at the working copy:

$ sed -i -e 's/_copyUNIQUENAME//g' /your/maildir_copyUNIQUENAME/rmlint.sh
$ mv /your/maildir_copyUNIQUENAME/rmlint.sh /your/maildir/rmlint.sh # this may not be strictly necessary

Now it's time to deduplicate! Firstly, do a quick double check, running the script without making any changes. -x makes sure that rmlint.sh isn't deleted after running. -n configures this as a dry run only.

$ cd /your/maildir
$ ./rmlint.sh -x -n
# ////////////////////////////////////////////////////////////
# ///  This is only a dry run; nothing will be modified! ///
# ////////////////////////////////////////////////////////////
[  0%] Keeping:  /your/maildir/Sent/cur/1665585366.32080_3724.henleybeach,U=2933:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657859.2812_13825.henleybeach,U=8765:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665585044.32080_1537.henleybeach,U=746:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1642635782.5313_1397.henleybeach,U=16:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657827.2812_13096.henleybeach,U=8036:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657888.2812_14554.henleybeach,U=9494:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657788.2812_12367.henleybeach,U=7307:2,S
...

If it seems to be deleting from the right place, run the script for real (you'll need to provide some input to get it to start):

$ cd /your/maildir
$ ./rmlint.sh -x
usage: ./rmlint.sh OPTIONS

OPTIONS:

  -h   Show this message.
  -d   Do not ask before running.
  -x   Keep rmlint.sh; do not autodelete it.
  -p   Recheck that files are still identical before removing duplicates.
  -r   Allow deduplication of files on read-only btrfs snapshots. (requires sudo)
  -n   Do not perform any modifications, just print what would be done. (implies -d and -x)
  -c   Clean up empty directories while deleting duplicates.
  -q   Do not show progress.
  -k   Keep the timestamp of directories when removing duplicates.
  -i   Ask before deleting each file

This script will delete certain files rmlint found.
It is highly advisable to view the script first!

Rmlint was executed in the following way:

   $ rmlint -g --types=defaults -ed -dd

Execute this script with -d to disable this informational message.
Type any string to continue; CTRL-C, Enter or CTRL-D to abort immediately
go!
[  0%] Keeping:  /your/maildir/Sent/cur/1665585366.32080_3724.henleybeach,U=2933:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665657859.2812_13825.henleybeach,U=8765:2,S
[  0%] Deleting: /your/maildir/Sent/cur/1665585044.32080_1537.henleybeach,U=746:2,S
...
[ 99%] Deleting: /your/maildir/Archive/cur/1665657567.2812_7020.henleybeach,U=26968:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665586645.5084_5180.henleybeach,U=15035:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665586298.5084_1245.henleybeach,U=11100:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665657372.2812_3083.henleybeach,U=23031:2,S
[100%] Done!

Now would be a good idea to check your mailbox to ensure it's all okay now!

Finally, sync with mbsync. --expunge-far should be enabled (or its equivalent in the mbsync config file) to ensure that your locally deleted emails get deleted on the far side as well4.

$ mbsync -a --expunge-far
C: 0/3  B: 9/11  F: +0/0 *0/0 #0/0  N: +0/0 *0/0 #0/0

And breathe!


  1. Speaking of which, if anyone knows what this header really does, I'd love to know - there's very little information about it online. ↩︎

  2. This code snippet claims to help: https://gist.github.com/lewisthompson/bb0e0399254c90cf36dba03956bd2ff0 and there's a discussion on this mailing list: https://groups.google.com/g/mu-discuss/c/SqMOZSouh0Y ↩︎

  3. Edit: rmlint is now available in Alpine Edge↩︎

  4. You might notice that you get some errors about the near side box not existing. If that's the case, you need to recreate those mail boxes in your maildir, and resynchronise the mail into it. I plan to write a future post about this topic. ↩︎

Articles from blogs I read

Hugo: rename a tag

This blog is rendered by the means of a static site generator (SSG) called Hugo. Each blog post has a set of one or more tags associated to it. The more posts I create, the more consolidated the tags become. Sometimes I need to rename tags after-the-fact to …

via Not Just Serendipity January 29, 2024

Why Prusa is floundering, and how you can avoid their fate

Prusa is a 3D printer manufacturer which has a long history of being admired by the 3D printing community for high quality, open source printers. They have been struggling as of late, and came under criticism for making the firmware of their Mk4 printer non-…

via Drew DeVault's blog December 26, 2023

attribution armored code

Attribution of source code has been limited to comments, but a deeper embedding of attribution into code is possible. When an embedded attribution is removed or is incorrect, the code should no longer work. I've developed a way to do this in Haskell that…

via see shy jo November 21, 2023

Generated by openring