Removing duplicate emails from an mbsync maildir
2022-11-09
I use mbsync to synchronise my emails between my mail provider and my various computers. After messing around migrating emails from one box to another, I found myself accidentally creating ~40,000 duplicate emails in my maildir. Unfortunately, these duplicates were not all textually identical - since they were uniquely downloaded from the mail server, they all have unique X-TUID headers in the file1. As a result, I needed to deduplicate mails that are identical in every way except for this header.
This is a reasonably well-explored problem2, but the method that worked for me was mostly following this guide, with a couple of tweaks. This post illustrates the process more fully, with some tips to stop the issue reocurring.
The general process is:
- Make a copy of the mailbox
- Strip the X-TUID headers out of the copy, so that duplicate emails now have identical contents
- Run
rmlint
to generate a script to remove the duplicates - Modify the script to reference files in the original mailbox
- Run the script
But before we do anything, make sure that your maildir is backed up:
$ cp -r /your/maildir /your/maildir_backup
My maildir was made ~11.5GB larger because of the duplicates, so I decided to compress it:
$ tar -czf /your/maildir.tar.gz /your/maildir
Identifying the problem
Before we begin, it would be wise to make sure that you're experincing the same problem - namely, that duplicate emails exist that only differ by the X-TUID header. To do this, find a duplicate email in your inbox that has some distinctive text in the subject/header.
Then grep through your mailbox to find at least two copies of the email:
$ cd /your/maildir
$ grep -r "Assistance Request with Error" *
/your/maildir/1665657241.2812_494.henleybeach,U=20442:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
/your/maildir/1648507477.henleybeach.843910446fe3022ed33fb971c75a5,U=608:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
/your/maildir/1665657340.2812_2461.henleybeach,U=22409:2,S:Subject: Ticket Received - [Chatbot] Assistance Request with Error
...
Now compare the two files:
$ diff /your/maildir/1665657241.2812_493.henleybeach,U=20441:2,S /your/maildir/1648507477.henleybeach.843910445baf79c0a7d7a39f56623,U=607:2,S
--- /your/maildir/1665657241.2812_493.henleybeach,U=20441:2,S
+++ /your/maildir/1648507477.henleybeach.843910445baf79c0a7d7a39f56623,U=607:2,S
@@ -154,7 +154,7 @@
Content-Transfer-Encoding: 7bit
sent-on: 2021-03-18 19:55:46 +0000
X-MESSAGEID: z3x/2B0tA3mvKAUXMxEH61eIUaI8ECiXYGyZYNcdvq5NYzwxZHXCLJf/fBn6jmBlW9mpPmAQqNgN9q/bFMFFhGxDdwEXe598YVmmA8NfRUU1LmMdizmDSr81WW6UWakdgJY7A8HFtTP+IB/wp4sArRmUwTFP7xJX9x8yT6mEYrw=
-X-TUID: AiQwG2dVS4Mo
+X-TUID: vdYQtbR960ob
----==_mimepart_6053b04260bcd_2442ad0be6055909039010
Seems about right!
Preventing the problem from happening again
Purportedly, running mbsync
with CopyArrivalDate yes
prevents this from occurring.
Fixing the maildir
Firstly make a working copy of the mailbox under a unique name, and navigate there.
The unique name will allow us to safely modify the rmlint
-generated script later on.
$ cp -r /your/maildir /your/maildir_copyUNIQUENAME
Strip the X-TUID headers out of the working copy with sed:
$ cd /your/maildir_copyUNIQUENAME
$ find ./ -type f -exec sed -i -e '/X-TUID/d' {} \;
Run rmlint
to check for duplicates.
If you don't have rmlint
, now would be a good time to install it with your system package manager.
I'm packaging it for Alpine3, and it seems fairly available across various distros.
Failing this, you can simply compile it yourself - it's not too hard!
We supply the -g
flag to get a nice progress bar.
The --types
flag is going to select the default types of files, except duplicate and empty directories.
If we removed empty directories, then integral parts of the mailbox structure (such as empty cur
, new
, and tmp
directories) would be removed and cause errors later.
$ cd /your/maildir_copyUNIQUENAME
$ ~/dev/rmlint/rmlint -g --types="defaults -ed -dd"
▕░░░░░░░░░░░░░░░░░░░▏ Traversing (42480 usable files / 8 + 0 ignored files / folders)
▕░░░░░░░░░░░░░░░░░░░▏ Preprocessing (reduces files to 42450 / found 0 other lint)
▕░░░░░░░░░░░░░░░░░░░▏ Matching (39692 dupes of 2688 originals; 0 B to scan in 0 files, ETA: 1s)
==> In total 42480 files, whereof 39692 are duplicates in 2688 groups.
==> This equals 11.49 GB of duplicates which could be removed.
==> Scanning took in total 13.600s.
Wrote a sh file to: /your/maildir_copyUNIQUENAME/rmlint.sh
Wrote a json file to: /your/maildir_copyUNIQUENAME/rmlint.json
That's a lot of duplicates!
Open up the generated rmlint.sh
to check everything's looking sensible, and in particular to check that the paths are absolute.
If they're relative, move rmlint.sh
to ./your/maildir
.
Otherwise, point the files at the working copy:
$ sed -i -e 's/_copyUNIQUENAME//g' /your/maildir_copyUNIQUENAME/rmlint.sh
$ mv /your/maildir_copyUNIQUENAME/rmlint.sh /your/maildir/rmlint.sh # this may not be strictly necessary
Now it's time to deduplicate!
Firstly, do a quick double check, running the script without making any changes.
-x
makes sure that rmlint.sh
isn't deleted after running.
-n
configures this as a dry run only.
$ cd /your/maildir
$ ./rmlint.sh -x -n
# ////////////////////////////////////////////////////////////
# /// This is only a dry run; nothing will be modified! ///
# ////////////////////////////////////////////////////////////
[ 0%] Keeping: /your/maildir/Sent/cur/1665585366.32080_3724.henleybeach,U=2933:2,S
[ 0%] Deleting: /your/maildir/Sent/cur/1665657859.2812_13825.henleybeach,U=8765:2,S
[ 0%] Deleting: /your/maildir/Sent/cur/1665585044.32080_1537.henleybeach,U=746:2,S
[ 0%] Deleting: /your/maildir/Sent/cur/1642635782.5313_1397.henleybeach,U=16:2,S
[ 0%] Deleting: /your/maildir/Sent/cur/1665657827.2812_13096.henleybeach,U=8036:2,S
[ 0%] Deleting: /your/maildir/Sent/cur/1665657888.2812_14554.henleybeach,U=9494:2,S
[ 0%] Deleting: /your/maildir/Sent/cur/1665657788.2812_12367.henleybeach,U=7307:2,S
...
If it seems to be deleting from the right place, run the script for real (you'll need to provide some input to get it to start):
$ cd /your/maildir
$ ./rmlint.sh -x
usage: ./rmlint.sh OPTIONS
OPTIONS:
-h Show this message.
-d Do not ask before running.
-x Keep rmlint.sh; do not autodelete it.
-p Recheck that files are still identical before removing duplicates.
-r Allow deduplication of files on read-only btrfs snapshots. (requires sudo)
-n Do not perform any modifications, just print what would be done. (implies -d and -x)
-c Clean up empty directories while deleting duplicates.
-q Do not show progress.
-k Keep the timestamp of directories when removing duplicates.
-i Ask before deleting each file
This script will delete certain files rmlint found.
It is highly advisable to view the script first!
Rmlint was executed in the following way:
$ rmlint -g --types=defaults -ed -dd
Execute this script with -d to disable this informational message.
Type any string to continue; CTRL-C, Enter or CTRL-D to abort immediately
go!
[ 0%] Keeping: /your/maildir/Sent/cur/1665585366.32080_3724.henleybeach,U=2933:2,S
[ 0%] Deleting: /your/maildir/Sent/cur/1665657859.2812_13825.henleybeach,U=8765:2,S
[ 0%] Deleting: /your/maildir/Sent/cur/1665585044.32080_1537.henleybeach,U=746:2,S
...
[ 99%] Deleting: /your/maildir/Archive/cur/1665657567.2812_7020.henleybeach,U=26968:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665586645.5084_5180.henleybeach,U=15035:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665586298.5084_1245.henleybeach,U=11100:2,S
[ 99%] Deleting: /your/maildir/Archive/cur/1665657372.2812_3083.henleybeach,U=23031:2,S
[100%] Done!
Now would be a good idea to check your mailbox to ensure it's all okay now!
Finally, sync with mbsync.
--expunge-far
should be enabled (or its equivalent in the mbsync
config file) to ensure that your locally deleted emails get deleted on the far side as well4.
$ mbsync -a --expunge-far
C: 0/3 B: 9/11 F: +0/0 *0/0 #0/0 N: +0/0 *0/0 #0/0
And breathe!
-
Speaking of which, if anyone knows what this header really does, I'd love to know - there's very little information about it online. ↩︎
-
This code snippet claims to help: https://gist.github.com/lewisthompson/bb0e0399254c90cf36dba03956bd2ff0 and there's a discussion on this mailing list: https://groups.google.com/g/mu-discuss/c/SqMOZSouh0Y ↩︎
-
Edit: rmlint is now available in Alpine Edge! ↩︎
-
You might notice that you get some errors about the near side box not existing. If that's the case, you need to recreate those mail boxes in your maildir, and resynchronise the mail into it. I plan to write a future post about this topic. ↩︎