[Imap-protocol] BODY.PEEK[section]<origin.size> FETCH response
brong at fastmail.fm
Tue Nov 1 00:56:16 PDT 2011
On Tuesday, November 01, 2011 9:26 AM, "Timo Sirainen" <tss at iki.fi> wrote:
> On 1.11.2011, at 9.00, Bron Gondwana wrote:
> > On Tue, Nov 01, 2011 at 08:40:57AM +0200, Timo Sirainen wrote:
> >> On 1.11.2011, at 8.31, Bron Gondwana wrote:
> >>> On Tue, Nov 01, 2011 at 08:06:29AM +0200, Timo Sirainen wrote:
> >>>> Dovecot also stores messages with LFs and has no trouble exporting them as if they were CRLFs. I think the only actual (performance) problem with it is .. well, actually the topic of this thread :) A partial fetch from a non-zero offset requires some scanning to find out the LF-only-offset. But luckily all clients just fetch the blocks in increasing order from zero offset, so this isn't such an important problem.
> >>> How do you handle a message with a mix of LF and CRLF in the original?
> >> "Correctly." :)
> > Er - by which you mean that you always return the exact bytes you were given?
> I don't think LF vs. CRLF have any special meaning in email data, they're both simply newlines. So Dovecot doesn't try to preserve them. They're both converted to newlines anyway (LFs or CRLFs depending on context). Although I did initially wonder about supporting binary message bodies, but never bothered with it.
The real issue is things which do checksums on the email contents.
Digital signatures mainly. Luckily mostly the clients that
actually bother with digital signatures also bother with getting
the other parts right.
> >> Basically everywhere there are message (part) sizes, I store the "physical size" (exactly as it is stored in disk, with or without CRs) and the "virtual size" (all LFs converted to CRLFs). If physical size equals to virtual size, I'll do some extra optimizations like being able to seek to wanted offset immediately or use sendfile() to send the message.
> > Sounds to me like that's enough benefit to store it all CRLFs in itself.
> > 1/65 of storage space vs seek and sendfile.
> Well, that's why it's an option :) But typically I've noticed that I/O is the problem, not CPU, so sendfile isn't all that useful. The seeking is more of a theoretical problem. Normally when clients fetch partial data they start from offset 0, so no seeking needed. The next block starts from where the previous block ended, which Dovecot remembers and continues again without seeking. And so on. So even if LFs save only a little disk space and disk I/O, I figured it's better than nothing.
Agree - IO is our biggest issue by far. Of course, we're not google. We throw dual CPUs
in our IMAP boxes just because you need two CPUs to drive 48Gb of RAM happily, and that's
our current sweet spot US$13k machines with two SSDs, 12 2Tb SATA hard disks and 48Gb
RAM with a pair of low-end CPUs. Along with battery backed RAID controllers to take the
edge off the slow disks, it works pretty well.
Certainly anything which only hits the index files is blindingly fast! But if I was going
to store somehow cleverly to save disk space, it would be zlib with a dose of pre-optimised
dictionary. We already do that for our backups - they're tar.gz files. I wrote a talk a
little while back about how we use a pure-perl library which can repack tar files in a
single streaming read/write to get good compression over time.
> >> Although a mix of LFs and CRLFs in the same message shouldn't normally appear in mail files.
> > Most often seen with headers, or between parts. The most ugly cases
> > being differences between the mime-headers of a part, and the content
> > of said part.
> Coming from where? SMTP? IMAP APPENDs? I've never noticed, because Dovecot handles them silently.
Most rubbish comes in via SMTP - we handle it with a bunch of cleanups in our LMTP proxy
before Cyrus sees it - but it's amazing what IMAP clients will try to give you too!
brong at fastmail.fm
More information about the Imap-protocol