[Imap-uw] Mix filesize
Per Foreby
perf at ddg.lth.se
Sat Sep 15 10:27:14 PDT 2007
On Fri, 14 Sep 2007, Mark Crispin wrote:
> I think that this discussion is very useful, and I hope that it continues.
> I'm getting a lot of good information from all this feedback, and I hope to
> get a lot more.
Here's a litte more feedback. I checked all mix folders on my server.
2457 users have mix folders, 4966 still use mbx, 849 have empty mail
folders (never used).
Out of the mix users, 1198 are users with large folders which I have
converted manually. The others are new accouts, created after I made mix
the default.
MIXDATAROLL has been 4M for most of the time.
Average number of messages per mix data file:
$ find /mail -name '.mix4*' |wc -l
53502
$ find /mail -name .mixindex -print0 |xargs -0 cat|wc -l
907134
$ bc -q
scale=1
907134/53502
16.9
The new users probably only have one mix file which isn't full, so I did
a new search only looking at old users (those with large mailboxes):
$ find /mail -wholename '/mail/?/??07???' -prune -o -name '.mix4*' |wc -l
49744
$ find /mail -wholename '/mail/?/??07???' -prune -o -name .mixindex -print0 \
|xargs -0 cat|wc -l
890787
$ bc -q
scale=1
890787/49744
17.9
Not a big difference, so there is ovbiously no need to exclude the new
users for the further tests.
This is the average size of a mix data file:
$ find /mail -name '.mix4*' -print0 > /tmp/files
$ du -kc --files0-from=/tmp/files|grep total
161020908 total
$ bc -q
161020908/53502
3009
161020908/907134
177
Thats's 3 mB average file size and 177 kB average message size. So
what's the frequency?
This is loking at sizes of mix datafiles:
$ ./freq.pl
0 10323 19.3%
0-20k 4314 8.1%
20-50k 1555 2.9%
50-100k 1501 2.8%
100-200k 1750 3.3%
200-500k 2599 4.9%
500-1000k 3062 5.7%
1000-2000k 3489 6.5%
2000-5000k 14941 27.9%
5000-10000k 7562 14.1%
> 10000k 2414 4.5%
TOTAL 53510 100.0%
(Don't know why I have so many zero sized files, maybe the conversion
from mbx with mailutil and later mixcvt created these).
The message size frequencies:
$ ./freq2.pl
0 0 0.0%
0-20k 685004 76.6%
20-50k 64602 7.2%
50-100k 39487 4.4%
100-200k 28513 3.2%
200-500k 28296 3.2%
500-1000k 16658 1.9%
1000-2000k 12824 1.4%
2000-5000k 11541 1.3%
5000-10000k 5291 0.6%
> 10000k 1943 0.2%
TOTAL 894159 100.0%
Now it's beginning to get interesting. If we have so many small emails,
maybe the filesize should be kept small, leaving the larger messages in
data files on their own. On the other hand, if small and large files are
mixed randomly, maybe the filesize should be larger. Let's look at the
number of messages per file:
$ ./freq3.pl
1 9726 22.5%
2 3321 7.7%
3-5 6456 15.0%
6-10 6019 13.9%
11-20 6253 14.5%
21-50 7001 16.2%
51-100 3002 7.0%
101-200 1042 2.4%
201-500 277 0.6%
> 500 70 0.2%
TOTAL 43167 100.0%
I don't know how interesting this is, but is shows that most files have
a small number of files due to the larger emails being randomly spread
across the files. (Remember the average 16.9 messages/file above.)
If anyone want to check their own servers, the three simple perl scripts
used above are available here:
http://www.ddg.lth.se/perf/mix/freq.pl
http://www.ddg.lth.se/perf/mix/freq2.pl
http://www.ddg.lth.se/perf/mix/freq3.pl
Conclusion: A relatively small number of large messages increases the
number of mix data files. This will hurt filesystem performance. If the
filesize is inreased to avoid this, it will instead lead to longer
backup times.
Suggestion: MIXDATAROLL shouldn't be larger than 1 Mb. Put all small
messages (less than 20 kb for example) in separate data files, and use
the current placement algorithm for larger mesages. Maybe .mixmeta could
be used to keep track of files that only should contain small messages.
/Per
More information about the Imap-uw
mailing list