[Imap-uw] Mix filesize

Per Foreby perf at ddg.lth.se
Sat Sep 15 10:27:14 PDT 2007


On Fri, 14 Sep 2007, Mark Crispin wrote:

> I think that this discussion is very useful, and I hope that it continues. 
> I'm getting a lot of good information from all this feedback, and I hope to 
> get a lot more.

Here's a litte more feedback. I checked all mix folders on my server.

2457 users have mix folders,  4966 still use mbx, 849 have empty mail 
folders (never used).

Out of the mix users, 1198 are users with large folders which I have 
converted manually. The others are new accouts, created after I made mix 
the default.

MIXDATAROLL has been 4M for most of the time.

Average number of messages per mix data file:

$ find /mail -name '.mix4*' |wc -l
53502
$ find /mail -name .mixindex -print0 |xargs -0 cat|wc -l
907134
$ bc -q
scale=1
907134/53502
16.9

The new users probably only have one mix file which isn't full, so I did 
a new search only looking at old users (those with large mailboxes):

$ find /mail -wholename '/mail/?/??07???' -prune -o -name '.mix4*' |wc -l
49744
$ find /mail -wholename '/mail/?/??07???' -prune -o -name .mixindex -print0 \
|xargs -0 cat|wc -l
890787
$ bc -q
scale=1
890787/49744
17.9

Not a big difference, so there is ovbiously no need to exclude the new 
users for the further tests.

This is the average size of a mix data file:

$ find /mail -name '.mix4*' -print0 > /tmp/files
$ du -kc --files0-from=/tmp/files|grep total
161020908       total
$ bc -q
161020908/53502
3009
161020908/907134
177

Thats's 3 mB average file size and 177 kB average message size. So 
what's the frequency?

This is loking at sizes of mix datafiles:

$ ./freq.pl
0             10323   19.3%
0-20k          4314    8.1%
20-50k         1555    2.9%
50-100k        1501    2.8%
100-200k       1750    3.3%
200-500k       2599    4.9%
500-1000k      3062    5.7%
1000-2000k     3489    6.5%
2000-5000k    14941   27.9%
5000-10000k    7562   14.1%
> 10000k       2414    4.5%
TOTAL         53510  100.0%

(Don't know why I have so many zero sized files, maybe the conversion 
from mbx with mailutil and later mixcvt created these).

The message size frequencies:

$ ./freq2.pl
0                  0    0.0%
0-20k         685004   76.6%
20-50k         64602    7.2%
50-100k        39487    4.4%
100-200k       28513    3.2%
200-500k       28296    3.2%
500-1000k      16658    1.9%
1000-2000k     12824    1.4%
2000-5000k     11541    1.3%
5000-10000k     5291    0.6%
> 10000k        1943    0.2%
TOTAL         894159  100.0%

Now it's beginning to get interesting. If we have so many small emails, 
maybe the filesize should be kept small, leaving the larger messages in 
data files on their own. On the other hand, if small and large files are 
mixed randomly, maybe the filesize should be larger. Let's look at the 
number of messages per file:

$ ./freq3.pl
1               9726   22.5%
2               3321    7.7%
3-5             6456   15.0%
6-10            6019   13.9%
11-20           6253   14.5%
21-50           7001   16.2%
51-100          3002    7.0%
101-200         1042    2.4%
201-500          277    0.6%
> 500             70    0.2%
TOTAL          43167  100.0%

I don't know how interesting this is, but is shows that most files have 
a small number of files due to the larger emails being randomly spread 
across the files. (Remember the average 16.9 messages/file above.)

If anyone want to check their own servers, the three simple perl scripts 
used above are available here:

   http://www.ddg.lth.se/perf/mix/freq.pl
   http://www.ddg.lth.se/perf/mix/freq2.pl
   http://www.ddg.lth.se/perf/mix/freq3.pl

Conclusion: A relatively small number of large messages increases the 
number of mix data files. This will hurt filesystem performance. If the 
filesize is inreased to avoid this, it will instead lead to longer 
backup times.

Suggestion: MIXDATAROLL shouldn't be larger than 1 Mb. Put all small 
messages (less than 20 kb for example) in separate data files, and use 
the current placement algorithm for larger mesages. Maybe .mixmeta could 
be used to keep track of files that only should contain small messages.

/Per




More information about the Imap-uw mailing list