splitjob test with different block sizes and different compression programs


Quick links to results for different compression programs: xz -9 xz bzip2 gzip Summary graph

Big block size vs small block size for jobs

For splitjob to be able to start parallel invocations of some compression program it has to divide the input data between those invocations. Dividing data into many small blocks has the advantage that many compression invocations may run simultaneously (if you have access to enough CPU cores). Unfortunately, small blocks of data also has the disadvantage that the compression algorithm has less data to identify and take advantage of redundancy.

These tests were made to see how splitjob block size and choice of compression algorithm affects the resulting size. This kind of data is nice at is easily reproducible. Other interesting data would be timing performance, and even though splitjob seems to scale rather well timing measurements will differ between different machines and even also on the same machine because of differences in CPU load. As the machine used for tests only had two CPU cores splitjob is only used to spawn two jobs (-j 2).

The input data is the file linux-3.15.6.tar (the Linux kernel source) with the size 571576320 bytes. This particular size is interesting as it is possible to split into half a number of times. I have also tried the splitjob default block size 1 MB (1048576 bytes) and the splitjob documentation example 10 MB (10485760 bytes), in the documentation 10 MB is used to reduce output size overhead.

xz -9

CommandBlock sizeResulting sizeRelative sizeComment
xz -9 < linux-3.15.6.tar | wc -cNA79420204NACompression without splitjob
splitjob -j 1 -b 571576320 "xz -9" < linux-3.15.6.tar | wc -c57157632079420204100%Only one single process as the block size is the same as the entire file size.
splitjob -j 2 -b 285788160 "xz -9" < linux-3.15.6.tar | wc -c28578816079545192100.15%File split into two blocks of equal size which are compressed in parallel.
splitjob -j 2 -b 142894080 "xz -9" < linux-3.15.6.tar | wc -c14289408079834892100.52%File split into four blocks of equal size which are compressed two and two in parallel.
splitjob -j 2 -b 71447040 "xz -9" < linux-3.15.6.tar | wc -c7144704080663856101.56%...eight blocks...
splitjob -j 2 -b 35723520 "xz -9" < linux-3.15.6.tar | wc -c3572352081677840102.84%...16 blocks...
splitjob -j 2 -b 17861760 "xz -9" < linux-3.15.6.tar | wc -c1786176082989044104.49%...32 blocks...
splitjob -j 2 -b 10M "xz -9" < linux-3.15.6.tar | wc -c1048576084392984106.26%Documentation example 10 MB block size to reduce output size overhead
splitjob -j 2 -b 8930880 "xz -9" < linux-3.15.6.tar | wc -c893088084872516106.86%...64 blocks...
splitjob -j 2 -b 4465440 "xz -9" < linux-3.15.6.tar | wc -c446544086690952109.15%...128 blocks ...
splitjob -j 2 -b 2232720 "xz -9" < linux-3.15.6.tar | wc -c223272089145344112.24%...256 blocks ...
splitjob -j 2 -b 1116360 "xz -9" < linux-3.15.6.tar | wc -c111636092317568116.23%...512 blocks...
splitjob -j 2 "xz -9" < linux-3.15.6.tar | wc -c104857692645396116.65%Splitjob default block size
splitjob -j 2 -b 558180 "xz -9" < linux-3.15.6.tar | wc -c55818096290324121.24%File split into 1024 blocks of equal size which are compressed two and two in parallel.

xz

CommandBlock sizeResulting sizeRelative sizeComment
xz < linux-3.15.6.tar | wc -cNA82223868103.53% compared with xz -9Compression without splitjob
splitjob -j 1 -b 571576320 xz < linux-3.15.6.tar | wc -c57157632082223868100%Only one single process as the block size is the same as the entire file size.
splitjob -j 2 -b 285788160 xz < linux-3.15.6.tar | wc -c28578816082269680100.05%File split into two blocks of equal size which are compressed in parallel.
splitjob -j 2 -b 142894080 xz < linux-3.15.6.tar | wc -c14289408082345544100.14%File split into four blocks of equal size which are compressed two and two in parallel.
splitjob -j 2 -b 71447040 xz < linux-3.15.6.tar | wc -c7144704082496700100.33%...eight blocks...
splitjob -j 2 -b 35723520 xz < linux-3.15.6.tar | wc -c3572352082752564100.64%...16 blocks...
splitjob -j 2 -b 17861760 xz < linux-3.15.6.tar | wc -c1786176083411432101.44%...32 blocks...
splitjob -j 2 -b 10M xz < linux-3.15.6.tar | wc -c1048576084431992102.68%Documentation example 10 MB block size to reduce output size overhead
splitjob -j 2 -b 8930880 xz < linux-3.15.6.tar | wc -c893088084874000103.22%...64 blocks...
splitjob -j 2 -b 4465440 xz < linux-3.15.6.tar | wc -c446544086691036105.43%...128 blocks ...
splitjob -j 2 -b 2232720 xz < linux-3.15.6.tar | wc -c223272089145316108.41%...256 blocks ...
splitjob -j 2 -b 1116360 xz < linux-3.15.6.tar | wc -c111636092317564112.27%...512 blocks...
splitjob -j 2 xz < linux-3.15.6.tar | wc -c104857692645396112.67%Splitjob default block size
splitjob -j 2 -b 558180 xz < linux-3.15.6.tar | wc -c55818096290324117.10%File split into 1024 blocks of equal size which are compressed two and two in parallel.
It might be worth noting that with a splitjob block size of 1 MB or lower "xz -9" gives exactly the same result as "xz" which is the same as "xz -6". This probably says something about how xz uses different block sizes for different compression levels. Further testings has shown that 384 MB is a rather optimal block size for "xz -9". In this particular case were input data is only 546 MB it does not give such a big improvement, but still this might be useful information for people compressing gigabytes or terabytes of data, maybe using multiple machines.
CommandBlock sizeResulting sizeRelative sizeComment
splitjob -j 2 -b 384M "xz -9" < linux-3.15.6.tar | wc -c384 MB79542212100.15%File split into two blocks of different sizes which are compressed in parallel.

bzip2

CommandBlock sizeResulting sizeRelative sizeComment
bzip2 < linux-3.15.6.tar | wc -cNA95116256115.67% compared with xz
119.76% compared with xz -9
Compression without splitjob
splitjob -j 1 -b 571576320 bzip2 < linux-3.15.6.tar | wc -c57157632095116256100%Only one single process as the block size is the same as the entire file size.
splitjob -j 2 -b 285788160 bzip2 < linux-3.15.6.tar | wc -c28578816095162155100.04%File split into two blocks of equal size which are compressed in parallel.
splitjob -j 2 -b 142894080 bzip2 < linux-3.15.6.tar | wc -c14289408095179351100.06%File split into four blocks of equal size which are compressed two and two in parallel.
splitjob -j 2 -b 71447040 bzip2 < linux-3.15.6.tar | wc -c7144704095210226100.09%...eight blocks...
splitjob -j 2 -b 35723520 bzip2 < linux-3.15.6.tar | wc -c3572352095248377100.13%...16 blocks...
splitjob -j 2 -b 17861760 bzip2 < linux-3.15.6.tar | wc -c1786176095247052100.13%...32 blocks...
splitjob -j 2 -b 10M bzip2 < linux-3.15.6.tar | wc -c1048576095272415100.16%Documentation example 10 MB block size to reduce output size overhead
splitjob -j 2 -b 8930880 bzip2 < linux-3.15.6.tar | wc -c893088095393785100.29%...64 blocks...
splitjob -j 2 -b 4465440 bzip2 < linux-3.15.6.tar | wc -c446544095472544100.37%...128 blocks ...
splitjob -j 2 -b 2232720 bzip2 < linux-3.15.6.tar | wc -c223272096200225101.13%...256 blocks ...
splitjob -j 2 -b 1116360 bzip2 < linux-3.15.6.tar | wc -c111636097019953102.00%...512 blocks...
splitjob -j 2 bzip2 < linux-3.15.6.tar | wc -c104857696775192101.74%Splitjob default block size
splitjob -j 2 -b 558180 bzip2 < linux-3.15.6.tar | wc -c55818098207747103.25%File split into 1024 blocks of equal size which are compressed two and two in parallel.

gzip

CommandBlock sizeResulting sizeRelative sizeComment
gzip < linux-3.15.6.tar | wc -cNA121474801127.71% compared with bzip2
147.73% compared with xz
152.95% compared with xz -9
Compression without splitjob
splitjob -j 1 -b 571576320 gzip < linux-3.15.6.tar | wc -c571576320121474801100%Only one single process as the block size is the same as the entire file size.
splitjob -j 2 -b 285788160 gzip < linux-3.15.6.tar | wc -c285788160121479493100.00%File split into two blocks of equal size which are compressed in parallel.
splitjob -j 2 -b 142894080 gzip < linux-3.15.6.tar | wc -c142894080121477971100.00%File split into four blocks of equal size which are compressed two and two in parallel.
splitjob -j 2 -b 71447040 gzip < linux-3.15.6.tar | wc -c71447040121482091100.00%...eight blocks...
splitjob -j 2 -b 35723520 gzip < linux-3.15.6.tar | wc -c35723520121487334100.01%...16 blocks...
splitjob -j 2 -b 17861760 gzip < linux-3.15.6.tar | wc -c17861760121502238100.02%...32 blocks...
splitjob -j 2 -b 10M gzip < linux-3.15.6.tar | wc -c10485760121521501100.03%Documentation example 10 MB block size to reduce output size overhead
splitjob -j 2 -b 8930880 gzip < linux-3.15.6.tar | wc -c8930880121532538100.04%...64 blocks...
splitjob -j 2 -b 4465440 gzip < linux-3.15.6.tar | wc -c4465440121592245100.09%...128 blocks ...
splitjob -j 2 -b 2232720 gzip < linux-3.15.6.tar | wc -c2232720121699102100.18%...256 blocks ...
splitjob -j 2 -b 1116360 gzip < linux-3.15.6.tar | wc -c1116360121932408100.37%...512 blocks...
splitjob -j 2 gzip < linux-3.15.6.tar | wc -c1048576121957733100.39%Splitjob default block size
splitjob -j 2 -b 558180 gzip < linux-3.15.6.tar | wc -c558180122405005100.76%File split into 1024 blocks of equal size which are compressed two and two in parallel.

Graph with all data

Back to splitjob main page