splitjob


Quick favorite links: Documentation FAQ Download Contact

About

This is a small utility which splits up data read as input into blocks of a chosen size, sends such blocks to parallel invocations of some program and concatenates the output of those invocations. It was inspired by the description of dbzip2 and implements most of its useful features in a simple way which gives more flexibility.

Its primary intention is to speed up compression, but if you find any other use for this generic program, please feel free to parallelize any cpu consuming task where this program might help.

License

Splitjob is published under GNU General Public License v.2. For more information on GPL visit the GNU web-site.

Documentation

From the README:

1. General
----------

This program is used to split up data from stdin in blocks which are sent
as input to parallel invocations of commands. The output from those are
then concatenated in the right order and sent to stdout.

Splitting up and parallelizing jobs like this might be useful to speed up
compression using multiple CPU cores or even multiple computers.

For this approach to be useful, the compressed format needs to allow multiple
compressed files to be concatenated. This is the case for gzip, bzip2 and xz.

2. Installation
---------------

Step 1, unpack the archive:

tar -xJvf splitjob*.tar.xz

Step 2, compile:

cd splitjob-*
make

Step 3, become root and install

su (and give password)
make install

3. Examples
-----------

Example 1, use multiple local cores:
splitjob -j 4 bzip2 < bigfile > bigfile.bz2

Example 2, use remote machines:
splitjob "ssh host1 gzip" "ssh host2 gzip" < f > f.gz

The above example assumes that ssh is configured to allow logins without asking
for password. See the manpage for ssh-keygen or do a google search for examples
on how to accomplish this.

Example 3, Use bigger blocks to reduce overhead:
splitjob -j 2 -b 10M gzip < file > file.gz

Example 4, parallel decompression:
splitjob -X -r 10 -j 10 -b 384M "xz -d -" < file.xz > file

4. Documentation
----------------

There is a man-page for splitjob, and you will get some help by typing:

splitjob -h

5. Known problems
-----------------

Splitjob does its best to detect and avoid any problems. If some sub command
fails it will by default make some retries before giving up and exiting with
a non-zero return value. However, like pbzip2, mpibzip2 and bzip2smp I would
like to say: Use at your own risk! Verify the contents of compressed files
before relying on them. If splitjob exits with any other return value than 0
its output should be discarded!

At parallel decompression there is a risk that the compressed data contains
the magic bytes used to separate compressed blocks. This could happen by
coincidence, but more likely because the compression has been used recursively
e g a compressed tar file or disk image file containing files compressed with
the same algorithm. Since version 3.1 of splitjob attempts to avoid failure
are made by merging with data from next job at retry when a failure is
detected and magic bytes are used to separate blocks. These attempts might
still end in failure if:

* A single block of compressed data contains more occurances of the magic
  bytes than the selected number of retries. This will give the error message
  "Failed again, giving up!" and can be avoided by increasing the number of
  retries with the -r switch.

* A job has already sent some of its data to stdout and no longer keeps it
  in its buffer. This will give the error message "Got too much data and
  failed!" and can be avoided by increasing the block size with the -b switch.
	    

Performance wins and drawbacks

How splitjob is able to reduce compression time has been studied in these splitjob performance tests. The drawback of increased compressed size because of splitting up input data in small blocks being compressed independently has been studied in this splitjob test with different block sizes and different compression programs.

FAQ

In lack of questions the FAQ doesn't exist yet. Questions will be answered at SourceForge splitjob support page.

Screen-shot

Usage:
        splitjob [options] [commands]
Reads from stdin, splits and sends to multiple parallel invocations
of commands and concatenates their output to stdout.
Options:
  -j     Set number of parallel jobs (default number of commands)
  -b  Set block size for each job (default 1 MB)
  -m   Set magic bytes to separate blocks (default none)
  -B        Use bzip2 magic bytes to separate blocks
  -X        Use xz magic bytes to separate blocks
  -L        Use lzip magic bytes to separate blocks
  -r     Set number of retries for failed jobs (default 3)
  -h        Display this help and exit
  -v        Show program version and copyright
Examples:
Use multiple local cores:  splitjob -j 4 bzip2 < bigfile > bigfile.bz2
Use remote machines:       splitjob "ssh h1 gzip" "ssh h2 gzip" < f > f.gz
Big block reduce overhead: splitjob -j 2 -b 10M gzip < file > file.gz
	    

Change-log

From the CHANGELOG:

28/8  2021  3.2        Robustness against lots of false magic bytes within
                       compressed data as the number of retries now might be
                       bigger than the number of jobs.

16/7  2021  3.1        Improved handling of false magic bytes within compressed
                       data. This often happens at parallel decompression of
                       a compressed archive or disk image file containing
                       files compressed with the same algorithm.

8/4   2020  3.0        Removed predefined support for gzip parallel
                       decompresion as this might fail without showing from
                       gzip return value.

9/3   2017  2.2        Added experimental support for parallel decompression

11/11 2017  2.1        Bugfix: Fixed copy-paste error in code which caused
                       writing outside allocated buffer when output data from
                       called program was bigger than input data. This could
                       happen also at compression if data is already
                       compressed. In theory bugs like this could cause more
                       or less random behavior. In practice this bug has caused
                       corrupted backup archives. Any users of version 2.0
                       should upgrade to version 2.1 to avoid this bug!

15/10 2017  2.0        Added support for increasing number of jobs with SIGUSR2
                       and decreasing number of jobs with SIGUSR1.

9/10  2017  1.2        Might be able to recover if sub process fails even if
                       some data has been read out from the sub process.

31/1  2015  1.1        Freeing unused RAM in child processes.

14/12 2014  1.0        First stable version. No changes since 0.9.2beta which
                       has been tested for some months without any problems
                       found.
24/8  2014  0.9.2beta  Bugfix: taking care of short reads which could cause
                       random and non optimal compression performance when
                       blocks sent to compression not allways were as big as
                       intended.
24/7  2014  0.9beta    First public release
	    

Download

Current stable version

Current stable version 3.2 is available from SourceForge download. The md5sum of splitjob-3.2.tar.xz is e11d35fced4b34de1ac5196c257d2b20

Previous development version

Latest development version was 2.2, it is available from SourceForge download. The md5sum of splitjob-2.2.tar.xz is 192ac1d5062d6fe77129e1b9391774ec

Older versions

Old stable version 3.1 is still available from SourceForge download. The md5sum of splitjob-3.1.tar.xz is c3d0b6779cfe54278299d607e500ac86

Old stable version 3.0 is still available from SourceForge download. The md5sum of splitjob-3.0.tar.xz is 888fc4ca36d6b59117814363a2366d65

Old development version 2.1, is still available from SourceForge download. The md5sum of splitjob-2.1.tar.xz is 13452f670b8294e060a1b0de7aa609b1

Old stable buggy version 2.0, is still available from SourceForge download, but please do not use it!. The md5sum of splitjob-2.0.tar.xz is 09acdbf1a79d60f625a7ea3955964c70

Old stable version 1.2 is available from SourceForge download. The md5sum of splitjob-1.2.tar.xz is fc36de81834244f875221aeb61427c1a

Old stable version 1.1 is available from SourceForge download. The md5sum of splitjob-1.1.tar.bz2 is 524569591836405b9ee13f1ae7b8dde0

Old stable version 1.0 is available from SourceForge download. The md5sum of splitjob-1.0.tar.bz2 is cb3eb993b69dd1821c02fe3bc87d7ab8

Version 0.9.2beta is available from SourceForge download. The md5sum of splitjob-0.9.2beta.tar.bz2 is af7001e9e5680da24a214dafa4ae68e4

Version 0.9beta, is available from SourceForge download. The md5sum of splitjob-0.9beta.tar.bz2 is d79cd625e24f7f3d00b1ac726c65b459

Contact

Bug reports

Bugs should be reported to the SourceForge Bug Tracking System.

Feature requests

With my limited time to spend I make no promises, but requests for new features can be posted at SourceForge splitjob feature request page. Feature requests are welcome, but even more welcome are new implemented features contributed as patches at SourceForge splitjob patches page.

Support

Questions will be answered at SourceForge splitjob support page.

Email

It was once possible contact me, Henrik Carlqvist, by my sourceforge email. Unfortunately that email address is no longer usable because large amounts of spam.
Hosted by:
SourceForge Logo