splitjob


Quick favorite links: Documentation FAQ Download Contact

About

This is a small utility which splits up data read as input into blocks of a chosen size, sends such blocks to parallel invocations of some program and concatenates the output of those invocations. It was inspired by the description of dbzip2 and implements most of its useful features in a simple way which gives more flexibility.

Its primary intention is to speed up compression, but if you find any other use for this generic program, please feel free to parallelize any cpu consuming task where this program might help.

License

Splitjob is published under GNU General Public License v.2. For more information on GPL visit the GNU web-site.

Documentation

From the README:

1. General
----------

This program is used to split up data from stdin in blocks which are sent
as input to parallel invocations of commands. The output from those are
then concatenated in the right order and sent to stdout.

Splitting up and parallelizing jobs like this might be useful to speed up
compression using multiple CPU cores or even multiple computers.

For this approach to be useful, the compressed format needs to allow multiple
compressed files to be concatenated. This is the case for gzip, bzip2 and xz.

2. Installation
---------------

Step 1, unpack the archive:

tar -xzvf splitjob*.tgz

Step 2, compile:

cd splitjob-*
make

Step 3, become root and install

su (and give password)
make install

3. Examples
-----------

Example 1, use multiple local cores:
splitjob -j 4 bzip2 < bigfile > bigfile.bz2

Example 2, use remote machines:
splitjob "ssh host1 gzip" "ssh host2 gzip" < f > f.gz

The above example assumes that ssh is configured to allow logins without asking
for password. See the manpage for ssh-keygen or do a google search for examples
on how to accomplish this.

Example 3, Use bigger blocks to reduce overhead:
splitjob -j 2 -b 10M gzip < file > file.gz

4. Documentation
----------------

There is a man-page for splitjob, and you will get some help by typing:

splitjob -h

5. Known problems
-----------------

Splitjob does its best to detect and avoid any problems. If some sub command
fails it will by default make some retries before giving up and exiting with
a non-zero return value. However, like pbzip2, mpibzip2 and bzip2smp I would
like to say: Use at your own risk! Verify the contents of compressed files
before relying on them. If splitjob exits with any other return value than 0
its output should be discarded!
	    

Performance wins and drawbacks

How splitjob is able to reduce compression time has been studied in these splitjob performance tests. The drawback of increased compressed size because of splitting up input data in small blocks being compressed independently has been studied in this splitjob test with different block sizes and different compression programs.

FAQ

In lack of questions the FAQ doesn't exist yet. Questions will be answered at SourceForge splitjob support page.

Screen-shot

Usage:
        splitjob [options] [commands]
Reads from stdin, splits and sends to multiple parallel invocations
of commands and concatenates their output to stdout.
Options:
  -j     Set number of parallel jobs (default number of commands)
  -b  Set block size for each job (default 1 MB)
  -r     Set number of retries for failed jobs (default 3)
  -h        Display this help and exit
  -v        Show program version and copyright
Examples:
Use multiple local cores:  splitjob -j 4 bzip2 < bigfile > bigfile.bz2
Use remote machines:       splitjob "ssh h1 gzip" "ssh h2 gzip" < f > f.gz
Big block reduce overhead: splitjob -j 2 -b 10M gzip < file > file.gz
	    

Change-log

From the CHANGELOG:

31/1  2015  1.1        Freeing unused RAM in child processes.

14/12 2014  1.0        First stable version. No changes since 0.9.2beta which
                       has been tested for some months without any problems
                       found.
24/8  2014  0.9.2beta  Bugfix: taking care of short reads which could cause
                       random and non optimal compression performance when
                       blocks sent to compression not allways were as big as
                       intended.
24/7  2014  0.9beta    First public release
	    

Download

Current stable version

Current stable version is 1.1, it is available from SourceForge download. The md5sum of splitjob-1.1.tar.bz2 is 524569591836405b9ee13f1ae7b8dde0

Older versions

Old stable version 1.0 is available from SourceForge download. The md5sum of splitjob-1.0.tar.bz2 is cb3eb993b69dd1821c02fe3bc87d7ab8

Version 0.9.2beta is available from SourceForge download. The md5sum of splitjob-0.9.2beta.tar.bz2 is af7001e9e5680da24a214dafa4ae68e4

Version 0.9beta, is available from SourceForge download. The md5sum of splitjob-0.9beta.tar.bz2 is d79cd625e24f7f3d00b1ac726c65b459

Contact

Bug reports

Bugs should be reported to the SourceForge Bug Tracking System.

Feature requests

With my limited time to spend I make no promises, but requests for new features can be posted at SourceForge splitjob feature request page. Feature requests are welcome, but even more welcome are new implemented features contributed as patches at SourceForge splitjob patches page.

Support

Questions will be answered at SourceForge splitjob support page.

Email

It was once possible contact me, Henrik Carlqvist, by my sourceforge email. Unfortunately that email address is no longer usable because large amounts of spam.
Hosted by:
SourceForge Logo