Parallelising plink (or anything else) the easy way

plink is the swiss-army knife of genome association studies. Its impressive tool set can be seen here. I am currently running some experiments for which I need to compute associations between 30’000 SNPs and 130 assays. This calculation is only the first step of the experiments, which I want to run as many times as possible. So to save time, the more direct approach is to try and parallelise the whole process.

Enters GNU parallel, an amazing unix command which makes the parallelisation a piece of cake. The best way to learn is to go through the numerous examples. As you can see, it’s used pretty much as a normal xargs (see this recent post about xargs on Getting Genetics Done).

To compute the associations, I created a .phen file which contains all the assays, as well as each subjects’ family and ID. This is just a long tab-separated text file. Its header starts with FID IID nameAssay1 nameAssay2 etc..

A normal use of plink would look like this:

Select All Code:
plink --manyOptions --pheno allAssays.phen --all-pheno --linear --out analyses

This will calculate a linear regression for each pair (SNP,assays) and store the results in a directory analyses. Time it took in my case: about 80mn.

Using GNU parallel however, the change is minimal. I just need:

  • to parse the header in order to extract all the assays’ names.
  • to tell plink which phenotype I want to process. This is done with –pheno-name

The first bit is done with a simple combination of usual unix tools:

Select All Code:
head -n1 allAssays.phen |cut -f 3- |sed 's/\t/\n/g'

This will produce the list of assays.

Now combine this with –pheno-name and parallel:

Select All Code:
head -n1 allAssays.phen |cut -f 3- |sed 's/\t/\n/g'|parallel plink --manyOptions --pheno allAssays.phen --pheno-name {} --linear --out analyses/experimentID.{}

And this is it! I’ve just piped the list of assays to parallel plink. This now runs #cores copies of plink, each processing one phenotype. Each instance of {} is replaced by what is piped in, in this case, the name of a phenotype. You really can’t make it easier. How satisfying it is to do an htop and watch all processors being used!
The whole thing is now done in 10-15mn, with very little extra effort to make it work.

The official website provides the sources and some binaries for it. If you use Ubuntu, there’s a PPA available here and it’s straightforward to install. Note that there’s a Ubuntu package called ‘moreutils’ which contains a parallel command, but it’s different from GNU parallel.

This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

2 Responses to Parallelising plink (or anything else) the easy way

  1. Ole Tange says:

    If your cpus are not 100% but only 90% used, consider spawning more than one job per cpu, e.g 1.3 per cpu core: -j130%

    Also remember –version/–bibtex if it is used for a publication.

    • CL says:

      Thank you Ole, that’s good to know. From what I saw from the htop, plink seems to use up 100% of the processor but I’ll give it a go and see if it makes a difference.

      And thanks a lot for this great tool by the way!