Simple and scalable scripting for large sequencing data sets in Hadoop
Andr´ e Schumacher Luca Pireddu Matti Niemenmaa Aleksi Kallio Eija Korpelainen Gianluigi Zanetti and Keijo Heljanko
we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing data sets in a scalable and simple manner. SeqPig scripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts.