An evaluation of the genome alignment landscape Alexandre Fonseca KTH Royal Institute of Technology December 16, 2013
Introduction Evaluation Setup Results Conclusion Genetic Research Motivation Objective Table of Contents 1 Introduction Genetic Research Motivation Objective 2 Evaluation Setup Hardware Software Inputs 3 Results Accuracy Duration Scalability 4 Conclusion Alexandre Fonseca An evaluation of the genome alignment landscape 2 / 16
Introduction Evaluation Setup Results Conclusion Genetic Research Motivation Objective What is genetic research? The study of an individual s genome DNA & RNA Allows identification of: Particular traits and characteristics Anomalous mutations Applicability: agriculture, medicine, Genome analysis pipeline Alexandre Fonseca An evaluation of the genome alignment landscape 3 / 16
Introduction Evaluation Setup Results Conclusion Genetic Research Motivation Objective Motivation Next Generation Sequencing (NGS): Parallelization of the sequencing process Each run produces thousands of small reads >400x more throughput than 1 st generation Alignment: Matching reads to correct segments of a reference genome Most complex and time-consuming task Most implementations still very centralized Parallelization at the thread/processor level Hard to scale to handle increasing amounts of data Could we leverage the power of the cloud? Alexandre Fonseca An evaluation of the genome alignment landscape 4 / 16
Introduction Evaluation Setup Results Conclusion Genetic Research Motivation Objective Objective of the project Evaluate and compare different sequence aligners: Alignment duration Alignment accuracy Scalability Centralized Aligners: Bowtie1-100 (April 9 th, 2013) BWA - 0510 (November 13 th, 2013) Bowtie2-210 (February 21 st, 2013) Distributed Aligners: Crossbow - 121 (May 30 th, 2013) SEAL - 032 (February 7 th, 2013) Alexandre Fonseca An evaluation of the genome alignment landscape 5 / 16
Introduction Evaluation Setup Results Conclusion Hardware Software Inputs Table of Contents 1 Introduction Genetic Research Motivation Objective 2 Evaluation Setup Hardware Software Inputs 3 Results Accuracy Duration Scalability 4 Conclusion Alexandre Fonseca An evaluation of the genome alignment landscape 6 / 16
Introduction Evaluation Setup Results Conclusion Hardware Software Inputs Hardware Evaluated on the 7-node SICS cluster Each node: 2x 6-core AMD Opteron 24355 CPUs 32 GB of RAM 1TB of disk Node interconnection: 1Gbps full duplex Ethernet Alexandre Fonseca An evaluation of the genome alignment landscape 7 / 16
Introduction Evaluation Setup Results Conclusion Hardware Software Inputs Software Shared Hadoop 220 installation: 5 NodeManagers on nodes 1-5 ResourceManager on node 6 and NameNode on node 7 16GB of RAM and 12 cores available to each NodeManager Maximum memory usage per container: 8GB Additional software: FastXToolkit - 0013 PicardTools - 1101 SAMTools - 0119 SRAToolkit - 233 WGSim - https://githubcom/lh3/wgsim (a12da33) Alexandre Fonseca An evaluation of the genome alignment landscape 8 / 16
Introduction Evaluation Setup Results Conclusion Hardware Software Inputs Inputs hg-19 reference human genome Single-ended and paired-ended sets of reads Bowtie1 and Crossbow only suppport single-ended SEAL only supports paired-ended Two categories of read sets: Simulated - Sampled from hg-19 900k 100 base-pair reads 2%, 5% and 10% error rates 009% SNP mutation rate 001% indel mutation rate 200MB (x2) Real read sets - From the NA12878 individual Chromosome 4-156M 101bp reads - 355GB (x2) Chromosome 20-51M 101bp reads - 12GB (x2) Chromosome 21-31M 101bp reads - 7GB (x2) Alexandre Fonseca An evaluation of the genome alignment landscape 9 / 16
Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Table of Contents 1 Introduction Genetic Research Motivation Objective 2 Evaluation Setup Hardware Software Inputs 3 Results Accuracy Duration Scalability 4 Conclusion Alexandre Fonseca An evaluation of the genome alignment landscape 10 / 16
Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Mapped Reads Mapped reads with single-end read sets Bowtie1 Crossbow BWA Bowtie2 Mapped reads with paired-end read sets BWA Seal Bowtie2 Mapped Reads (%) 120 100 80 8254 8254 9639 9603 60 40 20 0 100-002 3364 3364 8132 8984 100-005 Read sets 325 325 2937 6607 100-010 9663 9663 9937 8936 8937 9395 4698 4699 688 120 100 80 60 40 20 0 100-002-pair 100-005-pair 100-010-pair Read sets Mapped Reads (%) Alexandre Fonseca An evaluation of the genome alignment landscape 11 / 16
Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Error Alignment error for single-end read sets Bowtie1 Crossbow BWA Bowtie2 Alignment error for paired-end read sets BWA Seal Bowtie2 8 672 8 Error (%) 6 4 275 275 269 337 2 0 100-002 318 318 304 471 100-005 Read sets 43 43 367 523 100-010 325 325 381 37 37 523 481 48 6 4 2 0 100-002-pair 100-005-pair 100-010-pair Read sets Error (%) Alexandre Fonseca An evaluation of the genome alignment landscape 12 / 16
Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Duration Alignment duration for single-end read sets Bowtie1 Crossbow BWA Bowtie2 Alignment duration for paired-end read sets BWA Seal Bowtie2 400 374 1,382 1,500 Duration (minutes) 300 200 282 138 13915 100 0 chr4 8767 40 127 429 chr20 Read sets 515 30 132 25 chr21 367 352 377 1305 9776 273 785 57 0 chr4-pair chr20-pair chr21-pair Read sets 1,000 500 Duration (minutes) Alexandre Fonseca An evaluation of the genome alignment landscape 13 / 16
Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Scalability Duration based on number of nodes for chromosome 20 Crossbow (single-end) Seal (paired-end) Bowtie1 BWA 500 521 Duration (minutes) 400 300 200 100 377 292 88 0 1 257 1305 655 40 3 5 # of nodes 108 29 7 Alexandre Fonseca An evaluation of the genome alignment landscape 14 / 16
Introduction Evaluation Setup Results Conclusion Table of Contents 1 Introduction Genetic Research Motivation Objective 2 Evaluation Setup Hardware Software Inputs 3 Results Accuracy Duration Scalability 4 Conclusion Alexandre Fonseca An evaluation of the genome alignment landscape 15 / 16
Introduction Evaluation Setup Results Conclusion Conclusion Distributing the alignment is feasible More than 3x speedup with no accuracy impact Different aligners target different optimization areas Chance to improve with newer algorithms Alexandre Fonseca An evaluation of the genome alignment landscape 16 / 16
Introduction Evaluation Setup Results Conclusion Conclusion Distributing the alignment is feasible More than 3x speedup with no accuracy impact Different aligners target different optimization areas Chance to improve with newer algorithms Questions? Alexandre Fonseca An evaluation of the genome alignment landscape 16 / 16