An evaluation of the genome alignment landscape

Similar documents
GRAMMATICAL EVOLUTION. Peter Černo

Assessing Solvency by Brute Force is Computationally Tractable

High Performance Risk Aggregation: Addressing the Data Processing Challenge the Hadoop MapReduce Way

Monoxide Scale out Blockchains with Asynchronous Consensus Zones. Jiaping Wang, Hao Wang Sinovation Ventures ICT/CAS The Ohio State University

CHEP An update on the scalability limits of the Condor batch system

A MapReduce Framework for Analysing Portfolios of Catastrophic Risk with Secondary Uncertainty

Financial Analysis Using a Distributed System

Unparalleled Performance, Agility and Security for NSE

CONTENTS DISCLAIMER... 3 EXECUTIVE SUMMARY... 4 INTRO... 4 ICECHAIN... 5 ICE CHAIN TECH... 5 ICE CHAIN POSITIONING... 6 SHARDING... 7 SCALABILITY...

CHEP An update on the scalability limits of the Condor batch system. by Igor Sfiligoi for the UWisc and Red Hat Condor teams doing the real work

Genium INET. Trading Workstation Installation Guide BIST. Version:

Tampere University of Technology. Kanniainen, Juho; Piché, Robert; Mikkonen, Tommi. Use of distributed computing in derivative pricing

Barrier Option. 2 of 33 3/13/2014

Expert4x NoWorries EA. November 21, 2017

VPLS Reseller Rent2Own Program Terms and Conditions

Clouds for HPC Potential? Challenges?

High Performance Risk Aggregation:

The following answers are provided in response to questions received:

Building the Healthcare System of the Future O R A C L E W H I T E P A P E R F E B R U A R Y

Accelerating Financial Computation

Role of Research in Industry Assurant, Inc James R. Grana, Ph.D.

Asset Liability Management An Integrated Approach to Managing Liquidity, Capital, and Earnings

HONG KONG FUTURES EXCHANGE LIMITED HKATS TRADING PROCEDURES

Dynamic Resource Allocation for Spot Markets in Cloud Computi

Reconfigurable Acceleration for Monte Carlo based Financial Simulation

INVITATION FOR QUOTATION. TEQIP-III/2018/mecj/Shopping/4

Efficient Algorithms for Flood Risk Analysis

Operational Risk Quantification System

Modernization of the CNSS Information System: SI-CNSS A case of the National Social Security Fund

BETTER BUDGETING 3RD ANNUAL WEB-STREAM SERIES. NEW! Three-Part Series

Analytics in 10 Micro-Seconds Using FPGAs. David B. Thomas Imperial College London

Fighting Fraud in Financial Services: three success stories

Handout 4: Deterministic Systems and the Shortest Path Problem

Why know about performance

Aleator: Random Beacon via Scalable Threshold Signatures

Genetic Algorithms Overview and Examples

Scaling SGD Batch Size to 32K for ImageNet Training

Rules for Rules: Bringing Order and Efficiency to the Modern Insurance Enterprise ORACLE STRATEGY BRIEF FEBRUARY 2016

Online Algorithms SS 2013

2) What is algorithm?

A Branch-and-Price method for the Multiple-depot Vehicle and Crew Scheduling Problem

Tender for. Supply and Installation of DNA/RNA Fragment Analyzer System

Oracle. Financials Cloud Using Financials for EMEA. Release 13 (update 17D)

Amazon Elastic Compute Cloud

Financial Risk Modeling on Low-power Accelerators: Experimental Performance Evaluation of TK1 with FPGA

What is HomeBlockCoin

ECCSD Requirements And Application Instructions

Stochastic Market Clearing: Advances in Computation and Economic Impacts

Intra European Sales Reporting

Distributed Computing in Finance: Case Model Calibration

QIAGEN reports results for third quarter and first nine months of 2018

Ultimate Control. Maxeler RiskAnalytics

HPC IN THE POST 2008 CRISIS WORLD

Tax Reporting for Germany

A contribution model for funding of the national e-infrastructure

COS 318: Operating Systems. CPU Scheduling. Today s Topics. CPU Scheduler. Preemptive and Non-Preemptive Scheduling

A Particle Swarm Optimization Algorithm for Agent-Based Artificial Markets

Mark Redekopp, All rights reserved. EE 357 Unit 12. Performance Modeling

KULICKE & SOFFA INDUSTRIES NASDAQ: KLIC JUNE QUARTER 2018 INVESTOR PRESENTATION

Stock Portfolio Selection using Genetic Algorithm

The Dynamic Cross-sectional Microsimulation Model MOSART

Cabcharge Taxi Management System (CTMS) User Guide

Welcome to Redefining Perspectives

Impact of Risk Based Supervision Technology Departments. Arun Pingaley, Industry Director FSI, SAP India

Innovation in the global credit

Project planning and creating a WBS

HKUST CSE FYP , TEAM RO4 OPTIMAL INVESTMENT STRATEGY USING SCALABLE MACHINE LEARNING AND DATA ANALYTICS FOR SMALL-CAP STOCKS

Oracle Global Human Resources Cloud Using Absence Management 19A

Single-step GBLUP. Integrates all available information. ssgblup vs. BayesX methods. Phenotypes Genotypes Pedigree

Portfolio Optimization with Gurobi. Gurobi Anwendertage 2017

SALESFORCE LIGHTNING

Proxy Function Fitting: Some Implementation Topics

Data based stock portfolio construction using Computational Intelligence

How to Bid the Cloud

Computational Finance Improving Monte Carlo

Chapter 3. Dynamic discrete games and auctions: an introduction

Hadoop Capacity Scheduler

Oracle Financial Services Service Descriptions and Metrics

Sincerely, Peter J. Ungaro President and Chief Executive Officer

Report for Prediction Processor Graduate Computer Architecture I

Private Wealth Management. Understanding Blockchain as a Potential Disruptor

Investing in the Blockchain Ecosystem

OCZ Technology Group Reports Fiscal 2012 Second Quarter Results

PRESS RELEASE. Mellanox Technologies, Ltd.

User Guide 2015 Physician Quality Reporting System (PQRS) Payment Adjustment Feedback Report

Optimization Methods. Lecture 16: Dynamic Programming

Denisova Admixture and the First Modern Human. Dispersals into Southeast Asia and Oceania

JAMMU AND KASHMIR STATE FINANCIAL CORPORATION (Incorporated under the State Financial Corporation s Act 1951)

INTEGRA COMPARATIVE PROFILER

6/7/2018. Overview PERT / CPM PERT/CPM. Project Scheduling PERT/CPM PERT/CPM

A new PDE-based approach for construction scheduling and resource allocation. Paul Gabet, Julien Nachef CE 291F Project Presentation Spring 2014

1.1 Capitalised words are either defined in the Standard Terms and Conditions or in this Agreement. Unless the context otherwise requires:

INVITATION FOR TENDER FOR SUPPLY OF EQUIPMENT

Wastewater Asset Management Unique Perspectives from the Engineer & Municipality

Oracle. Project Portfolio Management Cloud Using Project Performance Reporting. Release 13 (update 17D)

TEPZZ 858Z 5A_T EP A1 (19) (11) EP A1 (12) EUROPEAN PATENT APPLICATION. (43) Date of publication: Bulletin 2015/15

Oracle. Project Portfolio Management Cloud Using Project Performance Reporting. Release 13 (update 18B)

BCN1043. By Dr. Mritha Ramalingam. Faculty of Computer Systems & Software Engineering

Tightening the Bounds on Cache-Related Preemption Delay in Fixed Preemption Point Scheduling

UNIT-II Project Organization and Scheduling Project Element

Transcription:

An evaluation of the genome alignment landscape Alexandre Fonseca KTH Royal Institute of Technology December 16, 2013

Introduction Evaluation Setup Results Conclusion Genetic Research Motivation Objective Table of Contents 1 Introduction Genetic Research Motivation Objective 2 Evaluation Setup Hardware Software Inputs 3 Results Accuracy Duration Scalability 4 Conclusion Alexandre Fonseca An evaluation of the genome alignment landscape 2 / 16

Introduction Evaluation Setup Results Conclusion Genetic Research Motivation Objective What is genetic research? The study of an individual s genome DNA & RNA Allows identification of: Particular traits and characteristics Anomalous mutations Applicability: agriculture, medicine, Genome analysis pipeline Alexandre Fonseca An evaluation of the genome alignment landscape 3 / 16

Introduction Evaluation Setup Results Conclusion Genetic Research Motivation Objective Motivation Next Generation Sequencing (NGS): Parallelization of the sequencing process Each run produces thousands of small reads >400x more throughput than 1 st generation Alignment: Matching reads to correct segments of a reference genome Most complex and time-consuming task Most implementations still very centralized Parallelization at the thread/processor level Hard to scale to handle increasing amounts of data Could we leverage the power of the cloud? Alexandre Fonseca An evaluation of the genome alignment landscape 4 / 16

Introduction Evaluation Setup Results Conclusion Genetic Research Motivation Objective Objective of the project Evaluate and compare different sequence aligners: Alignment duration Alignment accuracy Scalability Centralized Aligners: Bowtie1-100 (April 9 th, 2013) BWA - 0510 (November 13 th, 2013) Bowtie2-210 (February 21 st, 2013) Distributed Aligners: Crossbow - 121 (May 30 th, 2013) SEAL - 032 (February 7 th, 2013) Alexandre Fonseca An evaluation of the genome alignment landscape 5 / 16

Introduction Evaluation Setup Results Conclusion Hardware Software Inputs Table of Contents 1 Introduction Genetic Research Motivation Objective 2 Evaluation Setup Hardware Software Inputs 3 Results Accuracy Duration Scalability 4 Conclusion Alexandre Fonseca An evaluation of the genome alignment landscape 6 / 16

Introduction Evaluation Setup Results Conclusion Hardware Software Inputs Hardware Evaluated on the 7-node SICS cluster Each node: 2x 6-core AMD Opteron 24355 CPUs 32 GB of RAM 1TB of disk Node interconnection: 1Gbps full duplex Ethernet Alexandre Fonseca An evaluation of the genome alignment landscape 7 / 16

Introduction Evaluation Setup Results Conclusion Hardware Software Inputs Software Shared Hadoop 220 installation: 5 NodeManagers on nodes 1-5 ResourceManager on node 6 and NameNode on node 7 16GB of RAM and 12 cores available to each NodeManager Maximum memory usage per container: 8GB Additional software: FastXToolkit - 0013 PicardTools - 1101 SAMTools - 0119 SRAToolkit - 233 WGSim - https://githubcom/lh3/wgsim (a12da33) Alexandre Fonseca An evaluation of the genome alignment landscape 8 / 16

Introduction Evaluation Setup Results Conclusion Hardware Software Inputs Inputs hg-19 reference human genome Single-ended and paired-ended sets of reads Bowtie1 and Crossbow only suppport single-ended SEAL only supports paired-ended Two categories of read sets: Simulated - Sampled from hg-19 900k 100 base-pair reads 2%, 5% and 10% error rates 009% SNP mutation rate 001% indel mutation rate 200MB (x2) Real read sets - From the NA12878 individual Chromosome 4-156M 101bp reads - 355GB (x2) Chromosome 20-51M 101bp reads - 12GB (x2) Chromosome 21-31M 101bp reads - 7GB (x2) Alexandre Fonseca An evaluation of the genome alignment landscape 9 / 16

Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Table of Contents 1 Introduction Genetic Research Motivation Objective 2 Evaluation Setup Hardware Software Inputs 3 Results Accuracy Duration Scalability 4 Conclusion Alexandre Fonseca An evaluation of the genome alignment landscape 10 / 16

Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Mapped Reads Mapped reads with single-end read sets Bowtie1 Crossbow BWA Bowtie2 Mapped reads with paired-end read sets BWA Seal Bowtie2 Mapped Reads (%) 120 100 80 8254 8254 9639 9603 60 40 20 0 100-002 3364 3364 8132 8984 100-005 Read sets 325 325 2937 6607 100-010 9663 9663 9937 8936 8937 9395 4698 4699 688 120 100 80 60 40 20 0 100-002-pair 100-005-pair 100-010-pair Read sets Mapped Reads (%) Alexandre Fonseca An evaluation of the genome alignment landscape 11 / 16

Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Error Alignment error for single-end read sets Bowtie1 Crossbow BWA Bowtie2 Alignment error for paired-end read sets BWA Seal Bowtie2 8 672 8 Error (%) 6 4 275 275 269 337 2 0 100-002 318 318 304 471 100-005 Read sets 43 43 367 523 100-010 325 325 381 37 37 523 481 48 6 4 2 0 100-002-pair 100-005-pair 100-010-pair Read sets Error (%) Alexandre Fonseca An evaluation of the genome alignment landscape 12 / 16

Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Duration Alignment duration for single-end read sets Bowtie1 Crossbow BWA Bowtie2 Alignment duration for paired-end read sets BWA Seal Bowtie2 400 374 1,382 1,500 Duration (minutes) 300 200 282 138 13915 100 0 chr4 8767 40 127 429 chr20 Read sets 515 30 132 25 chr21 367 352 377 1305 9776 273 785 57 0 chr4-pair chr20-pair chr21-pair Read sets 1,000 500 Duration (minutes) Alexandre Fonseca An evaluation of the genome alignment landscape 13 / 16

Introduction Evaluation Setup Results Conclusion Accuracy Duration Scalability Scalability Duration based on number of nodes for chromosome 20 Crossbow (single-end) Seal (paired-end) Bowtie1 BWA 500 521 Duration (minutes) 400 300 200 100 377 292 88 0 1 257 1305 655 40 3 5 # of nodes 108 29 7 Alexandre Fonseca An evaluation of the genome alignment landscape 14 / 16

Introduction Evaluation Setup Results Conclusion Table of Contents 1 Introduction Genetic Research Motivation Objective 2 Evaluation Setup Hardware Software Inputs 3 Results Accuracy Duration Scalability 4 Conclusion Alexandre Fonseca An evaluation of the genome alignment landscape 15 / 16

Introduction Evaluation Setup Results Conclusion Conclusion Distributing the alignment is feasible More than 3x speedup with no accuracy impact Different aligners target different optimization areas Chance to improve with newer algorithms Alexandre Fonseca An evaluation of the genome alignment landscape 16 / 16

Introduction Evaluation Setup Results Conclusion Conclusion Distributing the alignment is feasible More than 3x speedup with no accuracy impact Different aligners target different optimization areas Chance to improve with newer algorithms Questions? Alexandre Fonseca An evaluation of the genome alignment landscape 16 / 16