Why know about performance

Similar documents
Anne Bracy CS 3410 Computer Science Cornell University

Mark Redekopp, All rights reserved. EE 357 Unit 12. Performance Modeling

ECSE 425 Lecture 5: Quan2fying Computer Performance

EXERCISES ON PERFORMANCE EVALUATION

BCN1043. By Dr. Mritha Ramalingam. Faculty of Computer Systems & Software Engineering

CS 230 Winter 2013 Tutorial 7 Monday, March 4, 2013

EC 413 Computer Organization

Characterizing Microprocessor Benchmarks. Towards Understanding the Workload Design Space

How Computers Work Lecture 12

CS429: Computer Organization and Architecture

ATOP-DOWN APPROACH TO ARCHITECTING CPI COMPONENT PERFORMANCE COUNTERS

performance counter architecture for computing CPI components

Real-Time Market Data Technology Overview

Morningstar Advisor Workstation Enterprise Edition

Assessing Solvency by Brute Force is Computationally Tractable

Analytics in 10 Micro-Seconds Using FPGAs. David B. Thomas Imperial College London

Accelerating Financial Computation

Bell Aliant PC Phone Installation/Removal Guide

TDT4255 Lecture 7: Hazards and exceptions

TEPZZ 858Z 5A_T EP A1 (19) (11) EP A1 (12) EUROPEAN PATENT APPLICATION. (43) Date of publication: Bulletin 2015/15

HPC IN THE POST 2008 CRISIS WORLD

Lecture 8: Skew Tolerant Domino Clocking

MEMORY SYSTEM. Mahdi Nazm Bojnordi. CS/ECE 3810: Computer Organization. Assistant Professor School of Computing University of Utah

Unparalleled Performance, Agility and Security for NSE

COMPARISON OF BUDGET BORROWING AND BUDGET ADAPTATION IN HIERARCHICAL SCHEDULING FRAMEWORK

Technical Whitepaper. Order Book: a kdb+ Intraday Storage and Access Methodology. Author:

A different re-execution speed can help

General Business 706 Midterm #3 November 25, 1997

An Algorithm for Distributing Coalitional Value Calculations among Cooperating Agents

Expected Value of a Random Variable

Accelerated Option Pricing Multiple Scenarios

Section 7C Finding the Equation of a Line

Benchmarks Open Questions and DOL Benchmarks

DMI Certification. David G. Lawrence DMI Working Group

CUDA-enabled Optimisation of Technical Analysis Parameters

Rate-Based Execution Models For Real-Time Multimedia Computing. Extensions to Liu & Layland Scheduling Models For Rate-Based Execution

Seeing financing in a new light

A Branch-and-Price method for the Multiple-depot Vehicle and Crew Scheduling Problem

Scaling SGD Batch Size to 32K for ImageNet Training

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

COS 318: Operating Systems. CPU Scheduling. Jaswinder Pal Singh Computer Science Department Princeton University

Commercial Lending for Lenders 2015

Reconfigurable Acceleration for Monte Carlo based Financial Simulation

Research. Evaluation of Retirement Strategies. 1. Retirement Strategies Key variables Key questions

IBM Enterprise Services without Term Value Commitment

Financial Statements and Key Metrics No Margin, No Mission

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Lecture 8: Skew Tolerant Design (including Dynamic Circuit Issues)

1. Introduction. Proceedings of the 37th International Symposium on Microarchitecture (MICRO ) /04 $20.

Software Requirement Specification

CHAPTER V ANALYSIS AND INTERPRETATION

Dynamic Resource Allocation for Spot Markets in Cloud Computi

Supplementary Conditions IBM Enterprise Services with Term Value Commitment

Yosemite Trip Participants

SPECTRUM MARKETS. Randall Berry, Michael Honig Department of EECS Northwestern University. DySPAN Conference, Aachen, Germany

YNAB Budgeting System User Guide

2 Exploring Univariate Data

The New ROI. Applications and ROIs

ACCT323, Cost Analysis & Control H Guy Williams, 2005

Collateralized Debt Obligation Pricing on the Cell/B.E. -- A preliminary Result

Chapter 1: Data Storage

Oracle. Project Portfolio Management Cloud Using Project Performance Reporting. Release 13 (update 18B)

FPGA ACCELERATION OF MONTE-CARLO BASED CREDIT DERIVATIVE PRICING

Systems Engineering. Engineering 101 By Virgilio Gonzalez

Statistics vs. statistics

Operational Risk Quantification System

Lecture 9 Feb. 21, 2017

15-451/651: Design & Analysis of Algorithms November 9 & 11, 2015 Lecture #19 & #20 last changed: November 10, 2015

The Best Solution For All Your Amortization Needs

Loan Approval and Quality Prediction in the Lending Club Marketplace

15-451/651: Design & Analysis of Algorithms October 23, 2018 Lecture #16: Online Algorithms last changed: October 22, 2018

What You Need to Know Before Purchasing a PACS Peter B. Mancino, Esq, Terence A. Russo, Esq

Homework 4 SOLUTION Out: April 18, 2014 LOOKUP and IF-THEN-ELSE Functions

Representation of the interested Bidders / vendors. Item & Specification in the tender Bidder / Vendor s representation Response to bidders

McKesson Radiology 12.0 Web Push

Linux kernels 2.2, 2.4, and 2.5 performance comparison

Don t Settle for Less

Enhanced Shell Sorting Algorithm

B2B DEBT COLLECTION BEST PRACTICES INTRODUCTION COLLECTION BEST PRACTICES. Presented by Michael C. Dennis, MBA, CBF, CCP, CPC

The Roberts Report - Over 30 years of bringing you up-to-date information. The Roberts Report

Lectures 24 & 25: Determination of exchange rates

4.5 Comparing Exponential Functions

Lecture 3: Project Management, Part 2: Verification and Validation, Project Tracking, and Post Performance Analysis

Lecture 3: Project Management, Part 2: Verification and Validation, Project Tracking, and Post Performance Analysis

Legend. Extra options used in the different configurations slow Apache (all default) svnserve (all default) file: (all default) dump (all default)

Session 174 PD, Nested Stochastic Modeling Research. Moderator: Anthony Dardis, FSA, CERA, FIA, MAAA. Presenters: Runhuan Feng, FSA, CERA

Cost Estimation as a Linear Programming Problem ISPA/SCEA Annual Conference St. Louis, Missouri

ORIGINALLY APPEARED IN ACTIVE TRADER M AGAZINE

Automatic Generation and Optimisation of Reconfigurable Financial Monte-Carlo Simulations

STAB22 section 1.3 and Chapter 1 exercises

The Truth About Fibonacci. Trading

Introduction to Real-Time Systems. Note: Slides are adopted from Lui Sha and Marco Caccamo

In a moment, we will look at a simple example involving the function f(x) = 100 x

` (A premier Public Sector Bank) Information Technology Division Head Office, Mangalore. Corrigendum 3. Tender Number: 14/ dated

Application of Earned Value Management (EVM) for Effective Project Control

QuikCalc Benefits. Premium Edition $ Available for Immediate Download and on CD- ROM

Framework Program 6 - CPF Editor SUBJECT: Frequently Asked Questions

Intro to the Statement of Cash Flows With Sage 50 Complete Accounting Section 0160A

Financial reports give a snapshot of a company s value at the end of a

Preferred Portfolio Services SM. managed portfolios. striking the right balance.

Transcription:

1

Performance Today we ll discuss issues related to performance: Latency/Response Time/Execution Time vs. Throughput How do you make a reasonable performance comparison? The 3 components of CPU performance The 2 laws of performance 2

Why know about performance Purchasing Perspective: Given a collection of machines, which has the Best Performance? Lowest Price? Best Performance/Price? Design Perspective: Faced with design options, which has the Best Performance Improvement? Lowest Cost? Best Performance/Cost? Both require Basis for comparison Metric for evaluation 3

Many possible definitions of performance Every computer vendor will select one that makes them look good. How do you make sense of conflicting claims? Q: Why do end users need a new performance metric? A: End users who rely only on megahertz as an indicator for performance do not have a complete picture of PC processor performance and may pay the price of missed expectations. 4

Two notions of performance Plane DC to Paris Speed Passengers Throughput (pmph) 747 6.5 hours 610 mph 470 286,700 Concorde 3 hours 1350 mph 132 178,200 Which has higher performance? Depends on the metric Time to do the task (Execution Time, Latency, Response Time) Tasks per unit time (Throughput, Bandwidth) Response time and throughput are often in opposition 5

Some Definitions Performance is in units of things/unit time E.g., Hamburgers/hour Bigger is better If we are primarily concerned with response time Performance(x) = 1 execution_time(x) Relative performance: X is N times faster than Y N = Performance(X) = execution_time(y) Performance(Y) execution_time(x) 6

Some Examples Plane DC to Paris Speed Passengers Throughput (pmph) 747 6.5 hours 610 mph 470 286,700 Concorde 3 hours 1350 mph 132 178,200 Time of Concorde vs. 747? Throughput of Concorde vs. 747? 7

Basis of Comparison When comparing systems, need to fix the workload Which workload? Workload Actual Target Workload Full Application Benchmarks Small Kernel or Synthetic Benchmarks Microbenchmarks Pros Representative Portable Widely used Realistic Easy to run Useful early in design Identify peak capability and potential bottlenecks Cons Very specific Non-portable Difficult to run/measure Less representative Easy to fool Real application performance may be much below peak 8

Benchmarking Some common benchmarks include: Adobe Photoshop for image processing BAPCo SYSmark for office applications Unreal Tournament 2003 for 3D games SPEC2000 for CPU performance The best way to see how a system performs for a variety of programs is to just show the execution times of all of the programs. Here are execution times for several different Photoshop 5.5 tasks, from http://www.tech-report.com 9

Summarizing performance Summarizing performance with a single number can be misleading just like summarizing four years of school with a single GPA! If you must have a single number, you could sum the execution times. This example graph displays the total execution time of the individual tests from the previous page. A similar option is to find the average of all the execution times. For example, the 800MHz Pentium III (in yellow) needed 227.3 seconds to run 21 programs, so its average execution time is 227.3/21 = 10.82 seconds. A weighted sum or average is also possible, and lets you emphasize some benchmarks more than others. 10

The components of execution time Execution time can be divided into two parts. User time is spent running the application program itself. System time is when the application calls operating system code. The distinction between user and system time is not always clear, especially under different operating systems. The Unix time command shows both. salary.125 > time distill 05-examples.ps Distilling 05-examples.ps (449,119 bytes) 10.8 seconds (0:11) 449,119 bytes PS => 94,999 bytes PDF (21%) 10.61u 0.98s 0:15.15 76.5% User time Wall clock time (including other processes) System time CPU usage = (User + System) / Total 11

Three Components of CPU Performance CPU time X,P = Instructions executed P * CPI X,P * Clock cycle time X Cycles Per Instruction 12

Instructions Executed Instructions executed: We are not interested in the static instruction count, or how many lines of code are in a program. Instead we care about the dynamic instruction count, or how many instructions are actually executed when the program runs. There are three lines of code below, but the number of instructions executed would be XXXX?. li $a0, 1000 Ostrich: sub $a0, $a0, 1 bne $a0, $0, Ostrich 13

CPI The average number of clock cycles per instruction, or CPI, is a function of the machine and program. The CPI depends on the actual instructions appearing in the program a floating-point intensive application might have a higher CPI than an integer-based program. It also depends on the CPU implementation. For example, a Pentium can execute the same instructions as an older 80486, but faster. It is common to each instruction took one cycle, making CPI = 1. The CPI can be >1 due to memory stalls and slow instructions. The CPI can be <1 on machines that execute more than 1 instruction per cycle (superscalar). 14

Clock cycle time One cycle is the minimum time it takes the CPU to do any work. The clock cycle time or clock period is just the length of a cycle. The clock rate, or frequency, is the reciprocal of the cycle time. Generally, a higher frequency is better. Some examples illustrate some typical frequencies. A 500MHz processor has a cycle time of 2ns. A 2GHz (2000MHz) CPU has a cycle time of just 0.5ns (500ps). 15

Execution time, again CPU time X,P = Instructions executed P * CPI X,P * Clock cycle time X The easiest way to remember this is match up the units: Seconds Program = Instructions Program * Clock cycles Instructions * Seconds Clock cycle Make things faster by making any component smaller!! Program Compiler ISA Organization Technology Instruction Executed CPI Clock Cycle TIme Often easy to reduce one component by increasing another 16

Example 1: ISA-compatible processors Let s compare the performances two 8086-based processors. An 800MHz AMD Duron, with a CPI of 1.2 for an MP3 compressor. A 1GHz Pentium III with a CPI of 1.5 for the same program. Compatible processors implement identical instruction sets and will use the same executable files, with the same number of instructions. But they implement the ISA differently, which leads to different CPIs. CPU time AMD,P = Instructions P * CPI AMD,P * Cycle time AMD = = CPU time P3,P = Instructions P * CPI P3,P * Cycle time P3 = = 17

Example 2: Comparing across ISAs Intel s Itanium (IA-64) ISA is designed facilitate executing multiple instructions per cycle. If an Itanium processor achieves an average CPI of.3 (3 instructions per cycle), how much faster is it than a Pentium4 (which uses the x86 ISA) with an average CPI of 1? a) Itanium is three times faster b) Itanium is one third as fast c) Not enough information 18

Improving CPI Many processor design techniques we ll see improve CPI Often they only improve CPI for certain types of instructions n CPI = Σ CPI F where F = I i i i i i = 1 Instruction Count Fi = Fraction of instructions of type i First Law of Performance: Make the common case fast 19

Example: CPI improvements Base Machine: Op Type Freq (fi) Cycles CPIi ALU 50% 3 Load 20% 5 Store 10% 3 Branch 20% 2 How much faster would the machine be if: we added a cache to reduce average load time to 3 cycles? we added a branch predictor to reduce branch time by 1 cycle? we could do two ALU operations in parallel? 20

Amdahl s Law Amdahl s Law states that optimizations are limited in their effectiveness. Execution time after improvement = Time affected by improvement Amount of improvement + Time unaffected by improvement For example, doubling the speed of floating-point operations sounds like a great idea. But if only 10% of the program execution time T involves floating-point code, then the overall performance improves by just 5%. Execution time after improvement = 0.10 T 2 + 0.90 T = 0.95 T What is the maximum speedup from improving floating point? Second Law of Performance: Make the fast case common 21

Summary Performance is one of the most important criteria in judging systems. There are two main measurements of performance. Execution time is what we ll focus on. Throughput is important for servers and operating systems. Our main performance equation explains how performance depends on several factors related to both hardware and software. CPU time X,P = Instructions executed P * CPI X,P * Clock cycle time X It can be hard to measure these factors in real life, but this is a useful guide for comparing systems and designs. Amdahl s Law tell us how much improvement we can expect from specific enhancements. The best benchmarks are real programs, which are more likely to reflect common instruction mixes. 22