Hardware benchmarking for HASH 3 (for non Hardware designers)

Size: px

Start display at page:

Download "Hardware benchmarking for HASH 3 (for non Hardware designers)"

Carmella Lambert
6 years ago
Views:

1 Hardware benchmarking for HASH 3 (for non Hardware designers) Ingrid Verbauwhede ingrid.verbauwhede-at-esat.kuleuven.be K.U.Leuven, COSIC Computer Security and Industrial Cryptography with input from: Junfeng Fan, Miroslav Knezevic, Patrick Schaumont Slides from: own Course notes, Rabaey s Digital Integrated Circuit KULeuven - COSIC Tenerife, Hash 3 1 Nov 2009

2 Outline Goal of hardware design What is hardware design? What are the different options? What are the different contexts? How to compare hardware design: benchmark Where are we now? KULeuven - COSIC Tenerife, Hash 3 2 Nov 2009

3 HW - SW continuum When Hardware design? KULeuven - COSIC Tenerife, Hash 3 3 Nov 2009

4 When Hardware design? Fast Small Low power Security (Analog, RF) HW HW SW continuum KULeuven - COSIC Tenerife, Hash 3 4 Nov 2009

5 HW-SW continuum HW HW-SW SW ASIC FPGA Domain specific DSP VLIW General purpose High Area efficiency Intel AES-NI Westmere Low Performance/Energy unit Low High Programmability KULeuven - COSIC Tenerife, Hash 3 5 Nov 2009

6 Design parameters Speed or throughput: Gbits/sec or Mbits/sec/slice Cycles/byte (see D. Bernstein) Area: mm2 (gate or transistor count) Memory Power or energy consumption: Power (Watts) for cooling or transmission (RFID) Energy: battery operated devices Security: Side channel resistance: special circuits styles KULeuven - COSIC Tenerife, Hash 3 6 Nov 2009

7 Power density problem Intel S. Borkar power density problem [Author: S. Borkar, Intel] KULeuven - COSIC Tenerife, Hash 3 7 Nov 2009

Power, energy Include picture of Intel cooling issues Immediate need to add 8MWto prepare for 2007 installs of new systems Need total of 40-50 MW for projected systems by 2011.

8 Power, energy Include picture of Intel cooling issues Immediate need to add 8MWto prepare for 2007 installs of new systems Need total of MW for projected systems by Numbers just for computers, add 75% for cooling. Cooling will require tons of chiller capacity. Source: ORNL Oak Ridge National Lab, US Dept. of Energy KULeuven - COSIC Tenerife, Hash 3 8 Nov 2009

9 Heat and parallelism Reduce power = reduce WASTE!! M P memory processor C Power (Heat) P mono = CV 2 f (Watt) M/4 P/4 M/4 P/4 M/4 P/4 M/4 P/4 C/4 C/4 C/4 C/4 4 (C/4)V 2 (f/4) = P mono /4 but since f ~ V can be even P mono /4 3 TREND: MULTI-CORE!! KULeuven - COSIC Tenerife, Hash 3 9 Nov 2009

10 Low Energy: battery capacity Rabaey slide battery capacity KULeuven - COSIC Tenerife, Hash 3 10 Nov 2009

11 What is hardware design? KULeuven - COSIC Tenerife, Hash 3 11 Nov 2009

12 Skiing down a mountain Translation from spec into RTL (Register Transfer Level, e.g. VHDL, Verilog)l C, C++, block diagram Specification:HASHX pipelining, unrolling Algorithm Transformations loop merging, compaction Memory Transformations and Optimizations 40 bit accumulator Multi-precision arithmetic ASIC FPGA Retargetable coprocessor DSP processor DSP- RISC GPU KULeuven - COSIC Tenerife, Hash 3 12 Nov 2009

13 From RTL to tape-out or FPGA Back-end : VHDL, Verilog, synthesis, FPGA ASIC FPGA Retargetable coprocessor DSP DSP RISC, VLIW, Extensions GPU, CPU To RISC Hardware Software Verilog-VHDL Synopsys synthesis Cadence place&route FPGA download C-compilation Assembly optimization System-on-a-chip, system in package KULeuven - COSIC Tenerife, Hash 3 13 Nov 2009

14 Context 1: ASIC design Standard cell based design KULeuven - COSIC Tenerife, Hash 3 14 Nov 2009

Semicustom Design Flow Design Capture Behavioral

Simulation HDL Logic Synthesis Floorplanning

15 Semicustom Design Flow Design Capture Behavioral Design Iteration Pre-Layout Simulation Post-Layout Simulation HDL Logic Synthesis Floorplanning Placement Structural Physical Circuit Extraction Routing Timing closure! Tape-out Technology/library/manufacturer input KULeuven - COSIC Tenerife, Hash 3 15 Nov 2009

16 Cell-based Design (or standard cells) Feedthrough cell Logic cell Routing channel Functional module (RAM, multiplier, ) Routing channel requirements are reduced by presence of more interconnect layers KULeuven - COSIC Tenerife, Hash 3 16 Nov 2009

17 Standard Cell Example [Brodersen92] KULeuven - COSIC Tenerife, Hash 3 17 Nov 2009

18 Standard Cell The New Generation Cell-structure hidden under interconnect layers KULeuven - COSIC Tenerife, Hash 3 18 Nov 2009

19 The Design Closure Problem Iterative Removal of Timing Violations (white lines) Courtesy Synopsys KULeuven - COSIC Tenerife, Hash 3 19 Nov 2009

Place-and-Route Info Place-and-Route Optimization

20 Synthesis together w Physical Design RTL (Timing) Constraints Physical Synthesis Macromodules Fixed netlists Netlist with Place-and-Route Info Place-and-Route Optimization Technology/library manufacturer input Artwork KULeuven - COSIC Tenerife, Hash 3 20 Nov 2009

21 Benchmark on gate count?? Gate count (GE) depends on library and tools! Definition of one GATE? Example: PRESENT[20] contains 1,000 GE in 0.35 m technology 53,974 m 2. PRESENT[20] contains 1,169 GE in 0.25 m technology 32,987 m 2. PRESENT[20] contains 1,075 GE in 0.18 m technology 10,403 m 2. Comparison is fair ONLY if the SAME library, SAME tools, and SAME settings are used. KULeuven - COSIC Tenerife, Hash 3 21 Nov 2009

22 Benchmark on synthesis settings?? Same VHDL design synthesized with different constraints will result in different performance. Benchmark on area-time product?? Note: 2.7GHz is synthesis report: NOT FEASIBLE in practice! [source: M. Knezevic] KULeuven - COSIC Tenerife, Hash 3 22 Nov 2009

23 Context 2: FPGA design KULeuven - COSIC Tenerife, Hash 3 23 Nov 2009

24 Late-Binding Implementation Array-based Pre-diffused (Gate Arrays) Pre-wired (FPGA's) KULeuven - COSIC Tenerife, Hash 3 24 Nov 2009

25 Look-up Table Based Logic Cell In Out Out ln1 ln2 KULeuven - COSIC Tenerife, Hash 3 25 Nov 2009

26 LUT-Based Logic Cell C 1...C 4 4 xx xxxx xxxx xxxx D 4 D 3 D 2 Logic function of xxx xx xx xx xx Bits control x xx x xxxx xx D 1 F 4 F 3 F 2 Logic function of xxx Logic function x of xxx x xx xx xx xx x Bits control xx x xx xx x xxxx x xx F 1 x xxxxx Xilinx 4000 Series Not most up to date H P x xx xx Multiplexer Controlled by Configuration Program Courtesy Xilinx x KULeuven - COSIC Tenerife, Hash 3 26 Nov 2009

27 RAM-based FPGA Xilinx XC4000ex Courtesy Xilinx KULeuven - COSIC Tenerife, Hash 3 27 Nov 2009

28 Xilinx Virtex-II Pro FPGA IBM PowerPC RISC CPU Synchronous Dual-Port RAM Conexant 3.125Gb Serial XtremeDSP SelectIO-ltra SystemIO & XCITE KULeuven - COSIC Tenerife, Hash 3 28 Nov 2009

29 Multi-Pass Place-and-Route Analysis GMU SHA-512, Xilinx Virtex runs for different placement starting points ~ 20% The smaller the better best worst 29 Minimum clock [courtesy: Kris Gaj] 29 KULeuven - COSIC Tenerife, Hash 3 29 Nov 2009

30 Dependence of Results on Requested Clock freq. [courtesy: Kris Gaj] KULeuven - COSIC Tenerife, Hash 3 30 Nov

31 Saar Drimer, Figure 5.2 Ph.D. thesis Distribution max achievable clock frequency for Place&Route with 100 different PAR seeds. 1 & 2: for 1 or 4 AES instances 3 & 4: same on different platform 5: different speed grade KULeuven - COSIC Tenerife, Hash 3 31 Nov 2009

32 FPGA benchmarks?? Easier than ASIC Tools are (almost) free (at least at universities) Options: similar to software Trend getting worse: FPGA becomes heterogeneous machine Report with/without block-rams Report with/without DSP multipliers Report with/without high speed IO KULeuven - COSIC Tenerife, Hash 3 32 Nov 2009

33 FPGA benchmarks?? Area numbers: Slices, LUT s, CLB s, Xilinx application engineer: The number of CLB s inside LUT s changes from generation to generation. (or was it LUT s inside CLB s?) Speed: accurately reported by tools Power: Poorly reporting by tools Hard to measure on board KULeuven - COSIC Tenerife, Hash 3 33 Nov 2009

34 Context 3: HW-SW interface Dan would call this the API? KULeuven - COSIC Tenerife, Hash 3 34 Nov 2009

35 Intro: SHA3-ZOO 3 types of Hardware reporting, but no interface! SHA3 Mem Fully Autonomous Fully Autonomous With external memory Core functionality Integration of Hash module?? KULeuven - COSIC Tenerife, Hash 3 35 Nov 2009

36 Integration of the Hash module: options for HW/SW co-design Option 1: instruction set extension SHA3 Tightly coupled Reuse of busses Reuse of registers Define instruction Usually: C-intrinsic or pragma Example: AES-NI off Intel (see Shay s presentation!) Example: Build your own extension to embedded processor see e.g. Xtensa or Target Compiler Technologies KULeuven - COSIC Tenerife, Hash 3 36 Nov 2009

37 Option 2: Memory mapped Main processor SHA3 Local memory Memory-mapped coprocessor Loosely coupled Typical for DSP and other embedded processors No need to change compiler Check latency of coprocessor & memory consistency! KULeuven - COSIC Tenerife, Hash 3 37 Nov 2009

38 Option 3: novel forms of co-operation router router SHA3 Custom HW or Network on Chip (NOC) Loosely coupled Flexible interconect Popular for large multicore designs (80 or 100 cores) One of many other cores KULeuven - COSIC Tenerife, Hash 3 38 Nov 2009

39 Can have different forms in on Systemon-chip (SOC) external memory CPU Memory custom dp I$ D$ Memory Controller Timer Parallel I/O Local Bus High-speed Bus Bridge Peripheral Bus Custom HW DMA Bus Master UART Custom HW direct I/O KULeuven - COSIC Tenerife, Hash 3 39 Nov 2009

40 AES acceleration for SH3-DSP AES Co-processor For 128bit key Using GEZEL Communicate with the SH3-DSP ISS via the memory mapped interface KVM on SH3-DSP ISS GEZEL-SH Co-Simulator { volatile char *ins = 0x2f000; volatile int *dout = 0x2f004; volatile int *din = 0x2f008; } address 0x2f000 0x2f004 0x2f008 memory-mapped interface 8 ins 32 dout 32 din aes_encoder aes_top [Ref: Y. Matsuoka et al, CASES04] load reset key text_in done text_out 128 Co-processor in GEZEL Simulation Kernel KULeuven - COSIC Tenerife, Hash 3 40 Nov 2009

41 AES Optimization results Number of lock cycles per AES encryption (Key scheduling + Block encryption) Starting from Java function call in user application KNI overhead limits the overall performance gain Java API I/F (a) Java (b) Java+C (c) Java+C+GEZEL KNI I/F Acceleration I/F Mem-Mapped I/F Total Cycles [Ref: Y. Matsuoka et al, CASES04] (6.8x) (10.4x) KULeuven - COSIC Tenerife, Hash 3 41 Nov 2009

42 Context 4: Bandwidth KULeuven - COSIC Tenerife, Hash 3 42 Nov 2009

43 Adapt HW platform to application Simple example: Key Schedule for secret key Two options: On the fly = just in time processing Pre-compute and store in memory Key Schedule BC Key Schedule Memory Typical for Hardware BC Typical for Software KULeuven - COSIC Tenerife, Hash 3 43 Nov 2009

44 Key schedule on the fly The cost of fast key context switching in SW Example for IPSEC router one 128 bit key = 1408 bits round keys (10 rounds + initial key) half of internet packets are only 64 bytes in length (512 bits) Context bandwidth (Gbps) Data at 1Gbps ARC4 AES 3DES Record Size (bytes) [source: J. Goodman] KULeuven - COSIC Tenerife, Hash 3 44 Nov 2009

45 Benchmark?? Cost of HW module (minimum minimorum): Key storage assume sub-keys on the fly State storage: Does all state need to be alive all the time? Wide pipe - narrow pipe Windowing? Think context switching Input block / output block Can I process input already before the complete input block and/or padding is present? Same for output: can I send output, or do I have to wait for the complete output block KULeuven - COSIC Tenerife, Hash 3 45 Nov 2009

46 Context 5: gap between application and architecture KULeuven - COSIC Tenerife, Hash 3 46 Nov 2009

47 Match between algorithm & architecture Close the gap: Application Dedicated HW: ASIC Programmable HW: FPGA Custom instructions, handcoded assembly Compiled code Power JAVA on virtual machine, compiled on a real machine Cost ASIC Fixed Platform??? General Purpose KULeuven - COSIC Tenerife, Hash 3 47 Nov 2009

48 AES 128bit key 128bit data 0.18μm CMOS Throughput Energy numbers Throughput 3.84 Gbits/sec Power 350 mw Figure of Merit (Gb/s/W = Gb/J) 11 (1/1) FPGA [1] 1.32 Gbit/sec 490 mw 2.7 (1/4) ASM StrongARM [2] 31 Mbit/sec 240 mw 0.13 (1/85) Asm Pentium III [3] 648 Mbits/sec 41.4 W (1/800) C Emb. Sparc [4] 133 Kbits/sec 120 mw (1/10.000) Java [5] Emb. Sparc 450 bits/sec 120 mw (1/ ) [1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator [2] Dag Arne Osvik: 544 cycles AES ECB on StrongArm SA-1110 [3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet [4] gcc, Mhz Sparc assumes 0.25 u CMOS [5] Java on KVM (Sun J2ME, non-jit) on MHz Sparc assumes 0.25 u CMOS KULeuven - COSIC Tenerife, Hash 3 48 Nov 2009

49 Context 6: transformations KULeuven - COSIC Tenerife, Hash 3 49 Nov 2009

50 Data Flow Graph representation Illustrate with RIPEMD Indicate loops, operations, and delays TD TD TD TD TD B CTD rol(10) D TD E F TD rol(s) A TD Ki Xi KULeuven - COSIC Tenerife, Hash 3 50 Nov 2009

51 Iteration Bound t l loop calculation time w l number of algorithmic delays (marked with T D ) in the l-th loop TD TD TD TD TD B CTD rol(10) D TD E F TD rol(s) A TD Ki Xi KULeuven - COSIC Tenerife, Hash 3 51 Nov 2009

52 Iteration Bound TD TD TD TD TD B CTD rol(10) D TD E F TD rol(s) A TD Ki Xi KULeuven - COSIC Tenerife, Hash 3 52 Nov 2009

53 Critical path The longest path between any two storage elements. - Determines the clock frequency! Problem: Critical Path > Iteration Bound! TD TD TD TD TD B CTD rol(10) D TD E F rol(s) A TD TD Ki Xi KULeuven - COSIC Tenerife, Hash 3 53 Nov 2009

54 Retiming transformation Transformation technique that changes the locations of unit-delay elements in a circuit without affecting the input/output characteristic. After retiming: Critical Path = Iteration Bound! TD TD TD TD TD B CTD rol(10) D TD E F rol(s) + + A1 + + TD Ki+1 Xi+1 KULeuven - COSIC Tenerife, Hash 3 54 Nov 2009

55 Hardware tricks For speed: Parallelism Pipelining Loop unrolling FPGA: Block RAM instead of Logic For area: Multiplexing Composite field instead of Sbox For power/energy: Parallelism Pipelining KULeuven - COSIC Tenerife, Hash 3 55 Nov 2009

56 Algorithm properties As they affect HW realization Internal state Block size Initialization cost Iterative, sequential, Parallelism KULeuven - COSIC Tenerife, Hash 3 56 Nov 2009

57 Benchmark efforts Benchmarks on FPGA, ASIC API efforts Open questions KULeuven - COSIC Tenerife, Hash 3 57 Nov 2009

58 Stefan Tillich See his presentation for the context KULeuven - COSIC Tenerife, Hash 3 58 Nov 2009

59 Brian Baldwin FPGA: CubeHash, Grostl, Shabal, SIMD, JH, Hamsi and Fugue Core functionality & compression function See his presentation for context KULeuven - COSIC Tenerife, Hash 3 59 Nov 2009

60 Christian Wenzel-Benner external Benchmarking extension KULeuven - COSIC Tenerife, Hash 3 60 Nov 2009

61 Miroslav Knezevic Illustration of transformations: applied to Luffa and others More observations KULeuven - COSIC Tenerife, Hash 3 61 Nov 2009

62 Patrick Schaumont: API for HW INIT & GETCONFIG: initialization, type of I/O, etc IDATA & ODATA: parameter 16, 32 bit: low end processor 64, 128 (256): high end processors KULeuven - COSIC Tenerife, Hash 3 62 Nov 2009

63 Kris Gaj: ATHENa 5 Database query ATHENa Server 6 User Ranking of designs 1 Download scripts and configuration files8 HDL + scripts + configuration files FPGA Synthesis and Implementation 2 3 Result Summary + Database Entries HDL + FPGA Tools 4 Database Entries Designer Interfaces + Testbenches 63 0 KULeuven - COSIC Tenerife, Hash 3 63 Nov 2009

ATHENa Major Features synthesis, implementation, and timing analysis in the batch mode support for devices and tools of multiple FPGA vendors: generation of

64 ATHENa Major Features synthesis, implementation, and timing analysis in the batch mode support for devices and tools of multiple FPGA vendors: generation of results for multiple families of FPGAs of a given vendor automated choice of a best-matching device within a given family KULeuven - COSIC Tenerife, Hash 3 64 Nov

65 Open questions Area comparisons Throughput comparisons Power/Energy comparisons Sets of environments KULeuven - COSIC Tenerife, Hash 3 65 Nov 2009

66 Conclusions Results depend on: ASIC set-up FPGA set-up Hardware API Bandwidth Transformations Need: Set of contexts area and speed, but also POWER and ENERGY! KULeuven - COSIC Tenerife, Hash 3 66 Nov 2009

Accelerating Financial Computation

Accelerating Financial Computation Wayne Luk Department of Computing Imperial College London HPC Finance Conference and Training Event Computational Methods and Technologies for Finance 13 May 2013 1 Accelerated