#### ARCHITECTURE OF

COMPUTER

SYSTEMS

LECTURE 15:



VECTOR
COMPUTERS

#### LAST TIME LECTURE 14:



#### SUPERCOMPUTERS

- Definition of a supercomputer:
- Fastest machine in world at given task
- A device to turn a compute-bound problem into an I/O bound problem
- Any machine costing \$30M+
- Any machine designed by Seymour Cray
- CDC6600 (Cray, 1964) regarded as first supercomputer

#### CDC 6600 SEYMOUR CRAY, 1963



00



- 128 Kword main memory capacity, 32 banks
- Ten functional units (parallel, unpipelined)
  - Floating Point: adder, 2 multipliers, divider
  - Integer: adder, 2 incrementers, ...



- Scoreboard for dynamic scheduling of instructions
- Ten Peripheral Processors for Input/Output
  - a fast multi-threaded 12-bit integer ALU
- Very fast clock, 10 MHz (FP add in 4 clocks)
- >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling
- Fastest machine in world for 5 years (until 7600)



#### IBM MEMO ON

Choma6600 Jr., IBM CEO, August 1963: "Last week, Control Data ... announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers... Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world's most powerful computer."

To which Cray replied: "It seems like Mr. Watson has answered his own question."



# ALQAD/STORE ARCHITE Material Contractions to manipulate three types of reg. ARCHITE Material Contractions (X) 18 bit laddress registers (A) 8 18-bit index registers (B)

All arithmetic and logic instructions are reg-to-reg

Only Load and Store instructions refer to memory!

| 6      | 3 | 3 | 18   | _                              |
|--------|---|---|------|--------------------------------|
| opcode | i | j | disp | $Ri \leftarrow M[(Rj) + disp]$ |

Touching address registers 1 to 5 initiates a load 6 to 7 initiates a store - very useful for vector operations



#### CDC6600 ISA DESIGNED TO SIMPLIFY HIGH-PERFORMANCE

- Ute of three-address, negister-register ALU instructions simplifies pipelined implementation
  - No implicit dependencies between inputs and outputs
- Decoupling setting of address register (Ar) from retrieving value from data register (Xr) simplifies providing multiple outstanding memory accesses
  - Software can schedule load of address register before use of value
  - Can interleave independent instructions inbetween
- CDC6600 has multiple parallel but unpipelined functional units
  - E.g., 2 separate multipliers
- Follow-on machine CDC7600 used pipelined functional units
  - Foreshadows later RISC designs

### CDC6600: VECTOR ADDITION BO ? - n ADDITION BO, exit

Ai = address register Bi = index register Xi = data register

#### SUPERCOMPUTER APPLICATIONS

- Typical application areas
  - Military research (nuclear weapons, cryptography)
  - Scientific research
  - Weather forecasting
  - Oil exploration
  - Industrial design (car crash simulation)
  - Bioinformatics
  - Cryptography
- All involve huge computations on large data sets
- In 70s-80s, Supercomputer  $\equiv$  Vector Machine

#### **PROGRAMMING**



#### Vector Length Register LR





#### VECTOR CODE EXAMPLE

```
# C code
                                           # Vector Code
for (i=0; i<64; i++)
                                             LI VLR, 64
  C[i] = A[i] + B[i];
                      # Scalar Code
                                             LV V1, R1
                                             LV V2, R2
                        LI R4, 64
                                             ADDV.D V3, V1, V2
                      loop:
                                             SV V3, R3
                        L.D F0, 0(R1)
                         L.D F2, 0(R2
                        S.D F4, 0(R3)
                         DADDIU R2, 8
                        DADDIU R3, 8
                        DSUBIU R4, 1
                        BNEZ R4, loop
```

#### VEEpiTomRd SyldipsFiR19763MPUTERS

- Scalar Unit
  - Load/Store Architecture
- Vector Extension
  - Vector Registers
  - Vector Instructions
- Implementation
  - Hardwired Control
  - Highly Pipelined Functional Units
  - Interleaved Memory System
  - No Data Caches
  - No Virtual Memory





memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)

### VECTOR INSTRUCTION SET

ADV one subriting ruction excodes N operations

- Expressive, tells hardware that these N operations:
  - are independent
  - use the same functional unit
  - access disjoint registers
  - access registers in same pattern as previous instructions
  - access a contiguous block of memory (unit-stride load/store)
  - access memory in a known pattern (strided load/store)
- Scalable
  - can run same code on more parallel pipelines (lanes)

#### VECTOR ARITHMETIC

ECUTION

• Use deep pipeline (=> fast clock) to execute element operations

 Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)

Six stage multiply pipeline



#### VECTOR INSTRUCTION

EXECUTION

Execution using one pipelined functional unit

Execution using four pipelined functional units





## INTERLEAVED VECTOR Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Binkbusy time: Vine before bank ready to accept next request





### MICROPROCESSOR (UCB/ICCI 1005)

Vector register elements striped over lanes



Lane

#### VECCEN QUE LE LA CETTE DE L'EST L'ES example machine has 32 elements per vector register and 8 Load Unit Multiply Unit Add Unit load mu add time mul add Instruction issue

Complete 24 operations/cycle while issuing 1 short instruction/cycle

#### VECT entroduced with cray-1



#### VECTOR CHAINING

Without chairing must wait for last element of result to be written before starting dependent instruction



 With chaining, can start dependent instruction as soon as first result appears



#### Two components of vector startup penalty

- functional unit la ency (time through pipeline)
- dead time or recovery time (time before another vector instruction can start down pipeline)



#### DEAD TIME AND SHORT VECTORS



#### VECTOR MEMORY-MEMORY

#### VER operands in main nemory that unions told divergor FR

The first vector machines, CDC Star-100 ('73) and TI ASC ('71), were in mony-memory machines

Cray-1 ('76) was first vector register machine

Example Source Code
for (i=0; i<N; i++)
{
 C[i] = A[i] + B[i];
 D[i] = A[i] - B[i];
}</pre>

Vector Memory-Memory Code

ADDV C, A, B
SUBV D, A, B

Vector Register Code

LV V1, A
LV V2, B
ADDV V3, V1, V2
SV V3, C
SUBV V4, V1, V2
SV V4, D

## VECTOR MEMORY-MEMORY VS. VECTOR REGISTER

- Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?
  - All operands must be read in and out of memory
- VMMAs make if difficult to overlap execution of multiple vector operations, why?
  - Must check dependencies on memory addresses
- VMMAs incur greater startup latency
  - Scalar code was faster on CDC Star-100 for vectors < 100 elements</li>
  - For Cray-1, vector/scalar breakeven point was around 2 elements
- Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector machines since Cray-1 have had vector register architectures
- (we ignore vector memory-memory from now on)



#### **VECTOR STRIPMINING**

**Problem:** Vector registers have finite length

Solution: Break loops into pieces that fit in registers, "Stripmining"



## VECTOR CONTROL OF With conditional code: EXECUTIVE AND SON; i++) A[i] = B[i];

Solution: Add vector *mask* (or *flag*) registers

- vector version of predicate registers, 1 bit per element
- ...and *maskable* vector instructions
  - vector operation becomes bubble ("NOP") at elements where mask bit is clear

#### Code example:

```
CVM # Turn on all elements

LV vA, rA # Load entire A vector

SGTVS.D vA, FO # Set bits in mask register where A>0

LV vA, rB # Load B vector into A under mask

SV vA, rA # Store A back to memory under mask
```

#### MASKED VECTOR Simple Implementation Density-T

Density-Time Implementation

execute all coverations, turn efficiently

scan mask vector and only execute elements with non-zero masks





#### Problem: Loop-carried dependence on reduction variables Sor (i=0; i<N; i++)

sum += A[i]; # Loop-carried dependence on sum

**Solution**: Re-associate operations if possible, use binary tree to perform reduction

#### VECTOR SCATTER/GATHER

Want to vectorize loops with indirect accesses:

```
for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]
```

Indexed load instruction (*Gather*)

```
LV vD, rD  # Load indices in D vector

LVI vC, rC, vD  # Load indirect from rC base

LV vB, rB  # Load B vector

ADDV.D vA,vB,vC  # Do add

SV vA, rA  # Store result
```

#### VECTOR SCATTER/GATHER

```
Histogram example:
   for (i=0; i<N; i++)
   A[B[i]]++;</pre>
```

Is following a correct translation?

```
LV vB, rB  # Load indices in B vector

LVI vA, rA, vB  # Gather initial A values

ADDV vA, vA, 1  # Increment

SVI vA, rA, vB  # Scatter incremented values
```

#### A MODERN VECTOR SUPER: 65nm CMOS technology Vector unit (3.2 GHz)



- 8 foreground VRegs + 64 background VRegs (256x64-bit elements/VReg)
- 64-bit functional units: 2 multiply, 2 add, 1 divide/sqrt, 1 logical, 1 mask unit
- 8 lanes (32+ FLOPS/cycle, 100+ GFLOPS peak per CPU)
- 1 load or store unit (8 x 8-byte accesses/cycle)
- Scalar unit (1.6 GHz)
  - 4-way superscalar with out-oforder and speculative execution
- Memory system provides 256GB/s DR 4 MB I-cache and 64KB data
- Up to 16 CPUs and up to 1TB DRAM form shared-memory node
  - total of 4TB/s bandwidth to shared DRAM memory
- Up to 512 nodes connected via 128GB/s network links (message passing between nodes)

# EXTENSIONS (AKA SIMD) LEANTENSIONS (AKA SIMD)

- Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b
  - Lincoln Labs TX-2 from 1957 had 36b datapath split into 2x18b or 4x9b
  - Newer designs have wider registers
    - 128b for PowerPC Altivec, Intel SSE2/3/4
    - 256b for Intel AVX



## MULTIMEDIA EXTENSIONS Limited instruction, set: VEROSector Sengen Fortrol TORS

- no strided load/store or scatter/gather
- unit-stride loads must be aligned to 64/128-bit boundary
- Limited vector register length:
  - requires superscalar dispatch to keep multiply/add/load units busy
  - loop unrolling to hide latencies increases register pressure
- Trend towards fuller vector support in microprocessors
  - Better support for misaligned memory accesses
  - Support of double-precision (64-bit floating-point)
  - New Intel AVX spec (announced April 2008), 256b vector registers (expandable up to 1024b)

#### ACKNOWLEDGEMENTS

- These slides contain material developed and copyright by:
  - Arvind (MIT)
  - Krste Asanovic (MIT/UCB)
  - Joel Emer (Intel/MIT)
  - James Hoe (CMU)
  - John Kubiatowicz (UCB)
  - David Patterson (UCB)
- MIT material derived from course 6.823
- UCB material derived from course CS252