Architecture of Computer Systems

Lecture 2 - Simple Machine Implementations

### Last Time in Lecture 1

- Computer Architecture >> ISAs and RTL
  - CS152 is about interaction of hardware and software, and design of appropriate abstraction layers
- Technology and Applications shape Computer Architecture
  - History provides lessons for the future
- First 130 years of CompArch, from Babbage to IBM 360
  - Move from calculators (no conditionals) to fully programmable machines
  - Rapid change started in WWII (mid-1940s), move from electromechanical to pure electronic processors
- Cost of software development becomes a large constraint on architecture (need compatibility)
- IBM 360 introduces notion of "family of machines" running same ISA but very different implementations
  - Six different machines released on same day (April 7, 1964)
  - "Future-proofing" for subsequent generations of machine

### IBM 360: Initial Implementations

Model 30 . . . Model 70

*Memory* 8K - 64 KB 256K - 512 KB

Datapath 8-bit 64-bit

Circuit Delay 30 nsec/level 5 nsec/level

Local Store Main Store Transistor Registers

Control Store Read only 12sec Conventional circuits

IBM 360 instruction set architecture (ISA) completely hid the underlying technological differences between various models.

Milestone: The first true ISA designed as portable hardwaresoftware interface!

### IBM 360 Survives Today:

6 Cores **21.2**GHV/ai

Process
Special-purpose

each core

coprocessors on

48MB of Level-3 cache on chip

32nm SOI Technology 2.75 billion transistors 23.7mm x 25.2mm 15 layers of metal 7.68 miles of wiring! 10,000 power pins (!) 1,071 I/O pins



[From IBM HotChips24 presentation, August 28, 2012]

### Instruction Set Architecture (ISA)

- The contract between software and hardware
- Typically described by giving all the programmervisible state (registers + memory) plus the semantics of the instructions that operate on that state
- IBM 360 was first line of machines to separate ISA from implementation (aka. *microarchitecture*)
- Many implementations possible for a given ISA
  - E.g., the Soviets build code-compatible clones of the IBM360, as did Amdahl after he left IBM.
  - E.g.2., today you can buy AMD or Intel processors that run the x86-64 ISA.
  - E.g.3: many cellphones use the ARM ISA with implementations from many different companies including TI, Qualcomm, Samsung, Marvell, etc.

### ISA to Microarchitecture Mapping

- ISA often designed with particular microarchitectural style in mind, e.g.,
  - Accumulator ⇒ hardwired, unpipelined
  - CISC  $\Rightarrow$  microcoded
  - RISC  $\Rightarrow$  hardwired, pipelined
  - VLIW ⇒ fixed-latency in-order parallel pipelines
  - JVM  $\Rightarrow$  software interpretation
- But can be implemented with any microarchitectural style
  - Intel Ivy Bridge: hardwired pipelined CISC (x86) machine (with some microcode support)
  - Simics: Software-interpreted SPARC RISC machine
  - ARM Jazelle: A hardware JVM processor
  - This lecture: a microcoded RISC-V machine

# Today, Microprogramming to show how to build very small processors with complex ISAs

- To help you understand where CISC\* machines came from
- Because still used in common machines (IBM360, x86, PowerPC)
- As a gentle introduction into machine structures
- To help understand how technology drove the move to RISC\*

<sup>\* &</sup>quot;CISC"/"RISC" names much newer than style of machines they refer to.

### Microarchitecture: Implementation of an ISA



Structure: How components are connected.

Static

Behavior: How data moves between components

**Dynamic** 

### Microcontrol Unit Maurice Wilkes, 1954

First used in EDSAC-2, completed 1958

Embed the control logic state table in a memory array



### Microcoded Microarchitecture



## RISC-Wew RISC design from UC Berkeley

- Realistic & complete ISA, but open & small
- Not over-architected for a certain implementation style
- Both 32-bit and 64-bit address space variants
  - RV32 and RV64
- Designed for multiprocessing
- Efficient instruction encoding
- Easy to subset/extend for education/research
- Techreport with RISC-V spec available on class website
- We'll be using 32-bit RISC-V this semester in lectures and labs, very similar to MIPS you saw in CS61C

### **RV32 Processor State**

Program counter (pc)

32x32-bit integer registers (**x0-x31**)

• x0 always contains a 0

32 floating-point (FP) registers (**f0-f31**)

• each can contain a single- or doubleprecision FP value (32-bit or 64-bit IEEE FP)

FP status register (**fsr**), used for FP rounding mode & exception reporting

| XPRLEN-1  | 0 |
|-----------|---|
| x0 / zero |   |
| x1 / ra   |   |
| x2        |   |
| x3        |   |
| x4        |   |
| x5        |   |
| x6        |   |
| x7        |   |
| x8        |   |
| x9        |   |
| x10       |   |
| x11       |   |
| x12       |   |
| x13       |   |
| x14       |   |
| x15       |   |
| x16       |   |
| x17       |   |
| я18       |   |
| x19       |   |
| x20       |   |
| x21       |   |
| x22       |   |
| x23       |   |
| x24       |   |
| x25       |   |
| x26       |   |
| x27       |   |
| x28       |   |
| x29       |   |
| х30       |   |
| x31       |   |
| XPRLEN    |   |
| XPRLEN-1  | 0 |
| рс        |   |

XPRLEN

| 93  | U |
|-----|---|
| f0  |   |
| f1  |   |
| f2  |   |
| f3  |   |
| f4  |   |
| f5  |   |
| f6  |   |
| f7  |   |
| f8  |   |
| f9  |   |
| f10 |   |
| f11 |   |
| f12 |   |
| f13 |   |
| f14 |   |
| f15 |   |
| f16 |   |
| f17 |   |
| f18 |   |
| f19 |   |
| f20 |   |
| f21 |   |
| f22 |   |
| f23 |   |
| f24 |   |
| f25 |   |
| f26 |   |
| £27 |   |
| f28 |   |
| f29 |   |
| f30 |   |
| f31 |   |
| 64  |   |
| 31  | 0 |
| fsr |   |
| 32  |   |

### RISC-V Instruction Encoding



- Can support variable-length instructions.
- Base instruction set (RV32) always has fixed 32-bit instructions lowest two bits = 11<sub>2</sub>
- All branches and jumps have targets at 16-bit granularity (even in base ISA where all instructions are fixed 32 bits)

### RISC-V Instruction Formats



### R-Type/I-Type/R4-Type Formats



#### Reg-Reg ALU operations



Reg-Imm ALU operations

12-bit signed immediate

Load instructions, (rs1 + immediate) addressing



Only used for floating-point fused multiply-add

Reg. Source 3

### B-Type



Branches, compare two registers, PC+(immediate<<1) target (Branches do not have delay slot)

Store instructions, (rs1 + immediate) addressing, rs2 data

### L-Type



Writes 20-bit immediate to top of destination register.

Used to build large immediates.

12-bit immediates are signed, so have to account for sign when building 32-bit immediates in 2-instruction sequence (LUI high-20b, ADDI low-12b)



"J" Unconditional jump, PC+offset target "JAL" Jump and link, also writes PC+4 to **x1** 

Offset scaled by 1-bit left shift – can jump to 16-bit instruction boundary (Same for branches)

### Data Formats and Memory Addresses

#### Data formats:

8-b Bytes, 16-b Half words, 32-b words and 64-b double words

#### Some issues





Microinstruction: register to register transfer (17 control signals)

MA PC means RegSel = PC; enReg=yes; IdMA= yes
B Reg[rs2] means RegSel = rs2; enReg=yes; IdB = yes

### Memory Module



Assumption: Memory operates independently and is slow as compared to Reg-to-Reg transfers (multiple CPU clock cycles per access)

### Instruction Execution

#### Execution of a RISC-V instruction involves:

- 1. instruction fetch
- 2. decode and register fetch
- 3. ALU operation
- 4. memory operation (optional)
- 5. write back to register file (optional)
  - + the computation of the next instruction address

## Microprogram Fragments Instriction Fragments

PC 2 A + 4

IR 2 Memory

dispatch on Opcode

can be treated as a macro

A 2 Reg[rs1] ALU:

B ? Reg[rs2]

Reg[rd] 1 func(A,B) do instruction fetch

**ALUi:** A ? Reg[rs1]

> sign extension B 2 Imm

Reg[rd] 2 Opcode(A,B)

do instruction fetch

### Microprogram Fragments (cont.)

MA 2 A + B

Reg[rd] 

Memory

do instruction fetch

J: A 2 A - 4 Get original PC back in A

B ? IR

PC 2 JumpTarg(A,B)

do instruction fetch

 $JumpTarg(A,B) = {A + (B[31:7] << 1)}$ 

beq: A ? Reg[rs1]

B Reg[rs2]

If A==B then go to bz-taken

do instruction fetch

bz-taken: A ? PC

A 2 A - 4 Get original PC back in A

B  $\square$  Blmm << 1 Blmm = IR[31:27,16:10]

PC 2 A + B

do instruction fetch

### RISC-V Microcontroller: first attempt

pure ROM implementation



### Microprogram in the ROM worksheet

| <br>State          | Ор  | zero? | busy | Control points      | next-state         |
|--------------------|-----|-------|------|---------------------|--------------------|
| fetch <sub>o</sub> | *   | *     | *    | MA,A 🛭 PC           | fetch₁             |
| fetch <sub>1</sub> | *   | *     | yes  |                     | fetch <sub>1</sub> |
| fetch <sub>1</sub> | *   | *     | no   | IR 2 Memory         | fetch <sub>2</sub> |
| fetch <sub>2</sub> | *   | *     | *    | PC 2 A + 4          | ?                  |
| fetch <sub>2</sub> | ALU | *     | *    | PC ? A + 4          | ALU <sub>0</sub>   |
| ALU <sub>0</sub>   | *   | *     | *    | A ? Reg[rs1]        | $ALU_1$            |
| $ALU_1$            | *   | *     | *    | B 🛽 Reg[rs2]        | $ALU_2$            |
| $ALU_2$            | *   | *     | *    | Reg[rd] 2 func(A,B) | fetch <sub>0</sub> |

### Microprogram in the ROM

| State Op                | zero? | busy | Control points       | next-state         |
|-------------------------|-------|------|----------------------|--------------------|
| fetch <sub>o</sub> *    | *     | *    | MA,A ? PC            | fetch <sub>1</sub> |
| fetch <sub>1</sub> *    | *     | yes  | ••••                 | fetch <sub>1</sub> |
| fetch <sub>1</sub> *    | *     | no   | IR 2 Memory          | fetch <sub>2</sub> |
| fetch <sub>2</sub> ALU  | *     | *    | PC ? A + 4           | ALU                |
| fetch <sub>2</sub> ALUi | *     | *    | PC ? A + 4           | ALUi               |
| fetch <sub>2</sub> LW   | *     | *    | PC ? A + 4           | LWo                |
| fetch <sub>2</sub> SW   | *     | *    | PC ? A + 4           | $SW_0$             |
| fetch <sub>2</sub> J    | *     | *    | PC ? A + 4           | $J_0$              |
| fetch <sub>2</sub> JAL  | *     | *    | PC ? A + 4           | JAL                |
| fetch <sub>2</sub> JR   | *     | *    | PC ? A + 4           | $JR_0$             |
| fetch <sub>2</sub> JALR | *     | *    | PC ? A + 4           | JALR               |
| fetch <sub>2</sub> beq  | *     | *    | PC ? A + 4           | beq <sub>0</sub>   |
| •••                     |       |      |                      |                    |
| ALU <sub>0</sub> *      | *     | *    | A ? Reg[rs1]         | $ALU_1$            |
| ALU <sub>1</sub> *      | *     | *    | B ? Reg[rs2]         | $ALU_2$            |
| ALU <sub>2</sub> *      | *     | *    | Reg[rd] 12 func(A,B) | fetch <sub>o</sub> |
| -                       |       |      |                      | J                  |

### Microprogram in the ROM cont.

| State C           | Ор | zero? | busy | Control points  | next-s                | state              |
|-------------------|----|-------|------|-----------------|-----------------------|--------------------|
| ALUi <sub>0</sub> | *  | *     | *    | A 2 Reg[rs1]    |                       | ALUi <sub>1</sub>  |
| ALUi₁             | *  | *     | *    | B 🛽 Imm         | ALUi <sub>2</sub>     | 1                  |
| ALUi <sub>2</sub> | *  | *     | *    | Reg[rd] Op(A,E  | 3) fetch <sub>0</sub> |                    |
| •••               |    |       |      |                 |                       |                    |
| $J_0$             | *  | *     | *    | A ? A - 4       | $J_1$                 |                    |
| $J_1$             | *  | *     | *    | B 🛭 IR          |                       | $J_2$              |
| $J_2$             | *  | *     | *    | PC 2 JumpTarg(/ | 4,B)                  | fetch <sub>0</sub> |
| •••               |    |       |      |                 |                       |                    |
| $beq_0$           | *  | *     | *    | A ? Reg[rs1]    |                       | $beq_1$            |
| $beq_1$           | *  | *     | *    | B 2 Reg[rs2]    |                       | beq <sub>2</sub>   |
| $beq_2$           | *  | yes   | *    | A ? PC          |                       | beq <sub>3</sub>   |
| $beq_2$           | *  | no    | *    |                 |                       | fetch <sub>o</sub> |
| $beq_3$           | *  | *     | *    | A ? A - 4       |                       | beq <sub>4</sub>   |
| $beq_4$           | *  | *     | *    | B 2 Blmm        |                       | beq <sub>5</sub>   |
| beq <sub>5</sub>  | *  | *     | *    | PC ? A+B        |                       | fetch <sub>0</sub> |

. .

### Size of Control Store



### Reducing Control Store Size

Control store has to be fast 2 expensive

- Reduce the ROM height (= address bits)
  - reduce inputs by extra external logic
     each input bit doubles the size of the control store
  - reduce states by grouping opcodes
     find common sequences of actions
  - condense input status bitscombine all exceptions into one, i.e.,exception/no-exception
- Reduce the ROM width
  - restrict the next-state encodingNext, Dispatch on opcode, Wait for memory, ...
  - encode control signals (vertical microcode)

### RISC-V Controller V2



### Jump Logic

PCSrc = Case
IJumpTypes

PC+1 next? if (busy) then PPC else PPC+1 spin 🛚 fetch absolute ? dispatch ? op-group ftrue if (zero) then absolute else <a>PC+1</a> ? ffalse if (zero) then PC+1 else absolute ?

### Instruction Fetch & ALD IntsRISC-V MEXITER 1882

| fetch <sub>0</sub> fetch <sub>1</sub> fetch <sub>2</sub>    | MA,A PC IR Memory PC A + 4                  | next<br>spin<br>dispatch |
|-------------------------------------------------------------|---------------------------------------------|--------------------------|
| ALU <sub>0</sub> ALU <sub>1</sub> ALU <sub>2</sub>          | A ? Reg[rs1] B ? Reg[rs2] Reg[rd]?func(A,B) | next<br>next<br>fetch    |
| ALUi <sub>0</sub><br>ALUi <sub>1</sub><br>ALUi <sub>2</sub> | A ? Reg[rs1] B ? Imm Reg[rd]? Op(A,B)       | next<br>next<br>fetch    |

Load & Store: RISC-Vecontrollepoints next-state

| LW <sub>0</sub>                    | A ? Reg[rs1]          | next         |
|------------------------------------|-----------------------|--------------|
| LW <sub>1</sub>                    | B 2 Imm               | next         |
| LW <sub>2</sub>                    | MA ? A+B              | next         |
| LW <sub>3</sub>                    | Reg[rd] 2 Memory      | spin         |
| LW <sub>4</sub>                    |                       | fetch        |
|                                    |                       |              |
|                                    |                       |              |
| $SW_0$                             | A Reg[rs1]            | next         |
| SW <sub>0</sub><br>SW <sub>1</sub> | A ? Reg[rs1] B ? Blmm | next<br>next |
| <b>U</b>                           |                       |              |
| $SW_1$                             | B ? Blmm              | next         |

### Branches: RISC-V-Controller-2

| State            | Control points | next-state |
|------------------|----------------|------------|
| beq <sub>0</sub> | A ? Reg[rs1]   | next       |
| beq <sub>1</sub> | B @ Reg[rs2]   | next       |
| beq <sub>2</sub> | A ? PC ffals   | se         |
| beq <sub>3</sub> | A ? A- 4       | next       |
| beq <sub>3</sub> | B 2 Blmm<<1    | next       |
| beq <sub>4</sub> | PC 2 A+B       | fetch      |

### Jumps: RISC-V-Controller-2

| State                                                               | Control points                            | next-       | state                        |
|---------------------------------------------------------------------|-------------------------------------------|-------------|------------------------------|
| J <sub>0</sub> J <sub>1</sub> J <sub>2</sub>                        | A ? A-4 B ? IR PC ? JumpTarg(A            |             | next<br>etch                 |
| JR <sub>0</sub> JR <sub>1</sub>                                     | A ? Reg[rs1] PC ? A                       |             | next<br>etch                 |
| JAL <sub>0</sub> JAL <sub>1</sub> JAL <sub>2</sub> JAL <sub>3</sub> | A ? PC<br>Reg[1] ? A<br>A ? A-4<br>B ? IR | r<br>r<br>r | next<br>next<br>next<br>next |
| JAL <sub>4</sub>                                                    | PC 2 JumpTarg(A                           | ,B) t       | fetch                        |



#### Mem-Mem ALU Instructions:

RISC-V-Controller-2

```
Mem-Mem ALU op
                         M[(rd)] 2 M[(rs1)] op M[(rs2)]
    ALUMM<sub>0</sub>
                  MA 🛭 Reg[rs1]
                                              next
    ALUMM<sub>1</sub> A ② Memory
                                              spin
    ALUMM<sub>2</sub>
                  MA 2 Reg[rs2]
                                              next
    ALUMM<sub>3</sub>
                  B ? Memory
                                              spin
    ALUMM_{4}
                  MA PReg[rd]
                                              next
    ALUMM<sub>5</sub>
                  Memory 2 func(A,B)
                                              spin
    ALUMM<sub>6</sub>
                                              fetch
```

Complex instructions usually do not require datapath modifications in a microprogrammed implementation

-- only extra space for the control program

Implementing these instructions using a hardwired controller is difficult without datapath modifications

## Performance Issues

Microprogrammed control

multiple cycles per instruction

```
Cycle time ?

t_C > max(t_{reg-reg}, t_{ALU}, t_{?ROM})
```

Good performance, relative to a single-cycle hardwired implementation, can be achieved even with a CPI of 10



- Horizontal μcode has wider μinstructions
  - Multiple parallel operations per µinstruction
  - Fewer microcode steps per macroinstruction
  - Sparser encoding ⇒ more bits
- Vertical μcode has narrower μinstructions
  - Typically a single datapath operation per µinstruction
    - separate μinstruction for branches
  - More microcode steps per macroinstruction
  - More compact ⇒ less bits
- Nanocoding
  - Tries to combine best of horizontal and vertical μcode

## Nanocoding

Exploits recurring control signal patterns in µcode, e.g.,

ALU<sub>0</sub> A ? Reg[rs1]

. . .

ALUi<sub>0</sub> A 🛽 Reg[rs1]

• • •



- MC68000 had 17-bit  $\mu$ code containing either 10-bit  $\mu$ jump or 9-bit nanoinstruction pointer
  - Nanoinstructions were 68 bits wide, decoded to give 196 control signals

## Microprogramming in IBM 360

|                        | M30   | M40   | M50   | M65   |
|------------------------|-------|-------|-------|-------|
| Datapath width (bits)  | 8     | 16    | 32    | 64    |
| μinst width (bits)     | 50    | 52    | 85    | 87    |
| μcode size (K μinsts)  | 4     | 4     | 2.75  | 2.75  |
| μstore technology      | CCROS | TCROS | BCROS | BCROS |
| μstore cycle (ns)      | 750   | 625   | 500   | 200   |
| memory cycle (ns)      | 1500  | 2500  | 2000  | 750   |
| Rental fee (\$K/month) | 4     | 7     | 15    | 35    |

Only the fastest models (75 and 95) were hardwired

## IBM Card Capacitor Read-Only



#### Microcode Emulation

- IBM initially miscalculated the importance of software compatibility with earlier models when introducing the 360 series
- Honeywell stole some IBM 1401 customers by offering translation software ("Liberator") for Honeywell H200 series machine
- IBM retaliated with optional additional microcode for 360 series that could emulate IBM 1401 ISA, later extended for IBM 7000 series
  - one popular program on 1401 was a 650 simulator, so some customers ran many 650 programs on emulated 1401s
    - (650 simulated on 1401 emulated on 360)

# Microprogramming thrived in the Seventies

- Significantly faster ROMs than DRAMs were available
- For complex instruction sets, datapath and controller were cheaper and simpler
- New instructions, e.g., floating point, could be supported without datapath modifications
- Fixing bugs in the controller was easier
- ISA compatibility across various models could be achieved easily and cheaply

Except for the cheapest and fastest machines, all computers were microprogrammed

#### Writable Control Store (WCS)

- Implement control store in RAM not ROM
  - MOS SRAM memories now almost as fast as control store (core memories/DRAMs were 2-10x slower)
  - Bug-free microprograms difficult to write
- User-WCS provided as option on several minicomputers
  - Allowed users to change microcode for each processor
- User-WCS failed
  - Little or no programming tools support
  - Difficult to fit software into small space
  - Microcode control tailored to original ISA, less useful for others
  - Large WCS part of processor state expensive context switches
  - Protection difficult if user can change microcode
  - Virtual memory required restartable microcode

## Microprogramming is far from extinct Played a crucial role in micros of the Eighties

- DEC uVAX, Motorola 68K series, Intel 286/386
- Plays an assisting role in most modern micros
  - e.g., AMD Bulldozer, Intel Ivy Bridge, Intel Atom, IBM PowerPC, ...
  - Most instructions executed directly, i.e., with hardwired control
  - Infrequently-used and/or complicated instructions invoke microcode
- Patchable microcode common for post-fabrication bug fixes, e.g. Intel processors load μcode patches at bootup

## Acknowledgements

- These slides contain material developed and copyright by:
  - Arvind (MIT)
  - Krste Asanovic (MIT/UCB)
  - Joel Emer (Intel/MIT)
  - James Hoe (CMU)
  - John Kubiatowicz (UCB)
  - David Patterson (UCB)
- MIT material derived from course 6.823
- UCB material derived from course CS252