Microarchitecture

Design of Digital Circuits 2017
Srdjan Capkun
Onur Mutlu
(Guest starring: Frank K. Gürkaynak and Aanjhan Ranganathan)

http://www.syssec.ethz.ch/education/Digitaltechnik_17
In This Lecture

- Overview of the MIPS Microprocessor
- RTL Design Revisited
- Single-Cycle Processor
- Performance Analysis
## Introduction

### Microarchitecture:
- How to implement an architecture in hardware

### Processor:
- Datapath: functional blocks
- Control: control signals

<table>
<thead>
<tr>
<th>Physics</th>
<th>Devices</th>
<th>Analog Circuits</th>
<th>Digital Circuits</th>
<th>Logic</th>
<th>Micro-architecture</th>
<th>Architecture</th>
<th>Operating Systems</th>
<th>Application Software</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>programs</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>device drivers</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>instructions</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>registers</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>datapaths</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>controllers</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>adders</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>memories</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>AND gates</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>NOT gates</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>amplifiers</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>filters</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>transistors</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>diodes</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>electrons</td>
</tr>
</tbody>
</table>
Microarchitecture

- Multiple implementations for a single architecture:
  - **Single-cycle** (this week)
    - Each instruction executes in a single cycle
  - **Multicycle** (in a few weeks)
    - Each instruction is broken up into a series of shorter steps
  - **Pipelined** (even later)
    - Each instruction is broken up into a series of steps
    - Multiple instructions execute at once.
What does our Microprocessor do?

- Program is in memory, stored as a series of instructions.

- Basic execution is:
  - Read one instruction
  - Decode what it wants
  - Find the operands either from registers or from memory
  - Perform the operation as dictated by the instruction
  - Write the result (if necessary)
  - Go to the next instruction
What are the Basic Components of MIPS?

- The main pieces
  - Memory to store program
  - Registers to access data fast
  - Data memory to store more data
  - An ALU to perform the core function
What are the Basic Components of MIPS?

- **The main pieces**
  - Memory to store program
  - Registers to access data fast
  - Data memory to store more data
  - An ALU to perform the core function

- **Some auxiliary pieces**
  - A program counter to tell where to find next instruction
  - Some logic to decode instructions
  - A way to manipulate the program counter for branches
What are the Basic Components of MIPS?

- The main pieces
  - Memory to store program
  - Registers to access data fast
  - Data memory to store more data
  - An ALU to perform the core function

- Some auxiliary pieces
  - A program counter to tell where to find next instruction
  - Some logic to decode instructions
  - A way to manipulate the program counter for branches

- Today we will see how it all comes together
RTL Design Revisited

- Registers store the current state

- Combinational logic determines next state
  - Data moves from one register to another (datapath)
  - Control signals move data through the datapath
  - Control signals depend on state and input

- Clock moves us from state to another
Let us Start with Building the Processor

- Program is stored in a (read-only) memory as 32-bit binary values.

- Memory addresses are also 32-bit wide, allowing (in theory) $2^{32} = 4$ Gigabytes of addressed memory.

- The actual address is stored in a register called PC.

<table>
<thead>
<tr>
<th>Address</th>
<th>Instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>0040000C</td>
<td>016D4022</td>
</tr>
<tr>
<td>00400008</td>
<td>2268FFF4</td>
</tr>
<tr>
<td>00400004</td>
<td>02328020</td>
</tr>
<tr>
<td>00400000</td>
<td>8C0A0020</td>
</tr>
</tbody>
</table>

Assembly Code | Machine Code
-------------|-------------
lw $t2, 32($0)   | 0x8C0A0020
add $s0, $s1, $s2 | 0x02328020
addi $t0, $s3, -12 | 0x2268FFF4
sub $t0, $t3, $t5 | 0x016D4022

Main Memory

PC
What to do with the Program Counter?

- The PC needs to be increment by 4 during each cycle (for the time being).
- Initial PC value (after reset) is 0x00400000

```verilog
reg [31:0] PC_p, PC_n; // Present and next state of PC

// [...]
assign PC_n <= PC_p + 4; // Increment by 4;

always @ (posedge clk, negedge rst)
begin
  if (rst == '0') PC_p <= 32'h00400000; // default
  else PC_p <= PC_n; // when clk
end
```
We have an Instruction Now What?

- There are different types of instructions
  - **R type** (three registers)
  - **I type** (the one with immediate values)
  - **J type** (for changing the flow)

- First only a subset of MIPS instructions:
  - R type instructions: and, or, add, sub, slt
  - Memory instructions: lw, sw
  - Branch instructions: beq

- Later we will add addi and j
R-Type MIPS instructions

- **Register-type: 3 register operands:**
  - rs, rt: source registers
  - rd: destination register

- **Other fields:**
  - op: the operation code or opcode (0 for R-type instructions)
  - funct: the function together, the opcode and function tell the computer what operation to perform
  - shamt: the shift amount for shift instructions, otherwise it’s 0

### R-Type

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>rd</th>
<th>shamt</th>
<th>funct</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>6 bits</td>
</tr>
</tbody>
</table>
R-Type Examples

Assembly Code

add $s0, $s1, $s2
sub $t0, $t3, $t5

Field Values

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>rd</th>
<th>shamt</th>
<th>funct</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>17</td>
<td>18</td>
<td>16</td>
<td>0</td>
<td>32</td>
</tr>
<tr>
<td>0</td>
<td>11</td>
<td>13</td>
<td>8</td>
<td>0</td>
<td>34</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>rd</th>
<th>shamt</th>
<th>funct</th>
</tr>
</thead>
<tbody>
<tr>
<td>000000</td>
<td>10001</td>
<td>10010</td>
<td>10000</td>
<td>00000</td>
<td>100000</td>
</tr>
<tr>
<td>000000</td>
<td>01011</td>
<td>01101</td>
<td>01000</td>
<td>00000</td>
<td>100010</td>
</tr>
</tbody>
</table>

Machine Code

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>rd</th>
<th>shamt</th>
<th>funct</th>
</tr>
</thead>
<tbody>
<tr>
<td>000000</td>
<td>10001</td>
<td>10010</td>
<td>10000</td>
<td>00000</td>
<td>100000</td>
</tr>
<tr>
<td>000000</td>
<td>01011</td>
<td>01101</td>
<td>01000</td>
<td>00000</td>
<td>100010</td>
</tr>
</tbody>
</table>

(0x02328020)
(0x016D4022)

Note the order of registers in the assembly code:

add rd, rs, rt
We Need a Register File

- **Store 32 registers, each 32-bit**
  - $2^5 = 32$, we need 5 bits to address each

- **Every R-type instruction uses 3 register**
  - Two for reading (RS, RT)
  - One for writing (RD)

- **We need a special memory with:**
  - 2 read ports (address x2, data out x2)
  - 1 write port (address, data in)
Register File

```verilog
input [4:0] a_rs, a_rt, a_rd;
input [31:0] di_rd;
input we_rd;
output [31:0] do_rs, do_rt;

reg [31:0] R_arr [31:0]; // Array that stores regs

// Circuit description
assign do_rs = R_arr[a_rs]; // Read RS
assign do_rt = R_arr[a_rt]; // Read RT

always @(posedge clk)
    if (we_rd) R_arr[a_rd] <= di_rd; // write RD
```
Register File

```verilog
input [4:0] a_rs, a_rt, a_rd;
input [31:0] di_rd;
input we_rd;
output [31:0] do_rs, do_rt;

reg [31:0] R_arr [31:0]; // Array that stores regs

// Circuit description; add the trick with $0
assign do_rs = (a_rs != 5'b00000)? // is address 0?
    R_arr[a_rs] : 0; // Read RS or 0

assign do_rt = (a_rt != 5'b00000)? // is address 0?
    R_arr[a_rt] : 0; // Read RT or 0

always @(posedge clk)
    if (we_rd) R_arr[a_rd] <= di_rd; // write RD
```
Arithmetic Logic Unit

*The reason why we study digital circuits: the part of the CPU that does something (other than copying data)*

- Defines the basic operations that the CPU can perform directly
  - Other functions can be realized using the existing ones iteratively. (i.e. multiplication can be realized by shifting and adding)

- Mostly, a collection of resources that work in parallel.
  - Depending on the operation one of the outputs is selected
Example: Arithmetic Logic Unit (ALU), pg243

<table>
<thead>
<tr>
<th>$F_{2:0}$</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>000</td>
<td>A &amp; B</td>
</tr>
<tr>
<td>001</td>
<td>A</td>
</tr>
<tr>
<td>010</td>
<td>A + B</td>
</tr>
<tr>
<td>011</td>
<td>not used</td>
</tr>
<tr>
<td>100</td>
<td>A &amp; ~B</td>
</tr>
<tr>
<td>101</td>
<td>A</td>
</tr>
<tr>
<td>110</td>
<td>A - B</td>
</tr>
<tr>
<td>111</td>
<td>SLT</td>
</tr>
</tbody>
</table>
Example: ALU Design

<table>
<thead>
<tr>
<th>$F_{2:0}$</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>000</td>
<td>A &amp; B</td>
</tr>
<tr>
<td>001</td>
<td>A</td>
</tr>
<tr>
<td>010</td>
<td>A + B</td>
</tr>
<tr>
<td>011</td>
<td>not used</td>
</tr>
<tr>
<td>100</td>
<td>A &amp; ~B</td>
</tr>
<tr>
<td>101</td>
<td>A</td>
</tr>
<tr>
<td>110</td>
<td>A - B</td>
</tr>
<tr>
<td>111</td>
<td>SLT</td>
</tr>
</tbody>
</table>
ALU Does the Real Work in a Processor

<table>
<thead>
<tr>
<th>$F_{2:0}$</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>000</td>
<td>$A &amp; B$</td>
</tr>
<tr>
<td>001</td>
<td>$A \mid B$</td>
</tr>
<tr>
<td>010</td>
<td>$A + B$</td>
</tr>
<tr>
<td>011</td>
<td>not used</td>
</tr>
<tr>
<td>100</td>
<td>$A &amp; \neg B$</td>
</tr>
<tr>
<td>101</td>
<td>$A \mid \neg B$</td>
</tr>
<tr>
<td>110</td>
<td>$A - B$</td>
</tr>
<tr>
<td>111</td>
<td>SLT</td>
</tr>
</tbody>
</table>
So far so good

- We have most of the components for MIPS
  - We still do not have data memory
So far so good

- **We have most of the components for MIPS**
  - We still do not have data memory

- **We can implement all R-type instructions**
  - Assuming our ALU can handle all of the functions
  - At the moment we do not have the shifter
  - We know what to do for more functions.
So far so good

- **We have most of the components for MIPS**
  - We still do not have data memory

- **We can implement all R-type instructions**
  - Assuming our ALU can handle all of the functions
  - At the moment we do not have the shifter
  - We know what to do for more functions.

- **...but not much else**
  - *No memory access*
  - *No immediate values*
  - *No conditional branches, no absolute jumps*
Data Memory

- Will be used to store the bulk of data

```vhdl
input [15:0] addr;  // Only 16 bits in this example
input [31:0] di;
input we;
output [31:0] do;

reg [65535:0] M_arr [31:0];  // Array for Memory

// Circuit description
assign do = M_arr[addr];  // Read memory

always @(posedge clk)
  if (we) M_arr[addr] <= di;  // write memory
```
Let us Read (then Write) to Memory

- The main memory can have up to $2^{32}$ bytes
  - It needs an 32-bit address for the entire memory.
  - In practice much smaller memories are used

- We need the `lw` instruction to read from memory
  - I-type instruction
  - We need to calculate the address from an immediate value and a register.
  - Maybe we can use the ALU to add immediate to the register?
I-Type

- Immediate-type: 3 operands:
  - rs, rt: register operands
  - imm: 16-bit two’s complement immediate

- Other fields:
  - op: the opcode
  - Simplicity favors regularity: all instructions have opcode
  - Operation is completely determined by the opcode

I-Type

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>imm</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>16 bits</td>
</tr>
</tbody>
</table>
Differences between R-type and I-type

- **Changes to Register file**
  - $RS + \text{sign extended immediate}$ is the address
  - RS is always read.
  - For some instructions RT is read as well
  - If one register is written it is RT, not RD
  - The write\_data\_in can come from Memory or ALU
  - Not all I-type instructions write to register file

- **Changes to ALU**
  - Used to calculate memory address, add function needed during non R-type operations
  - Result is also used as address for the data memory
  - One input is no longer RT, but the immediate value
Now Writing to Memory is Similar

- We need the \texttt{sw} instruction to write to memory
  - I-type instruction
  - Address calculated the same way
  - Unlike, previous step, there is no writing back to register.
  - But we need to enable the write enable of the memory
Our MIPS Datapath has Several Options

- **ALU inputs**
  - Either RT or Immediate

- **Write Address of Register File**
  - Either RD or RT

- **Write Data In of Register File**
  - Either ALU out or Data Memory Out

- **Write enable of Register File**
  - Not always a register write

- **Write enable of Memory**
  - Only when writing to memory (sw)
Our MIPS Datapath has Several Options

- **ALU inputs**
  - Either RT or Immediate \((MUX)\)

- **Write Address of Register File**
  - Either RD or RT \((MUX)\)

- **Write Data In of Register File**
  - Either ALU out or Data Memory Out \((MUX)\)

- **Write enable of Register File**
  - Not always a register write \((MUX)\)

- **Write enable of Memory**
  - Only when writing to memory (sw) \((MUX)\)

All these options are our control signals
Let us Develop our Control Table

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Op_{5:0}</th>
<th>RegWrite</th>
<th>RegDst</th>
<th>AluSrc</th>
<th>MemWrite</th>
<th>MemtoReg</th>
<th>ALUOp</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- **RegWrite**: Write enable for the register file
- **RegDst**: Write to register RD or RT
- **AluSrc**: ALU input RT or immediate
- **MemWrite**: Write Enable
- **MemtoReg**: Register data in from Memory or ALU
- **ALUOp**: What operation does ALU do
Let us Develop our Control Table

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Op&lt;sub&gt;5:0&lt;/sub&gt;</th>
<th>RegWrite</th>
<th>RegDst</th>
<th>AluSrc</th>
<th>MemWrite</th>
<th>MemtoReg</th>
<th>ALUOp</th>
</tr>
</thead>
<tbody>
<tr>
<td>R-type</td>
<td>000000</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>funct</td>
</tr>
</tbody>
</table>

- **RegWrite**: Write enable for the register file
- **RegDst**: Write to register RD or RT
- **AluSrc**: ALU input RT or immediate
- **MemWrite**: Write Enable
- **MemtoReg**: Register data in from Memory or ALU
- **ALUOp**: What operation does ALU do
Let us Develop our Control Table

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Op&lt;sub&gt;5:0&lt;/sub&gt;</th>
<th>RegWrite</th>
<th>RegDst</th>
<th>AluSrc</th>
<th>MemWrite</th>
<th>MemtoReg</th>
<th>ALUOp</th>
</tr>
</thead>
<tbody>
<tr>
<td>R-type</td>
<td>000000</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>funct</td>
</tr>
<tr>
<td>lw</td>
<td>100011</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>add</td>
</tr>
</tbody>
</table>

- **RegWrite**: Write enable for the register file
- **RegDst**: Write to register RD or RT
- **AluSrc**: ALU input RT or immediate
- **MemWrite**: Write Enable
- **MemtoReg**: Register data in from Memory or ALU
- **ALUOp**: What operation does ALU do
Let us Develop our Control Table

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Op_{5:0}</th>
<th>RegWrite</th>
<th>RegDst</th>
<th>AluSrc</th>
<th>MemWrite</th>
<th>MemtoReg</th>
<th>ALUOp</th>
</tr>
</thead>
<tbody>
<tr>
<td>R-type</td>
<td>000000</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>funct</td>
</tr>
<tr>
<td>lw</td>
<td>100011</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>add</td>
</tr>
<tr>
<td>sw</td>
<td>101011</td>
<td>0</td>
<td>X</td>
<td>1</td>
<td>1</td>
<td>X</td>
<td>add</td>
</tr>
</tbody>
</table>

- **RegWrite:** Write enable for the register file
- **RegDst:** Write to register RD or RT
- **AluSrc:** ALU input RT or immediate
- **MemWrite:** Write Enable
- **MemtoReg:** Register data in from Memory or ALU
- **ALUOp:** What operation does ALU do
Now we are much better

- We have almost all components for MIPS

- We already had (nearly) all R-type instructions
  - Our ALU can implement some functions
  - We know what to do for more functions.

- Now we can also Read and Write to memory

- Conditional Branches are still missing
  - Until know, PC just increases by 4 every time.
  - We need to somehow be able to influence this.
Branch if equal: BEQ

- From Appendix B:
  
  Opcode: 000100
  
  If ([rs] == [rt]) PC = BTA
  
  BTA = Branch Target Address
  = PC+4 +(SignImm <<2)
Branch if equal: BEQ

- **From Appendix B:**
  - Opcode: \texttt{000100}
  - If \([rs] == [rt]\) PC = BTA
  - BTA = Branch Target Address
    = PC+4 + (SignImm \ll 2)

- **We need ALU for comparison**
  - Subtract RS – RT
  - We need a ‘zero’ flag, if RS equals RT
Branch if equal: BEQ

- From Appendix B:
  Opcode: 000100
  If ([rs]==[rt]) PC= BTA
  BTA = Branch Target Address
  = PC+4 +(SignImm <<2)

- We need ALU for comparison
  - Subtract RS – RT
  - We need a ‘zero’ flag, if RS equals RT

- We also need a Second Adder
  - The immediate value has to be added to PC+4
Branch if equal: BEQ

- **From Appendix B:**
  - Opcode: 000100
  - If ([rs] == [rt]) PC = BTA
  - BTA = Branch Target Address
    = PC+4 + (SignImm << 2)

- **We need ALU for comparison**
  - Subtract RS – RT
  - We need a ‘zero’ flag, if RS equals RT

- **We also need a Second Adder**
  - The immediate value has to be added to PC+4

- **PC has now two options**
  - Either PC+4 or the new BTA
**More Control Signals**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Op&lt;sub&gt;5:0&lt;/sub&gt;</th>
<th>RegWrite</th>
<th>RegDst</th>
<th>AluSrc</th>
<th>Branch</th>
<th>MemWrite</th>
<th>MemtoReg</th>
<th>ALUOp</th>
</tr>
</thead>
<tbody>
<tr>
<td>R-type</td>
<td>0000000</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>funct</td>
</tr>
<tr>
<td>lw</td>
<td>100011</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>add</td>
</tr>
<tr>
<td>sw</td>
<td>101011</td>
<td>0</td>
<td>X</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>X</td>
<td>add</td>
</tr>
<tr>
<td>beq</td>
<td>000100</td>
<td>0</td>
<td>X</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>X</td>
<td>sub</td>
</tr>
</tbody>
</table>

- **New Control Signal**
  - **Branch**: Are we jumping or not?
We are almost done

- We have almost all components for MIPS

- We already had (nearly) all R-type instructions
  - Our ALU can implement some functions
  - We know what to do for more functions.

- Now we can also Read and Write to memory

- We also have conditional branches

- Absolute jumps are still missing
  - J-type instructions
  - The next PC address is directly determined
Machine Language: J-Type

- **Jump-type**
  - 26-bit address operand (addr)
  - Used for jump instructions (j)

**J-Type**

<table>
<thead>
<tr>
<th>op</th>
<th>addr</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 bits</td>
<td>26 bits</td>
</tr>
</tbody>
</table>

JTA: 0000 0000 0100 0000 0000 0000 1010 0000 (0x004000A0)

26-bit addr: 0000 0000 0100 0000 0000 0000 0000 1010 0000 (0x0100028)

Field Values

<table>
<thead>
<tr>
<th>op</th>
<th>imm</th>
<th>Machine Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>6 bits</td>
<td>0000 0000 0100 0000 0000 0000 0000 1010 0000 (0x0C100028)</td>
</tr>
<tr>
<td>6 bits</td>
<td>26 bits</td>
<td>000011 00 0001 0000 0000 0010 1000 (0x0C100028)</td>
</tr>
</tbody>
</table>
Jump: J

- **From Appendix B:**
  Opcode: 000010
  PC = JTA
  JTA = Jump Target Address
  = {(PC+4)[31:28], addr, 2'b0}

- **No need for ALU**
  - Assemble the JTA from PC and 26-bit immediate

- **No need for memory, register file**

- **One more option (mux + control signal) for PC**
Even More Control Signals

<table>
<thead>
<tr>
<th>Instruction</th>
<th>$Op_{5:0}$</th>
<th>RegWrite</th>
<th>RegDst</th>
<th>AluSrc</th>
<th>Branch</th>
<th>MemWrite</th>
<th>MemtoReg</th>
<th>ALUOp</th>
<th>Jump</th>
</tr>
</thead>
<tbody>
<tr>
<td>R-type</td>
<td>000000</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>funct</td>
<td>0</td>
</tr>
<tr>
<td>lw</td>
<td>100011</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>add</td>
<td>0</td>
</tr>
<tr>
<td>sw</td>
<td>101011</td>
<td>0</td>
<td>X</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>X</td>
<td>add</td>
<td>0</td>
</tr>
<tr>
<td>beq</td>
<td>000100</td>
<td>0</td>
<td>X</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>X</td>
<td>sub</td>
<td>0</td>
</tr>
<tr>
<td>j</td>
<td>000010</td>
<td>0</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>0</td>
<td>X</td>
<td>1</td>
</tr>
</tbody>
</table>

- **New Control Signal**
  - **Jump**: Direct jump yes or no
Now we are Almost Done

- We have all major parts
  - We can implement, R, I, J type instructions
  - We can calculate, branch, and jump
  - We would be able to run most programs
Now we are Almost Done

- **We have all major parts**
  - We can implement, R, I, J type instructions
  - We can calculate, branch, and jump
  - We would be able to run most programs

- **Still we can improve:**
  - ALU can be improved for more ‘funct’
  - We do not have a shifter.
  - Or a multiplier.
  - More I-type functions (only lw and sw supported)
  - More J-type functions (jal, jr)
Now the original slides once again

- The following slides will repeat what we just did
- Consistent with the text book
MIPS State Elements

- **Program counter:**
  32-bit register

- **Instruction memory:**
  Takes input 32-bit address A and reads the 32-bit data (i.e., instruction) from that address to the read data output RD.

- **Register file:**
  The 32-element, 32-bit register file has 2 read ports and 1 write port

- **Data memory:**
  Has a single read/write port. If the write enable, WE, is 1, it writes data WD into address A on the rising edge of the clock. If the write enable is 0, it reads address A onto RD.
Single-Cycle Datapath: lw fetch

**STEP 1:** Fetch instruction

```plaintext
lw $s3, 1($0)  # read memory word 1 into $s3
```

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>imm</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>16 bits</td>
</tr>
</tbody>
</table>
Single-Cycle Datapath: lw register read

**STEP 2:** Read source operands from register file

\[ \text{lw} \ $s3, \ 1($0) \quad \# \ read \ memory \ word \ 1 \ into \ $s3 \]

### I-Type

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>imm</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>16 bits</td>
</tr>
</tbody>
</table>
Single-Cycle Datapath: `lw` immediate

**STEP 3:** Sign-extend the immediate

```
lw $s3, 1($0)  # read memory word 1 into $s3
```

I-Type

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>imm</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>16 bits</td>
</tr>
</tbody>
</table>
Single-Cycle Datapath: lw address

**STEP 4:** Compute the memory address

```
lw $s3, 1($0)  # read memory word 1 into $s3
```

**I-Type**

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>imm</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>16 bits</td>
</tr>
</tbody>
</table>
Single-Cycle Datapath: `lw` memory read

**STEP 5:** Read from memory and write back to register file

```
lw $s3, 1($0)  # read memory word 1 into $s3
```

**I-Type**

<p>| | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>op</td>
<td>rs</td>
<td>rt</td>
<td>imm</td>
</tr>
<tr>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>16 bits</td>
</tr>
</tbody>
</table>
Single-Cycle Datapath: lw PC increment

**STEP 6:** Determine address of next instruction

```
lw $s3, 1($0)  # read memory word 1 into $s3
```

**I-Type**

```
<table>
<thead>
<tr>
<th></th>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>imm</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>16 bits</td>
</tr>
</tbody>
</table>
```
Single-Cycle Datapath: sw

- Write data in rt to memory

sw $t7, 44($0)  # write t7 into memory address 44

I-Type

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>imm</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>16 bits</td>
</tr>
</tbody>
</table>
Single-Cycle Datapath: R-type Instructions

- Read from rs and rt, write ALUResult to register file

\[ \text{add } t, b, c \quad \# t = b + c \]

**R-Type**

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>rd</th>
<th>shamt</th>
<th>funct</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>5 bits</td>
<td>6 bits</td>
</tr>
</tbody>
</table>
Single-Cycle Datapath: beq

Determine whether values in rs and rt are equal
Calculate BTA = (sign-extended immediate \(\ll 2\)) + (PC+4)

beq $s0, $s1, target  # branch is taken
Complete Single-Cycle Processor

[Diagram of a complete single-cycle processor.]
Control Unit

<table>
<thead>
<tr>
<th>ALUOp</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>add</td>
</tr>
<tr>
<td>01</td>
<td>subtract</td>
</tr>
<tr>
<td>10</td>
<td>look at funct field</td>
</tr>
<tr>
<td>11</td>
<td>n/a</td>
</tr>
</tbody>
</table>
ALU Does the Real Work in a Processor

<table>
<thead>
<tr>
<th>$F_{2:0}$</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>000</td>
<td>A &amp; B</td>
</tr>
<tr>
<td>001</td>
<td>A</td>
</tr>
<tr>
<td>010</td>
<td>A + B</td>
</tr>
<tr>
<td>011</td>
<td>not used</td>
</tr>
<tr>
<td>100</td>
<td>A &amp; ~B</td>
</tr>
<tr>
<td>101</td>
<td>A</td>
</tr>
<tr>
<td>110</td>
<td>A - B</td>
</tr>
<tr>
<td>111</td>
<td>SLT</td>
</tr>
</tbody>
</table>
**Review: ALU**

### ALU Function Table

<table>
<thead>
<tr>
<th>$F_{2:0}$</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>000</td>
<td>$A &amp; B$</td>
</tr>
<tr>
<td>001</td>
<td>$A \mid B$</td>
</tr>
<tr>
<td>010</td>
<td>$A + B$</td>
</tr>
<tr>
<td>011</td>
<td>not used</td>
</tr>
<tr>
<td>100</td>
<td>$A &amp; \neg B$</td>
</tr>
<tr>
<td>101</td>
<td>$A \mid \neg B$</td>
</tr>
<tr>
<td>110</td>
<td>$A - B$</td>
</tr>
<tr>
<td>111</td>
<td>SLT</td>
</tr>
</tbody>
</table>
## Control Unit: ALU Decoder

### ALUOp<sub>1:0</sub> Meaning

<table>
<thead>
<tr>
<th>ALUOp&lt;sub&gt;1:0&lt;/sub&gt;</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>Add</td>
</tr>
<tr>
<td>01</td>
<td>Subtract</td>
</tr>
<tr>
<td>10</td>
<td>Look at Funct</td>
</tr>
<tr>
<td>11</td>
<td>Not Used</td>
</tr>
</tbody>
</table>

### ALUOp<sub>1:0</sub> Funct ALUControl<sub>2:0</sub>

<table>
<thead>
<tr>
<th>ALUOp&lt;sub&gt;1:0&lt;/sub&gt;</th>
<th>Funct</th>
<th>ALUControl&lt;sub&gt;2:0&lt;/sub&gt;</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>X</td>
<td>010 (Add)</td>
</tr>
<tr>
<td>X1</td>
<td>X</td>
<td>110 (Subtract)</td>
</tr>
<tr>
<td>1X</td>
<td>100000 (add)</td>
<td>010 (Add)</td>
</tr>
<tr>
<td>1X</td>
<td>100010 (sub)</td>
<td>110 (Subtract)</td>
</tr>
<tr>
<td>1X</td>
<td>100100 (and)</td>
<td>000 (And)</td>
</tr>
<tr>
<td>1X</td>
<td>100101 (or)</td>
<td>001 (Or)</td>
</tr>
<tr>
<td>1X</td>
<td>101010 (slt)</td>
<td>111 (SLT)</td>
</tr>
</tbody>
</table>
# Control Unit: Main Decoder

<table>
<thead>
<tr>
<th>Instruction</th>
<th>(\text{Op}_{5:0})</th>
<th>(\text{RegWrite})</th>
<th>(\text{RegDst})</th>
<th>(\text{AluSrc})</th>
<th>(\text{Branch})</th>
<th>(\text{MemWrite})</th>
<th>(\text{MemtoReg})</th>
<th>(\text{ALUOp}_{1:0})</th>
</tr>
</thead>
<tbody>
<tr>
<td>R-type</td>
<td>000000</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>lw</td>
<td>100011</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>sw</td>
<td>101011</td>
<td>0</td>
<td>X</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>beq</td>
<td>000100</td>
<td>0</td>
<td>X</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>X</td>
</tr>
</tbody>
</table>
Single-Cycle Datapath Example: or
Extended Functionality: `addi`

- No change to datapath
## Control Unit: addi

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Op&lt;sub&gt;5:0&lt;/sub&gt;</th>
<th>RegWrite</th>
<th>RegDst</th>
<th>AluSrc</th>
<th>Branch</th>
<th>MemWrite</th>
<th>MemtoReg</th>
<th>ALUOp&lt;sub&gt;1:0&lt;/sub&gt;</th>
</tr>
</thead>
<tbody>
<tr>
<td>R-type</td>
<td>000000</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>10</td>
</tr>
<tr>
<td>lw</td>
<td>100011</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>00</td>
</tr>
<tr>
<td>sw</td>
<td>101011</td>
<td>0</td>
<td>X</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>X</td>
<td>00</td>
</tr>
<tr>
<td>beq</td>
<td>000100</td>
<td>0</td>
<td>X</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>X</td>
<td>01</td>
</tr>
<tr>
<td>addi</td>
<td>001000</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>00</td>
</tr>
</tbody>
</table>
Extended Functionality: }
# Control Unit: Main Decoder

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Op&lt;sub&gt;5:0&lt;/sub&gt;</th>
<th>RegWrite</th>
<th>RegDst</th>
<th>AluSrc</th>
<th>Branch</th>
<th>MemWrite</th>
<th>MemtoReg</th>
<th>ALUOp&lt;sub&gt;1:0&lt;/sub&gt;</th>
<th>Jump</th>
</tr>
</thead>
<tbody>
<tr>
<td>R-type</td>
<td>0000000</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>10</td>
<td>0</td>
</tr>
<tr>
<td>lw</td>
<td>100011</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>00</td>
<td>0</td>
</tr>
<tr>
<td>sw</td>
<td>101011</td>
<td>0</td>
<td>X</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>X</td>
<td>00</td>
<td>0</td>
</tr>
<tr>
<td>beq</td>
<td>000100</td>
<td>0</td>
<td>X</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>X</td>
<td>01</td>
<td>0</td>
</tr>
<tr>
<td>j</td>
<td>000100</td>
<td>0</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>XX</td>
<td>1</td>
</tr>
</tbody>
</table>
Processor Performance

- How fast is my program?
  - Every program consists of a series of instructions
  - Each instruction needs to be executed.
Processor Performance

- **How fast is my program?**
  - Every program consists of a series of instructions
  - Each instruction needs to be executed.

- **So how fast are my instructions?**
  - Instructions are realized on the hardware
  - They can take one or more clock cycles to complete
  - \textit{Cycles per Instruction} = \textit{CPI}
Processor Performance

■ How fast is my program?
  ▪ Every program consists of a series of instructions
  ▪ Each instruction needs to be executed.

■ So how fast are my instructions?
  ▪ Instructions are realized on the hardware
  ▪ They can take one or more clock cycles to complete
  ▪ \textit{Cycles per Instruction} = \textit{CPI}

■ How much time is one clock cycle?
  ▪ The critical path determines how much time one cycle requires = \textit{clock period}.
  ▪ \(1/\text{clock period} = \textit{clock frequency}\) = how many cycles can be done each second.
Processor Performance

Now as a general formula

- Our program consists of executing $N$ instructions.
- Our processor needs $CPI$ cycles for each instruction.
- The maximum clock speed of the processor is $f$, and the clock period is therefore $T = 1/f$
Now as a general formula

- Our program consists of executing $N$ instructions.
- Our processor needs $CPI$ cycles for each instruction.
- The maximum clock speed of the processor is $f$, and the clock period is therefore $T = 1/f$

Our program will execute in

$$N \times CPI \times (1/f) = N \times CPI \times T \text{ seconds}$$
How can I Make the Program Run Faster?

\[ N \times \text{CPI} \times (1/f) \]
How can I Make the Program Run Faster?

\[ N \times CPI \times (1/f) \]

- Reduce the number of instructions
  - Make instructions that ‘do’ more (CISC)
  - Use better compilers
How can I Make the Program Run Faster?

\[ N \times CPI \times (1/f) \]

- **Reduce the number of instructions**
  - Make instructions that ‘do’ more (CISC)
  - Use better compilers

- **Use less cycles to perform the instruction**
  - Simpler instructions (RISC)
  - Use multiple units/ALUs/cores in parallel
How can I Make the Program Run Faster?

\[ N \times CPI \times (1/f) \]

- **Reduce the number of instructions**
  - Make instructions that ‘do’ more (CISC)
  - Use better compilers

- **Use less cycles to perform the instruction**
  - Simpler instructions (RISC)
  - Use multiple units/ALUs/cores in parallel

- **Increase the clock frequency**
  - Find a ‘newer’ technology to manufacture
  - Redesign time critical components
  - Adopt pipelining
Single-Cycle Performance

- $T_C$ is limited by the critical path ($lw$)
Single-Cycle Performance

- Single-cycle critical path:
  - $T_c = t_{pcq\_PC} + t_{mem} + \max(t_{RFread}, t_{sext} + t_{mux}) + t_{ALU} + t_{mem} + t_{mux} + t_{RFsetup}$

- In most implementations, limiting paths are:
  - memory, ALU, register file.
  - $T_c = t_{pcq\_PC} + 2t_{mem} + t_{RFread} + t_{mux} + t_{ALU} + t_{RFsetup}$
## Single-Cycle Performance Example

<table>
<thead>
<tr>
<th>Element</th>
<th>Parameter</th>
<th>Delay (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Register clock-to-Q</td>
<td>$t_{\text{pcq_PC}}$</td>
<td>30</td>
</tr>
<tr>
<td>Register setup</td>
<td>$t_{\text{setup}}$</td>
<td>20</td>
</tr>
<tr>
<td>Multiplexer</td>
<td>$t_{\text{mux}}$</td>
<td>25</td>
</tr>
<tr>
<td>ALU</td>
<td>$t_{\text{ALU}}$</td>
<td>200</td>
</tr>
<tr>
<td>Memory read</td>
<td>$t_{\text{mem}}$</td>
<td>250</td>
</tr>
<tr>
<td>Register file read</td>
<td>$t_{\text{RFread}}$</td>
<td>150</td>
</tr>
<tr>
<td>Register file setup</td>
<td>$t_{\text{RFsetup}}$</td>
<td>20</td>
</tr>
</tbody>
</table>

$$T_c =$$
### Single-Cycle Performance Example

<table>
<thead>
<tr>
<th>Element</th>
<th>Parameter</th>
<th>Delay (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Register clock-to-Q</td>
<td>$t_{pcq_PC}$</td>
<td>30</td>
</tr>
<tr>
<td>Register setup</td>
<td>$t_{setup}$</td>
<td>20</td>
</tr>
<tr>
<td>Multiplexer</td>
<td>$t_{mux}$</td>
<td>25</td>
</tr>
<tr>
<td>ALU</td>
<td>$t_{ALU}$</td>
<td>200</td>
</tr>
<tr>
<td>Memory read</td>
<td>$t_{mem}$</td>
<td>250</td>
</tr>
<tr>
<td>Register file read</td>
<td>$t_{RF_read}$</td>
<td>150</td>
</tr>
<tr>
<td>Register file setup</td>
<td>$t_{RF_setup}$</td>
<td>20</td>
</tr>
</tbody>
</table>

\[
T_c = t_{pcq\_PC} + 2t_{mem} + t_{RF\_read} + t_{mux} + t_{ALU} + t_{RF\_setup}
\]

\[
= [30 + 2(250) + 150 + 25 + 200 + 20] \text{ ps}
\]

\[
= 925 \text{ ps}
\]
Example:

For a program with 100 billion instructions executing on a single-cycle MIPS processor:
Single-Cycle Performance Example

Example:
For a program with 100 billion instructions executing on a single-cycle MIPS processor:

\[
\text{Execution Time} = \# \text{ instructions} \times \text{CPI} \times \text{TC} \\
= (100 \times 10^9)(1)(925 \times 10^{-12} \text{ s}) \\
= 92.5 \text{ seconds}
\]
What Did We Learn?

- How to determine the performance of processors
  - CPI
  - Clock speed
  - Number of instructions

- Single-cycle MIPS architecture
  - Learned individual components
    - ALU
    - Data Memory
    - Instruction Memory
    - Register File
  - Discussed implementation of some instructions
  - Explained the control flow