## The Future of Microprocessor Architecture

### Donald Alpert Stanford University Intel Corporation



The Future of Microprocessor Architecture

## Outline

- Where are we?
- How did we get here?
- Where are we going?

Tokyo, April 15, 1998

# Today: Alpha 21264

- 64-bit Address/Data
- Superscalar
- Out-of-Order Execution
- 256 TLB entries
- 128KB Cache
- Adaptive Branch Prediction
- 0.35 μm CMOS Process
- 15.2M Transistors
- 600 MHz



Source: Digital

**COOL** Chips I

# History

Technology

### Functionality

### Partitioning

1200:

Tokyo, April 15, 1998

The Future of Microprocessor Architecture

# In the Beginning: Intel 4004

- 4-bit Data
- 12-bit Address
- 8 μm PMOS
- 2300 Transistors
- 750 KHz
- 1971



Source: Intel Don Alpert Slide 5

The Future of Microprocessor Architecture

Tokyo, April 15, 1998

# Lithography



Source: A. Yu, IEEE Micro 12/96

### **Die Size**



Don Alpert Slide 7

Source: Intel

### **Transistor Count**





Source: A. Yu, IEEE Micro 12/96

- 8-bit Data
- 16-bit Address
- 6 μm NMOS
- 6K Transistors
- 2 MHz
- 1974





Intel 8080

Source: Intel

The Future of Microprocessor Architecture

#### Issues

- Segmented vs. Linear
  Memory Addresses
- Registers
- Addressing Modes
- Floating-Point



#### Motorola 68000

Photograph: Computer Museum

Tokyo, April 15, 1998

COOL Chips I

The Future of Microprocessor Architecture

|                      | Intel<br>8086 | Zilog<br>Z8000 | Motorola<br>68000 |  |
|----------------------|---------------|----------------|-------------------|--|
| Integer Path         | 16-bit        | 16-Bit         | 16-Bit            |  |
| Floating-Point       | 8087          | No             | Νο                |  |
| Addresses            | Segment (16)  | Segment (16)   | Linear (24)       |  |
| <b>OS Protection</b> | No            | Yes            | Yes               |  |
| Memory Mgt.          | No            | Segmented      | No                |  |
| Cache                | No            | No             | Νο                |  |
| Technology           | 3μm NMOS      | 4-6(?)μm NMOS  | 4μm NMOS          |  |
| No. Transistors      | 29K           | 17.5K          | 68K               |  |
| Frequency            | 5 MHz         | 4 MHz          | 8 MHz             |  |
| Year                 | 1978          | 1979           | 1979              |  |

**COOL** Chips I

Tokyo, April 15, 1998

0011

#### Issues

- Cache
- TLB
- RISC vs. CISC





#### Intel386 CPU

Source: Intel

The Future of Microprocessor Architecture

|                       | Intel Motorola   |               | MIPS          |  |  |
|-----------------------|------------------|---------------|---------------|--|--|
|                       | 80386            | 68020         | R2000         |  |  |
| Integer Path          | 32-bit           | 32-Bit        | 32-Bit        |  |  |
| <b>Floating-Point</b> | 80387            | 68881         | R2010         |  |  |
| Addresses             | Seg/Linear (32)  | Linear (32)   | Linear (32)   |  |  |
| <b>OS Protection</b>  | Yes              | Yes           | Yes           |  |  |
| Memory Mgt.           | 32-entry TLB     | 68851         | 64-entry TLB  |  |  |
| Cache                 | 82385            | 256B          | Controller    |  |  |
| Technology            | 1.5 $\mu$ m CMOS | $2\mu m$ CMOS | $2\mu m$ CMOS |  |  |
| No. Transistors       | 275K             | 200K          | 100K          |  |  |
| Frequency             | 16 MHz           | 16 MHz        | 16.7 MHz      |  |  |
| Year                  | 1985             | 1984          | 1986          |  |  |

COOL Chips I

0011

## **Baseline Microprocessor**

#### Full Functionality

- 32-bit Integer
- 64-bit Floating-Point
- Paged Virtual Memory (TLB)

#### Performance

- Full-Width Datapaths
- Pipelined Function Units
- 8-16KB Cache

### Technology

- ~1.0  $\mu m$  CMOS
- ~1M Transistors





Source:SGI MIPS

Tokyo, April 15, 1998

**COOL** Chips I

The Future of Microprocessor Architecture

### **Since Baseline Microprocessor**

#### Technology

- 1.0 μm → 0.25 μm
- 1M Tx  $\rightarrow$  10M Tx

#### Addresses/Integers

— 32b → 64b

### Superscalar

- In-Order Execution
- Out-of-Order Execution
- Branch Prediction
- Cache

- Underestimate Technology Improvement Rate
- Underestimate Complexity
- Underestimate Software Development Effort
- Underestimate Market Size



### Where Are We going?

#### What we know

#### What we know that we don't know

#### What we don't know that we don't know

Tokyo, April 15, 1998

The Future of Microprocessor Architecture

# Semiconductor Technology Roadmap

|                                                    | 1997       | 1999         | 2001         | 2003         | 2006         | 2009         | 2012          |
|----------------------------------------------------|------------|--------------|--------------|--------------|--------------|--------------|---------------|
| Lithography (nm)                                   | 250        | 180          | 150          | 130          | 100          | 70           | 50            |
| Die Size (mm²)                                     | 300        | 340          | 385          | 430          | 520          | 620          | 750           |
| Transistors (M)                                    | 11         | 21           | 40           | 76           | 200          | 520          | 1400          |
| Frequency (MHz)<br>Local Clock<br>Cross-Chip Clock | 750<br>750 | 1250<br>1200 | 1500<br>1400 | 2100<br>1600 | 3500<br>2000 | 6000<br>2500 | 10000<br>3000 |
| Power (W)                                          | 70         | 90           | 110          | 130          | 160          | 170          | 175           |
| Voltage (V)                                        | 1.8-2.5    | 1.5-1.8      | 1.2-1.5      | 1.2-1.5      | 0.9-1.2      | 0.6-0.9      | 0.5-0.6       |
| I/O Pins                                           | 1450       | 2000         | 2400         | 3000         | 4000         | 5400         | 7300          |
| Wiring Levels                                      | 6          | 6-7          | 7            | 7            | 7-8          | 8-9          | 9             |

Tokyo, April 15, 1998

0011

COOL Chips I

The Future of Microprocessor Architecture

Source: Semiconductor Industry Association

# IA-64 and Merced<sup>™</sup> CPU

#### IA-64

- Joint 64-bit architecture definition by Intel and H-P
- Explicitly Parallel Instruction Computing (EPIC)
  - Encode independent instructions
  - 128 registers
  - Predication
  - Speculation

#### Merced CPU

- First IA64 implementation
- 0.18 µm technology
- 1999 Production

## **Fabrication Facility Costs**



Moore's Second Law: Fab Costs Grow 40% Per Year

Tokyo, April 15, 1998

COOL Chips I

The Future of Microprocessor Architecture

Source: A. Yu, IEEE Micro 12/96

## **Known Challenges**

- Interconnect
- Power
- Reliability
- Verification
- Mixed-Signal

Tokyo, April 15, 1998

COOL Chips I

The Future of Microprocessor Architecture

# Wire Delay Is Increasing

- Gate delay decreasing 25% per generation
- Wire delay increasing 100% per generation
- Communicate across a chip
  - 1 clock at 400 MHz in  $0.35\mu m$
  - 12.4 clocks at 1 GHz in  $0.1 \mu m$



Tokyo, April 15, 1998 The F

# **Off-Chip Data Bandwidth Is Scaling**

- Achievable bit times scale with circuit speed
- Transceiver fits in the area of a (large) pad driver
- Still may need to increase number of I/O signals each generation to match logic integration

COOL Chips I



Source: M. Horowitz, IEEE Micro 1/98 and B. Dally

Tokyo, April 15, 1998

The Future of Microprocessor Architecture

## **Scalable Architecture for ULSI**

#### Processor

- Core cluster of computational units and registers
- Memory
- Inter-processor communication unit

### Technology Properties

- Local interconnect for highest-frequency cluster
- Shrink and replicate processors for higher integration

### Programming Properties

- Replicate chips of multiprocessors for higher performance
- Consistent latencies in clocks across generations

### **Scalable Architecture for ULSI**



Board-Level Multiprocessor

Tokyo, April 15, 1998

0011

**COOL** Chips I

The Future of Microprocessor Architecture

### **Microprocessor Architecture Research**

- Wave Pipelining
- Multithreaded Processors
- Single-Chip Multiprocessors
- Vector/Stream Processors
- Intelligent RAM
- Reconfigurable Computing

# Wave Pipelining

#### Sub-Nanosecond Arithmetic Processor (SNAP)

— Prof. Mike Flynn at Stanford

#### Wave Pipelining

- Uses minimum propagation delay (T<sub>min</sub>) to store data in combinational logic paths
- Conventional pipeline limited by maximum delay path (T<sub>max</sub>)
- Wave pipeline limited by difference in delay (T<sub>max</sub> - T<sub>min</sub>)
- Potential 2-3X performance improvement in CMOS with comparable cost to conventional pipelining



#### CMOS Wave-Pipelined Vector Unit

#### **COOL** Chips I

Tokyo, April 15, 1998

### **Multithreaded Processors**

### Simultaneous Multithreading (SMT) Processors

- Prof. Susan Eggers et al at University of Washington
- Targets fine-grain multithreaded applications/workloads

### Based on a Dynamic, Superscalar Processor

- Add IDs for multiple (8) threads to registers/structures
- Function units are scheduled dynamically with data-ready instructions from multiple threads

### Multi-ported Instruction Cache

- Fetch from two threads simultaneously
- Priority to threads with fewest instructions in pipe

### Potential 2X Performance Improvement

- For incremental cost vs. conventional superscalar

# **Single-Chip Multiprocessors**

#### Hydra Project

- Prof. Kunle Olukotun at Stanford
- Targets thread-level parallelism
- 4 CPUs on a Chip
- 3-Level Cache Hierarchy
- Parallelizing Compiler Technology



#### COOL Chips I

:0011

## **Vector/Stream Processors**

### Imagine Project

- Prof. Bill Dally at Stanford
- Targets Graphics and Signal Processing

### Arithmetic Clusters (8)

- Multiple interconnected-ALUs
- Local registers
- Statically scheduled operations and bus usage

### Memory Streams

- Arrays of multimedia structures
- Multiple SDRAM banks
- Vector register file
  - 16K words
    - 18 streams



# Intelligent RAM

#### **IRAM**

- Prof. Dave Patterson at U.C. Berkeley
- Target latency/bandwidth gap between CPU and DRAM
- Integrate DRAM with Conventional CPU
- **Specialized Processor Exploits On-chip DRAM Bandwidth**



# **Reconfigurable Computing**

### Adaptive Computing Systems

- DARPA program
- Target FPGAs for high-performance programmable HW

### Improved Performance Over Programming SW

- 10X over DSP
- 100X over general-purpose microprocessor

### Cost 2X Over ASIC at Comparable Performance

### Potential Applications

- Pattern matching (image recognition)
- Encryption
- Signal processing

# **Predictions for Cool Chips X**

### No "Cool" Technology

- But power managed at all design levels

### Systems on Chips

- Integrated application solutions
- Majority of transistors for memory
- Multiple, heterogeneous processors
- Mixed-signal applications
- On-chip bus standards
- On-chip interconnection networks

### Unlikely

- Optical interconnect
- Reconfigurable computing