National Semiconductor NS32532
National Semiconductor NS32764 (Swordfish)
Intel
Itanium Processor (Merced)
Intel 815 Chipset
(Solano)
Other Research and Publications
Z80,000
Microprocessor
Zilog
1986
I began working at Zillog in June 1980. I had hoped to get a job designing "real" computers at Amdahl, but they were not hiring. By chance, I had met Ross Freeman, who was Zilog's Director R&D, at a Peace Corps event in San Francisco. With Ross's introduction, I was hired to work for John Banning as an architect of the Z80,000 microprocessor. The Z80,000 was a 32-bit extension to the 16-bit Z8000 that Zilog introduced in 1979. [CHM 2007] Bernard Peuto, who had been the architect for the Z8000, was Zilog's VP of Engineering at that time.
My work on the Z80,000 was an extraordinary introduction to VLSI technology. I became fascinated by the range of cost, performance, and power tradeoffs for custom circuit design. In particular, I became interested in the tradeoffs among different types of memory (ROM, RAM, CAM) and their relation to computer architecture. This interest followed through to my Ph.D. research. [Alpert 1984] The project was also an opportunity to apply what I had studied at Stanford about mainframes and minicomputers to microprocessors, especially in techniques for instruction pipelines, cache memory, and performance evaluation.

91K transistors, 10 mm x10 mm, 2.0µm single-metal nMOS
process

[Zilog
1984, p. E-2]
32-Bit Integer Pipeline

[Alpert 1983b, p. 117]
Paged Memory Management

[Alpert 1983b, p. 116]
256B Sectored, Unified Instruction/Data Cache

[Zilog
1984, p. C-1]
[Alpert 1983a] Donald Alpert, “Powerful 32-Bit Micro Includes Memory Management,” Computer Design, October 1983, pp. 213-220.
[Alpert 1983b] Don Alpert, Dean Carberry, Mike Yamamura, et al, “32-Bit Processor Chip Integrates Major System Functions,” Electronics, July 14, 1983, pp. 113-119.
[Alpert 1984] Donald Alpert, "Memory hierarchies for directly executed language microprocessors," Technical Report 84-260, Computer Systems Laboratory, Stanford University, Stanford, California 94305, June 1984.
[CHM 2007] “Oral History Panel on the Development and Promotion of the Zilog Z8000 Microprocessor,” Computer History Museum, April 27, 2007.
[Zilog 1984] "Z80,000 CPU Preliminary Technical Manual," September 1984, Zilog, Inc.
NS32532 Microprocessor
National Semiconductor
1987
I joined National Semiconductor’s design center in Herzlia,
Israel in May 1985, working for Uri Weiser. NSC
had introduced the NS16000 family of microprocessors in 1981, including
the NS16032 CPU, NS16082 MMU, and NS16081 FPU to implement a complete
32-bit architecture supporting 32-bit integers, IEEE double-precision
floating-point data, and paged memory management. At
the time, other merchant semiconductor vendors offered chips that
implemented only a subset of these functions. [Bal]
Consequently, the NSC architecture was adopted by a number of
computer system companies for high-end workstations and
minicomputer-class multiprocessors. Unfortunately,
the design and validation techniques used by NSC and other chip makers
were inadequate to debug a complete computer system.
For the next
generation NS32132 CPU NSC had developed much more effective
validation techniques to create a very reliable product, but the
market oppotunity had passed for workstations. The NS32000 line
still had design wins for embedded applications, in particular for
laser printers that required substantial sofware to interpret
postscript files.

370,000 transistors, 11.5 mm x 14 mm, 1.25 µm double-metal CMOS
technology

[NS32532DS, p. 1]
Hardware Cache Coherence
“The microprocessor's hardware-based coherence mechanisms include a bus of eight pins that controls total or partial invalidation of the on-chip instruction and data caches. Figure 4 shows the organization of the two-way set-associative data cache and its connection with the invalidation bus. Each of the cache lines within a set has an address tag. 16 bytes of data, and four dual-ported validity bits. Both lines of a data cache set can be invalidated using the invalidation bus. Because the validity bits are dual-ported, invalidation of the on-chip caches occurs without interfering with ongoing cache accesses or bus transactions.”
[Maytal, p. 73]
Serialize Memory-Mapped I/O

[U.S. Patent 4,802,085]
[Bal] S. Bal, A. Kaminker, Y. Lavi, A. Menachem, and Z. Soha. 1982. The NS16000 Family-Advances in Architecture and Hardware. Computer 15, 6 (June 1982), 58-67. DOI=10.1109/MC.1982.1654051 http://dx.doi.org/10.1109/MC.1982.1654051
[Maytal] Benjamin Maytal, Sorin Iacobovici, Donald Alpert, et al, “Design Considerations for a General-Purpose Microprocessor,” Computer, January 1989, pp. 66-76. http://dx.doi.org/10.1109/2.19824
[Alpert ICCD] D. Alpert, J. Levy, and B. Maytal, "Architecture of the NS32532 Microprocessor," Proceedings ICCD, October 1987, pp. 168-172.
[Alpert CompEuro] D. Alpert, D. Biran, L. Epstein, et al, “Trends in VLSI Microprocessor Design,” Proceedings, First Annual Conference on Computer Technology, Systems and Applications (CompEuro ‘87), May 1987, pp. 564-567.
[NS32532DS] NS32532-20/NS32532-25/NS32532-30 High-Performance 32-Bit Microprocessor Datasheet, National Semiconductor Corporation, May 1991.
U.S. Patent 4,802,085, Apparatus and method for detecting and handling memory-mapped I/O by a pipelined microprocessor.
[CHM 2008] “National Semiconductor 32000 Microprocessor Oral History Panel,” Computer History Museum, February 26, 2008. http://archive.computerhistory.org/resources/access/text/Oral_History/102658246.05.01.acc.pdf
Swordfish
Microprocessor
National Semiconductor
1991
DIE PHOTO:

The die is 13 mm x13 mm, using a 0.8µm double-metal CMOS process.
[Talmudi 1991, p. 100]

[U.S. Patent 5,481,751, Fig. 1]
Superscalar Execution
“We were trying to figure out how to get parallelism out of multiple functional units, and adopted a microarchitecture that was like VLIW: each FU was assigned to fixed slots in a 2-wide instruction word fetched from the cache. We had the HW detect dependencies as instructions were placed in the cache slots, so it was a superscalar architecture with a VLIW machine organization. To improve icache efficiency we allowed dependent instructions to be packed together with a bit per pair of instructions that indicated whether or not they were dependent. Independent instructions could be executed in parallel, dependent instructions had to be executed sequentially, but still on the pipeline assigned to that slot. Just about the only wasted cache slots were for FP instructions that could not be paired with a load or integer op.
... Overall it was a very efficient architecture. With little extra cost for a second integer pipe and a simple control structure, it was possible to derive a lot of parallelism on many embedded loops.”
[Smotherman Swordfish]

As shown in FIG. 3, each instruction cache entry includes two slots, i.e. Slot A and Slot B. Thus, each entry can contain one or two partially-decoded instructions that are represented with fixed fields for opcode (Opc), source and destination register numbers (R1 and R2, respectively), and immediate values (32b IMM). The entry also includes auxiliary information used to control the sequence of instruction execution, including a bit P that indicates whether the entry contains two consecutive instructions that can be executed in parallel and a bit G that indicates whether the entry is for a complex instruction that is emulated, and additional information representing the length of the instruction(s) in a form that allows fast calculation of the next instruction's address.
[U.S. Patent 5,481,751, Fig. 3, 6:1-14]
Decoded Instruction Cache

[U.S. Patent 5,481,751 Abstract]
Debugging Support

[Intrater 1991, p. 269]
[Danieli 1990] D. Alpert, A.
Averbuch, and O. Danieli. "Performance comparison of load/store and
symmetric instruction set architectures." in Proceedings of the
17th annual international symposium on Computer Architecture
(ISCA '90). ACM, New York, NY, USA, 172-181. http://doi.acm.org/10.1145/325164.325137
[Talmudi 1991] Ran Talmudi, et al., "A 100MIPS, 64b superscalar microprocessor with DSP enhancements," IEEE Intl. Solid-State Circuits Conference, San Francisco, Feb. 1991, pp. 100-101. http://dx.doi.org/10.1109/ISSCC.1991.689081
[Intrater 1991] Intrater, G.; Talmudi, R.; , "A superscalar microprocessor," Electrical and Electronics Engineers in Israel, 1991. Proceedings., 17th Convention of , vol., no., pp.267-270, 5-7 Mar 1991. http://dx.doi.org/10.1109/EEIS.1991.217646
[Intrater 1994] G. D. Intrater and I. Y. Spillinger. 1994. Performance Evaluation of a Decoded Instruction Cache for Variable Instruction Length Computers. IEEE Trans. Comput. 43, 10 (October 1994), http://dx.doi.org/10.1109/12.324540
Swordfish Microprocessor Architecture Specification, Revision 2.0, February 1990. Appears as Appendix A to U.S. Patent 5,249,286.
[Smotherman Swordfish] http://www.cs.clemson.edu/~mark/swordfish.html
[CHM 2008] “National Semiconductor 32000 Microprocessor Oral History Panel,” Computer History Museum, February 26, 2008. http://archive.computerhistory.org/resources/access/text/Oral_History/102658246.05.01.acc.pdf
U.S. Patent 5,249,286, Selectively locking memory locations within a microprocessor's on-chip cache
U.S. Patent 5,481,751, Apparatus and method for storing partially-decoded instructions in the instruction cache of a CPU having multiple execution units
U.S. Patent 5,669,011, Partially decoded instruction cache
After my experience designing three microprocessors
in the 1980's, I was satisfied that the designs had received some
technical recognition, but disappointed that none of the hard work had
led to commercial success. With NSC I had the opportunity to visit
a number of computer system companies to promote the Swordfish product
during its development. What I heard consistently was that they
liked the technology of maintaining architecture compatibility while
using semiconductor improvements to deliver high performance. The
system vendors were frustrated that both Intel and (with the x86 and 860
product lines) and Motorola (with the 68k and 88K product lines) were
forcing their customers to choose between CISC compatibility and
higher-performance RISC architectures. They told me that if
they had to choose a new architecture, the most important criterion was
the installed base of application software. I listened, and I knew
that the IBM PC had given Intel's x86 architecture a growing lead lead
in application software, so in January 1989, I took the opportunity to
join Intel as the first engineer working on what became the P5 project,
which developed the first PentiumTM microprocessor product.
At that time, Intel was only a month away from completing the 486 design. which could execute basic integer instructions with 1-clock throughput and double-precision floating-point instructions in about 20 clocks. Several areas for improvement were identified to improve performance for the next microprocessor, including opportunities for certain techniques addressed specifically at limitations of the x86 architecture compared with the newer RISC architectures.