In computer engineering, microarchitecture (sometime abbreviated to µarch or uarch) is a description of the electrical circuitry of a computer, central processing unit, or digital signal processor that is sufficient for completely describing the operation of the hardware. Computer engineering (or Computer Systems Engineering) encompasses broad areas of both Electrical engineering and Computer science. An electrical network is an interconnection of Electrical elements such as Resistors Inductors Capacitors Transmission lines Voltage A computer is a Machine that manipulates data according to a list of instructions. A digital signal processor ( DSP or DSP micro) is a specialized Microprocessor designed specifically for Digital signal processing, generally
In academic circles, the term computer organization is used, while in the computer industry, the term microarchitecture is more often used. Microarchitecture and instruction set architecture (ISA) together constitute the field of computer architecture. An instruction set is a list of all the instructions and all their variations that a processor can execute In Computer engineering, computer architecture is the conceptual design and fundamental operational structure of a Computer system
Contents |
Since the 1950s, many computers used microprogramming to implement their control logic which decoded the program instructions and executed them. The 1950s Decade refers to the years of 1950 to 1959 inclusive Microprogramming (ie writing microcode) is a method that can be employed to implement Machine instructions in a CPU relatively easily often using less Control logic is the part of a Software architecture that controls what the program will do The bits within the microprogram words controlled the processor at the level of electrical signals.
The term microarchitecture was used to describe the units that were controlled by the microprogram words, as opposed to architecture that was visible and documented for programmers. While architecture usually had to be compatible between hardware generations, the underlying microarchitecture could be easily changed.
The microarchitecture is related to, but not the same as, the instruction set architecture. An instruction set is a list of all the instructions and all their variations that a processor can execute The instruction set architecture is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer, which includes the execution model, processor registers, address and data formats etc. A parallel programming model is a set of software technologies to express Parallel algorithms and match applications with the underlying parallel systems See the terminology section below for information regarding inconsistent use of the terms assembly and assembler In Computer architecture, a processor register is a small amount of storage available on the CPU whose contents can be accessed more quickly than storage The microarchitecture (or computer organization) is mainly a lower level structure and therefore governs a large number of details that are hidden in the programming model. It describes the constituent parts of the processor and how they are interconnected and interoperate in order to implement the architectural specification. [1][2] [3]
The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describes the interconnections of the various microarchitectual elements of the machine. The actual electronic circuitry that implements these elements is called the implementation of that microarchitecture (which also includes layout, packaging, and other physical details). Microarchitectual elements may be everything from single gates, via registers, LUTs, multiplexers, counters, etc, to complete ALUs and even larger elements. The electronic circuitry level can, in turn, be subdivided into transistor-level details, such as which basic gate-building structures are used and what logic implementation types (static/dynamic, number of phases, etc) are chosen, in addition to the actual logic design built on the foundation of these choices, at - mainly - the gate-level and up.
A very simplified high level description, which is common in marketing, may simply show characteristics such as bus-widths, the number of (so called) execution units and their types, along with blocks such as branch prediction, cache memories etc. In Computer engineering, an execution unit is a part of a CPU that performs the operations and calculations called for by the program. In Computer architecture, a branch predictor is the part of a processor that determines whether a Conditional branch in the instruction Some details regarding pipeline structure (like fetch, decode, assign, execute, write-back) may also be included.
A few important points:
The pipelined datapath is the most commonly used datapath design in microarchitecture today. Pipelining redirects here For HTTP pipelining see HTTP pipelining. A datapath is a collection of functional units, such as ALUs or multipliers that perform Data processing operations This technique is used in most modern microprocessors, microcontrollers, and DSPs. A microcontroller (also MCU or µC is a functional Computer system-on-a- chip. The pipelined architecture allows multiple instructions to overlap in execution, much like an assembly line. The pipeline includes several different stages which are fundamental in microarchitecture designs. [4] Some of these stages include instruction fetch, instruction decode, execute, and write back. Some architectures include other stages such as memory access. The design of pipelines is one of the central microarchitectural tasks.
Execution units are also essential to microarchitecture. Execution units include arithmetic logic units (ALU), floating point units (FPU), load/store units, branch prediction, and SIMD. In Computing, an arithmetic logic unit ( ALU) is a Digital circuit that performs Arithmetic and Logical operations A floating point unit (FPU is a part of a Computer system specially designed to carry out operations on Floating point numbers In Computing, SIMD ( S ingle I nstruction M ultiple D ata is a technique employed to achieve data level parallelism as in a Vector These units perform the operations or calculations of the processor. The choice of the number of execution units, their latency and throughput is a central microarchitectural design task. The size, latency, throughput and connectivity of memories within the system are also microarchitectural decisions.
System-level design decisions such as whether or not to include peripherals, such as memory controllers, can be considered part of the microarchitectural design process. For an account of the words periphery and peripheral as they are used in biology sociology politics computer hardware and other fields see the The memory controller is a chip on a computer's Motherboard or CPU die which manages the flow of data going to and from the memory. This includes decisions on the performance-level and connectivity of these peripherals.
Unlike architectural design, where achieving a specific performance level is the main goal, microarchitectural design pays closer attention to other constraints. Since microarchitecture design decisions directly affect what goes into a system, attention must be paid to such issues as:
In general, all CPUs, single-chip microprocessors or multi-chip implementations run programs by performing the following steps:
Complicating this simple-looking series of steps is the fact that the memory hierarchy, which includes caching, main memory and non-volatile storage like hard disks, (where the program instructions and data reside) has always been slower than the processor itself. In Computer science, a cache (kæʃ like "cash") is a collection of data duplicating original Computer data storage, often called storage or memory, refers to Computer components devices and recording media that retain digital A hard disk drive ( HDD) commonly referred to as a hard drive, hard disk, or fixed disk drive, is a Non-volatile storage device Step (2) often introduces a lengthy (in CPU terms) delay while the data arrives over the computer bus. In Computer architecture, a bus is a subsystem that transfers data between computer components inside a Computer or between computers A considerable amount of research has been put into designs that avoid these delays as much as possible. Over the years, a central goal was to execute more instructions in parallel, thus increasing the effective execution speed of a program. These efforts introduced complicated logic and circuit structures. Initially these techniques could only be implemented on expensive mainframes or supercomputers due to the amount of circuitry needed for these techniques. As semiconductor manufacturing progressed, more and more of these techniques could be implemented on a single semiconductor chip.
See Article Central Processing Unit for a more detailed discussion on operation basics.
See Article History of general purpose CPUs for a more detailed discussion on the development history of CPUs. The history of general purpose CPUs is a continuation of the earlier History of computing hardware.
What follows is a survey of micro-architectural techniques that are common in modern CPUs.
The choice of which Instruction Set Architecture to use greatly affects the complexity of implementing high performance devices. An instruction set is a list of all the instructions and all their variations that a processor can execute Over the years, computer architects have strived to simplify instruction sets, which enables higher performance implementations by allowing designers to spend effort and time on features which improve performance as opposed to spending their energies on the complexity inherent in the instruction set.
Instruction set design has progressed from CISC, RISC, VLIW, EPIC types. Very Long Instruction Word or VLIW refers to a CPU architecture designed to take advantage of Instruction level parallelism (ILP Explicitly Parallel Instruction Computing ( EPIC) is a term coined in 1997 by the HP-Intel alliance to describe a Computing paradigm that began to be researched Architectures that are dealing with data parallelism include SIMD and Vectors. Data parallelism (also known as loop-level parallelism) is a form of Parallelization of computing across multiple processors in Parallel computing In Computing, SIMD ( S ingle I nstruction M ultiple D ata is a technique employed to achieve data level parallelism as in a Vector A vector processor, or array processor, is a CPU design where the instruction set includes operations that can perform mathematical operations on multiple data
One of the first, and most powerful, techniques to improve performance is the use of the instruction pipeline. Pipelining redirects here For HTTP pipelining see HTTP pipelining. Pipelining redirects here For HTTP pipelining see HTTP pipelining. Early processor designs would carry out all of the steps above for one instruction before moving onto the next. Large portions of the circuitry were left idle at any one step; for instance, the instruction decoding circuitry would be idle during execution and so on.
Pipelines improve performance by allowing a number of instructions to work their way through the processor at the same time. In the same basic example, the processor would start to decode (step 1) a new instruction while the last one was waiting for results. This would allow up to four instructions to be "in flight" at one time, making the processor look four times as fast. Although any one instruction takes just as long to complete (there are still four steps) the CPU as a whole "retires" instructions much faster and can be run at a much higher clock speed.
RISC make pipelines smaller and much easier to construct by cleanly separating each stage of the instruction process and making them take the same amount of time — one cycle. The processor as a whole operates in an assembly line fashion, with instructions coming in one side and results out the other. An assembly line is a Manufacturing process in which parts (usually Interchangeable parts) are added to a product in a sequential manner using optimally planned Due to the reduced complexity of the Classic RISC pipeline, the pipelined core and an instruction cache could be placed on the same size die that would otherwise fit the core alone on a CISC design. In the History of computer hardware, some early Reduced instruction set computer Central processing units (RISC CPUs used a very similar architectural solution now This was the real reason that RISC was faster. Early designs like the SPARC and MIPS often ran over 10 times as fast as Intel and Motorola CISC solutions at the same clock speed and price. SPARC (from Scalable Processor Architecture is a RISC Microprocessor Instruction set architecture originally MIPS (originally an acronym for Microprocessor without Interlocked Pipeline Stages) is a RISC microprocessor architecture developed by MIPS Technologies Motorola Inc ( is an American, multinational Fortune 100, Telecommunications company based in Schaumburg Illinois.
Pipelines are by no means limited to RISC designs. By 1986 the top-of-the-line VAX (the 8800) was a heavily pipelined design, slightly predating the first commercial MIPS and SPARC designs. Most modern CPUs (even embedded CPUs) are now pipelined, and microcoded CPUs with no pipelining are seen only in the most area-constrained embedded processors. Large CISC machines, from the VAX 8800 to the modern Pentium 4 and Athlon, are implemented with both microcode and pipelines. Improvements in pipelining and caching are the two major microarchitectural advances that have enabled processor performance to keep pace with the circuit technology on which they are based.
It was not long before improvements in chip manufacturing allowed for even more circuitry to be placed on the die, and designers started looking for ways to use it. One of the most common was to add an ever-increasing amount of cache memory on-die. Cache is simply very fast memory, memory that can be accessed in a few cycles as opposed to "many" needed to talk to main memory. The CPU includes a cache controller which automates reading and writing from the cache, if the data is already in the cache it simply "appears," whereas if it is not the processor is "stalled" while the cache controller reads it in.
RISC designs started adding cache in the mid-to-late 1980s, often only 4 KB in total. This number grew over time, and typical CPUs now have about 512 KB, while more powerful CPUs come with 1 or 2 or even 4, 6, 8 or 12 MB, organized in multiple levels of a memory hierarchy. The Hierarchical arrangement of storage in current Computer architectures is called the memory hierarchy. Generally speaking, more cache means more speed.
Caches and pipelines were a perfect match for each other. Previously, it didn't make much sense to build a pipeline that could run faster than the access latency of off-chip memory. Using on-chip cache memory instead, meant that a pipeline could run at the speed of the cache access latency, a much smaller length of time. This allowed the operating frequencies of processors to increase at a much faster rate than that of off-chip memory.
One of barriers to achieving higher performance through instruction-level parallelism are pipeline stalls and flushes due to branches. Normally, whether a conditional branch will be taken isn't known until late in the pipeline as conditional branches depend on results coming from a register. From the time that the processor's instruction decoder has figured out that it has encountered a conditional branch instruction to the time that the deciding register value can be read out, the pipeline might be stalled for several cycles. On average, every fifth instruction executed is a branch, so that's a high amount of stalling. If the branch is taken, its even worse, as then all of the subsequent instructions which were in the pipeline needs to be flushed.
Techniques such as branch prediction and speculative execution are used to lessen these branch penalties. In Computer architecture, a branch predictor is the part of a processor that determines whether a Conditional branch in the instruction In Computer science, speculative execution is the execution of code, the result of which may not be needed Branch prediction is where the hardware makes educated guesses on whether a particular branch will be taken. The guess allows the hardware to prefetch instructions without waiting for the register read. Speculative execution is a further enhancement in which the code along the predicted path is executed before it is known whether the branch should be taken or not.
Even with all of the added complexity and gates needed to support the concepts outlined above, improvements in semiconductor manufacturing soon allowed even more logic gates to be used.
In the outline above the processor processes parts of a single instruction at a time. Computer programs could be executed faster if multiple instructions were processed simultaneously. This is what superscalar processors achieve, by replicating functional units such as ALUs. A superscalar CPU architecture implements a form of parallelism called Instruction-level parallelism within a single processor The replication of functional units was only made possible when the die area of a single-issue processor no longer stretched the limits of what could be reliably manufactured. By the late 1980s, superscalar designs started to enter the market place.
In modern designs it is common to find two load units, one store (many instructions have no results to store), two or more integer math units, two or more floating point units, and often a SIMD unit of some sort. In Computing, SIMD ( S ingle I nstruction M ultiple D ata is a technique employed to achieve data level parallelism as in a Vector The instruction issue logic grows in complexity by reading in a huge list of instructions from memory and handing them off to the different execution units that are idle at that point. The results are then collected and re-ordered at the end.
The addition of caches reduces the frequency or duration of stalls due to waiting for data to be fetched from the memory hierarchy, but does not get rid of these stalls entirely. In early designs a cache miss would force the cache controller to stall the processor and wait. Of course there may be some other instruction in the program whose data is available in the cache at that point. Out-of-order execution allows that ready instruction to be processed while an older instruction waits on the cache, then re-orders the results to make it appear that everything happened in the programmed order. In Computer engineering, out-of-order execution, OoOE, is a paradigm used in most high-performance Microprocessors to make use of cycles that
One problem with an instruction pipeline is that there are a class of instructions that must make their way entirely through the pipeline before execution can continue. In particular, conditional branches need to know the result of some prior instruction before "which side" of the branch to run is known. For instance, an instruction that says "if x is larger than 5 then do this, otherwise do that" will have to wait for the results of x to be known before it knows if the instructions for this or that can be fetched.
For a small four-deep pipeline this means a delay of up to three cycles — the decode can still happen. But as clock speeds increase the depth of the pipeline increases with it, and some modern processors may have 20 stages or more. In this case the CPU is being stalled for the vast majority of its cycles every time one of these instructions is encountered.
The solution, or one of them, is speculative execution, also known as branch prediction. In Computer science, speculative execution is the execution of code, the result of which may not be needed In Computer architecture, a branch predictor is the part of a processor that determines whether a Conditional branch in the instruction In reality one side or the other of the branch will be called much more often than the other, so it is often correct to simply go ahead and say "x will likely be smaller than five, start processing that". If the prediction turns out to be correct, a huge amount of time will be saved. Modern designs have rather complex prediction systems, which watch the results of past branches to predict the future with greater accuracy.
Computer architects have become stymied by the growing mismatch in CPU operating frequencies and DRAM access times. None of the techniques that exploited instruction-level parallelism within one program could make up for the long stalls that occurred when data had to be fetched from main memory. Additionally, the large transistor counts and high operating frequencies needed for the more advanced ILP techniques required power dissipation levels that could no longer be cheaply cooled. For these reasons, newer generations of computers have started to exploit higher levels of parallelism that exist outside of a single program or program thread. A thread in Computer science is short for a thread of execution.
This trend is sometimes known as throughput computing. This idea originated in the mainframe market where online transaction processing emphasized not just the execution speed of one transaction, but the capacity to deal with massive numbers of transactions. Online transaction processing, or OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications typically for data entry and retrieval With transaction-based applications such as network routing and web-site serving greatly increasing in the last decade, the computer industry has re-emphasized capacity and throughput issues.
One technique of how this parallelism is achieved is through multiprocessing systems, computer systems with multiple CPUs. Multiprocessing is the use of two or more central processing units (CPUs within a single computer system Once reserved for high-end mainframes and supercomputers, small scale (2-8) multiprocessors servers have become commonplace for the small business market. Mainframes (often colloquially referred to as Big Iron) are Computers used mainly by large organizations for critical applications typically bulk data A supercomputer is a Computer that is at the frontline of processing capacity particularly speed of calculation (at the time of its introduction For large corporations, large scale (16-256) multiprocessors are common. Even personal computers with multiple CPUs have appeared since the 1990s. A personal computer ( PC) is any Computer whose original sales price size and capabilities make it useful for individuals and which is intended to be operated
With further transistor size reductions made available with semiconductor technology advances, multicore CPUs have appeared where multiple CPUs are implemented on the same silicon chip. A multi-core processor (or chip-level multiprocessor, CMP) combines two or more independent cores into a single package composed of a single Integrated Initially used in chips targeting embedded markets, where simpler and smaller CPUs would allow multiple instantiations to fit on one piece of silicon. By 2005, semiconductor technology allowed dual high-end desktop CPUs CMP chips to be manufactured in volume. Some designs, such as Sun Microsystems' UltraSPARC T1 have reverted back to simpler (scalar, in-order) designs in order to fit more processors on one piece of silicon. Sun Microsystems Inc ( is a multinational vendor of Computers computer components Computer software, and Information technology services Sun Microsystems ' UltraSPARC T1 Microprocessor, known until its 14 November 2005 announcement by its development Codename "
Another technique that has become more popular recently is multithreading. Multithreading computers have hardware support to efficiently execute multiple threads. In multithreading, when the processor has to fetch data from slow system memory, instead of stalling for the data to arrive, the processor switches to another program or program thread which is ready to execute. Though this does not speed up a particular program/thread, it increases the overall system throughput by reducing the time the CPU is idle.
Conceptually, multithreading is equivalent to a context switch at the operating system level. A context switch is the Computing process of storing and restoring the state ( context) of a CPU such that multiple processes can share The difference is that a multithreaded CPU can do a thread switch in one CPU cycle instead of the hundreds or thousands of CPU cycles a context switch normally requires. This is achieved by replicating the state hardware (such as the register file and program counter) for each active thread. A register file is an array of Processor registers in a central processing unit (CPU. The program counter, or shorter PC (also called the instruction pointer, part of the instruction sequencer in some Computers is a register in
A further enhancement is simultaneous multithreading. Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of Superscalar CPUs with Hardware This technique allows superscalar CPUs to execute instructions from different programs/threads simultaneously in the same cycle.
See Article History of general purpose CPUs for other research topics affecting CPU design. The history of general purpose CPUs is a continuation of the earlier History of computing hardware.