Superscalar

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed.
Processor board of a CRAY T3e parallel computer with four superscalar Alpha processors

A superscalar CPU architecture implements a form of parallel computing called Instruction-level parallelism inside a single processor which allows faster CPU throughput at the same clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions (termed instruction dispatching) to multiple redundant functional units built inside the processor. Each functional unit is not a separate CPU core but an execution resource inside the CPU such as an arithmetic logic unit, floating point unit (FPU), a bit shifter, or a multiplier.

While most superscalar CPUs are also pipelined, it possible to have a non-pipelined superscalar CPU or a pipelined non-superscalar CPU.

The superscalar technique is associated with several identifying characteristics of the CPU core:

  1. Instructions are issued from a sequential instruction stream.
  2. CPU hardware dynamically checks for data dependencies between instructions at run time.
  3. Accepts multiple instructions per clock cycle

Each instruction executed by a Scalar processors manipulates one or two data items at a time, while each instruction executed by a Vector processor operates simultaneously on many data items. A superscalar processor is a mixture of the two:

  1. Each instruction processes one data item.
  2. There are multiple redundant functional units inside each CPU code, so that multiple instructions manipulates separate data items concurrently.

In a superscalar CPU an instruction dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them on the multiple redundant functional units available inside the CPU.

Superscalar CPU design is concerned with improving accuracy of the instruction dispatcher, and allowing it to keep the multiple functional units in use at all times. As of 2008 all general-purpose CPUs are superscalar, a typical superscalar CPU may include up to 4 ALUs, 2 FPUs, and two SIMD units. If the dispatcher failed to utilize all of the units at all times the performance of the CPU will suffer.

Limitations[change | change source]

Available performance improvement in Superscalar CPU design is limited by two key areas:

  1. The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism.
  2. The complexity and time cost of the dispatcher and associated dependency checking logic.

However even given infinitely fast dependency checking logic inside a conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms another limitation.

No matter how fast the dispatcher speed, there is a practical limit on how many instructions can be simultaneously dispatched. While hardware advances will allow greater numbers of functional units (e.g., ALUs), the problem of checking instruction dependencies increases to a limit that the achievable superscalar dispatching limit is somewhat small. -- Likely on the order of five to six simultaneously dispatched instructions.

Alternatives[change | change source]

  • Simultaneous multithreading: often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs. SMT permits multiple independent threads of execution to better utilize the resources available inside a modern superscalar processors.
  • Multi-core processors: superscalar processors differ from multi-core processors in that the multiple redundant functional units are not entire processors. A single superscalar processor is composed of advanced functional units such as the ALU, integer multiplier, integer shifter, floating point unit (FPU), etc. There may be multiple versions of each functional unit to enable execution of many instructions in parallel. This differs from a Multi-core processors that concurrently processes instructions from multiple threads, one thread per core.
  • Pipelined processors: superscalar processors also differs from a pipelined CPU, where the multiple instructions can concurrently be in various stages of execution.

The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in a single processor. Thus it is possible to design a multicore CPU is where each core is an independent processor containing multiple parallel superscalar pipelines. Some multicore processors also include vector capability.

Other pages[change | change source]

References[change | change source]

  • Mike Johnson, Superscalar Microprocessor Design, Prentice-Hall, 1991, ISBN 0-13-875634-1
  • Sorin Cotofana, Stamatis Vassiliadis, "On the Design Complexity of the Issue Logic of Superscalar Machines", EUROMICRO 1998: 10277-10284
  • Steven McGeady, "The i960CA SuperScalar Implementation of the 80960 Architecture", IEEE 1990, pp. 232–240
  • Steven McGeady, et al., "Performance Enhancements in the Superscalar i960MM Embedded Microprocessor," ACM Proceedings of the 1991 Conference on Computer Architecture (Compcon), 1991, pp. 4–7

Other websites[change | change source]