`
`
`
`
`
`
`
`
`
`
`
`
`KENNETH J. THURBER
`
`
`
`
`
`
`
`
`
`
`
`
`Product Development Group, Sperry Univac Defense Systems, St. Paul, Minnesota 55165
`
`
`
`
`
`
`
`
`
`Computer, Information and Control Sciences Department, Universzty of Minnesota,
`
`
`
`Minneapolis, Minnesota 55455
`
`
`
`LEON D. WALD
`
`
`
`
`
`
`
`
`
`
`Test Equipment Engineering, Government and Aeronautical Products Division, Honeywell, Inc.,
`
`
`Minneapolis, Minnesota
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`This paper is a tutorial survey of the area of parallel and associative processors.
`
`
`
`
`
`
`
`
`
`
`
`
`The paper covers the main design tradeoffs and major architectures of SIMD
`
`
`
`
`
`
`
`
`
`(Single Instruction Stream Multiple Data Stream) systems. Summaries of
`
`
`
`
`
`
`
`
`
`
`
`ILLIAC IV, STARAN, OMEN, and PEPE, the major SIMD processors, are
`
`included.
`
`
`
`
`
`
`
`
`
`
`Keywords and Phrases: associative processor, parallel processor, OMEN,
`
`
`
`
`
`
`
`
`STARAN, PEPE, ILLIAC IV, architecture, large-scale systems, SIMD
`
`
`
`
`processors, array processors, ensemble.
`
`
`
`
`
`CR Categories: 3.80, 4.20, 6.22
`
`
`
`
`
`
`
`
`
`
`
`
`
`INTRODUCTION
`ciative processors have been published [140].
`
`
`
`
`
`
`
`Some of these surveys are only of historical
`
`
`
`
`
`
`
`
`
`
`Purpose and Scope
`interest due to their age. Several conferences in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`to present a
`The purpose of
`this paper is
`this area may also be of interest [1 1-14] .
`
`
`
`
`
`
`
`
`
`
`
`
`tutorial survey on the subject of parallel and
`
`
`
`
`
`
`Classification of Computer Architectures
`associative processors. The paper covers the
`
`
`
`
`
`
`
`
`
`
`
`Many approaches to classification of computer
`topics of system categorizations, applications,
`
`
`
`
`
`
`
`
`
`
`
`architectures are possible. Most techniques use
`main tradeoff
`issues, historically important
`
`
`
`
`
`
`
`
`
`
`
`
`
`global architecture properties and are thus valid
`architectures, and the architectures of systems
`
`
`
`
`
`
`
`
`
`
`
`that are currently available.
`only within limited ranges. The main classifica-
`
`
`
`
`
`
`
`
`
`tion techniques discussed below provide a
`Currently, the microprocessor/computer-on-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`framework within which to view associative and
`a-chip revolution is providing the potential for
`
`
`
`
`
`
`
`parallel processors.
`production of very cost-effective high-
`
`
`
`
`
`
`
`
`
`
`
`Flynn [15] proposed a classification scheme
`performance computers through utilization of a
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`that divides systems into categories based upon
`large number of these processors in parallel or
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`their procedure and data streams. The four
`in a network. The parallel and associative proc-
`
`
`
`
`
`
`
`
`
`
`categories are:
`essors provide a class of architectures which can
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`l) SISD (Single Instruction Stream Single
`be readily used to take immediate advantage of
`
`
`
`
`
`
`
`
`
`
`Data Stream) — a uniprocessor such as a
`the microprocessor technology.
`
`
`
`
`single processor IBM 360.
`
`
`
`
`
`
`
`Surveys
`2) MISD (Multiple Instruction Stream Single
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`A number of good surveys on the subject (or
`Data Stream) - a pipeline processor such
`
`
`
`
`
`
`
`
`
`
`
`as CDC STAR.
`bordering on the subject) of parallel and asso-
`
`
`
`
`
`
`
`
`
`
`
`Copyright © 1976, Association for Computing Machinery, Inc. General permission to republish
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`but not for profit, all or part of this material is granted, provided that ACM’s copyright notice is
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`given and that reference is made to this pubhcation, to its date of issue, and to the fact that re-
`
`
`
`
`
`
`
`
`
`
`
`
`printing privileges were granted by permission of the Association for Computing Machinery.
`
`,
`
`
`
`
`
`
`
`
`
`
`
`
`
`Computing Surveys, Vol. 7, No. 4, December 1975
`INTEL - 1010
`
`INTEL - 1010
`
`
`
`
`216
`
`-
`
`
`
`Kenneth J. Thurber and Leon D. Wald
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`CONTENTS
`
`
`
`parallelism
`
`
`
`
`
`based
`cation
`upon
`technique
`
`
`
`
`properties. Their categories are:
`
`
`
`
`l) General-purpose network computer;
`
`
`
`
`
`’2) Special-purpose network With global
`
`
`parallelism; and
`
`
`
`
`.3) Non-global,
`semi-independent network
`
`
`
`
`
`
`
`with local parallelism — this category is a
`
`
`
`
`
`
`
`catchall for machines which do not fit
`
`
`
`
`
`into the first two categories.
`
`
`
`
`
`
`
`Categories
`and 2 have
`the following
`1
`
`subcases:
`
`
`
`
`
`1.a) General—purpose network with cen-
`
`
`
`tralized common control;
`
`
`
`
`l.b) General-purpose network With
`
`
`
`
`identical processors but
`independent
`
`
`
`instruction execution actions;
`
`
`
`2.a) Pattern processors;
`
`
`
`2.b) Associative processors.
`
`
`
`
`
`
`The purpose of this classification technique
`
`
`
`
`
`was to differentiate between multiprocessors
`
`
`
`
`and highly parallel organizations.
`
`
`
`
`
`sug-
`Another possible classification view,
`
`
`
`
`
`
`
`
`
`gested by Hobbs, et al.; [1] consists of the
`
`
`following categorles:
`
`
`1) Multiprocessor
`
`
`
`2) Associative processor
`
`
`
`
`
`
`3) Network or array processor, and
`
`
`
`4) Functional machines.
`
`
`
`
`
`
`
`Further, Hobbs, et al.; suggested that archi-
`
`
`
`
`
`
`
`tectures could be classified based upon the
`
`
`
`
`amount of parallelism in:
`
`
`1) Control
`
`
`
`
`2) Processmg units, and
`
`
`
`3) Data streams.
`
`
`
`
`
`
`
`However, it was noted that these parameters
`
`
`
`
`
`
`
`
`were present in all highly parallel machines and
`
`
`
`
`
`
`
`were therefore not adequate to define a
`
`
`machine archltecture.
`
`
`
`
`
`
`In his article against parallel processors,
`
`
`
`
`
`
`
`Shore
`[17] presents a unique classification
`
`
`
`
`
`
`technlque which derives the machine descrip-
`
`
`
`
`
`
`
`tions from the description of a uniprocessor.
`
`
`
`
`
`
`The machine categories considered are sum-
`
`
`
`marized as follows:
`
`
`
`
`
`
`1) Machine I — a uniprocessor.
`
`
`
`
`
`
`
`2) Machine 11 — a bit-slice associative proc-
`
`
`
`
`
`
`
`
`essor built from Machine I by adding bit-
`
`
`
`
`
`slice processing and access
`capability
`
`
`(e.g., STARAN).
`
`
`
`
`
`3) Machine III —an orthogonal computer
`
`
`
`
`
`
`
`derived from Machine 11 by adding par-
`
`
`
`
`
`
`allel word processing and access capa-
`
`
`
`bility (e.g., OMEN).
`
`INTRODUCTION
`
`Purpose and Scope
`
`
`
`Surveys
`
`Classrfication of Computer Architectures
`
`
`
`
`Definitions
`
`Reasons for Use of SIMD Processors
`
`
`
`
`
`
`Applications
`
`Matrix Multiplication on a Parallel Processor
`
`
`
`
`
`
`SIMD GENEALOGY
`
`
`MAIN TRADEOFF ISSUES
`
`
`
`Assocrauve Addressmg
`
`
`The State Transmon Concept and Assocrative
`
`
`
`
`
`Processor Algorithms
`
`
`Processmg Element (PE) lnterconnectrons
`
`
`
`Input/Output
`
`Host Interaction
`
`
`Memory Distribution
`
`
`Software
`
`Actmty Control and Match Resolution
`
`
`
`
`Balance Between Logic and Memory
`
`
`
`
`
`ASSOCIATIVE PROCESSORS
`
`
`Introduction
`
`DLM (Distributed Logic Memory)
`
`
`
`
`Highly Parallel Assomative Processors
`
`
`
`
`Mixed Mode and Multidimensmnal Memories
`
`
`
`
`Implemented Assocrative Processors
`
`
`
`Assocrattve Processor Simulators
`
`
`
`PARALLEL PROCESSOR AND ENSEMBLES
`
`
`
`Introduction
`
`Ungcr's Machine
`
`
`SOLOMON l, SOLOMON II, and ILLIAC IV
`
`
`
`
`
`
`
`Machines With Other Interconnection Structures
`
`
`
`
`
`Orthogonal Computer
`
`
`PEPE
`
`Comment
`
`ACTUAL MACHINES
`
`Introduction
`
`STARAN
`
`OMEN
`
`PEPE
`
`ILLIAC IV
`
`
`CONCLUSION
`
`REFERENCES
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`+
`
`
`
`
`
`
`
`3) SIMD (Single Instruction Stream Multiple
`
`
`
`
`
`
`Data Stream) — an associative or parallel
`
`
`
`
`
`processor such as ILLIAC IV.
`
`
`
`
`
`4) MIMD (Multiple Instruction Stream
`
`
`
`
`
`
`Multiple Data Stream) - a multiprocessor
`
`
`
`
`
`
`such as a Univac 1108 multiprocessor
`
`system.
`
`
`
`
`
`
`
`
`Murtha and Beadles [16] proposed a classifi-
`
`
`
`
`
`
`
`
`
`Computing Surveys, Vol. 7, No. 4, December 1975
`
`INTEL - 1010
`
`INTEL - 1010
`
`
`
`
`
`0
`
`
`
`
`
`
`
`
`
`217
`
`
`
`
`
`Associative and Parallel Processors
`
`
`
`
`
`
`
`
`sistent with those used by Thurber [10]. The
`
`
`
`definitions are'
`
`
`
`
`
`
`l) SIMD machine: any computer with a
`
`
`
`
`
`
`single global control unit which drives
`
`
`
`
`
`multiple processing units, all of which
`
`
`
`
`
`either execute or
`ignore the current
`
`instruction.
`
`
`
`
`
`
`2) Associative Processor: any SIMD machine
`
`
`
`
`
`
`
`in which the processing units (or proc—
`
`
`
`
`
`
`essmg memory)
`are addressed by a
`
`
`
`
`
`
`
`property of the data contents rather than
`
`
`
`
`
`
`
`
`by address (i.e., multiply A and B to-
`
`
`
`
`
`
`
`gether in all elements where A < B).
`
`
`
`
`
`
`
`3) Parallel Processor: any SIMD machine in
`
`
`
`
`
`
`which the processing elements are the
`
`
`
`
`
`
`
`order of complexity of current day small
`
`
`
`
`
`
`computers and which has, typically, high
`
`
`
`
`
`level of interconnectivity between proc—
`
`
`essing elements.
`
`
`
`
`
`
`
`4) Ensemble
`a parallel processor in which
`
`
`
`
`
`the interconnection level between proc-
`
`
`
`
`
`
`
`essing elements is very low or nonex-
`
`istent.
`
`
`
`
`
`
`Clearly, the intersection of these definitions
`
`
`
`
`
`
`
`
`
`
`will not be null due to the overlap in claSSI-
`
`
`
`fying machine architectures.
`
`
`
`
`
`
`
`4) Machine IV — a machine derived from
`
`
`
`
`
`Machine I by rephcating the processing
`
`
`
`units (e.g., PEPE).
`
`
`
`
`
`5) Machine V — a machine derived from
`
`
`
`
`
`Machine IV by adding interconnections
`
`
`
`
`
`between processors (e.g., ILLIAC IV).
`
`
`
`
`
`
`
`6) Machine VI — a machine derived from
`
`
`
`
`
`
`Machine I by integrating processing logic
`
`
`
`
`
`
`into every memory element (e.g., Kautz’s
`
`
`
`
`logic-in-memory computers [85, 86] ).
`
`
`
`
`
`classifies computers using
`[67]
`Higbie
`
`
`
`
`
`
`four
`baSlC categories. However, he
`Flynn’s
`
`
`
`
`
`
`
`expands the SIMD category into the following
`
`
`four subcategories:
`
`
`
`
`
`
`
`1) Array Processor — a processor that proc-
`
`
`
`
`
`
`
`esses data in parallel and addresses the
`
`
`
`
`
`
`
`
`
`data by address instead of by tag or value.
`
`
`
`
`2) Associative Memory Processor — a
`
`
`
`
`
`processor‘that operates on data addressed
`
`
`
`
`
`
`
`
`by tag or value rather than by address
`
`
`
`
`
`
`(Note:
`this definition does not require
`
`
`
`
`
`parallel operation; however,
`the defini-
`
`
`
`
`
`
`
`tion does allow for machines that operate
`
`
`
`
`on data in parallel).
`
`
`
`
`
`
`3) Associative Array Processor — a processor
`
`
`
`
`
`
`assocxative and also operates
`that
`is
`
`
`
`
`
`
`
`on arrays of data (typically. the opera-
`
`
`
`
`
`
`
`
`
`tions are on a b1t-slice basis, i.e., a single
`
`
`
`
`bit of many words).
`
`
`
`
`
`
`4) Orthogonal Processor — a processor w1th
`
`
`
`
`two subsystems — an associative array
`
`
`
`
`
`
`processor subsystem and a serial (SISD)
`
`
`
`
`processor
`subsystem — which share
`a
`
`
`
`common memory array.
`
`
`
`
`
`
`Higbie’s categories provide for the identifica-
`
`
`
`
`
`
`
`
`tion of the ability to perform parallel, asso-
`
`
`
`
`
`
`
`ciative, and serial processing. Higbie defines a
`
`
`
`
`
`
`
`
`parallel processor to be any computer that con-
`
`
`
`
`
`
`
`tains multiple arithmetic units and operates on
`
`
`
`
`
`
`
`multiple data streams. Clearly, all of Higbie’s
`
`
`
`
`
`
`
`four subcategories of SIMD processors fit hls
`
`
`
`
`
`definltion of a parallel processor.
`
`
`
`
`
`
`Although none of the classification schemes
`
`
`
`
`
`
`presented here are mathematically precrse, they
`
`
`
`
`
`
`
`are necessary to place assocrative and parallel
`
`
`
`
`
`
`
`processors in perspective. In the next section
`
`
`
`
`
`
`the terminology of assoc1ative and parallel
`
`
`
`
`
`
`
`
`
`processors will be defined for use in the re-
`
`
`
`
`mainder of this paper.
`
`Definitions
`
`
`
`
`
`
`
`The definitions used for the remainder of this
`
`
`
`
`
`
`
`
`
`
`paper are based upon Flynn [15] , and are con-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Reasons for Use of SIMD Processors
`
`
`
`
`
`
`
`
`
`There are many reasons for the use of SIMD
`
`
`
`
`
`
`
`architectures. Some of the most important are:
`
`
`1) Functional
`
`
`
`
`
`
`
`a) Large problems such as weather data
`
`processing.
`
`
`
`
`
`b) Problems with inherent data structure
`
`
`and parallelism.
`
`
`
`
`c) Reliability and graceful degradation.
`
`
`
`(1) System complexity.
`
`
`
`e) Computational load.
`
`
`2) Hardware
`
`
`
`
`
`
`a) Better use of hardware on problems
`
`
`
`
`
`With large amounts of parallelism.
`
`
`
`
`
`b) Advent of LSI microprocessors.
`
`
`
`
`
`c) Economy of duplicate structures.
`
`
`
`
`(1) Lower nonrecurring and recurring
`
`costs.
`
`
`3) Software
`
`
`
`
`
`a) Simpler than for multiprocessor.
`
`
`
`
`
`
`b) Easier to construct large systems.
`
`
`
`
`0) Less executive function requirements.
`
`
`
`
`
`
`However, the reader is cautioned to remem-
`
`
`
`
`
`
`ber that these are special-purpose machines and
`
`
`
`
`
`
`
`any attempt
`to apply them to an incorrectly
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Computing Surveys, Vol. 7, No. 4, December 1975
`INTEL - 1010
`
`INTEL - 1010
`
`
`
`Kenneth J. Thurber and Leon D. Wald
`
`
`
`
`
`
`
`
`
`
`
`
`
`218
`
`
`
`.
`
`
`
`
`
`
`
`
`sized, or designed, problem is an exercise in
`futility.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`systen
`followed by parallel
`level,
`systems
`
`
`
`
`
`design, rather than vice versa.
`
`
`
`
`
`
`SIMD processors have been proposed ant
`
`
`
`
`
`
`
`appear well suited for a number of applications
`
`
`
`
`
`
`
`
`these are listed here along with their primary
`
`
`
`
`
`
`
`references. Factors to be considered in applying
`
`
`
`
`
`
`SIMD processors have been discussed elsewhere
`
`
`
`
`
`
`
`
`
`[18]. To be useful for solving the listed prob-
`
`
`
`
`
`
`lems,
`the highly parallel processors must
`
`
`
`
`
`become more
`cost-effective. The suggested
`
`
`
`application areas are:
`
`
`
`
`
`
`l) Applications in which associative proc-
`
`
`
`
`
`essors appear to be cost-effective:
`
`
`
`
`
`
`[19]
`a) Virtual memory mechanisms
`
`
`
`
`[20]
`b) Resource allocation
`
`
`
`
`[21]
`c) Hardware executives
`
`
`
`
`
`d) Interrupt processmg
`[22,23]
`
`
`
`
`
`e) Protection mechanisms
`[19]
`
`
`
`f) Scheduling
`[20]
`
`
`
`
`
`
`2) Applications in which parallel processors
`
`
`
`
`
`appear cost-effective and in which
`
`
`
`
`
`
`associative processors may be
`effective:
`
`cost-
`
`Applications
`
`
`
`
`
`
`
`
`
`
`
`
`
`Numerous applications have been proposed for
`
`
`
`
`
`
`associative and parallel processors. Whether an
`
`
`
`
`
`
`
`application is ever implemented is based upon
`
`
`
`
`
`
`
`
`
`the economics of the problem. It is not enough
`
`
`
`
`
`
`
`
`
`to be able to describe a problem solution. To
`
`
`
`
`
`
`date, considerable energy has been expended
`
`
`
`
`
`searching for cost-effective parallel solutions,
`
`
`
`
`
`
`
`
`but little has been done to actually implement
`
`
`
`
`
`
`them. Examples of
`such applications are:
`
`
`
`
`matrix manipulation, differential equations,
`and linear programming.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Several areas of application appear quite well
`suited to associative and parallel processing. In
`some cases, SIMD processors provide cost-
`effective system augmentation (e.g., air traffic
`control and associative head-per-track disks). In
`others, SIMD processors are very close to func-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`tional analogs of the physical system (e.g., data
`
`
`
`
`compression, concentration, and multiplexing).
`
`
`
`
`
`
`
`The use of associative and parallel processors
`
`
`
`
`
`
`
`appears promising in the elimination of critical
`
`
`
`
`
`bottlenecks in current general-purpose com-
`
`
`
`
`
`
`
`puter systems; however, only very small asso-
`
`
`
`
`
`
`
`
`ciative memories are required, and cost is really
`
`
`
`
`
`
`
`
`
`not a critical factor in the solution of these
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`resource management,
`to problems such as
`
`
`
`
`
`
`virtual memory mechanisms, and some aug-
`
`
`
`
`
`
`
`mentation of current systems, rather than to
`
`
`
`
`
`
`data-base management or large file searching
`
`
`and processing. The application trend for
`
`
`
`parallel processors and ensembles seems to be in
`
`the area of large data computation problems
`
`
`such as weather data processing, nuclear data
`
`
`processing, and ballistic missile defense. Re-
`
`
`
`searchers must concentrate on systems aspects
`
`
`
`of the problem, not on developing solutions to
`
`
`
`
`
`
`problems which are ill-defined or nonexistent.
`
`
`
`
`
`
`The only truly useful data-processing applica-
`
`
`
`
`
`
`
`tions of parallel and associative processors have
`
`
`
`
`
`
`
`been developed through extensive work at the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`The application trend for associative
`processors is well defined. Due to cost factors,
`applications will be limited (in the near future)
`
`
`
`
`
`
`[24]
`a) Bulk filtering
`
`
`
`[25]
`b) Tracking
`
`
`
`
`
`[26]
`c) Air traffic control
`
`
`
`
`[27]
`d) Data compression
`
`
`
`
`e) Communications multiplexing [28,29]
`
`
`
`
`f) Signal processing
`[30]
`
`
`
`
`
`
`3) Applications in which associative proc-
`
`
`
`
`
`
`
`
`essors would be useful if they were more
`
`
`
`
`
`
`cost-effective and in which parallel proc-
`
`
`
`
`
`essors are probably cost-effective are:
`
`
`
`
`a) Information retrieval
`[3]]
`
`
`
`b) Sorting
`[32]
`
`
`
`
`c) Symbol manipulation
`[33]
`
`
`
`
`(1) Pattern recognition
`[34]
`
`
`
`e) 'Picture processing
`[35]
`
`
`
`
`f) Sonar processing
`[36]
`
`
`
`
`g) Sea surveillance
`[37]
`
`
`
`
`h) Graph processing
`[38]
`
`
`
`1) Dynamic programming
`[39]
`
`
`
`
`j) Differential equations
`[40]
`
`
`
`k) Eigenvector
`[41]
`
`
`
`
`1) Matrix operations
`[42]
`
`
`
`
`
`m) Network flow analysis
`[43]
`
`
`
`
`11) Document retrieval
`[44]
`
`
`
`
`
`0) Data file manipulation
`[45]
`
`
`
`
`
`p) Machine document translation
`[46]
`
`
`
`
`
`q) Data file searching
`[47]
`
`
`
`r) Compilation
`[48]
`
`
`
`
`s) Formatted files
`[49]
`
`
`
`
`
`t) Automatic abstracting
`[50,51]
`
`
`
`problems. Such problems may be encountered
`in the management of computer resources, and
`might involve such items as protection mecha-
`nisms,
`resource allocation, and memory
`management.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Computing Surveys, Vol. 7, No. 4, December 1975
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`INTEL - 1010
`
`INTEL - 1010
`
`
`
`
`
`
`
`
`Assocmtive and Parallel Processors
`
`Each cell is interconnected to its four nearest
`
`neighbors. Cannon [59] derived the following
`algorithm to multiply two nxn matrices to-
`gether in n stages using this processor.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`.
`
`219
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Algorithm:
`Set:
`
`CREG = 0
`
`
`
`Shift:
`
`
`
`[52]
`u) Dictionary-look-up translations
`[53]
`v) Data management
`[54]
`w) Theorem proving
`[55]
`x) Computer graphics
`[56]
`y) Weather data processing
`Hollander’s paper
`[2]
`presents many
`experts’ opinions about
`the applicability of
`associative and parallel processors. Slotnick
`[57] and Fuller
`[58] have compared asso-
`01at1ve and parallel processors. They concluded
`that parallel processors appear more useful than
`associative processors, but the machines would
`always be special purpose. Shore [17] has pre-
`sented a very eloquent case against SIMD
`
`machines and parallel processors.
`
`
`
`
`
`
`Matrix Multiplication on a Parallel Processor
`
`
`Matrix manipulations can be used to illustrate
`the potential of a parallel processor. One of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`major computations in many of the possible
`applications cited previously is
`the matrix
`multiply. Assume that a parallel processor such
`
`
`
`
`Multiply:
`
`Add:
`
`Shift:
`
`
`
`
`
`
`(TREG = AREG times BREG)
`In Parallel in all PEs
`
`
`
`
`
`Jth column of B up J-l rows
`for all J <N
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`(CREG = CREG + TREG)
`In Parallel in all PEs
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`BREG [PEI J] = B [I,J] for all
`
`
`
`1,] <N
`’
`
`
`
`
`
`
`
`AREG [PEI J] = [I,J] for all
`
`
`1,] <N
`’
`
`
`
`
`
`
`
`1th row of A left I-l columns
`
`
`
`for all I <N
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`AREG right one row
`
`
`
`BREG down one column
`
`
`
`
`
`
`If not Nth pass, Jump to
`
`Multiply:
`
`
`
`Jump:
`
`As an example let:
`
`
`
`
`
`
`
`
`
`
`A z
`
`
`
`
`B =
`
`
`
`and
`
`
`
`C = A x B,
`
`
`
`a]
`
`
`
`34
`
`
`a7
`
`32
`
`
`
`as
`
`
`a8
`
`
`b1
`
`b2
`
`b3
`
`
`b4
`
`b5
`
`b6
`
`33‘
`
`
`
`
`36
`
`89
`
`
`
`b7
`
`b8
`b9
`
`
`
`
`
`
`
`
`
`as that shown in Figure 1
`this operation.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Each processing element will be assumed to
`have three registers, AREG, BREG, and CREG.
`
`is used to perform
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`mNTfiOL
`UNIT
`
`
`
`
`
`
`
`
`MEMORY TO AND FROM ALL PE;
`
`
`
`
`
`
`
`
`
`Figure 1. Parallel Processor with PE connec-
`tions to the Four Nearest Neighbors.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`After initialization, the memory map for CREG
`is:
`
`
`
`
`
`
`
`
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`
`
`
`0
`
`0
`
`0
`
`
`
`33
`
`
`
`a4
`
`as
`
`
`
`
`
`
`
`
`
`For AREG, the memory map is:
`
`
`
`
`
`
`
`a1
`
`as
`
`
`
`
`a9
`
`a2
`
`a6
`
`
`
`
`
`37
`
`
`
`
`
`
`
`
`
`
`
`
`
`Computing Surveys, Vol. 7, No. 4, December 1975
`INTEL - 1010
`
`
`
`
`
`INTEL - 1010
`
`
`
`Kenneth J. Thurber and Leon D. Wald
`
`
`
`220
`
`-
`
`
`
`
`
`
`And the BREG ma is:
`p
`
`
`
`
`
`b2
`
`
`
`
`
`
`
`
`
`
`b7
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`b3
`
`
`
`b2
`
`
`b4
`
`
`
`b6
`
`
`b8
`
`
`’07
`
`
`
`
`
`b6
`
`
`
`
`
`
`
`
`
`
`
`After the multiply, add, shift, and jump, the
`memory maps appear as follows:
`
`
`
`
`
`
`
`
`
`
`CREG =
`
`
`
`
`
`
`
`
`a1"1
`
`35b2
`
`a9b3
`
`AREG =
`
`
`
`a3
`a
`
`4
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`After two more iterations the multiply will be
`
`
`
`
`
`
`
`
`finished and CREG [PEI I] will contain C[I,J].
`
`
`azbs
`
`36'36
`
`a7b4
`
`a 1
`a
`
`5
`
`
`
`
`
`
`
`a3b9
`
`a4b7
`38138
`
`a2
`a
`
`6
`
`
`
`
`
`
`
`
`
`a8
`
`
`
`
`
`a9
`
`
`
`a7
`
`
`
`SIMD GENEALOGY
`
`
`
`
`
`
`
`
`Figure 2 ind1cates the major parallel and asso-
`
`
`
`
`
`
`ciaiive architectures considered in this paper,
`
`
`
`
`
`
`together with their generic relationships. These
`
`
`
`
`
`
`
`
`may differ substantially, yet they are related by
`.
`.
`.
`
`
`
`
`
`
`
`Virtue of
`then global assocrative or parallel
`
`.
`properties.
`
`SIMD MACHINES
`
`
`T—w——I__——I—
`ASSOCIATIVE PROCESSORS
`PARALLEL PROCESSORS
`ENSEMBLES
`
`
`
`
`
`——_I—_—
`T—TTTT—
`MEMEX (72)
`UNGER
`SOLOMON I
`ORTHOGONAL
`PEPE
`
`
`
`
`
`
`
`
`MACHINE
`(108)
`COMPUTER
`(115)
`
`
`
`(m,
`__l:“)
`
`CRYOGENIC CATALOG MEMORY (73)
`
`
`
`
`
`DISTRIBUTED LOGIC MEMORY (31)
`
`
`
`
`
`'
`
`
`
`
`
`
`I
`SOLOMON ll
`
`
`(109)
`
`
`I
`
`(57)
`ILLIAC Iv
`
`
`
`
`
`
`(105)
`L
`
`OMEN
`
`
`
`(112)
`
`I
`IMPLEMENTED
`
`ASSOCIATIVE
`
`PROCESSORS
`
`
`I
`ASSOCIATION
`STORING
`
`PROCESSOR
`(78)
`
`
`
`
`
`
`TREE
`
`CHANNEL
`
`PROCESSOR
`(80)
`
`
`
`
`
`
`
`
`TWO
`
`DIMENSIONAL
`AND BULK
`
`
`DISTRIBUTED
`LOGIC
`
`MEMORIES
`
`I81)
`
`
`
`
`
`HIGHLY PARALLEL
`
`
`7—__I——_I_
`AUGMENTED
`ASSOCIATIVE AswCIATIVE
`
`
`CONTENT
`CONTROL
`LOGIC
`
`
`ADDRESSAB LE
`PROCESSING
`SNITCH
`
`SYSTEM
`MEMORY
`(24)
`
`I85)
`(87)
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`WTBIT
`
`
`SLICE
`(1T3)
`
`
`
`
`
`
`ADDER
`
`SK EWED
`BIT
`
`SLICE
`
`I92)
`
`
`
`
`
`EXOR
`
`SK EWED
`BIT
`
`SLICE
`I93)
`
`
`
`
`
`BYTE
`
`SLICE
`
`(106)
`
`
`
`DISTRIBUTED
`LOGIC
`
`
`I20)
`
`
`
`
`
`STARAN
`(114)
`
`
`
`
`
`——_I
`MIXED MODE AND MULTIDIMENSIONAL (23)
`
`
`
`
`
`
`
`Figure 2. SIMD Geneology.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Computing Surveys, Vol. 7, No. 4, December 1975
`
`
`
`,
`
`INTEL - 1010
`
`INTEL - 1010
`
`
`
`
`
`
`MAIN TRADEOFF ISSUES
`
`.
`
`
`
`
`
`
`
`
`
`
`
`To allow the reader to place the processed
`
`
`
`
`
`
`architectures in perspective, the major system
`
`
`
`
`
`
`
`
`tradeoff issues in SIMD processors will be dis-
`
`
`
`
`
`
`
`
`cussed next. It should be understood that SIMD
`
`
`
`
`
`
`
`processors are special purpose. They are useful
`
`
`
`
`
`
`
`
`
`for a limited set of applications which may be
`
`
`
`
`
`characterized as having [18] :
`
`
`
`
`
`
`
`
`1) A large number of independent data sets;
`
`
`
`
`
`
`
`2) No relationships between data sets that
`
`
`
`
`
`
`prevent
`them from being processed in
`
`parallel;
`
`
`
`
`
`
`3) The requirement
`for quick throughput
`
`
`response; and
`
`
`
`
`
`
`associative
`to exp101t
`4) The potential
`
`
`
`addressing selection techniques.
`
`
`Associative Addressing
`
`
`
`
`
`
`
`
`Figure 3 indicates the main features required to
`
`
`
`
`
`
`implement an associative address process. The
`
`
`
`
`
`
`
`designer must
`include facilities to address or
`
`
`
`
`
`
`query the processor associatively,
`
`i.e.,
`
`to
`
`
`
`
`
`
`221
`Associative and Parallel Processors
`
`
`
`
`
`
`
`
`address the memory array by content. In Figure
`
`
`
`
`
`
`
`
`the application may require that all em-
`3,
`
`
`
`
`
`
`
`
`
`
`ployees with a salary of $35 per day be located.
`
`
`
`
`
`
`
`This would be accomplished by performing an
`
`
`
`
`
`
`
`
`
`equality search on the daily salary field of the
`
`
`
`
`
`
`
`
`
`file. This search is performed in parallel. To set
`
`
`
`
`
`
`
`
`
`up the search a Data Register must be loaded
`
`
`
`
`
`
`
`
`with the daily salary ($35) for comparison, a
`
`
`
`
`
`
`
`
`Mask Register is included to mask the data
`
`
`
`
`
`
`
`
`
`register so that only the desired fields may be
`
`
`
`
`
`
`
`searched, a Word Select (activity select) Reg-
`
`
`
`
`
`
`
`
`
`ister specifies the subset of words to be ad-
`
`
`
`
`
`
`
`
`dressed, a Results Register is required to collect
`
`
`
`
`
`
`
`
`
`the results of the search, a Match Indicator is
`
`
`
`
`
`
`
`
`
`used to indicate the number of matches, and a
`
`
`
`
`
`
`
`Multiple Match Resolver must be provided to
`
`
`
`
`
`
`
`generate a pointer to the “topmost” matched
`
`
`
`
`
`
`
`word. An eight-word example of a personnel
`
`
`
`
`
`
`
`
`file stored in a simple associatlve memory is
`
`
`
`
`
`
`
`
`illustrated in Figure 3,
`it assumes that
`the
`
`
`
`
`
`
`
`
`
`
`search for a salary of $35 per day was initiated
`
`
`
`
`
`
`
`
`on the entire file. All necessary registers and
`
`
`
`features are illustrated.
`
`CONTROL UNIT
`
`
`
`MATCH
`INDICATOR
`
`
`
`
`
`
`
`
`
`
`
`
`
`DATA REGISTER
`
`
`
`
`MASK REGISTER
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`GENERAL FORM OF THE
`
`
`
`FIELD DESCRIPTOR
`
`
`
`
`
`
`
`
`NAME
`
`EMPLOYEE
`DAILY
`
`SALARY NUMBER
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`I
`
`WORD
`
`SELECT
`
`REGISTER
`
`
`
`MULTIPLE
`
`MATCH
`RESOLVER
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`SEARCH
`
`RESULTS
`
`REGISTER
`
`
`
`1
`
`
`
`
`
`
`
`EXTERNAL
`PROCESSING
`LOGIC
`
`
`
`
`
`
`
`
`Figure 3. Assocrative Processor Storing a Personnel File.
`
`
`
`
`
`
`
`
`
`
`
`Computing Surveys, Vol. 7, No. 4, December 1975
`INTEL - 1010
`
`INTEL - 1010
`
`
`
`222
`
`.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`PRESENT
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Kenneth J. Thurber and Leon D. Wald
`
`
`
`
`
`
`The State Transition Concept and Associative
`
`
`Processor Algorithms
`
`
`
`
`
`
`
`
`
`In 1965, Fuller and Bird [35] described a b1t-
`
`
`
`
`
`slice assocwtive processor architecture. Fuller
`
`
`
`
`
`
`
`
`and Bird discussed their algorithms in terms of
`
`
`
`
`
`
`state transitions. To perform parallel processing
`
`
`
`
`
`
`
`
`of data in an associative processor, two basic
`
`
`
`
`
`
`
`operations are required. These are a content
`
`
`
`
`
`
`search and a “multiwrlte” operation. The multi-
`
`
`
`
`
`
`
`write operatlon consists of
`the ability to
`
`
`
`
`
`
`simultaneously write into selected bits of
`
`
`
`
`
`
`
`several selected words. With a content search
`
`
`
`
`
`
`and multiwrite capability, an associative proc-
`
`
`
`
`
`
`
`
`
`essor can be used to process data in highly
`
`
`parallel fashion.
`
`
`
`
`
`
`consider
`the problem of
`example,
`For
`
`
`
`
`
`
`
`
`
`
`adding field A to field B. This can be accom-
`
`
`
`
`
`
`
`
`
`plished as a sequence of bit pair additions (Ai
`
`
`
`
`
`
`
`
`
`added to Bi): with suitable attention paid to the
`
`
`
`
`
`
`
`
`
`
`
`carry bit. If bit 1
`is the least significant and bit
`
`
`
`
`
`
`
`
`
`n the most, Figure 4 indicates the state table
`
`
`
`
`
`
`
`
`
`
`for the bit-serial addition of A to B with the
`
`
`
`
`
`
`
`
`result accumulating into B. As shown Bi (result)
`
`
`
`
`
`
`
`
`
`
`
`and Ci + 1 differ from Bi (operand) and Ci in
`
`
`
`
`
`
`
`
`
`
`the four states, 1, 3, 4, and 6. One possible
`
`
`
`
`
`
`
`
`addition method would be to search for these
`
`
`
`
`
`
`
`
`
`
`states and multiwrite the correct Bi and Ci + 1
`
`
`
`
`
`
`
`
`
`into the appropriate fields, i.e., set up a “tag
`
`
`
`
`
`
`
`
`
`
`field” designated as Ci, and set C1 = 0. Then
`
`
`
`
`
`
`
`operations would proceed from the least sig-
`
`
`
`
`
`
`
`
`
`nificant bit to the most significant bit of the
`
`
`
`
`
`
`
`
`
`
`pair of operands. At each bit position, i, Ai, Bi,
`
`
`
`
`
`
`
`
`
`and Ci would be searched for state 1. Then
`
`
`
`
`
`
`
`
`
`
`
`
`Ci + l = 0, B1 = 1 would be multiwritten into
`
`
`
`
`
`
`
`
`
`all matching words. After a search for state 3,
`
`
`
`
`
`
`
`
`
`
`Ci + 1 = 1, Bi = 0 would be multiwritten into all
`
`
`
`
`
`
`
`
`matching words. This same procedure would be
`
`
`
`
`
`
`
`
`
`
`
`carried out for states 4 and 6.
`if a word is
`
`
`
`
`
`
`
`
`matched on any search,
`it would be removed
`
`
`
`
`
`
`
`
`from the list of active cells until operations
`
`
`
`
`
`
`
`
`
`
`
`moved from bit position i
`to i + 1, so that
`
`
`
`
`
`
`
`
`
`writing the resultant B1 and the carry-in for bit
`
`
`
`
`
`
`
`
`
`position i + 1 into the memory does not cause
`
`
`
`
`
`
`
`the data already processed to be reprocessed.
`
`
`
`
`
`
`Obviously, utilizing the sequential state trans-
`
`
`
`
`
`formation approach, any desired transforma-
`
`
`
`
`
`
`
`can be processed in an assocrative
`tions
`
`
`
`
`
`
`
`processor with simple content search and multi-
`
`
`
`
`
`
`
`write capability using very simple word logic.
`
`
`
`
`
`
`
`Operating on the state transition concept, six
`
`
`
`
`
`
`
`general requirements of the architecture of an
`
`
`
`
`
`
`associative processor were deduced. These are:
`
`Bu
`0» “—
`
`
`
`
`°
`° --
`
`
`
`
`
`
`
`
`
`1 _-___:'_'J
`°
`
`
`
`. —-
`
`
`
`
`
`
`
`
`
`1
`1 __—
`
`
`O
`0
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`IlJIl
`
`
`
`
`
`"f
`' CARRY lNTO BlT POSITION i+1 FROM BIT POSlTlON l
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 4. Bit-Slice Addition State Transition
`
`Diagram.
`
`
`
`
`
`
`
`
`
`
`the
`1) The basic processrng primitive is
`
`
`
`
`
`
`identification of all words meeting some
`
`
`
`
`binary state variable configuration;
`
`
`
`
`
`
`
`2) No explic1t addressing is necessary since
`
`
`
`
`
`can
`processing
`be performed Simul—
`
`
`
`
`
`
`
`taneously over all data sets of interest,
`
`
`
`
`
`
`
`3) Data is processed “column serially” to
`
`
`
`
`
`
`limit the number of state transformations
`
`
`
`to be remembered;
`
`
`
`
`
`
`
`4) Tag bits (temporary bit-slice storage) are
`
`
`
`
`
`
`
`very useful (for example, the carry bit);
`
`
`
`
`
`
`
`5) A search results/word select register and
`
`
`
`
`
`multiword write capability are desirable;
`
`and
`
`
`
`
`
`
`
`6) Stored program control of the associative
`
`
`
`processor is desrrable.
`
`
`
`
`
`
`Fuller and Bird recognized an important
`
`
`
`
`
`
`
`
`correlation between the way in which data is
`
`
`
`
`
`
`
`
`stored in the processor’s memory and its proc-
`
`
`
`
`
`
`
`essing speed. As examples of this phenomenon,
`
`
`
`
`
`
`
`
`
`they described both word per cell (W/C) and bit
`
`
`
`
`
`
`
`(B/C) data organizations for
`their
`per
`cell
`
`
`
`
`
`
`
`
`associative processor. Figures 5 and 6 show the
`
`
`
`
`
`
`
`W/C and B/C storage organizations for nine
`
`
`
`
`
`
`
`
`four-bit words X11, .. .
`, X33, respectively.
`
`
`
`
`
`
`
`
`The word per cell organization has become an
`
`
`
`
`
`
`
`
`
`is,
`industry standard, and it
`in fact, the way
`
`
`
`
`
`
`most associative processors store data. The
`
`
`
`
`
`
`
`desire to perform bit-slice processing on many
`
`
`
`
`
`
`data sets simultaneously without much inter-
`
`
`
`
`
`
`
`word communication led to the widespread use
`
`
`
`
`
`
`
`
`
`of this type of organization. W/C is very effi-
`
`
`
`
`
`
`
`
`
`cient if a large number of cells are involved;
`
`
`
`
`
`
`
`
`
`Computing Surveys, Vol. 7, No. 4, December 1975
`
`INTEL - 1010
`
`INTEL - 1010
`
`
`
`
`
`
`Associative and Parallel Processors
`
`
`
`-
`
`
`
`223
`
`
`
`
`
`BITS 0 TO 3
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`OF WORD X13
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 5. Word Per Cell Organization.
`
`
`
`
`
`
`
`
`operands can be stored in the same cell, and
`
`
`
`
`
`
`
`most operands are being processed by each bit-
`
`
`
`
`
`
`
`slice command. For example, in Figure 5, this
`
`
`
`
`
`
`
`type of organization would be very effic1ent if,
`
`
`
`
`