"ZISC" Zero Instruction Set
                          Computer Architecture

                          Copyright (C) 1989 by
                              Kevin Dowd


Introduction:

ZISC is a general purpose computer architecture unique for its
lack of a cpu or instruction set. Computation is distributed to address 
mapped functional units which pair off and share data under the 
direction of a simple sequencer. Customary traits of other general 
purpose architectures; branching, interrupt handling (and multitasking), 
are present. In addition, ZISC features on the fly configurability
to the task at hand.

There is no single ZISC configuration. A requisite minimal set of 
functional units provide for branching, interrupt handling, etc., but the 
remainder can be chosen as desired. This means that adders, multipliers,
transcendental functions, etc., can be added when an important application
needs more performance.

The problem of designing a computer without an instruction set seemed
interesting. We took the concept to one possible conclusion and found
that there are benefits to a processor without an instruction set.
This paper first describes our architecture and its capabilities and
then weighs the benefits and costs relative to a microprocessor
solution.

Motivation:

A typical cpu design incorporates the register storage and functional 
units necessary to implement the instruction set. Control hardware 
oversees instruction fetch, decode and execution. External to the 
processor are memory and I/O devices, possibly addressed by a single 
address bus and sharing data with the processor via a data bus. 

Each instruction is processed in stages: Fetch, Decode, Operand Fetch,
Execute and Operand Writeback.  If the operands are not in registers
already, they have to be loaded, either explicitly or implicitly,
depending on the architecture.  Consider for illustration the process
of summing of two integers which are register resident:

                  ADD RA,RB,RC

               o  fetch instruction
               o  decode instruction
               o  get register A and register B
               o  perform an addition
               o  write result back to register C

With respect to ZISC, a whole instruction cycle has been wasted. Had 
A and B been pre-associated with the adder, the addition could have been 
implicit. This would have reduced the process to 4 cycles.

               o  fetch instruction
               o  decode instruction
               o  get register A and register B
               o  write the result back to register C

Furthermore, if operations can be made implicit as a function of the
registers chosen, then you can eliminate a large portion of the
instruction set.  All operations can be accomplished by choosing the
correct registers as the destinations of data moves.  And if all
functional units, registers and data memory can be mapped into the
machine address space, then you can simplify the model further, and make
all operations implicit, based upon location.  This would have the
effect of eliminating the instruction set altogether.

Imagine what happens if every component of the computer is
memory-mapped.  Moving data from memory to registers or from memory to
functional units, etc., requires two addresses be generated: one for
the source and one for the destination.  If you were restricted to a
single address bus, it would take two address cycles to move a value
from one place to another.  But since both the source and destination
are known, there isn't a compelling reason to wait two cycles.  The
source and destination addresses can be accessed simultaneously using
two address busses. The net result is a processor of increased
efficiency and performance.


     left address space                    right address space

          ---
          mem
          ---
          mem
          ---
	  ...    /________________________\      FU
          ---   < ________data bus________ >     ---
          FU     \                        /      FU
          ---                                    ---
          FU                                     FU
      ___________                            __________

         / \                                     / \
         | | left address bus                    | | right addr bus
         | |_____________      __________________| |
         |_____________  |     |  _________________|
                       | |     | |
                -------------------------
                left addr   |  right addr
                =========================  program memory
                  ...       |    ...       (address pairs)


This diagram illustrates the fundamental structure of the ZISC architecture.
It consist of two address busses and a data bus. Every cycle, an "address 
pair" is presented to the "Left" and "Right" address busses and the chosen 
functional units are mated. All functions are address-mapped, including 
pedestrian units like adders, multipliers, etc., and also hardware
associated with the operation of the machine itself, like program counters, 
and the Program Status Word, etc. 

Each functional unit occupies a number of addresses dictated by the
complement of operands taken in and results returned. For example,
consider an integer adder, requiring four addresses.

      ______________________________________________
     | operand 1 | operand 2 |   sum     |  carry   |
     |___________|___________|___________|__________|
addr: A           A+1         A+2         A+3

Addition is performed by mating the adder with its operands. One bit 
of the address pair has been borrowed in order to designate the direction 
of the transfer. Here, represented for ZISC, is the addition discussed
in the examples above.


 ins ~       left address bus       right address bus     dir
_________|______________________|______________________|_______
    1      mem addr of operand1    A    (operand1)         ->
    2      mem addr of operand1    A+1  (operand2)         ->
    3      address for result      A+2  (sum)              <-


Functional units may reside on one or both of the two address busses. 
Multiportedness is not implied, since there still is a single data bus.
However, two sets of address decode logic may service each grouping of 
functional units.

The sequencer or program memory system, presents fixed length address 
pairs at regular intervals to the two address busses. It operates 
continuously, at full bandwidth, and is completely independent of the 
data memory. A multibank scheme allows for zero delay branching. 
That is, there is no pause in processing when a branch occurs. 

Address pairs are fetched from the program memory under direction of 
of several program counters, one of which will be active, depending 
on the machine state. A pipelined fetch scheme and address scheme 
could be used for greater performance.

Several multibanked, non-cached data memory systems have been considered.
A data cache is very attractive because of easy applicability and is 
currently the memory interface of choice.

The ZISC design is multitasking and general purpose.

                               =/=

                         Functional Units

The layout and use of a ZISC integer adder was detailed above. Several more 
functional units will considered from a virtually unlimited set of 
possibilities. Those which are used in direct computation of results have 
been loosely termed "external" functional units. Those whose operations
direct the computer (i.e. the program counter) are called "internal".

The input operands for all external functional units are readable in 
as well as writable. This is necessary in order to be able to
preserve state in case of a context switch. It yields an interesting 
side effect in that the more functional units a given ZISC configuration
contains, the more expensive a context switch becomes.


                        External Functional Units

Registers:

There actually few registers since operations can be cache-to-functional
unit and back again. Registers must be readable and writable in order that
context may be saved. Since all functional unit input operands share that 
property, they can be borrowed for temporary storage as well. 

      ____________
     | register  | 
     |___________|
addr: R           

Index Registers:

Index registers have a number of possibilities in this architecture, 
including numerical use. Like any other functional unit, there may be an 
arbitrary number of them.


      _______________________
     | i-registr | increment |
     |___________|___________|
addr: I           I+1       

A value is written into address "i-registr" and a constant 
increment (can be negative) can be written to "increment". 
Each time "i-register" is read the value is bumped:
 
     i-registr := i-register + increment.

If "increment" is 0 this becomes a general purpose register. The most 
obvious use is a memory reference index register. Other uses include 
multiplication of short integers by small numbers.

Logical operations:

logical: 
      ________________________________________________________________
     | operand 1 | operand 2 | op1 && op2 | op1 || op2 | op1 >> 1   ....
     |___________|___________|____________|____________|____________|__
addr: L           L+1         L+2          L+3          L+4


Misc operations:

ebcdic -> ascii (for example)
logical: 
      ________________________
     | ebcdic in | ascii out | 
     |___________|___________|
addr: E           E+1         

Comparators:

These functional units serve both as "select" units and for generating
targets for branches.

relop: (>,=, etc)
      ________________________________________________________________
     | operand 1 | operand 2 | target1 in | target2 in | target out | ...
     |___________|___________|____________|____________|____________|__
addr: C           C+1         C+2          C+3          C+4

Comparators can be used for selecting one of two target values based on the
result of <op1 -relop- op2>. If the test is true, "target1 in" will appear
at "target out", otherwise "target2 in" will appear. 

The functional unit is also employed for generating branch targets, in which
case values in "target1" and "target2" will be program addresses. The 
address presented in "target out" is imposed on the program counter
by a subsequent instruction.


Floating Point Functions:

To date, our prototypes have not employed any floating point hardware. 
Generally, such operations will take a number of cycles to complete 
and will be pipelined. ZISC need not wait for completion of a 
single floating point calculation, but may focus activity elsewhere 
and return when the result is ready. 

Where considerable parallelism exists, these functions are bandwidth
limited.


                      Internal Functional Units

There is a minimal fixed set of internal functional units that must be 
present:

Program Counters (2)
Program Counter "Stuffs" (2)
Indirect Memory Address "Stuff"
Data and Instruction base registers
PSW

Program Counters:

 Mem Bk #1   <======++                  ++=======> Mem Bk #2
                    ||               ___||___
                    ||              | latch  |
                    ++============> |________|
                    ||    ___________
                    ++===| PC system |<----- enable   __________
                    ||   |___________|<==============| PC stuff |<=++==data
                    ||    ___________                |__________|<====++=addr
                    ++===| PC normal |<----- enable                || ||
                         |___________|<=++                         || ||
                        ______       ___||___                      || ||
                       | Base |====>| adder |         __________   || ||
                       |_reg__|     |___+___|<=======| PC stuff |<=++ ||
                                                     |__________|<====++


The program memory for the prototype is made of two banks. While one 
bank is driving the address busses, the other is being accessed. 

Branching occurs without a processing delay. Since the address pair being 
presented lags the program counter by one cycle, there exists a one instruc-
tion window after branch initiation in which to do something which may be 
of benefit to either of the two possible branch targets, depending on which 
way the branch is likely to go at runtime.

The program memory always operates at full bandwidth. Our first design
includes fast static RAM. A full scale design would require some thoughts
about an instruction memory cache backed by a larger, slower memory store.

In use, the program counter appears to be held steady as each of 
the two banks are presented, in turn, on the address busses. This organization
requires that the instructions travel in pairs, and that branches be initiated
to the first instruction of a pair, or "first half". For some operations, the 
two-bank instruction memory scheme necessitates that the two-halves of the 
instruction be non-symmetric.

Example 1: Program Counter Stuff functional unit.

In order to branch, the PC must be writable (there are in fact two PCs, one 
for "system" mode and one for "normal" mode), and in order to save context 
it must be readable. 

To jump, the "PC stuff" is paired with a functional unit or memory containing 
the target address. The "PC stuff" will ignore negative addresses, and branch 
to positive ones. This feature facilitates use of the comparator detailed 
above as a source for branch targets by easing the number of values that 
need to be supplied. An negative value presented to the "PC stuff" by 
the comparator will result in a fallthrough without change of program counter.

The "PC stuff" has a soft restriction in that a target can only be stuffed 
during the first half of the instruction cycle. The branch target will
not take effect until the beginning of the next whole cycle (the clock
tick after next). This means that after the jump is initiated, there 
is a place for an instruction who's result may be available for either
the fallthrough or the branch, depending on whether the branch was taken.

Reading from the PC stuff returns the current value of the Program Counter.

Example 2: Indirect Stuff

Consider data memory access. On the Left address bus might appear the 
memory address of the data item, and on the right, the address of the 
functional unit where it is to be utilized.

 ins ~       left address bus       right address bus     dir
_________|______________________|______________________|_______
   1       memory addr of value       address of FU        ->

Assume now that the item loaded from memory was not the desired data,
but rather a pointer to the data. How on Earth could this design
dereference a pointer?  This is a problem since the program and data
memories are separate, and the contents of the program memory are
being shelled out at full bandwidth all the time, i.e. there is no
bandwidth available with which the program memory can modify what it
is going to present to the address busses in the future. (Actually, it
is possible, though expensive, through I/O). What is needed is a method
for forcing a value of the pointer onto one of the address busses at
some fixed, predictable time later on.

This functional unit is called the "Indirect Stuff". It overrides the
Left Address bus one cycle in the future. It operates a pipeline two
deep. The last value stuffed and the state of the pipeline are
readable and writable in order to be able to save context

During the first instruction of use, the "Indirect Stuff" receives the 
value of the pointer. In the second, the data at the address <ptr> is 
retrieved.

 ins ~       left address bus       right address bus     dir
_________|______________________|______________________|_______
   1      addr of indirect stuff  address of FU w/ptr     <-
   2              <ptr>           address of dest. FU     ->

Example 3: Base registers

Part of a process' context are two base registers, one for data memory
addressing and one for program memory addressing. When in "normal" mode,
the values in these base registers are added to memory references and
branch targets automatically. This allows for programs which are compiled
for absolute addresses to run concurrently.

Example 4: PSW

The prototype has a primitive interrupt mechanism. Three lines are
available, INT1, INT2 and SWI. Activation of any of the above places the
machine in "system" mode immediately following the completion of the 
second half of the current instruction cycle, provided the interrupt 
occurred before the first half completed. In addition there is a SLEEP
line for halting the processor internally or externally and an external
WAKE line. 

Reading the PSW in "system" mode returns information needed to determine
the which line caused the interrupt (there are no vectors). Writing resets
lines. Any access to the PSW in "normal" mode causes an SWI. This is the 
mechanism by which system calls are made.

There are a number of improvements which could be made in the handling of
interrupts. The fastest response would be obtained with a set of 
prioritorized program counters (in lieu of vectors), the highest overriding
all the lesser.

                             ===========

I/O system:

I/O is all performed via a simple channel protocol involving trading 
of interrupts and setting reserved status registers. The I/O processor
is intended to be a microcomputer with a set of peripherals already 
attached, such as a PC or clone.

It may seem ironic that an I/O channel protocol has been implemented
in machine where all cpu functions are memory-mapped. We have a couple
of reasons: First, PCs have a rich set of peripherals and controllers
already available. Secondly, some DMA-like protocol was required since
the program memory was not designed with the spare bandwidth needed to
supervise its own downloading.

To initiate I/O, the processor sets the status registers, and toggles the 
interrupt line (for the PC) and puts itself to sleep. The PC grabs the 
status information and wakes the processor back up. When the I/O is complete
the PC toggles the processors interrupt line, the processor finds a 
convenient spot and goes to sleep, and the PC completes the transfer.

Performance:

Absolute performance is a function of more than architectural efficiency.
However, talking about instruction latency as the number of clock ticks 
required for completion normalizes the speed of two processors so that 
they can be compared. Certainly, with semiconductor features approaching the
sub-micron range, an external bus oriented architecture will not be able 
to compete in terms of raw speed with a complete single chip processor. 
However, the next few years should see a number of optical backplane designs,
capable of withstanding nearly unlimited loading, and maintaining a very
high bandwidth.

The RISC/CISC argument has been made repeatedly, so it makes sense to describe
the ZISC's relative performance in terms of a RISC architecture. Since
the arrangement of a given ZISC implementation is arbitrary, and since
this paper advocates ZISC, we'll choose to make comparisons employing
the optimal selection of functional units.

Some Examples:

Here are several nararrated examples of code fragments mapped onto 
a ZISC configuration. For comparison, each is also optimally 
coded for a SPARC machine, following the SPARC architecture.


Example 1 - null loop body. 

In C:
      for (i=0; i<1000; i++);

In SPARC assembler code:

            mov     0x0,%o0
      L1:
            cmp     %o0,0x3e7
            blt     L1
            add     %o0,0x1,%o0
      L2:
            ...

      Register o0 is zeroed for use as a loop index. When the limit is 
      reached, control is passed to the instruction at L2.

      note: A SPARC machine will execute the instruction immediately 
      following a branch instruction regardless of which way the branch 
      goes (like ZISC). It is for that reason that the add instruction 
      appears following the branch, making optimal use of the cycles.

On a ZISC configuration employing:

   a) an index register
           _______________________
          | i-registr | increment |
          |___________|___________|
     addr: I           I+1       

   b) a comparator which tests for >=
 
     relop: >=
           ________________________________________________________________
          | operand 1 | operand 2 | target1 in | target2 in | target out | ...
          |___________|___________|____________|____________|____________|__
     addr: C           C+1         C+2          C+3          C+4

      ins ~       left address bus       right address bus     dir
     _________|______________________|______________________|_______
         1      mem addr const.0x3e7    I    (i-registr)        ->
         2      mem addr const. -1      I+1  (increment)        ->
         3      mem addr const. 0       C    (comparator op1)   ->
         4      mem addr const. -1      C+2  (target 1)         ->
         5      mem addr const (ins 7)  C+3  (target 2)         -> 
         6      I    (i-registr)        C+1  (comparator op2)   ->
         7      C+4  (target out)       P    (prog. ctr)        ->
         8      I    (i-registr)        C+1  (comparator op2)   ->

         9      .....

         First the index register is loaded with the loop limit and an 
         increment of -1. Recall that each time the value at 'I' is read, 
         the increment is added to the register value. This will cause the 
         index register to count down from 999 to 0.

         Next the comparator is set up so that when 'operand 1' is >= 
         'operand 2' the address of instruction 7 will appear at 'target out'.
         When the test is no longer true, the -1 loaded into 'target 1' 
         will appear at 'target out', causing a fallthrough to instruction 9.
         Recall that the PC Stuff will accept positive addresses and ignore
         negative ones.

         Note that there is a considerable amount of set-up before the loop 
         can be executed. It may be argued that all the steps detailed may 
         not be needed every time, but for sake of argument they will be 
         counted here.


How many cycles will it take for each architecture to complete 1000 iterations
of an empty loop (including startup)? SPARC requires 1 + 3000 = 3001 cycles.
ZISC requires 6 + 2000 = 2006 cycles.


Example 2 - simple integer arithmetic through pointers

In C:

foo (a,b,c)
int *a, *b, *c; 
{
	int x;
	x = *a + *b;
	if (x > *c) 
		*c = x;
}

In SPARC assembler:

                   ...
            ld     [%fp+0x0],%o0
	    ld     [%o0],%o1
	    ld     [%fp+0x4],%o0
	    ld     [%o0],%o2
	    add    %o3,%o2,%o1
	    ld     [%fp+0x8],%o0
	    ld     [%o0],%o4
	    cmp    %o3,%o4
	    bgt    L1
	    nop
	    b      L2
      L1:   st     [%o0],%o3
            ...

      The argument pointer is assumed to point an argument list consisting
      of the addresses *a, *b and *c. Each load/store operation requires 2
      cycles to complete.


On a ZISC configuration employing:

   a) an index register
           _______________________
          | i-registr | increment |
          |___________|___________|
     addr: I           I+1       

   b) a comparator which tests for >=
 
     relop: >=
           ________________________________________________________________
          | operand 1 | operand 2 | target1 in | target2 in | target out | ...
          |___________|___________|____________|____________|____________|__
     addr: C           C+1         C+2          C+3          C+4

   c) one adder

           __________________________________________________
          | operand 1 | operand 2 | sum        | carry      |...
          |___________|___________|____________|____________|_
     addr: A           A+1         A+2          A+3

   d) two g.p. registers

           _____________
          | reg       | 
          |___________|_
     addr: R  

           _____________
          | reg       | 
          |___________|_
     addr: R+1

   d) indirect stuff (standard in any configuration)


      ins ~       left address bus       right address bus     dir
     _________|______________________|______________________|_______
         1      mem addr const 0x4      I+1  (i-reg incr)       ->
         2      fp (see below)          I    (i-register)       ->
         3      I   (i-register)        indirect stuff          ->
         4      [indirect stuff *(*a)]  indirect stuff          ->
         5      [indirect stuff (*a)]   A    (adder1, op1)      ->
         6      I   (i-register)        indirect stuff          ->
         7      [indirect stuff *(*b)]  indirect stuff          ->
         8      [indirect stuff (*b)]   A+1  (adder1, op2)      ->
         9      A+3 (adder1, sum)       C+2  (target 1)         ->
        10      I   (i-register)        indirect stuff          ->
        11      [indirect stuff *(*c)]  R   (g.p. reg1)         ->
        12      R   (g.p. reg1)         indirect stuff          ->
        13      [indirect stuff (*c)]   C+3  (target 2)         ->
        14      C+3 (target 2)          R+1  (g.p. reg2)        ->
        15      R+1 (g.p. reg2)         C    (comparator op1)   ->
        16      A+2 (adder sum)         C+1  (comparator op2)   ->
	17      R   (g.p. reg1)         indirect stuff          ->
        18      [indirect stuff (*c)]   target out              <-

        The designation "fp" in the above example is made for clarity
        in comparison with the SPARC code fragment. It is assumed to
        be a general purpose register whose use is designated as a
        frame pointer.

        The index register is employed to step through the argument
        list. The increment is loaded with the value "4", the presumption
        being that each address is separated by 4 bytes. Recall that
        each time the index register is read its value is bumped.
        The value of *c is temporarily squirreled away into
        reg1 so that it will be available later should the value
        for 'c' need to be modified.

        Notice the use of the "indirect stuff". The index register
        first supplies the address of the address of the desired
        variable, which is turned around through the indirect stuff
        to obtain the address of the variable, and finally the value
        of the variable itself.

        No branch is made in this example, rather the comparator is
        used to 'select' either the value of 'x' or the value of 'c'
        for return to storage.

Each architecture, ZISC and SPARC, requires 18 cycles to complete 
the operation and return the result to memory.

Example - 3

Here is a code fragment where ZISC's configurability allows one
configuration to drastically outperform a standard processor
design:

Compute  a^2 + 2ab + b^2, where a,b are halfword integers on a ZISC 
configuration employing 3 multipliers and two adders. Assume that 
the integer multipliers are not pipelined and require 6 cycles to
produce a product. The configuration contains:

   a) two adders

           __________________________________________________
          | operand 1 | operand 2 | sum        | carry      |...
          |___________|___________|____________|____________|_
     addr: A           A+1         A+2          A+3
           __________________________________________________
          | operand 1 | operand 2 | sum        | carry      |...
          |___________|___________|____________|____________|_
     addr: A+4         A+5         A+6          A+7


   b) three halfword multipliers

           ______________________________________
          | operand 1 | operand 2 | product    | ...
          |___________|___________|____________|_
     addr: M           M+1         M+2          
           ______________________________________
          | operand 1 | operand 2 | product    | 
          |___________|___________|____________|_
     addr: M+3         M+4         M+5          
           ______________________________________
          | operand 1 | operand 2 | product    | 
          |___________|___________|____________|_
     addr: M+6         M+7         M+8          


      ins ~       left address bus       right address bus     dir
     _________|______________________|______________________|_______
         1      address of 'a'          M   (mult 1, op 1)      ->
         2      address of 'b'          M+1 (mult 1, op 2)      ->
         3      M   (mult 1, op 1)      M+3 (mult 2, op 1)      ->
	 4      M   (mult 1, op 1)      M+4 (mult 2, op 2)      ->
	 5      M+1 (mult 1, op 2)      M+6 (mult 3, op 1)      ->
         6      M+1 (mult 1, op 2)      M+7 (mult 3, op 2)      ->
         7      ---- (nop) ----
         8      ---- (nop) ----
         9      M+2 (product 1)         A   (adder 1, op 1)     ->
        10      M+2 (product 1)         A+1 (adder 1, op 2)     ->
	11      A+2 (sum 1)             A+4 (adder 2, op 1)     ->
	12      M+5 (product 2)         A+5 (adder 2, op 2)     ->
        13      A+6 (sum 2)             A   (adder 1, op 1)     ->
	14      M+8 (product 3)         A+1 (adder 1, op 2)     ->

The total time to complete the operation is 14 cycles.

Observations:

ZISC may be most aptly suited for special purpose, embedded applications.
Because if its on-the-fly configurability, problems requiring additional
horsepower of a given type (like multiplication in the above example)
can benefit greatly. The ZISC abstraction can also be extended to more 
complex distributions where the functional units represent processes and 
ZISC an executive.

The architecture is expensive in terms of the number of gates and 
supporting memory required. Each instruction, regardless of how simple,
requires two full addresses.

Compilation for an arbitrary configuration would be non-trivial. Traditional
front-ends could be used. At code generation time each atomic operation
could be parcelled out, according to a configuration table, to the
functional units. In the absence of the appropriate hardware,
instructions could be recursively replaced with macros until simulated by
the available functional resources. An operating system would be able
to depend on the presence of a minimal set of functional units. Other
applications could be stored and traded in intermediate language- only
to be completely compiled for the target machine after reaching their 
destination.

What next? A single ZISC configuration is memory and processor bandwidth 
limited in a world where many applications exhibit a considerable amount 
of fine-grained parallelism. The next step would be to string multiple 
ZISCS together into a Very Long Instruction Word (VLIW) arrangement:

  __                  __                 __
  __ <-- data bus --> __ <-- data bus -->__
  __                  __                 __    ...etc

  ^                   ^^                 ^
  |                   ||                 |_____
  |                   |-_-_-_-_-_-_-_-_-_-_-_-_ address busses
  |____________________________________________

Since little control hardware is required to make ZISC work in the 
first place, an extension to a VLIW configuration would only require 
that each bank of address busses be driven from the same program counter.
Additionally, when desired, the ZISCS could break ranks and operate
independently as a multiprocessors or re-rendezvous into a VLIW again.

Memory access would be made through a write-through cache and results in 
the stacks of functional units could be readable on either side. A new 
sort of functional unit who's only purpose is to become a transparent
data window could be included for the purpose of moving data to stacks
more than one hop away.