"ZISC" Zero Instruction Set Computer Architecture Copyright (C) 1989 by Kevin Dowd Introduction: ZISC is a general purpose computer architecture unique for its lack of a cpu or instruction set. Computation is distributed to address mapped functional units which pair off and share data under the direction of a simple sequencer. Customary traits of other general purpose architectures; branching, interrupt handling (and multitasking), are present. In addition, ZISC features on the fly configurability to the task at hand. There is no single ZISC configuration. A requisite minimal set of functional units provide for branching, interrupt handling, etc., but the remainder can be chosen as desired. This means that adders, multipliers, transcendental functions, etc., can be added when an important application needs more performance. The problem of designing a computer without an instruction set seemed interesting. We took the concept to one possible conclusion and found that there are benefits to a processor without an instruction set. This paper first describes our architecture and its capabilities and then weighs the benefits and costs relative to a microprocessor solution. Motivation: A typical cpu design incorporates the register storage and functional units necessary to implement the instruction set. Control hardware oversees instruction fetch, decode and execution. External to the processor are memory and I/O devices, possibly addressed by a single address bus and sharing data with the processor via a data bus. Each instruction is processed in stages: Fetch, Decode, Operand Fetch, Execute and Operand Writeback. If the operands are not in registers already, they have to be loaded, either explicitly or implicitly, depending on the architecture. Consider for illustration the process of summing of two integers which are register resident: ADD RA,RB,RC o fetch instruction o decode instruction o get register A and register B o perform an addition o write result back to register C With respect to ZISC, a whole instruction cycle has been wasted. Had A and B been pre-associated with the adder, the addition could have been implicit. This would have reduced the process to 4 cycles. o fetch instruction o decode instruction o get register A and register B o write the result back to register C Furthermore, if operations can be made implicit as a function of the registers chosen, then you can eliminate a large portion of the instruction set. All operations can be accomplished by choosing the correct registers as the destinations of data moves. And if all functional units, registers and data memory can be mapped into the machine address space, then you can simplify the model further, and make all operations implicit, based upon location. This would have the effect of eliminating the instruction set altogether. Imagine what happens if every component of the computer is memory-mapped. Moving data from memory to registers or from memory to functional units, etc., requires two addresses be generated: one for the source and one for the destination. If you were restricted to a single address bus, it would take two address cycles to move a value from one place to another. But since both the source and destination are known, there isn't a compelling reason to wait two cycles. The source and destination addresses can be accessed simultaneously using two address busses. The net result is a processor of increased efficiency and performance. left address space right address space --- mem --- mem --- ... /________________________\ FU --- < ________data bus________ > --- FU \ / FU --- --- FU FU ___________ __________ / \ / \ | | left address bus | | right addr bus | |_____________ __________________| | |_____________ | | _________________| | | | | ------------------------- left addr | right addr ========================= program memory ... | ... (address pairs) This diagram illustrates the fundamental structure of the ZISC architecture. It consist of two address busses and a data bus. Every cycle, an "address pair" is presented to the "Left" and "Right" address busses and the chosen functional units are mated. All functions are address-mapped, including pedestrian units like adders, multipliers, etc., and also hardware associated with the operation of the machine itself, like program counters, and the Program Status Word, etc. Each functional unit occupies a number of addresses dictated by the complement of operands taken in and results returned. For example, consider an integer adder, requiring four addresses. ______________________________________________ | operand 1 | operand 2 | sum | carry | |___________|___________|___________|__________| addr: A A+1 A+2 A+3 Addition is performed by mating the adder with its operands. One bit of the address pair has been borrowed in order to designate the direction of the transfer. Here, represented for ZISC, is the addition discussed in the examples above. ins ~ left address bus right address bus dir _________|______________________|______________________|_______ 1 mem addr of operand1 A (operand1) -> 2 mem addr of operand1 A+1 (operand2) -> 3 address for result A+2 (sum) <- Functional units may reside on one or both of the two address busses. Multiportedness is not implied, since there still is a single data bus. However, two sets of address decode logic may service each grouping of functional units. The sequencer or program memory system, presents fixed length address pairs at regular intervals to the two address busses. It operates continuously, at full bandwidth, and is completely independent of the data memory. A multibank scheme allows for zero delay branching. That is, there is no pause in processing when a branch occurs. Address pairs are fetched from the program memory under direction of of several program counters, one of which will be active, depending on the machine state. A pipelined fetch scheme and address scheme could be used for greater performance. Several multibanked, non-cached data memory systems have been considered. A data cache is very attractive because of easy applicability and is currently the memory interface of choice. The ZISC design is multitasking and general purpose. =/= Functional Units The layout and use of a ZISC integer adder was detailed above. Several more functional units will considered from a virtually unlimited set of possibilities. Those which are used in direct computation of results have been loosely termed "external" functional units. Those whose operations direct the computer (i.e. the program counter) are called "internal". The input operands for all external functional units are readable in as well as writable. This is necessary in order to be able to preserve state in case of a context switch. It yields an interesting side effect in that the more functional units a given ZISC configuration contains, the more expensive a context switch becomes. External Functional Units Registers: There actually few registers since operations can be cache-to-functional unit and back again. Registers must be readable and writable in order that context may be saved. Since all functional unit input operands share that property, they can be borrowed for temporary storage as well. ____________ | register | |___________| addr: R Index Registers: Index registers have a number of possibilities in this architecture, including numerical use. Like any other functional unit, there may be an arbitrary number of them. _______________________ | i-registr | increment | |___________|___________| addr: I I+1 A value is written into address "i-registr" and a constant increment (can be negative) can be written to "increment". Each time "i-register" is read the value is bumped: i-registr := i-register + increment. If "increment" is 0 this becomes a general purpose register. The most obvious use is a memory reference index register. Other uses include multiplication of short integers by small numbers. Logical operations: logical: ________________________________________________________________ | operand 1 | operand 2 | op1 && op2 | op1 || op2 | op1 >> 1 .... |___________|___________|____________|____________|____________|__ addr: L L+1 L+2 L+3 L+4 Misc operations: ebcdic -> ascii (for example) logical: ________________________ | ebcdic in | ascii out | |___________|___________| addr: E E+1 Comparators: These functional units serve both as "select" units and for generating targets for branches. relop: (>,=, etc) ________________________________________________________________ | operand 1 | operand 2 | target1 in | target2 in | target out | ... |___________|___________|____________|____________|____________|__ addr: C C+1 C+2 C+3 C+4 Comparators can be used for selecting one of two target values based on the result of . If the test is true, "target1 in" will appear at "target out", otherwise "target2 in" will appear. The functional unit is also employed for generating branch targets, in which case values in "target1" and "target2" will be program addresses. The address presented in "target out" is imposed on the program counter by a subsequent instruction. Floating Point Functions: To date, our prototypes have not employed any floating point hardware. Generally, such operations will take a number of cycles to complete and will be pipelined. ZISC need not wait for completion of a single floating point calculation, but may focus activity elsewhere and return when the result is ready. Where considerable parallelism exists, these functions are bandwidth limited. Internal Functional Units There is a minimal fixed set of internal functional units that must be present: Program Counters (2) Program Counter "Stuffs" (2) Indirect Memory Address "Stuff" Data and Instruction base registers PSW Program Counters: Mem Bk #1 <======++ ++=======> Mem Bk #2 || ___||___ || | latch | ++============> |________| || ___________ ++===| PC system |<----- enable __________ || |___________|<==============| PC stuff |<=++==data || ___________ |__________|<====++=addr ++===| PC normal |<----- enable || || |___________|<=++ || || ______ ___||___ || || | Base |====>| adder | __________ || || |_reg__| |___+___|<=======| PC stuff |<=++ || |__________|<====++ The program memory for the prototype is made of two banks. While one bank is driving the address busses, the other is being accessed. Branching occurs without a processing delay. Since the address pair being presented lags the program counter by one cycle, there exists a one instruc- tion window after branch initiation in which to do something which may be of benefit to either of the two possible branch targets, depending on which way the branch is likely to go at runtime. The program memory always operates at full bandwidth. Our first design includes fast static RAM. A full scale design would require some thoughts about an instruction memory cache backed by a larger, slower memory store. In use, the program counter appears to be held steady as each of the two banks are presented, in turn, on the address busses. This organization requires that the instructions travel in pairs, and that branches be initiated to the first instruction of a pair, or "first half". For some operations, the two-bank instruction memory scheme necessitates that the two-halves of the instruction be non-symmetric. Example 1: Program Counter Stuff functional unit. In order to branch, the PC must be writable (there are in fact two PCs, one for "system" mode and one for "normal" mode), and in order to save context it must be readable. To jump, the "PC stuff" is paired with a functional unit or memory containing the target address. The "PC stuff" will ignore negative addresses, and branch to positive ones. This feature facilitates use of the comparator detailed above as a source for branch targets by easing the number of values that need to be supplied. An negative value presented to the "PC stuff" by the comparator will result in a fallthrough without change of program counter. The "PC stuff" has a soft restriction in that a target can only be stuffed during the first half of the instruction cycle. The branch target will not take effect until the beginning of the next whole cycle (the clock tick after next). This means that after the jump is initiated, there is a place for an instruction who's result may be available for either the fallthrough or the branch, depending on whether the branch was taken. Reading from the PC stuff returns the current value of the Program Counter. Example 2: Indirect Stuff Consider data memory access. On the Left address bus might appear the memory address of the data item, and on the right, the address of the functional unit where it is to be utilized. ins ~ left address bus right address bus dir _________|______________________|______________________|_______ 1 memory addr of value address of FU -> Assume now that the item loaded from memory was not the desired data, but rather a pointer to the data. How on Earth could this design dereference a pointer? This is a problem since the program and data memories are separate, and the contents of the program memory are being shelled out at full bandwidth all the time, i.e. there is no bandwidth available with which the program memory can modify what it is going to present to the address busses in the future. (Actually, it is possible, though expensive, through I/O). What is needed is a method for forcing a value of the pointer onto one of the address busses at some fixed, predictable time later on. This functional unit is called the "Indirect Stuff". It overrides the Left Address bus one cycle in the future. It operates a pipeline two deep. The last value stuffed and the state of the pipeline are readable and writable in order to be able to save context During the first instruction of use, the "Indirect Stuff" receives the value of the pointer. In the second, the data at the address is retrieved. ins ~ left address bus right address bus dir _________|______________________|______________________|_______ 1 addr of indirect stuff address of FU w/ptr <- 2 address of dest. FU -> Example 3: Base registers Part of a process' context are two base registers, one for data memory addressing and one for program memory addressing. When in "normal" mode, the values in these base registers are added to memory references and branch targets automatically. This allows for programs which are compiled for absolute addresses to run concurrently. Example 4: PSW The prototype has a primitive interrupt mechanism. Three lines are available, INT1, INT2 and SWI. Activation of any of the above places the machine in "system" mode immediately following the completion of the second half of the current instruction cycle, provided the interrupt occurred before the first half completed. In addition there is a SLEEP line for halting the processor internally or externally and an external WAKE line. Reading the PSW in "system" mode returns information needed to determine the which line caused the interrupt (there are no vectors). Writing resets lines. Any access to the PSW in "normal" mode causes an SWI. This is the mechanism by which system calls are made. There are a number of improvements which could be made in the handling of interrupts. The fastest response would be obtained with a set of prioritorized program counters (in lieu of vectors), the highest overriding all the lesser. =========== I/O system: I/O is all performed via a simple channel protocol involving trading of interrupts and setting reserved status registers. The I/O processor is intended to be a microcomputer with a set of peripherals already attached, such as a PC or clone. It may seem ironic that an I/O channel protocol has been implemented in machine where all cpu functions are memory-mapped. We have a couple of reasons: First, PCs have a rich set of peripherals and controllers already available. Secondly, some DMA-like protocol was required since the program memory was not designed with the spare bandwidth needed to supervise its own downloading. To initiate I/O, the processor sets the status registers, and toggles the interrupt line (for the PC) and puts itself to sleep. The PC grabs the status information and wakes the processor back up. When the I/O is complete the PC toggles the processors interrupt line, the processor finds a convenient spot and goes to sleep, and the PC completes the transfer. Performance: Absolute performance is a function of more than architectural efficiency. However, talking about instruction latency as the number of clock ticks required for completion normalizes the speed of two processors so that they can be compared. Certainly, with semiconductor features approaching the sub-micron range, an external bus oriented architecture will not be able to compete in terms of raw speed with a complete single chip processor. However, the next few years should see a number of optical backplane designs, capable of withstanding nearly unlimited loading, and maintaining a very high bandwidth. The RISC/CISC argument has been made repeatedly, so it makes sense to describe the ZISC's relative performance in terms of a RISC architecture. Since the arrangement of a given ZISC implementation is arbitrary, and since this paper advocates ZISC, we'll choose to make comparisons employing the optimal selection of functional units. Some Examples: Here are several nararrated examples of code fragments mapped onto a ZISC configuration. For comparison, each is also optimally coded for a SPARC machine, following the SPARC architecture. Example 1 - null loop body. In C: for (i=0; i<1000; i++); In SPARC assembler code: mov 0x0,%o0 L1: cmp %o0,0x3e7 blt L1 add %o0,0x1,%o0 L2: ... Register o0 is zeroed for use as a loop index. When the limit is reached, control is passed to the instruction at L2. note: A SPARC machine will execute the instruction immediately following a branch instruction regardless of which way the branch goes (like ZISC). It is for that reason that the add instruction appears following the branch, making optimal use of the cycles. On a ZISC configuration employing: a) an index register _______________________ | i-registr | increment | |___________|___________| addr: I I+1 b) a comparator which tests for >= relop: >= ________________________________________________________________ | operand 1 | operand 2 | target1 in | target2 in | target out | ... |___________|___________|____________|____________|____________|__ addr: C C+1 C+2 C+3 C+4 ins ~ left address bus right address bus dir _________|______________________|______________________|_______ 1 mem addr const.0x3e7 I (i-registr) -> 2 mem addr const. -1 I+1 (increment) -> 3 mem addr const. 0 C (comparator op1) -> 4 mem addr const. -1 C+2 (target 1) -> 5 mem addr const (ins 7) C+3 (target 2) -> 6 I (i-registr) C+1 (comparator op2) -> 7 C+4 (target out) P (prog. ctr) -> 8 I (i-registr) C+1 (comparator op2) -> 9 ..... First the index register is loaded with the loop limit and an increment of -1. Recall that each time the value at 'I' is read, the increment is added to the register value. This will cause the index register to count down from 999 to 0. Next the comparator is set up so that when 'operand 1' is >= 'operand 2' the address of instruction 7 will appear at 'target out'. When the test is no longer true, the -1 loaded into 'target 1' will appear at 'target out', causing a fallthrough to instruction 9. Recall that the PC Stuff will accept positive addresses and ignore negative ones. Note that there is a considerable amount of set-up before the loop can be executed. It may be argued that all the steps detailed may not be needed every time, but for sake of argument they will be counted here. How many cycles will it take for each architecture to complete 1000 iterations of an empty loop (including startup)? SPARC requires 1 + 3000 = 3001 cycles. ZISC requires 6 + 2000 = 2006 cycles. Example 2 - simple integer arithmetic through pointers In C: foo (a,b,c) int *a, *b, *c; { int x; x = *a + *b; if (x > *c) *c = x; } In SPARC assembler: ... ld [%fp+0x0],%o0 ld [%o0],%o1 ld [%fp+0x4],%o0 ld [%o0],%o2 add %o3,%o2,%o1 ld [%fp+0x8],%o0 ld [%o0],%o4 cmp %o3,%o4 bgt L1 nop b L2 L1: st [%o0],%o3 ... The argument pointer is assumed to point an argument list consisting of the addresses *a, *b and *c. Each load/store operation requires 2 cycles to complete. On a ZISC configuration employing: a) an index register _______________________ | i-registr | increment | |___________|___________| addr: I I+1 b) a comparator which tests for >= relop: >= ________________________________________________________________ | operand 1 | operand 2 | target1 in | target2 in | target out | ... |___________|___________|____________|____________|____________|__ addr: C C+1 C+2 C+3 C+4 c) one adder __________________________________________________ | operand 1 | operand 2 | sum | carry |... |___________|___________|____________|____________|_ addr: A A+1 A+2 A+3 d) two g.p. registers _____________ | reg | |___________|_ addr: R _____________ | reg | |___________|_ addr: R+1 d) indirect stuff (standard in any configuration) ins ~ left address bus right address bus dir _________|______________________|______________________|_______ 1 mem addr const 0x4 I+1 (i-reg incr) -> 2 fp (see below) I (i-register) -> 3 I (i-register) indirect stuff -> 4 [indirect stuff *(*a)] indirect stuff -> 5 [indirect stuff (*a)] A (adder1, op1) -> 6 I (i-register) indirect stuff -> 7 [indirect stuff *(*b)] indirect stuff -> 8 [indirect stuff (*b)] A+1 (adder1, op2) -> 9 A+3 (adder1, sum) C+2 (target 1) -> 10 I (i-register) indirect stuff -> 11 [indirect stuff *(*c)] R (g.p. reg1) -> 12 R (g.p. reg1) indirect stuff -> 13 [indirect stuff (*c)] C+3 (target 2) -> 14 C+3 (target 2) R+1 (g.p. reg2) -> 15 R+1 (g.p. reg2) C (comparator op1) -> 16 A+2 (adder sum) C+1 (comparator op2) -> 17 R (g.p. reg1) indirect stuff -> 18 [indirect stuff (*c)] target out <- The designation "fp" in the above example is made for clarity in comparison with the SPARC code fragment. It is assumed to be a general purpose register whose use is designated as a frame pointer. The index register is employed to step through the argument list. The increment is loaded with the value "4", the presumption being that each address is separated by 4 bytes. Recall that each time the index register is read its value is bumped. The value of *c is temporarily squirreled away into reg1 so that it will be available later should the value for 'c' need to be modified. Notice the use of the "indirect stuff". The index register first supplies the address of the address of the desired variable, which is turned around through the indirect stuff to obtain the address of the variable, and finally the value of the variable itself. No branch is made in this example, rather the comparator is used to 'select' either the value of 'x' or the value of 'c' for return to storage. Each architecture, ZISC and SPARC, requires 18 cycles to complete the operation and return the result to memory. Example - 3 Here is a code fragment where ZISC's configurability allows one configuration to drastically outperform a standard processor design: Compute a^2 + 2ab + b^2, where a,b are halfword integers on a ZISC configuration employing 3 multipliers and two adders. Assume that the integer multipliers are not pipelined and require 6 cycles to produce a product. The configuration contains: a) two adders __________________________________________________ | operand 1 | operand 2 | sum | carry |... |___________|___________|____________|____________|_ addr: A A+1 A+2 A+3 __________________________________________________ | operand 1 | operand 2 | sum | carry |... |___________|___________|____________|____________|_ addr: A+4 A+5 A+6 A+7 b) three halfword multipliers ______________________________________ | operand 1 | operand 2 | product | ... |___________|___________|____________|_ addr: M M+1 M+2 ______________________________________ | operand 1 | operand 2 | product | |___________|___________|____________|_ addr: M+3 M+4 M+5 ______________________________________ | operand 1 | operand 2 | product | |___________|___________|____________|_ addr: M+6 M+7 M+8 ins ~ left address bus right address bus dir _________|______________________|______________________|_______ 1 address of 'a' M (mult 1, op 1) -> 2 address of 'b' M+1 (mult 1, op 2) -> 3 M (mult 1, op 1) M+3 (mult 2, op 1) -> 4 M (mult 1, op 1) M+4 (mult 2, op 2) -> 5 M+1 (mult 1, op 2) M+6 (mult 3, op 1) -> 6 M+1 (mult 1, op 2) M+7 (mult 3, op 2) -> 7 ---- (nop) ---- 8 ---- (nop) ---- 9 M+2 (product 1) A (adder 1, op 1) -> 10 M+2 (product 1) A+1 (adder 1, op 2) -> 11 A+2 (sum 1) A+4 (adder 2, op 1) -> 12 M+5 (product 2) A+5 (adder 2, op 2) -> 13 A+6 (sum 2) A (adder 1, op 1) -> 14 M+8 (product 3) A+1 (adder 1, op 2) -> The total time to complete the operation is 14 cycles. Observations: ZISC may be most aptly suited for special purpose, embedded applications. Because if its on-the-fly configurability, problems requiring additional horsepower of a given type (like multiplication in the above example) can benefit greatly. The ZISC abstraction can also be extended to more complex distributions where the functional units represent processes and ZISC an executive. The architecture is expensive in terms of the number of gates and supporting memory required. Each instruction, regardless of how simple, requires two full addresses. Compilation for an arbitrary configuration would be non-trivial. Traditional front-ends could be used. At code generation time each atomic operation could be parcelled out, according to a configuration table, to the functional units. In the absence of the appropriate hardware, instructions could be recursively replaced with macros until simulated by the available functional resources. An operating system would be able to depend on the presence of a minimal set of functional units. Other applications could be stored and traded in intermediate language- only to be completely compiled for the target machine after reaching their destination. What next? A single ZISC configuration is memory and processor bandwidth limited in a world where many applications exhibit a considerable amount of fine-grained parallelism. The next step would be to string multiple ZISCS together into a Very Long Instruction Word (VLIW) arrangement: __ __ __ __ <-- data bus --> __ <-- data bus -->__ __ __ __ ...etc ^ ^^ ^ | || |_____ | |-_-_-_-_-_-_-_-_-_-_-_-_ address busses |____________________________________________ Since little control hardware is required to make ZISC work in the first place, an extension to a VLIW configuration would only require that each bank of address busses be driven from the same program counter. Additionally, when desired, the ZISCS could break ranks and operate independently as a multiprocessors or re-rendezvous into a VLIW again. Memory access would be made through a write-through cache and results in the stacks of functional units could be readable on either side. A new sort of functional unit who's only purpose is to become a transparent data window could be included for the purpose of moving data to stacks more than one hop away.