Introduction to the Pipelined Processor

This part of the lesson finally introduces the pipeline, as a new processor paradigm. We will explain, how the solution for the analogy of the laundry generalizes to the problem of the simple RTL circuit, from the last part of the lesson, and what all of this has to do with processors and how to make programs run faster on them.

Mapping the laundry analogy to the RTL case

How the analogy maps to the simple RTL circuit is rather straightforward: the clients map to the inputs of the circuit, and the machines map to the operations the that circuit executes on them. Letting each client use a different machine corresponds to dividing the circuit in two, with one further register, as shown in the circuit below, yielding a two-stage circuit which is able to effectively run faster, as the clock only has to wait for the propagation delay of the slowest of the two operations instead of waiting for the sum of their propagation delays. Just as each single client takes more overall time to process its own business, each input now takes longer to come out the other side, twice the slower component time, but the overall throughput of the circuit increases. Feel free to try for yourself for a given input to the circuit the version with only one stage, and two registers, and the new version below with two stages and three registers,and notice, that you can allow a faster clock with the latter.

Graphing the dynamic behaviour of the two-stage RTL circuit

Let us therefore see how to draw the diagram for this case. Let us use the same inputs as last time, therefore 0, 5, and 12. Let us write them down.

Inspect the clock and all the chain of signals, set the input, output, and all monitors to decimal, and pause the simulation.
Set the input to zero and advance the simulation by one clock cycle. We will call this, cycle one. You can see that the input zero has propagated in all the exponentiation stage of the circuit. Therefore, we can say that in cycle one the input zero was processed with the exponentiation.
Set the input to five and advance the simulation by one clock cycle. Now the input zero has passed to the increment stage, while the input five is in the exponentiation stage.
Set the input to twelve and advance the simulation by one clock cycle. We can see now that the input zero has finished processing, and the output shows the result one. Both the other inputs have behaved similarly, shifting to the right, with the 5 input now being incremented and the 12 input now being exponentiated.
Let us do one more cycle. Set the input to 7 and advance the simulation by one clock cycle. The output now shows twenty six, the result of input 5, while the incrementer is processing the input 12 and the exponentiator is handling the input 7.

We could go on indefinitely in this fashion.

Input	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8

Graphing the dynamic behaviour of a single cycle processor

The question is now, what does this have to do with the processor? As a matter of fact, the circuit that exponentiates and increments its input was also an analogy for a processor: a series of operations (read and write the memory, alu operations et cetera) that are executed on an input (an instruction) and that are synchronized by a clock signal, in a Register Transfer fashion. The very same diagrams that we compiled for the laundry and for the circuit, we can compile for a processor.

We made this whole journey, that started with remembering that logic gate operate with a propagation delay, to reach this point: processors are made of a very large number of those ports, and their cumulative delay becomes very noticeable with so many of them in series. It is the objective of the hardware engineer, to explore the ways to make it process more instructions in the same amount of time, preserving funcionality. Having already seen the solution in other analogic domains, it is just a matter of understanding how the principle generalizes to the processor.

The diagram for the single cycle processor is very simple, as only one instruction can fit inside it at any given cycle. Below is an example of a simple program. Let us use the acronym "SC" for cingle cycle to indicate all the operations, that the single cycle processor executes on the input instruction. In the first cycle, we expect the first instruction to be processed, in the second cycle the second instruction and so on until the end of execution. In case the branch is taken once, cycle six will contain the load, and the execution continues as before.

Here is a program to use in order to write a simple diagram for the single cycle processor:

Program:

sw t1 0(t2)         # M[0xbc] = 0x123
LOOP:
   lw a1 0(t2)      # a1 = M[0xbc] 
   add a2 a1 a1     # a2 = a1+a1
   sw a2 0(t2)      # M[0xbc] = a2
   blt a2 a0 LOOP   # jump if a2<a0

Input (instruction)	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8
sw t1 0(t2)
lw a1 0(t2)
add a2 a1 a1
sw a2 0(t2)
blt a2 a0 LOOP

Intuition for the multi-stage pipelined processor

Now, the main question that we are going to answer in this lesson is, how to modify the hardware in order to process more instructions in the same amount of time?

Just as in the two previous analogies, the objective is to divide the single cycle processor in multiple so called stages, separated by interstage registers, so that each contains one instruction completing one well defined part of the total execution. This way, just as the laundry allowed for three clients at any given time working in parallel, on different stages of the cleaning process, the so called "pipelined" processor will allow multiple instructions being processed, at any given time, each completing one different stage of its processing. Both in the case of the laundry and in the case of the RTL circuit however, we did not have to identify the various discrete operations in which to divide the process, as they were trivial. How to sort the components of the single cycle into functionally defined clusters, and therefore how to divide the processor with registers, how the stages interact with one another and with each type of instruction, are all topics of the next parts of the lesson.

Here is a cheatsheet that compares the case of the laundry to the processor. This is helpful to remember the core concepts.

CONTEXT	24/7 Laundry	Single Cycle Processor
THING TO PROCESS	CLIENT	INSTRUCTION
STAGES OF PROCESSING (fake timings)	Wash 1.45h, Dry 1.30h, Iron 2h	Fetch 4ns, Decode 1ns, Reg read 4ns, Execution 5ns, Memory 6ns, Writeback 3ns
WHO ARE WE	The Owners Evelyn and Waymond Wang	R&D Engineers, with the task to make the processor faster
CONSTRAINT	Only one client in the room at any given time. The laundry serves one client every 5.15h -> 4.36 clients a day	Only one instruction in the processor at any given time. The processor serves one instruction every 23ns -> 43.478.261 instructions per second (~43.5 MHz)
SOLUTION	Let three clients in the room at any given time, each using one machine and waiting for the next machine to free before using it (every two hours, the time of the longest task).	Divide the hardware of the processor in stages, one for each sub-operation, with interstate registers, which are clocked every 6 ns (time of the longest task)
CONS	Clients waits 15 minutes after washing before drying, and 30 minutes after drying before ironing, since ironing is the longest task. Each client takes 6 hours instead of 5.15 h to was the clothes.	Every stage except the slowest one will stall for some time after propagating, since they have to wait for the memory stage, the slowest, to finish so they can all proceed. Each instruction takes more to process five times the longest delay (5x6=30ns) instead of the sum of delays (23ns)
PROS	The laundry processes one client every 2 hours, time of the slowest stage, instead of 5.15, sum of all stages, as three clients at a time are in the laundry IN PARALLEL. The laundry serves 12 clients a day instead of 4.36	The processor serves one instruction every 6ns, time of the slowest stage, instead of 23ns, sum of all the stages, as five instructions at a tmie are in the processor IN PARALLEL. The processor serves 166.666.667 instructions per second (167MHz instead of 43.5MHz)
TRICK	No three clients are processed in a block, then 3 others in a block and so on. 3 Clients are there in any given moment, but they rotate (first in is first out)	No five instructions are processed in a block, then 5 others in a block and so on. 5 Instruction are there in any given moment, but they rotate (first in is first out)

Wrap Up Exercises

You found out during previous exercises a qualitative estimate of the maximum theoretical propagation delay of the Incrementer circuit. You may have noticed that the circuit is actually a 8 bit adder with one "hidden" input set to one. Did this influence your estimate of the total propagation delate? Discuss the difference in expected actual propagation delay, if any, between a normal adder and the incrementer.

Please write your answers on a document, to then submit as PDF in the "Assignments" section. Mind that all answers from the whole Circuit Timing section are to be written on the same document. Feel free to add images to your answers, if it helps to explain something more concisely.

Introduction to Pipelining