Introduction

Welcome to "Introduction to Pipelining": an experimental simulation-tool-assisted lesson to introduce the user to the basics of parallel execution and pipelining.

Who is this lesson for

"Introduction to Pipelining" caters to undergraduate computer science students that have an understanding of the basics of computer science and computer architecture and want to take one step closer towards understanding how the modern processors are designed and built.

Scope

The goals of the lesson are:

Introducing the user to the most rudimentary form of hardware level technique for parallel execution.
Presenting two software tools for simulating simple RISC-V processors at multiple abstraction layers.
Letting the user exercise the acquisition of knowledge and evaluate the lesson itself.

The highest goal of the lesson is to make the user understand that they can get their hands dirty, play and experiment with bits, clocks and instructions thanks to the simulators being presented, and maybe also have fun while cementing their knowledge.

Assumptions and Prerequisites

The lesson assumes that:

You have worked before with simple logic circuits and ports.
You are comfortable with basic RISC-V assembly programming and its binary and hex representations.
You know what the main components of a simple processor are and how assembly runs on them.

If at any moment you need to review any of these concepts, we have left some resources in the Appendix, expecially:

A RISC-V Instruction Set Cheat Sheet HERE
Interactive models of the most important components of a processor HERE. Use these models if at any point during the lesson you feel the need to see how one of the components works as isolate. For example if you want to see what is happening inside the ALU at a specific time of a Ripes program simulation, just copy in the interactive ALU model found in the appendix the values that you see in Ripes and inspect the ALU from there.

Setup

The lesson will use two simulation software, that is recommended that you verify are working for you before the lesson:

S.H.E.A.S., a simple logic ports simulator also available standalone. It will however be mostly embedded in the lesson pages. A tutorial is available in the appendices HERE
Ripes, a visual RISC-V processor simulator. The appendices contain HERE a virtual machine, to be sure it will work on any system, and a tutorial of the program.

Overview

The lesson will consist of videos, interactive visualization tools, and sometimes exercises. Most of the section with the video include and almost identical transcript below. Expect to dedicate two to four hours for completing the lesson, depending on your own pace (you could take even less). We will ask you to compile two forms:

An Answers Form, in which we ask you to submit the answers to the various questions and exercises found along the lesson. Open it now and keep it open in another tab for easier access during the lesson.
A Final Evaluation Form, which you will also find in the Assignments Section in embedded form for easier access. This has to be compiled after you have finished the lesson and the assignments.

What follows is an overview of the next chapters.

Chapter 2 - Timing: this chapter will deal with the problem of propagation delay. We will see how gates calculating their input in a non negligible amount of time impacts the single cycle processor and we are going to construct an intuitive idea of the solution, the new processor paradigm called pipeline, by analogy.
Chapter 3 - Building the Pipeline: in this chapter we will build the new processor paradigm lego-like, starting from the components of the single cycle processor. We will also see how each instruction activates each of the stages of the pipeline and understand the main advantages of the pipelined processor.
Chapter 4 - Demos: this chapter will finally cover the execution on the pipelined processor of a couple of programs from start to finish. We will encounter the problems that the pipeline introduced and that were not of concern before, and develop an intuition on how to resolve them. At the end of this chapter you will find the EVALUATION FORM embedded below the assignments.
Appendix: as said above, this contains some corollary material for the lesson, among which is the setup and tutorials for the two software tools we are going to use, a reference of the RISC-V Instruction Set and a catalogue of interactive processor components.

Transcript

In this lesson we are going to modify the hardware of a single cycle processor, building a new computer architecture paradigm, that allows more instructions to be processed in the same amount of time, without loss of functionality.

With the support of two hardware simulation tools, we will start with simple logic ports and their propagation delay, and we will study their consequences on the single cycle processor; then we will proceed to build a new execution paradigm called the pipeline, and, with the help of a complete RISC five Reference, run programs on it.

Before beginning the lesson, make sure you read the goals, prerequisites and overview written below.

Circuit Timing

In your studies so far you have treated logic gates and circuits as instantaneous, which means they yield their result immediately after having received the inputs. However this is not the case in the real world, where circuits always have a non negligible propagation delay. What consequences does this have on the processor? In this chapter we will find out.

Make sure you have read THIS to know how to use SHEAS properly.

Index

Consequences of a Non-Negligible Propagation Delay

This part of the lesson deals with the delay with which logic gates calculate their output, which is not zero. This propagation delay, which we often ignore, has consequences on the maximum clock frequency of any processor.

When logic gates are studied, there are usually two different ways to communicate their functionality.

One is the truth table, that is the complete list of outputs given the list of all the possible combinations of inputs. Some examples of this are the nand gate, the truth table of which shows the output only deactivating when the inputs are both on, or the xnor gate, whose output is active only when the two inputs have the same value.
The other way of communicating functionality, is the way we just used to describe the truth table of the first two gates, that is a functional description in plain english. This is often preferred for the sake of brevity with gates that have many possible inputs, but a simple rule to calculate the output: this is the case of the incrementer chip here, which has 256 possible inputs, since the input port is 8 bits wide, but its output can simply be described as "the input plus one" for any given input.

Whichever way the gates were described with, there is one further detail that is often omitted or quickly touched upon and not really cared about: the output of each gate is always calculated with a small but measurable delay, with respect to the input variation. This is called propagation delay. Thankfully, this tool is granular enough to show it.

Propagation delay of a simple gate

To observe this delay in action, try to fiddle with the nand gate in the following simulator by inspecting the inputs and the output of the nand gate and then pause the simulation. Zoom in on the monitors so that you can comfortably see each step. Now let us fiddle with the inputs, and try to understand how many steps it takes to calculate the output given any input variation. It takes one step, and you can verify all the combinations by yourself. How long one step is is rather unimportant, it might be one millisecond or one nanoseconds, for our purposes that is not important: it is the smallest time division of the engine of our simulator, and might as well be the smallest time interval we can measure with a logic analyzer, in the real world.

Propagation delay in a simple boolean circuit

We can do the same with the xnor gate. We notice that sometimes the output takes five steps to propagate, as it is the case with inputs changing from zero zero to zero one, and sometimes six steps, as it is the case with inputs changing from zero zero to one one. We can therefore say that the propagation delay of the xnor is six steps. Why then is this value higher than the propagation delay of the nand gate? Let us remember that the nand gate is what is called a universal gate, which means that it can be used to synthesise any other logic gate, like this xnor here, that just so happens to be synthetized out of nand. The longer propagation delay of the xnor, is therefore directly proportional to the complexity of its implementation in nand gates, and we have also verified it being variable with previous input.

Let us spend two minutes for a qualitative analysis, of how the inner structure of the xnor chip results in the propagation delay we observed. We see the inner structure being made of 'and', 'not' and 'or' gates. Internally the not is made of a single nand port with shortcircuited inputs, which means that its propagation delay is the same, one step. The and gate is made of two nand gates in series, so it is rather intuitive that its propagation delay will be two steps. The same can be said about the or gate, which has a maximum of two nand gates in series between any of the inputs and the output. The structure of the xnor gate is slightly more complex, but we only have to understand how to generalize the concept. Each input has two routes towards the output, one that is five steps long and one that is six steps long, as they both pass through a not gate, an or gate, an and gate and only one of the two passes through the first not gate. So without assumptions about the inputs, we can say that the maximum delay of the xnor gate is six steps.

Propagation delay in a complex circuit

Let us now consider the incrementer. You can see inspecting it that its a rather complex circuit, so we expect it to have a rather high maximum propagation delay. How much exactly? Let us set the input to decimal try some values. Zero to one takes 12 steps. 254 to 255 takes 4 steps. 127 to 128 takes many steps.

The exact propagation delay of this chip will be asked to you as an exercise, after this part of this lesson. I invite you to notice, that the carry of each full adder is given as input of the following full adder. You will be asked to reason about what that means for the total propagation delay and if, given this structure, the fact that one input of the adders is always one, somehow changes the delay estimate.

Wrap Up Exercises

Conduct a qualitative analysis of the full adder circuit similar to that done for the xnor gate in the previous lesson. What propagation delay do you estimate? Here is an isolated full adder circuit, of which there are many inside the incrementer.
Does the overall propagation delay of the incrementer chip change for different inputs? If yes, with which criteria?

Please write your answers on a document, to then submit as PDF in the "Assignments" section. Mind that all answers from the whole Circuit Timing section are to be written on the same document. Feel free to add images to your answers, if it helps to explain something more concisely.

Review of Register Transfer Level (RTL) abstraction

This part of the lesson covers the limitations that propagation delay causes to RTL version of the same circuits as the previous section.

In the previous lesson we learned about the propagation delay, which is an intrinsic characteristic of each logic gate. In this lesson we are going to cover the same topic at a slightly different abstraction level, the so called Register Transfer Level, or RTL in short. This is a design abstraction, that focuses on sampling the inputs and outputs of a circuit with registers, as you can see in the simulations below, which contains the same chips as the last one but with sampled inputs and outputs. You will hopefully understand how this abstracted structure, is all over the hardware architecture of processors.

Now let us explore how this new abstraction level influences how the chips work.

RTL of a simple gate

Focusing on the nand gate: pause the simulation and inspect its clock signal, and the input and outputs on both sides of the registers. When an input changes, it is not passed to the inner circuit, in this case the nand gate, until the next rising edge of the clock signal, which updates the input registers' ouputs after one step.

After that the nand gate is free to propagate and calculate its output in one step: such value is however not immediately transfered to the actual circuit output, it will do so only at the next rising edge of the clock signal, which will update the output register, that after one further step will update the actual circuit output.

Given that the input is detected at a given rising edge of the clock signal, the ideal case is when the correct output comes out of the output register at the very next rising edge of the clock signal, that is when the propagation delay of the circuit contained within the input and output registers is shorter than the clock period.

This is critically important: if we want for example to put more than one of this cells in series to one another, each one must compute its own output within one clock cycle, that is the rising edge of the clock signal immediately after the one that detected the new inputs, in order sample the correct output. If this does not happen, the wrong value is passed onto the next section of the circuit and the computation fails.

This being said, we can now see the propagation delay of a chip under a different perspective: it becomes a limit, to the maximum clock frequency that the corresponding RTL representation can handle without error, where by error we mean the output being sampled at any later point, than the rising edge immediately after the one that detected the different input. We can call this limitation a "sampling constraint".

Let us try then to find the maximum clock frequency of each of the circuits in the simulation panel. The clock component can be controlled by the number above it, which represents how many steps until the clock changes value, or half of the clock period as measured in steps. So, for example two consecutive rising edges of a clock set at number 5 are 10 steps apart, and that value, the clock period of 10 steps, is the maximum allowed propagation delay of the chip contained inbetween the registers, that are clocked by such signal.

Let us then tackle each circuit. We found out earlier that the propagation delay of the nand port is just one step. Therefore even a clock set to 1, which has a two long clock period, should present the correct output after just one clock cycle. If we verify that, we see that no matter when we change the input, the output changes exactly one step after the next rising edge, to the one which detected the new input. The extra step is due to the propagation delay of the output register itself.

RTL of a simple boolean circuit

We can do the same thing with the xnor, expecting the minimum clock period to be 6, that is the clock component set to 3. The way to test the worst case scenario is to use one known input, with the maximum expected propagation delay, in our case zero zero to one one, and change it at the same time as a rising edge of the clock. We see that even in that case, the circuit still functions properly.

RTL of a complex circuit

Regarding the incrementer, it was your task to find its exact propagation delay in the last exercise, so we are not going to spoil the solution. We are therefore going to use this chip as an example of what happens if you got your estimate wrong. If you overestimated the delay you are not going to see the error with the signal monitors, as the correct output will just wait to be sampled at the output register for a few steps. If you underestimated the delay however, it will be very evident from the signals, and that is the case we are covering now.

Let us say that you estimated the propagation delay of the incrementer to be 20 steps (we are taking a particularly low estimate just to be sure that it is very very wrong). That means that we expect that a clock set with number 10 will sample the correct output, at the rising edge immediately following the one that detected the new input.

Set the input radix and the monitors to decimal and the clock to 10. If we set the input to zero, and the change is detected at this rising edge, we can see the circuit comfortably outputting one just one step after the next rising edge. Remember that that extra step is due to the propagation delay of the output register.

However, if we input for example 63, we see that the next rising edge yields 56 instead of 64. What is more is that the next rising edge is sampling the value 32, again instead of 64. Finally, on the third rising edge after the one that sampled the new input we get 64. This behaviour happens because the propagation delay of the incrementer with input 63, is between 40 and 60 steps (between the second and third rising edge from the one that sampled the input) and at each rising edge before the result is ready, the output register is sampling some random number that comes out of the incrementer chip, while it is propagating. This is in a nutshell why in general, the clock period must be slow enough to allow for the full propagation of the circuit at any input.

Wrap Up Exercises

Reformulate the result of the previous exercise, that is the estimate of the propagation delay for the incrementer, as a constraint to the maximum clock speed of its RTL circuit form.

A Diagram for Graphing the Dynamic Behaviour of RTL Circuits

This part of the lesson introduces a simple diagram for representing the time evolution of the execution of a circuit at Register Transfer Level.

A multi-operation RTL circuit

The only difference between the circuit below and the previous ones, is that this executes two operations within the two registers: an exponentiation to the power of two, and an increment. Therefore, given input 'x', we expect output to be 'x' to the power of two, plus one. Note that the multiply chip does not feature the whole circuit down to nand gates, as it would be extremely complex: it 'cheats' with an ad hoc component and an artificial fixed propagation delay.

Graphing the dynamic behaviour of RTL circuits

The graph has the cycles on the x axis, that are read from left to right, and the inputs given to the circuits on the y axis, that are read from top to bottom. In a nutshell, it keeps track of which input was being processed by the circuit at each cycle. At the crossing between the cycle and the input we are considering, we write the operation that has been carried on that input in that cycle.

Input	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8

Let us now compile the graph while manually advancing the simulation. Let us pause it, inspect the clock signal, the input and the output, and enable the registers. Let's set the first input at zero, write it in the table, and advance by one clock cycle. We will see that though the input has been sampled by the rising edge of the clock, the output has not updated yet, as it will at the next rising edge. Remember that we consider the beginning of a clock cycle as its rising edge, and the end of it at the step just before the next rising edge.

At this point the operation of the circuit has been done on the first input, that is exponentiation and increment, so we write it down under cycle 1. We can therefore write the new output, let's say 5, and advance by one cycle.

We can see the result for the first input being sampled, and we can compile the diagram with the operation at cycle 2. Then we start over again. New input, write the input in the table, advance one cycle and write the operation on the table.

After a few steps we have a compiled table, that we can read in the following manner. What operation was done to input of value 5 in cycle 2? We can read exponentiation and increment. What operation was done on input of value 12 at cycle 4? We here read none. The reason why we have values only on the diagonal, is because the present circuit can only process one input at the time, which means that each input is processed in one clock cycle from beginning to end. For now, let us keep how this diagram is constructed in mind, but let us not forget the original topic: that of propagation delays and the constraint they set on the clock speed.

The 24/7 Laundry Analogy

Imagine being the engineer who designed this circuit,and that wants to process inputs faster. How can we modify the RTL circuit so that it maintains functionality but speeds calculations? Let us try to get to an answer with an analogy.

Imagine being the manager of a laundry business open 24/7, which offers a washing machine which takes 1.45 hours to operate, a dryer which takes 1.30 hours and an ironer which takes 2 hours. One client comes in, uses the room with the three machines and then leaves, much like one input of the circuit above, which enters the input register, is processed, exiting from the output register. With this organization each client keeps the room busy for 5.15 hours, and we can see a diagram similar to what we compiled before, in which each time frame contains the entire cleaning process of only one client. How then can we optimize the profits of the laundry business?

Daily activity of a 24/7 laundry.

Hours	8:00-13:15	13:15-18:30	18:30-23.45	23:45-5:00	5:00-10:15
Client 1	Wash+Dry+Iron
Client 2		Wash+Dry+Iron
Client 3			Wash+Dry+Iron
Client 4				Wash+Dry+Iron
Client 5					Wash+Dry+Iron

The solution, for a laundry business.

The simplest solution is the most intuitive: letting three clients use the three machines at the same time in a linear fashion, that is a client starts using the next machine in the cleaning process, when the previous client finishes using it. This means that the clients proceed to the next machine with the same frequency as the slowest, in this case every two hours, because of the ironer. Each client therefore completes the cleaning process in 6 hours instead of 5.15 hours, but the laundry processes one client every two hours instead of one every 5.15 hours. You can see how the diagram transforms: now, time frames are much smaller, each contains up to three different clients at the three different stages of the cleaning process, and processing the same amount of clients takes therefore way less time.

Hours	8:00-10:00	10:00-12:00	12:00-14:00	14:00-16:00	16:00-18:00	18:00-20:00	20:00-22:00
Client 1	Washes (and waits 15 min)	Dries (and waits 30 min)	Irons
Client 2		Washes (and waits 15 min)	Dries (and waits 30 min)	Irons
Client 3			Washes (and waits 15 min)	Dries (and waits 30 min)	Irons
Client 4				Washes (and waits 15 min)	Dries (and waits 30 min)	Irons
Client 5					Washes (and waits 15 min)	Dries (and waits 30 min)	Irons

Introduction to the Pipelined Processor

This part of the lesson finally introduces the pipeline, as a new processor paradigm. We will explain, how the solution for the analogy of the laundry generalizes to the problem of the simple RTL circuit, from the last part of the lesson, and what all of this has to do with processors and how to make programs run faster on them.

Mapping the laundry analogy to the RTL case

How the analogy maps to the simple RTL circuit is rather straightforward: the clients map to the inputs of the circuit, and the machines map to the operations the that circuit executes on them. Letting each client use a different machine corresponds to dividing the circuit in two, with one further register, as shown in the circuit below, yielding a two-stage circuit which is able to effectively run faster, as the clock only has to wait for the propagation delay of the slowest of the two operations instead of waiting for the sum of their propagation delays. Just as each single client takes more overall time to process its own business, each input now takes longer to come out the other side, twice the slower component time, but the overall throughput of the circuit increases. Feel free to try for yourself for a given input to the circuit the version with only one stage, and two registers, and the new version below with two stages and three registers,and notice, that you can allow a faster clock with the latter.

Graphing the dynamic behaviour of the two-stage RTL circuit

Let us therefore see how to draw the diagram for this case. Let us use the same inputs as last time, therefore 0, 5, and 12. Let us write them down.

Inspect the clock and all the chain of signals, set the input, output, and all monitors to decimal, and pause the simulation.
Set the input to zero and advance the simulation by one clock cycle. We will call this, cycle one. You can see that the input zero has propagated in all the exponentiation stage of the circuit. Therefore, we can say that in cycle one the input zero was processed with the exponentiation.
Set the input to five and advance the simulation by one clock cycle. Now the input zero has passed to the increment stage, while the input five is in the exponentiation stage.
Set the input to twelve and advance the simulation by one clock cycle. We can see now that the input zero has finished processing, and the output shows the result one. Both the other inputs have behaved similarly, shifting to the right, with the 5 input now being incremented and the 12 input now being exponentiated.
Let us do one more cycle. Set the input to 7 and advance the simulation by one clock cycle. The output now shows twenty six, the result of input 5, while the incrementer is processing the input 12 and the exponentiator is handling the input 7.

We could go on indefinitely in this fashion.

Input	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8

Graphing the dynamic behaviour of a single cycle processor

The question is now, what does this have to do with the processor? As a matter of fact, the circuit that exponentiates and increments its input was also an analogy for a processor: a series of operations (read and write the memory, alu operations et cetera) that are executed on an input (an instruction) and that are synchronized by a clock signal, in a Register Transfer fashion. The very same diagrams that we compiled for the laundry and for the circuit, we can compile for a processor.

We made this whole journey, that started with remembering that logic gate operate with a propagation delay, to reach this point: processors are made of a very large number of those ports, and their cumulative delay becomes very noticeable with so many of them in series. It is the objective of the hardware engineer, to explore the ways to make it process more instructions in the same amount of time, preserving funcionality. Having already seen the solution in other analogic domains, it is just a matter of understanding how the principle generalizes to the processor.

The diagram for the single cycle processor is very simple, as only one instruction can fit inside it at any given cycle. Below is an example of a simple program. Let us use the acronym "SC" for cingle cycle to indicate all the operations, that the single cycle processor executes on the input instruction. In the first cycle, we expect the first instruction to be processed, in the second cycle the second instruction and so on until the end of execution. In case the branch is taken once, cycle six will contain the load, and the execution continues as before.

Here is a program to use in order to write a simple diagram for the single cycle processor:

Program:

sw t1 0(t2)         # M[0xbc] = 0x123
LOOP:
   lw a1 0(t2)      # a1 = M[0xbc] 
   add a2 a1 a1     # a2 = a1+a1
   sw a2 0(t2)      # M[0xbc] = a2
   blt a2 a0 LOOP   # jump if a2<a0

Input (instruction)	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8
sw t1 0(t2)
lw a1 0(t2)
add a2 a1 a1
sw a2 0(t2)
blt a2 a0 LOOP

Intuition for the multi-stage pipelined processor

Now, the main question that we are going to answer in this lesson is, how to modify the hardware in order to process more instructions in the same amount of time?

Just as in the two previous analogies, the objective is to divide the single cycle processor in multiple so called stages, separated by interstage registers, so that each contains one instruction completing one well defined part of the total execution. This way, just as the laundry allowed for three clients at any given time working in parallel, on different stages of the cleaning process, the so called "pipelined" processor will allow multiple instructions being processed, at any given time, each completing one different stage of its processing. Both in the case of the laundry and in the case of the RTL circuit however, we did not have to identify the various discrete operations in which to divide the process, as they were trivial. How to sort the components of the single cycle into functionally defined clusters, and therefore how to divide the processor with registers, how the stages interact with one another and with each type of instruction, are all topics of the next parts of the lesson.

Here is a cheatsheet that compares the case of the laundry to the processor. This is helpful to remember the core concepts.

CONTEXT	24/7 Laundry	Single Cycle Processor
THING TO PROCESS	CLIENT	INSTRUCTION
STAGES OF PROCESSING (fake timings)	Wash 1.45h, Dry 1.30h, Iron 2h	Fetch 4ns, Decode 1ns, Reg read 4ns, Execution 5ns, Memory 6ns, Writeback 3ns
WHO ARE WE	The Owners Evelyn and Waymond Wang	R&D Engineers, with the task to make the processor faster
CONSTRAINT	Only one client in the room at any given time. The laundry serves one client every 5.15h -> 4.36 clients a day	Only one instruction in the processor at any given time. The processor serves one instruction every 23ns -> 43.478.261 instructions per second (~43.5 MHz)
SOLUTION	Let three clients in the room at any given time, each using one machine and waiting for the next machine to free before using it (every two hours, the time of the longest task).	Divide the hardware of the processor in stages, one for each sub-operation, with interstate registers, which are clocked every 6 ns (time of the longest task)
CONS	Clients waits 15 minutes after washing before drying, and 30 minutes after drying before ironing, since ironing is the longest task. Each client takes 6 hours instead of 5.15 h to was the clothes.	Every stage except the slowest one will stall for some time after propagating, since they have to wait for the memory stage, the slowest, to finish so they can all proceed. Each instruction takes more to process five times the longest delay (5x6=30ns) instead of the sum of delays (23ns)
PROS	The laundry processes one client every 2 hours, time of the slowest stage, instead of 5.15, sum of all stages, as three clients at a time are in the laundry IN PARALLEL. The laundry serves 12 clients a day instead of 4.36	The processor serves one instruction every 6ns, time of the slowest stage, instead of 23ns, sum of all the stages, as five instructions at a tmie are in the processor IN PARALLEL. The processor serves 166.666.667 instructions per second (167MHz instead of 43.5MHz)
TRICK	No three clients are processed in a block, then 3 others in a block and so on. 3 Clients are there in any given moment, but they rotate (first in is first out)	No five instructions are processed in a block, then 5 others in a block and so on. 5 Instruction are there in any given moment, but they rotate (first in is first out)

Wrap Up Exercises

You found out during previous exercises a qualitative estimate of the maximum theoretical propagation delay of the Incrementer circuit. You may have noticed that the circuit is actually a 8 bit adder with one "hidden" input set to one. Did this influence your estimate of the total propagation delate? Discuss the difference in expected actual propagation delay, if any, between a normal adder and the incrementer.

Building the Pipelined Processor

We are now going to build the pipeline lego-like, starting from the components of the single cycle processor. We will study the advantages as well as the compromises of this new paradigm.

Make sure you have read THIS in order to setup Ripes properly and know how to use it.

Index

Grouping the Components of the Single-Cycle Processor

This part of the lesson deals with re-arranging the components of the single cycle processor, together with all the modifications to the existing signals that we have to make in order to maintain the same exact functionality and avoid runtime execution errors.

Recap of the the computation flow of a processor

Let us build the actual pipelined processor. As we anticipated during the last part of the lesson, the point is to divide the components of the processor by function, and to divide the clusters with registers. Let us then start with a quick high level recap of the flow of the instruction in the processor.

The processor calculates the new program counter value, gives it as address input to the instruction memory, and retrieves the corresponding instruction. We will call this function "Instruction Fetch", or 'IF'.
The processor decodes the value of all its fields, which includes retrieving the values of registers from the register file. We will call this function "Instruction Decode", or 'ID'.
The processor selects the correct inputs and the correct alu function, to make the calculation associated to the current instruction. The branch instructions will also use the Branch Comparison chip. We will call this function "Execution", or 'EX'.
If applicable to the current instruction, the processor interacts with the data memory to either read it or write it (think the load and store instructions). We will call this function "Memory", or 'MEM'.
Again if applicable, the processor writes the results of the Execution or of the Memory operations to the register file. We will call this function "Write Back", or 'WB'.

Building the pipeline piece by piece

Now let us build each stage, that is each of the new "areas" of the processor, constituted by the components grouped by function and divided by registers. In doing so, we can compile the table below, which keeps track of all components and connections for each stage.

Stage	Which component are in this stage?	Which signals come from the following stages?	Which signals go towards previous stages?	Which signals go into the interstage register towards the next stage?
IF
ID
EX
MEM
WB

Let us follow the same points we layed out in the recap above.

IF Stage: The first area contains the components that perform the "Instruction Fetch" function, which means the Program counter register, the mux that selects its next value, the plus four adder and of course the instruction memory. The other chips, like the decode chip, are part of the next stages as they decode the instruction, not just fetch it. Now that we have the components, we ought to recognize which signals go into the interstage register towards the next stage, which will be all the signals that are needed in the following stages of the processor, and which signals come from one of those following stages, which are usually needed when a signal that controls one of the components in a given stage, is calculated in one of the following stages. Since this is the first stage there will be no signal going towards previous stages, but those will be needed in the following stages when sending back, the signals that are calculated after they are needed. This will be more clear as we go along.
- Of the outputs of the components of this stage, all three are needed in the following chips: the instruction signal is needed in the decode chip, the program counter signal in one of the alu operator muxes, the program counter plus four signal in the mux that selects the register file write data. They therefore all go into the interstage register, which since they are all 32 bit signals will be 96 bits wide.
- Of the inputs of the components only one is not covered by neither the components themselves, nor signals coming from the previous stage, that is the input of the mux that selects the next program counter value linked to the alu result. For example, a branch instruction with its branch taken, needs the alu to calculate its branch address, to feed the program counter. This is an example of a signal that is calculated in a later stage and sent back to a previous one.
ID Stage: The second area contains the components that perform the "instruction decode" function, which means firstly extracting opcode and source register indexes, therefore we need the decode chip, secondly the value of said registers, so we need the whole register file, and finally the immediates if any, so we need the immediate chip.
- Of the outputs of these components, all three are needed in the following chips: the two values of the source registers are needed as inputs of the alu, and of the branch comparison chip, as well as data memory for the source register 2, while the immediate is only needed as alu input. Note that since the program counter signal is not used here, it has just to be passed through towards the next stage. The same goes with the program counter plus four signal.
- Of the inputs of the components, again only one is not covered by the components themselves, which is the register file data in, calculated all the way back at the last mux. This is calculated at the end of all other operations, because many signals can be written in a register: arithmetic operations instructions write the alu result, loads write the memory output, jumps write the program counter plus four signal. There is however one observation that has to be done: when writing the register file, the data in signal and the write index signal must refer to the same instruction. This is not an issue in the single cycle processor, as the whole processor is processing a single instruction at any given time, but what we are doing is dividing the processor into stages, so that each stage can contain an instruction, and therefore process multiple instructions at the same time. That means that being the register file and the mux that selects the data in in two different stages, when writing the register file the instruction that will be in the same stage as the register file will be different than the one that the data in signal refers to. How to make sure that the write index signal will also refers to that same instruction, and not to the one currently in the decode stage? The solution is cutting the write index sigal and pass it on to the interstage register, towards the next stage, and pass it through all the stages until the same stage where the data in signal is calculated, to then be 'passed back' together with it. This ensures that the two signals refer to the same instruction, when writing the register. This also means that although we are placing the register file in this decode stage, functionally it is also part of the same stage that calculates the data in signal, which we have still to build. We can say that the register file chip is part of the decode stage when reading from it, and part of the other stage when writing to it.
EX Stage: The third area contains the components that perform the "execution function", that is the alu and the two muxes that select its inputs and the branch comparison chip.
- Of the outputs of these components, the alu result and the value of source register two will go in the interstage register, towards the next stage, together with the write index signal and the program counter plus four signal, that are not used here. The alu result is also passed back to the fetch stage, where it is used as a possible next program counter value, as we already said.
- Of the inputs of the components, all are covered by the components themselves or by signals passed from the previous stage, therefore no external ones are needed.
MEM Stage: The fourth area contains the components that perform the 'memory function', which is just the data memory.
- Of its outputs, the only one, the data out signal, is passed to the next stage, together with the alu result signal, that other than being used here as the address input of the data memory, is also used as input of the register file write data selector mux, and together with the write index signal and the program counter plus four signal, that are again not used here.
- Of its inputs, both are covered by the signals coming from the previous stage, so no external ones are needed.
WB Stage: Finally we get to the last stage, consisting of the only remaining component, the register file write data selector mux, that simply passes its output and the write index to the decode stage to perform the write operation. Remember that functionally the register file is also part of this stage, as its write function is controlled from here.

We now have a rudimentary pipelined processor. In the next part of the lesson, we will briefly see how each type of instruction activates each of its stages.

As an assignment you will be asked to load the same processor on Ripes but with the Extended layout, notice all the chips and signals that were previously hidden, and place them in the correct spot in the various stages, completing the table.

Wrap Up Exercises

Where in the groups we just defined are placed the components that were hidden in the standard visualization of the processor but that are shown in the extended visualization? Complete the table above with all the new components and signals.

If needed, here you can find a sketchable screenshot of the extended single cycle processor.

Synthesize a circuit in SHEAS that receives in input any RV32I instruction in hex form and outputs a single bit flag that indicates whether or not the input is a jump instruction.

Please write your answers on a document, to then submit as PDF in the "Assignments" section. Feel free to add images to your answers, if it helps to explain something more concisely.

An Overview of how Each Inctruction Activates Each Stage

This part of the lesson explains how every stage is activated by each type of instruction. In order to do that we will use a simple list of unconnected instructions:

Program

lui a5 30
addi sp, s0, 4
srai sp, s0, 4
sub ra, sp, gp
blt x12 x10 -12
lw a1 4(t2)
sw a1 8(t2)
jal a5 30

Ripes will be used for the simulation.

Video Transcript

We will now focus on how each type of instruction activates each stage. Please load the 5 stage pipelined processor without Hazard Detection and without Forwarding, in Extended mode, with no initializations, and copy the program in the editor tab. Let us now start the simulation and consider the fetch stage.

The fetch stage contains the circuit that calculates the next Program counter value, choosing between the current value plus four (which means fetching the instruction immediately following thecurrent one, as instructions are 32 bits, or 4 bytes wide) and the alu result, that for example holds the destination address of branch and jump instructions. Mind however that the signal that actually makes the choice is calculated in the execution stage. Therefore we can say that although topologically this circuit belongs to the Fetch stage, it functionally belongs to the execution stage, as that is where it is controlled from. As for the instruction, in this stage it has not been decoded yet, so it has no semantic meaning for the processor. This means that each instruction is treated the same way: in this stage it is just this hex number. For each instruction we can verify that its hex code is matched, at the output of the instruction memory.
Starting from the decode stage, not all the signals will be used by each instruction. For example the load upper immediate instruction here does not feature source registers fields, and yet we see some values nonetheless. That is because those very bits that in other instructions decode into the two source registers here are are used for something else, the immediate. The processor ignores the values that have nothing to do with the current instruction, it just does not use them. Therefore, we will ignore them, but you have to know the Instruction Set to know what to ignore and what to check. This is particularly important because it is here in the decode stage, that all the signals that control the other stages are calculated. Here on the right you see the cheat sheet that you can find at the end of the lesson menu.
- For the 'load upper immediate' instruction we can check the address of the destination register, the immediate, which is shifted by 12 positions as per instruction specification, and the following flags of the Control Unit from top to bottom:
  - The register file write enable which is enabled, since this instruction will eventually write to a register.
  - The write data select signal, which selects the alu result, which in this case will just contain a copy of the immediate. Remember that the register file is being written later, from the write back stage, through the write enable signal, the write address signal, and the write data signal. This means that the Register File chip although topologically part of the decode stage, is functionally part both of de decode stage itself for its read function, but also of the write back stage for the write function.
  - The next five signals are only for memory and branch or jump instructions, so we ignore them.
  - The next three signals select the first source register and the immediate as alu inputs, and 'load upper immediate' as alu function.
  - We also ignore the last signal as it is used only by branch instructions.
- Next cycle.
- For the 'add immediate' instruction we can check the address of the source register 1, and destination register, the immediate, and the following flags of the Control Unit from top to bottom:
  - The register file write enable which is enabled, since this instruction will eventually write to a register.
  - The write data select signal, which selects the alu result, which in this case will just contain the sum of the immediate and the content of source register 1.
  - These three signals again select the first source register and the immediate as alu inputs and 'add' as alu function.
- Next cycle.
- For the 'shift right arithmetic' immediate instruction, the analysis is mostly similar to the last instruction, except for the alu function, which is 'shift right arithmetic'.
- Next cycle.
- For the subtraction instruction we can check the address of the source registers, and destination register. The analysis of the control unit is the same as the shift right immediate instruction, except the second alu input is selected to be the content of source register 2, instead of immediate, and the alu function is 'subtract'.
- Next cycle.
- For the 'branch less than' instruction we can check the address of the source registers, the immediate, and the following flags of the Control Unit from top to bottom:
  - The register file write enable which is disabled, since this instruction will not eventually write to a register, or do anything really in the memory and write back stages.
  - We ignore the write data select and the three memory signals for this very reason.
  - We ignore the do jump signal, but we check the do branch signal being enabled.
  - The next three signals select the program counter and the immediate as alu inputs, and 'add' as alu function, as the alu will contain the destination address as a relative offset, from the current value of program counter.
  - The last signal selects 'less than' as branch comparison.
- Next cycle.
- For the load instruction we can check the address of the source and destination registers, the immediate, and the following flags of the Control Unit from top to bottom:
  - The register file write enable, which is enabled since this instruction will eventually write to a register.
  - The write data select which is set to the memory output, as the value to be written to the register file will be the one retrieved from memory.
  - The next three signals instruct the memory not to write, but to read an entire word from memory.
  - We ignore the do jump and do branch flag.
  - The next three signals select the first source register and the immediate as alu inputs, and add as alu function, as the alu will contain the memory load address as a relative offset, from the value of the source register.
  - We also ignore the last signal as it is used only by branch instructions.
- Next cycle.
- For the store instruction the analysis is similar to the load, except the memory is instructed to write a word, instead of reading one.
- Next cycle.
- Finally for the 'jump and link' instruction we can check the destination register and the immediate, and the following flags of the Control Unit from top to bottom:
  - The register file write enable, which is enabled, since this instruction will eventually write to a register.
  - The write data select which is set to the Program counter plus four signal, as that is the value to be saved to the register file when a jump is done. The reasons have to do with how calls in operating systems are implemented, but this is a matter for another course.
  - We ignore all the next signals except the do jump flag, which is lit, and the three flags that select the program counter and immediate as alu inputs, and 'add' as alu function, as this is how the jump address is calculated, as an offset from the program counter.
This was all of the decode cycle. Onto the next stages, we are going to go faster as most of the functionality has already been decided.
The execution stage is where all actual calculations take place. This means either alu functions or branch comparisons. Note that here is also where the Program counter select signal is calculated, for controlling the program counter mux in the fetch stage. Now for each instruction we will see what these operations mean. Mind that these are not the only ones, but there is at least one representative of each major category.
- The 'load upper immediate' instruction uses the alu, which has as inputs zero and the already shifted immediate. The alu function is 'load upper immediate', which simply passes the second alu operator, the immediate, to the output. That is the value that will be written to the destination register, x15 in this case.
- Next cycle.
- The 'add immediate' instruction also uses the alu, with inputs the content of source register 1, which is x8 in this case, and the immediate, in this case 4. The alu function sums the two, and the result is the value being written to the destination register, x2 in this case.
- Next cycle.
- The 'shift right arithmetic immediate' instruction uses the execution stage, just like the 'add immediate' instruction, besides the alu function which is 'arithmetic shift right', which shifts right with most significant bit extension, thus maintaining the sign of the operand.
- Next cycle.
- The subtraction instruction uses the alu, with inputs the two source registers, x2 and x3 in this case. The alu function is 'subtract' and the result is the value being written to the destination register, x1 in this case.
- Next cycle.
- The 'branch less than' instruction uses both the alu, the branch comparison chip and the program counter select circuit. The alu inputs are the prorgram counter and the immediate, which represent respectively the value of the program counter, from when the branch instruction was in the fetch stage, which is 0 ex 10 instead of the current 0 ex 18, and the offset from it that the branch is targeting, that is -12, or three instructions back. Note that saying "the value of the program counter, from when the instruction was in the branch stage" is the same as saying "the instruction address". The alu function is 'add' and the result is the branch destination address. The branch comparison chip takes as inputs the comparison type, in this case 'less than', and the values of the two source registers. Its result is a flag that signals whether the comparison is satisfied or not. The name is 'branch taken' because when the comparison is satisfied the branch is considered taken, and the mux selector in the fetch stage must select the destination address. Let us then study up close the program counter select circuit, which is made of one 'and' and one 'or' gate. If we notice the names of the signals, we see that the resulting flag, the program counter select circuit, is activated only if, case 1, the instruction is a jump instruction, or, case 2, if it is a branch instruction with at the same time the branch being taken. If you initialized no registers at the beginning of the simulation, the branch should not be taken right now, as the values of both the source registers in input to the branch comparison chip are zero, thus the first one, x12, is not less than the second one, x10. Just for fun we can rewind by one cycle the simulation, and create a case in which the branch is taken, for example by setting the x10 register to the value 2. Now, back at cycle 6 we can see that the program counter selector is activated, and the next program counter value is going to be the current alu result, which is 4. Now go back again one cycle and reset x10 to zero, so we can see the next instructions passing by the stage.
- Next two cycles.
- Both the load and store instructions use the alu only, with inputs the content of the first source register and an immediate. The function is 'add' and the result is the address of the data memory that they target, just one for reading from it and the other for writing to it.
- Next cycle.
- Finally the 'jump and link' instruction uses the alu much like the branch instruction, therefore calculating the sum of program counter and an offset, but bypasses the branch taken signal, activating the program counter select flag, in order to jump with no conditions.
This was the execution stage.
The memory stage is only used by the load and store instructions, which are the only ones to interact with the data memory. You recognize the data memory not being used when the value of the memory operation flag is 'nop'. With the load instruction instead its value is 'load word', and with the store instruction it is 'store word'. For both we can use the Memory tab, to check that memory is being used correctly: in the case of the load, we can just check the value of the memory cell corresponding to the address, in this case 2, whereas for the store, we have to check that the input data, for this case zero, is written to the memory cell of the input address, in this case 8, but at the next cycle. This is due to the fact that, as any register based circuit, any memory shows its output one cycle after having received the input and the write flag.
Finally we come to the write back stage, which is only used by the instructions that write one of the registers, which we can recognize by the activated register file write signal. The exact register address is given by the write index signal.
- The 'load upper immediate' instruction has the signal lit, so it writes the alu result, which in this case represents the shifted immediate. The write address is in this case 15.
- In the next three cycles, the 'add immediate' instruction, the 'shift right arithmetic immediate' instruction and the subtraction instruction all do the same as the 'load upper immediate' instruction, with the alu result being the result of the respective operations.
- Next cycle.
- The branch doesn't do anything in the write back stage, and we can see the disabled register file write signal.
- Next cycle.
- The load instruction writes to the x11 register the output of the data memory.
- Next cycle.
- The store doesn't do anything in the write back stage, and we can see the disabled register file write signal.
- Next cycle.
- Finally the 'jump and link' instruction writes the program counter plus four signal, to the register 15.

With this the overview of the impact of each instruction of each stage is concluded.

Finally, let us see how to graph the execution of the program we just followed. Ripes does it for us with this button, but we have to know how to read the diagram.

Rewind the whole program. If we press the pipeline diagram button now, we can verify that the pipeline, at cycle zero, only contains the load upper immediate instruction, in the fetch stage.
If we do the same at the next cycle, we can verify that the pipeline, at cycle one, contains the load upper immediate instruction, in the decode stage, and the add immediate instruction, in the fetch stage.
If we do that one last time at the next cycle, we can verify that at cycle two, the load upper immediate instruction is in the execution stage and the other two instructions are in the two following stages.

The diagram is constructed like this for all following stages.

Towards Hazards

Finally, in the next two sessions we will see the dynamic execution of two simple assembly programs on Ripes.

Make sure you have read THIS in order to setup Ripes properly and know how to use it.

Index

DEMO - Simple Meaningless Program

This part of the lesson will deal with the execution of a meaningless program. Since it is meaningless, don't try to understand what it does, focus on how instructions communicate with one another when they are together inside the pipeline.

Program

Ripes Processor: 5-stage processor (w/o forwarding or hazard detection) Extended
Registers to Initialize:
- a0 = 0x5
Program:

sw a0 0xbc(x0)   # M[0xbc]=a0
li a1 0x7        # -> a1 = 7
li a2 0x3        # -> a2 = 3
lw a3 0xbc(x0)   # a3=M[0xbc]
LOOP:
    add a4 a0 a1     # -> a4=a0+a1
    sub a5 a1 a2     # -> a5=a1-a2
    add a4 a3 a3     # -> a4=a3+a3
    slli a3 a3 1     # -> a3=a3<<1
    bne a5 x0 LOOP    # Branch if a5 is different from zero

Video Transcript

The first program actually does nothing meaningful, which helps us focus on how it flows through the processor. We will comment on each cycle of the execution. I suggest pausing the video whenever you feel like you are ready to analyse the next step yourself, and then let the video continue to verify your work. For the Fetch stage we will focus on the choice of the next PC value, for the Decode stage we will focus on the flags that influence the behaviour of the other stages, for the execution stage we will focus on abstracting the alu function, in relation to the instruction function, while in the last two stages we will focus on the completion of the processing of each instruction.

Side note, it is very important that you know hex values and register ABI names by heart, as they will be used interchangeably for naming registers. For example, often values in the visualisations will be in hex, while the register file on the right will have both decimal and ABI representations, or again the assembly code will use ABI register names. Make sure to know them well, not to get confused.

First of all:

Load the 5 stage processor without forwarding or hazard detection with the extended layout.
Initialise register a0 to the value 5.
Copy and paste the program below in Ripes' Editor tab.

Let us first understand what the program does, even if meaningless. The first instruction loads the value of a0 into the memory cell with address "bc". The second and the third instructions load 7 and 3 in a1 and a2 respectively. The fourth loads the content of the memory cell "bc" into a3. At this point we will expect a3 to hold the same value as a0. The next four instructions are simple arithmetic operations. At the end of each we should see respectively a4 equal to 12, a5 equal to 4, a4 equal to 10, a3 equal to 10, because shifting left by 1 position is equivalent to multiplying by 2. The last instruction branches back to the label if a5 is different from zero. Since neither a1 or a2 are being written and a5 will remain their sum, we expect the branch to continue to be taken forever, while both a3 and a4 continue to be doubled forever at each loop.

Let us now comment the execution, cycle by cycle.

Cycle Zero

The first instruction is in the Fetch Stage. We can see that the instruction in hexadecimal coming out of the instruction memory, matches the one of the first instruction. This can be done with all following instructions. As we can see from this mux, the next program counter value will be the current one plus four, which means that the instruction immediately following the store will be loaded, in this case the load immediate, or addi. This will not always be the case, we will see what happens when the branch reaches the ex stage, which is where this mux that chooses the next program counter is controlled from.

Cycle One

For the Fetch stage the same considerations apply as before. At the same time the Decode stage now contains the store instruction. I remind you that not all the signals in this stage are used by each instruction. For example the store instruction does not have a destination register field, as it uses that same bit range to encode part of the immediate, but we can see some value here nonetheless. The processor ignores the values that have nothing to do with the current instruction, by just not using them. Therefore, we will ignore them too, but you have to know the Instruction Set, to know what to ignore and what to check. For the store instruction, we can verify that the source register 1, which is x0, and the source register 2, which is x10, are indeed retrieved from the register file. At the beginning of the program we expect x10 to contain the value 5, and we can verify it here. Also the immediate output corresponds to the expected value of 188. We can also verify that the register file write enable is not activated, as the store instruction does not have to write to the register file, later in the write back stage. On the other hand, the data memory write enable is active, as the store instruction will write to data memory in the memory stage. The other lit signal is the second operator selector for the alu, which will be used in the next cycle to select the immediate. Finally, one of the signals that is always a good measure to check is the alu control, which now has the value "add". The store instruction uses the addition function of the alu, because it calculates the address for writing the data memory, by adding an offset, in this case 188, to the value of a register, in this case x0.

Cycle Two

The execution stage now contains the store instruction and, as expected, the alu is summing the immediate 188 with the content of x0, which is always zero, resulting in 188. This value will be used to address the data memory during the next cycle.

The decode stage contains the load immediate instruction: as before we have to check only the relevant signals. In this case the register file write enable, as this instruction will write 7 in register x11 in the write back stage. The immediate 7, and the flag that allows the alu to use it, as well as the alu control flag, which is once again set on "add". Finally, the write back select signal is set on alu result. That is because in the write back stage the value that has to be written, in this case to x11, is the result of the sum of 0 and the immediate 7.

As an example of why we ought to ignore the non relevant signals, the register 2 index signal contains the value 7. This has nothing to do with the immediate, just with its encoding. That very same subset of bits, that in other instructions decodes to the source register 2 index field, is used by the addi instruction to encode the immediate 7.

Cycle Three

The store instruction is now in the Memory stage, and it will be here that it will complete its execution. It will do nothing in the next stage, unlike most other instructions. Here we can see the memory write enable being used, after being calculated in the decode stage two cycles ago. From the decode stage also comes the data in, which is just the content of register 2 and we can confirm it is the value 5, while as already said the address comes from the execution stage, as it is just the alu result.

However, if we look in the memory tab for the memory cell 0xbc, we see that it does not contain the value 5 just yet. This is because, as for any register and therefore any memory, it will show its output at the next cycle.

We can verify in the execution stage that the alu is indeed summing the immediate, with the value zero.

We are not going to repeat anything about the fetch stage, as it contains the very same type of instruction as last cycle. You can use it as exercise also for the next three cycles.

Cycle Four

Here in the write back stage the instruction does nothing, but we can now verify in the Memory tab that it has actually written the value 5 into memory cell 0xbc.

The instruction in the memory stage also does nothing, as the addi instruction has nothing to do with the data memory.

We are not going to repeat anything about the execution stage, as it contains the very same type of instruction as last cycle. You can use it as exercise.

Finally, the decode stage contains the load: we can see that it will write to the register file in a few cycles, that it uses the immediate 0xbc in the next stage and that it reads from data memory.

Cycle Five

Here in the write back stage the add immediate instruction finally completes its processing, by selecting the alu result, that two cycles ago was calculated as the sum of 0 and 7, and writing it to the register file at index 0xb. Mind that at the same time another instruction is in the decode stage, and is doing some other thing in parallel. Since it is an add instruction it is reading two registers from the register file. This instruction is the first of a streak of 4 arithmetic operations, two examples of which you have already seen, so we will focus less on the details from now on.

Analysing the assembly, we found that we expect this add instruction to retrieve the register a0 with value 5, and the register a1 with value 7. However, on the right we can see that the register a1 is still zero, since the value is being written in this cycle and will be visible from the next. Fortunately, the implementation of the register file of this specific processor, allows for reading a register at the same time of it being written, so with input 0xb, which corresponds to a1, we can still see the value 7. But what if this was not the case or worse, what if the previous instruction, the load, needed the value of a1? The value was not ready at the previous cycle. We will cover this case in the next part of the lesson, so remember it.

So the add instruction retrieves the correct values of 5 and 7, and will write their sum in register 14 in the write back stage.

Cycle Six

The next sub instruction needs the values of registers a1 and a2 being 7 and 3 respectively. Again while a1 is ready from the previous cycle, a2 is being written to the register file right now, as you can see, but thanks to the implementation of this specific register file, it can still be used by the sub instruction.

Cycle Seven

The same happens now, when the load instruction is writing 5 to the register a3 from the write back stage,and the instruction in the decode stage needs to use it.

Cycle Eight

None of the add and sub instructions do anything in the memory stage, and they complete their execution in the write back stage. Now, the first instruction of the streak is here in the write back, and all the others will complete here in the next cycles in the same way.

Cycle Nine

Now the branch instruction is in the decode stage and needs the value of a5, which again is being written at the same time with the value 4. Being the last instruction of the program, the fetch stage remains empty. For confirmation we can see the program counter value in the fetch stage being 24, that is the address of the instruction just after the branch, which is empty. If you manage to sneak a malicious instruction there, congratulations, you have created a very rudimentary virus.

Anyway let us focus on the decoding of the branch instruction. The three lit flags are the do branch flag, the flag that switches the first alu input to the program counter, and the flag that switches the second alu input to the immediate. The alu operation is again "Add", as the branch instruction in the next cycle will sum the program counter and the negative immediate, to obtain the address of the instruction to jump backwards to.

Remember that the difference between jump and branch is that the jump requires no condition, while the branch does.

Cycle Ten

The first two stages are empty. The next instruction to fetch is being calculated in this cycle in the execution stage, and depends on whether the branch is taken. You can see the 'branch taken' flag being activated, and ultimately selecting the result of the alu as next program counter, back in the fetch stage. The new value of the program counter is the branch instruction address minus 16, which is the distance in bytes between the branch instruction and the targeted label. Note that the address on any instruction is the same value of the program counter, taken from when the instruction was in the decode stage: they are the same thing. The branch instruction will do nothing in the next two stages. At the next cycle the instruction whose address is the new program counter value will be retrieved. In this case, it is the first instruction of the loop.

From now on the execution will continue in the same way indefinitely.

DEMO - Program Towards Hazards

This part of the lesson will deal with the execution of an actual meaningful program: try to always keep in mind what the instruction is doing and what configuration it expects when it starts executing.

Program

Ripes Processor: 5-stage processor (w/o forwarding or hazard detection) Extended
Registers to Initialize:
- t1 = 0x123
- t2 = 0xbc
- a0 = 0x00000fff
Program:

sw t1 0(t2)         # M[0xbc] = 0x123
LOOP:
   lw a1 0(t2)      # a1 = M[0xbc] 
   add a2 a1 a1     # a2 = a1+a1
   sw a2 0(t2)      # M[0xbc] = a2
   blt a2 a0 LOOP   # jump if a2<a0

Video Transcript

Now onto the next demo:

Load the 5 stage processor without forwarding or hazard detection with the extended layout. This is the same processor as the last demo.
Initialise register t1 to the value 0 ex 123, t2 to the value 0xbc, a0 to the value 0x00000fff.
Copy and paste the program below in Ripes' Editor tab.

Let us first understand what the program does. The first instruction loads the value of t1 into the memory cell with address "bc", which unlike the last program is addressed with a register instead of with an offset. The second instruction loads such value into the register a1. The next instruction sets a2 to two times a1, and then the program stores this value in the same memory cell 'bc'. Then the program jumps to the start of the loop if a2 is less than a0. Basically, the program doubles the value inside a2 at each loop execution, until it becomes greater than a0. By changing the initialization of a0 we can control how long the program will run (bigger a0 means it runs for more cycles).

Onto the program, let us focus mainly on the decode stage.

Cycle zero to cycle two

In the decode stage, both the first and second instructions need registers which were initialised, x6 and x7, or t1 and t2. We can verify their values. Let us also keep in mind that the load instruction will write register x11 in the write back stage, which means in cycle five.

Cycle three

From the analysis of the assembly code, we saw that the add instruction, which is now in the decode stage, uses the value of a1, or x11, that is supposed to be set by the load instruction as 0 ex 123. However, we can see that the value being retrieved for the address 0x0b, which is x11, is zero. That is because, as we were saying for the previous cycle, the load instruction will write the a1 register in cycle five, that is in two cycles. By then the add instruction will be in the memory stage, and cannot use the alu anymore to calculate its sum.

Towards Hazards

Imagine being the engineers that invented the pipeline. You just subdivided the processor execution into 5 stages, each that executes faster than the full processor. This lets you process more instructions in the same amount of time, because the execution of each happens in parallel to that of four other instructions, at any given time. This also means that each instruction begins its execution, when two or three other instructions are still not completely executed, which is the same as to say that any register that is modified by a given instruction, is not actually modified until at least three cycles after the instruction starts execution. So what if, as in this case with the register a1, the instruction that immediately follows the load needs the loaded value? More in general, given any instruction that modifies the value of a register, what if an instruction immediately after needs to use the new value of that register? If the problem propagates, it means that execution will contain errors and it cannot be continued.

In cycle four we see that the result of the alu is zero and not the expected 0 ex 246, because the values of the registers were retrieved as zero instead of the expected 0 ex 123. This means that the value that will be written to the a2 register will be zero, which is wrong.

Therefore no, the execution cannot be continued as the current implementation of the pipeline leads to errors.

How to solve problems like this will be the topic of the next lesson, and as assignment you will be asked to analyse another assembly program, which contains a slightly different type of error, and write a small report detailing how you found it, much in the same way in which we have commented on the last two programs.

So, you will also be asked to speculate a solution, based on the following part of the lesson (and, if you want some spoilers, by trying to understand how the other three pipelined processors available in Ripes, solve the problem. Mind that this is not strictly necessary, as those processors will be the main topic of the next lesson) .

Back to our problem, hopefully you have now understood the need for some tweaks to the current pipeline hardware. Let us then try to understand intuitively how this can be solved, which again will be the topic of the next lesson.

In this particular case, the load instruction is writing back to the register file the value 0 ex 123 in cycle five, while the following instruction, the add, retrieves it in cycle three. This kind of occurrence is called a hazard (or more precisely a dependence, which is a possible hazard). There are two ways to make sure that the add can retrieve the correct value.

The first way is to make sure that the add instruction is in the decode stage at cycle five. This implies stalling it there for two cycles, by disabling the IDEX interstage register, while the load continues executing. This also implies that all instructions after the add instruction will be stalled, as if in a cue. Naturally, there would be the need of a chip that detects such occurrences, called a Hazard Detection Unit.

Another way is to pinpoint where exactly the value that the add instruction needs is ready, or where it is really needed. In our case, the value is ready as soon as it is retrieved for memory, in the memory stage, which is one cycle before the original case. At the same time, the value is retrieved in the decode stage but it is not needed there, it is only needed for the alu in the execution stage, that is one cycle after the original case. It is possible to forward the needed value from a point of the pipeline to another, in order to minimise the stall cycles of an instruction that is waiting for a value to be ready. This means in our case, forwarding the value 0 ex 123 for example from the write back stage directly to the execution stage, so that the add instruction must only wait for one cycles instead of two. Again, there would be the need of a chip that handles such occurrences, called a Forwarding Unit.

In the next lesson we will see exactly how to generalise this.

Assignments

Please write your report on a document, to then submit as PDF in the form below. Feel free to add images, if it helps to explain something more concisely.

Simulate the program below in a similar way to the previous lesson. Write a small report (~1 page) in which you describe what execution error you found and how you recognized it, together with what the consequence on the execution of the whole program is. Comment what the assembly does before simulating it and specify any signal that you reckon holds the wrong value and specify why it should hold the value you think is correct. Try to describe what is similar and what differs with the execution error found in the program of the previous lesson. Finally, describe how this error could be solved by using stalls or forwarding values between stages. Do not be afraid to speculate or get this last point wrong, the topic of the next lesson will be exactly this.

Ripes Processor: 5-stage processor (w/o forwarding or hazard detection) Extended
Registers to Initialize:
- t1 = 0x123
- a0 = 0x00000fff
Program:

LOOP:
    add t1 t1 t1 # a2 = t1+t1
    bgt t1 a0 20 #jumpif t1>a0
    add a1 t1 t1 # a1 = t1+t1
    sub a2 t1 t1 # a2 = t1-t1
    j LOOP

[Optional "game" to do in couples. Although very suggested, as it is an effective way to prepare to one of the course's exam exercise types.] Without the second person knowing, the first person chooses a program among the ones below, or writes a simple assembly program of his own. With the help of Ripes, he/she studies the program and if necessary comes up with a proper register initialization scheme to make it do something interesting. Then, the roles of the "game" are chosen: the first person chooses a stage of the pipeline to play as, and communicates it to the second person, while the second person, who cannot use the simulator (nor knows what program was chosen), plays as one adjecent stage. For example if the first person chooses to play as EX stage, the second can choose to play either as the MEM stage or as the ID stage. Mind that playing as the previous stage is always a bit more difficult than playing as the following stage. The first person chooses a random cycle of the execution of the program, which he/she can keep on Ripes as reference, and the objective of the second person is to find out the exact cycle number the first person chose: the way the second person finds out is by asking the first person to tell the value of some signal. Not any signal, just the signals that the two stages have in common: for example if the second person is playing as the ID stage and the first as EX stage, the second person cannot ask "What is the result of the alu?" or "Is the branch taken.", because those signals have nothing to do with the ID stage, but he/she can ask "What is the value of the first source register in the EX stage?" or "Is the alu selecting the PC or the first register as first operand?" or again "What is the ALU function?" as those are signals that the ID stage calculated in the previous clock and passed to the EX stage. If the second person is playing as the MEM stage (and the first always as EX), he/she can ask "What is the alu result?" as it will receive it in the next cycle, but not "Is the branch taken?" because that signal will never be "viewed" by the MEM stage. When the second person feels he/she understood which cycle the simulation is at, a guess is made. A correct guess is a win for the second player, a wrong guess is a win for the first. If the players wish, they can setup a system of points: the second players gets more points if he/she uses less questions to make a right guess, and makes the first player gain more points if the guess is wrong after many questions. For example starting from 10, if the guess is right the second player gets 10 minus number of questions asked points. If the guess is wrong the first player gets the number of questions asked points.

In order to increase the game difficulty, the two players can play as non-consecutive stages.

Examples of programs from the internet

addi x2, x0, 1

loop:
  sub x1, x1, x2
  sw  x1, 4(x0)
  blt x0, x1, loop

 addi x5,x0,0x11        # set x5 to 0x11
  sw x5, 0x100(x0)       # store at address 0x100
  lw x6, 0x100(x0)       # get from mem
  addi x6,x6,1          
  sw x6, 0x104(x0)       # store to mem 0x104

   addi x3,x0,0    # i = 0
   addi x4,x0,10   # const 10
loop:
   bge  x3,x4, exit
   addi x3,x3,1
   j loop
exit:

  addi x3,x0,0       #  s = 0
  addi x4,x0,0       #  i = 0
  addi x5,x0,5       #  const 5
  addi x6,x0,0x100   #  base address of ax[]
  addi x8,x0,0       #  offset = 0
loop:
  bge x4, x5, exit
  add x7, x6, x8    # compute effective address
  lw  x9, 0(x7)     # get ax[i]
  add x3, x3, x9    # s = s + ax[i]
  addi x8, x8, 4    # next element
  addi x4, x4, 1    # increment index
  j loop
exit:

Post-Test

Please compile this form after having finished the lesson and its assignments ( remember to submit your answers in the Answers Form):

YOU HAVE REACHED THE END OF THE LESSON. THANK YOU FOR YOUR TIME :)

Appendix

This part of the lesson contains complementary material to the rest of the lesson.

Index

RV32IMAC Cheat Sheet

Embedded below is the pdf summorizing the entire RISC-V instruction set (RV32IMAC), which is useful during most lessons. The pdf is structured as follows:

p.1 - Instructions on how to read the tables.
p.2 - Registers Table.
p.3-5 - Encoder Table (to translate assembly into binary machine code manually).
p.6-7 - Decoder Table (to translate binary machine code into assembly manually)

Below the embedded pdf is a set of exercises to train your ability in encoding and deconding instructions manually.

Open PDF in another tab

Encode and Decode Exercises

Decode the following instructions:
1. 0x01e007ef
2. 0x0001e7b7
3. 0x00440113
4. 0x40445113
5. 0x403100b3
6. 0xfea64ae3
7. 0x0003a583
8. 0x00b3a023
Encode the following instructions:
1. j -12
2. auipc sp 8
3. andi sp, s0, 4
4. sltiu sp, s0, 4
5. or ra, sp, gp
6. Solo l'istruzione di branch del seguente programma:
  - <LABEL>
  - lw a1 0(t2)
  - add a2 a1 a1
  - sw a1 0(t2)
  - bgtz sp <LABEL>
7. lw a7 8(t2)
8. sw a7 8(t2)

Examples

Encoding instruction `0x01e007ef`

First, convert hex into binary:

hex	`0`	`1`	`e`	`0`	`0`	`7`	`e`	`f`
bin	`0000`	`0001`	`1110`	`0000`	`0000`	`0111`	`1110`	`1111`

Then identify the instruction at hand using the tables from the RV32IMAC Cheat Sheet. Instructions are better decoded reading them from right to left, so that the first thing you find are its quadrant and opcode.

RV32I	`00000001111000000000`	`01111`	`11011`	`11`
Meaning	`<imm> (encoded)`	`rd`	`opcode`	`quadrant`
Value	`<imm>`	15	Jump and Link	4th

We therefore obtain jal x15 <imm>, where <imm> is the still to be decoded immediate.

All the instruction fields except <imm> are decodable just from the tables, for this instruction. For <imm> we have to further scramble some bits to decode it.

Identify the subfields of the immediate:

`<imm> (encoded)`	`0`	`0000001111`	`0`	`00000000`
Encoding	`m2`	`imm2`	`m1`	`imm1`

Re-order the subfields as written on the tables. Usually the immediate encoding is just below the corresponding instruction's encoding.

Decoding di `<imm>`	`-m2-`	`imm1`	`m1`	`imm2`	`0`
`<imm> (bin)`	`000000000000`	`00000000`	`0`	`0000001111`	`0`

We obtain the 2's complement

<imm> (bin) = 00000000000000000000000000011110

that is the decimal

<imm> (dec) = 30

The complete disassembled instruction is therefore jal x15 30, o jal a5 30 using the ABI register aliases.

Deconding instruction `j -12`

Find the instruction j on the encoding part of the RV32IMAC Cheat Sheet.

Instruction Encoding `j`	`<imm> (encoded)`	00000	11011	11

We therefore know the least significant bits xxxxxxxxxxxxxxxxxxxx000001101111. The rest is just an immediate, <imm> = 30 , which we now have to encode.

First convert <imm> = -12 to 2's complement <imm> (bin) = 10100, and now sign-extend it to cover all the 32 bits of its width <imm> (bin) = 11111111111111111111111111110100.

Divide this number into its subfields as of RV32IMAC Cheat Sheet, Encoder section.

`<imm> (bin)`	`111111111111`	`11111111`	`1`	`1111111010`	`0`
Decoding `<imm>`	`-m2-`	`imm1`	`m1`	`imm2`	`0`

Re-order the fields to encode the immediate:

Encoding	`m2`	`imm2`	`m1`	`imm1`
`<imm> (encoded)`	`1`	`1111111010`	`1`	`11111111`

We obtain <imm> (encoded) = 11111111010111111111xxxxxxxxxxxx, that are the missing most significant bits.

The complete assembled instruction is therefore 11111111010111111111000001101111, after having jointed the two half-results above.

Review of Single Cycle Processor Components

The following pages contain interactive examples of these circuits:

The circuits are implemented in S.H.E.A.S., a logic port simulation tool for which we provide a Tutorial.

Single Register (32 bits)

When detecting the rising edge of a clock signal, the register updates its output with the value of its input.

ROM

A memory is nothing but a bank of registers: given an address input, the memory retrieves the content of the register with that index as output. Remember that a non-writable memory is a good model for the processor's instruction memory.

The memory in this circuit contains the program:

<LABEL>
lw a1 0(t2)
add a2 a1 a1
sw a1 0(t2)
bgtz sp <LABEL>

Register File (RF)

Design of the 32 bit RF by Marek Materzok. This design is purely functional, it does not get to the level of detail of logic ports. The register file contains the 32 registers that are referred to in the assembly code as x0-x31, or with the corresponding ABI names. What distinguishes the register file from a normal memory is that it can read to registers at once, while simultaneously writing another one.

PC

The program counter is a register that contains the value of the next instruction to execute, which is decided by either incrementing it by 4 (which means selecting the next instruction) or receiving the alu result from the branch or jump calculation of the branch/jump address.

ALU

32 bits ALU design by Marek Materzok). This design is purely functional, it does not get to the level of detail of logic ports.

The alu receives as input two values and a function to execute on them. The result is given as output, together with a flag that indicates if the result is zero.

Simulation Tools

What follows is all the information you need to use the two tools we are using in this lesson:

S.H.E.A.S. Tutorial

S.H.E.A.S. is a logic ports simulator also available standalone. It will however be mostly embedded in the lesson pages. Note that if the current version of S.H.E.A.S. is higher than the one shown in the tutorial and you don't understand something, you will find below the script updated to the most recent version.

TRASNCRIPT (v1.1)

This is a basic tutorial for SHEAS one point one, a tool based on the DigitalJS open source logic simulation engine, originally created for visualizing and simulating verilog code by Marek Materzock.

SHEAS stands for Simple Hardware Editor And Simulator, and it provides DigitalJS with a basic interface that allows for completely code-free hardware modification and simulation, and in a near future, testing too.

SHEAS is open source too, and a live running page can be found at sheas.magiwanders.com.

Components are chosen from the dropdown and added to the simulation. The base components are buttons for inputs, lamps for outputs, and nand ports for synthesizing logic. Each of these has a 'bits' attribute which describes the width of their ports. More components are available in the dropdown and new ones are constantly added.

In the Visualization Window, components are moved by drag and drop and connected by dragging and dropping, from the output of one to the input of the other. If any glitch happens, the visualization can at any point be reloaded.

The simulation is already running by default, and at any point, clicking the blue looking glass that appears upon hovering each wire will activate a signal monitor below. To see the details, like the exact delay between signal activations, the simulation can be paused and forwarded step by step, while the signal monitors can be zoomed in and out, and navigated horizontally to the left or to the right.

Any component can be renamed. Any component can also be removed by name. No two components can have the same name, the page will output an error if trying to rename one to an already existing name.

At any point, the circuit can be saved for later retrieval, and the simulation can be reset.

Saved circuits are retrieved either as they were saved, with the load button, or as black boxes, with the add button. The black box can be used as any other component, and its internal ports can be accessed with the looking glass icon that appears upon hovering it.

At any point the current circuit can be shared by clicking the 'share circuit as link' button and pasting the link into another window. Mind that due to the limitations on the URL length, this only works for all but the biggest circuits. Alternatively, clicking the 'share only the chip' button will copy the chip in the clipboard. In another window, the chip can be retrieved with the dropdown option 'Circuit from Clipboard'. This works with much bigger circuits.

That is all, thank you.

Ripes Tutorial

Ripes is a visual RISC-V processor simulator available open source on GitHub. Note that if the version of Ripes in the virtual machine is higher than the one shown in the tutorial and you don't understand something, you will find below the script updated to the most recent version.

Ripes Setup

Virtual Machine Setup

Download the virtual machine containing Ripes HERE (the file comparch.ova is relatively big at ~2.7GB)
Download and install VirtualBox for your system.
Open VirtualBox and import the file comparch.ova:
- File > Import Appliance
- Select the file comparch.ova
- Click Next, click Import
- A new Virtual Machine should appear in the main window. Double click for booting it up.

Important Info on the Virtual Machine

The information that follows is not directly useful for the lesson but might be needed when operating the virtual machine:

Some Ripes sub-windows, like the Pipeline Diagram one, DO NOT have the 'x' button to close them. Use the 'Esc' key.

The virtual machine only shows the Ripes program, which opens at startup. It will be needed in the second part of the lesson.
Verify that the copy and paste functionality works between the host and the guest (it is simply easier to copy programs from this site to within Ripes).
Login credentials:
- user: student
- psw : student
System: Manjaro Linux i3 minimal updated september 2022. Any modification or update is strongly advised against to preserve functionality.
How to open a terminal:
- Alt+2
- Alt+Enter
- The terminal is a standard linux terminal. If one desires to modify the i3 configuration, one has to modify the file ~/.config/i3/config for example with the nano editor:
  - sudo nano ~/.config/i3/config
To return to Ripes in any moment:
- Alt+1
If for any reason Ripes closes, reboot the virtual machine.

Ripes Tutorial Video

Transcript v2.2.3

This tutorial is an unofficial Introduction to Ripes two point two point three.

RIPES is a graphical processor simulator and assembly code editor for educational purposes. It was written by Morten Petersen. The program itself is theoretically architecture agnostic, but we can see in the processor selector that it currently only offers RISC-V single stage, pipelined and dual issue processors. Each processor has an extended layout for more visualization detail and settings for extensions or register initialization.

Onto the prorgam itself, the EDITOR TAB contains the tools to program in assembly. Programs are loaded via File>Load Program. It is currently biased towards RV32I (and depending on the processor M/C) assembly. On the left hand side actual assembly can be written (or C if you link a RISC-V compiler in the settings). On the right hand side is the machine code in hex, disassembled or binary form.

The execution of a program can be controlled either step by step (forwards and also backwards in time), or setting a clock time and then pressing play. The execution can be reset or played without updating the ui for faster execution. If executing on a pipelined processor, on the right hand side of the editor tab it is displayed what stage the instruction is currently in.

The PROCESSOR TAB contains the hardware visualizations. The visualization itself is very intuitive: wires' can be highlighted for reading their path better and their value can be accessed by hovering (and can have any radix) (if they are 1 bit wide they are colored green when value is 1 and grey when value is 0, if they are multibit, they just flash yellow if updated at a given clock and remain grey if not updated). The selected option of muxes is colored green and a similar visualization is used for useful indicators, like branch taken. The other components tend to be self explanatory, by naming I/O explicitly. CTRL+Scroll is for zooming. On the right the register file is displayed. Values can be edited manually. Pipelined processors show the currently processed instruction above each stage.

The layout can be changed if needed. Just unlock it from outside the visualization square, and modify at your wish. At any point you can save the layout as json file by right clicking on the visualization square. Saved layouts can be loaded back only when layout is unlocked by right clicking on the visualization square.

On the bottom right is the instruction memory which for each instruction offers the possibility to insert breakpoints, other than again giving information about address and current stage. Here some statistics are given: Number of cycles, retired instructions, CPI, IPC, clock rate. Here is the console that prints ecall print functions. Finally, the table button brings up the pipeline diagram at the current cycle.

The visualization is easier on the eyes in view > dark mode and all values can be shown or hidden under view > show processor signal values.

Under the MEMORY tab the whole Data Memory can be accessed, either by scrolling or with the goto options below.

Cache and I/O tabs are used for, well, studying cache and Input/Output but the detail for those is for another video.

All of the processor designs can be modified in the source, for example for modifying the hazard unit of the 5 stage pipelined 32 bit processor one can modify the src/processors/RISC-V/rv5s/rv5s_hazardunit.h. The hardware description language is VSRTL, a domain specific language embedded in C++ by the same author. After modifying, Ripes has to be re-compiled. See instructions in the github repositories, links in the description.

If you are not familiar with the pipeline feel free to skip the rest of the video and come back if you need :)

Taking as reference the visualization of a pipelined single issue processor design, the student can use three overlapping but different ”perspectives of focus” when studying and playing with RIPES: The stage view, or vertical view, focuses on checking if a given stage is modified by different kinds of instructions according to the students expectations: for the simplest example let us consider the IF stage. We expect thes hexes out of the instruction memory, and we can verify it when the corresponding instructions pass in in the IF stage. For the ID stage, we expect for example the add instruction to enable the write back flag and the store not to. We can again verify. For the EX stage and the add instruction, we expect the first input of the alu to be the content of the register 1, and we can verify it again. This was the stage or vertical view. The instruction view, or cross view, focuses on following the instruction along the pipeline to understand its impact. This is the same thing as the previous focus, just under another perspective. For example the student has to notice that in whatever stage the instruction is in, it "drags"' with it the values it needs as they were when the instruction was in the stage in which they were created. For example the PC here is the current value, here is the previous, which was current when this instruction was in IF stage et cetera. This was the instruction or cross view. The pipeline view, or horizontal/global view, focuses on the evolution and global impact of signals within the pipeline: this is a combination of the previous two focuses. For example in the ID stage, the Write Register flag is produced here on the control unit and used here on the Register file, but there isn't any direct connection: it has to go all the way through the EX, MEM, and WB stages and come back to the ID stage, because it has to wait for the write data, that you see also coming from the WB stage, to be retrieved in the MEM stage in case of a load, or to be calculated in the EX stage in case of many other instructions. In each following stage the value of the write register flag is the that produced by the Control unit when the corresponding instruction was in the ID stage, so the value currently being used by the register file in the ID stage, which comes looping back from the WB stage, corresponds not to the one being produced by the instruction currently in the ID stage but to the one being produced by the instruction currently in the WB stage. So the instruction currently in WB is writing back to the register file only now, even if topologically the register file is in the ID stage and even if that same instruction already passed through it three cycles ago. The write register flag was ready then, but the data to write was not. So, the write register file had to be "dragged" along until the stage in which the actual write back was going to happen. This was the pipeline or horizontal or global focus.

That is all, thank you.