Is the number of bits the processor can interpret and execute at a given time?

Assembly language is convenient for humans to read. However, digital circuits understand only 1’s and 0’s. Therefore, a program written in assembly language is translated from mnemonics to a representation using only 1’s and 0’s called machine language. This section describes ARM machine language and the tedious process of converting between assembly and machine language.

ARM uses 32-bit instructions. Again, regularity supports simplicity, and the most regular choice is to encode all instructions as words that can be stored in memory. Even though some instructions may not require all 32 bits of encoding, variable-length instructions would add complexity. Simplicity would also encourage a single instruction format, but that is too restrictive. However, this issue allows us to introduce the last design principle:

Design Principle 4: Good design demands good compromises.

ARM makes the compromise of defining three main instruction formats: Data-processing, Memory, and Branch. This small number of formats allows for some regularity among instructions, and thus simpler decoder hardware, while also accommodating different instruction needs. Data-processing instructions have a first source register, a second source that is either an immediate or a register, possibly shifted, and a destination register. The Data-processing format has several variations for these second sources. Memory instructions have three operands: a base register, an offset that is either an immediate or an optionally shifted register, and a register that is the destination on an LDR and another source on an STR. Branch instructions take one 24-bit immediate branch offset. This section discusses these ARM instruction formats and shows how they are encoded into binary. Appendix B provides a quick reference for all the ARMv4 instructions.

6.4.1 Data-processing Instructions

The data-processing instruction format is the most common. The first source operand is a register. The second source operand can be an immediate or an optionally shifted register. A third register is the destination. Figure 6.16 shows the data-processing instruction format. The 32-bit instruction has six fields: cond, op, funct, Rn, Rd, and Src2.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.16. Data-processing instruction format

The operation the instruction performs is encoded in the fields highlighted in blue: op (also called the opcode or operation code) and funct or function code; the cond field encodes conditional execution based on flags described in Section 6.3.2. Recall that cond = 11102 for unconditional instructions. op is 002 for data-processing instructions.

The operands are encoded in the three fields: Rn, Rd, and Src2. Rn is the first source register and Src2 is the second source; Rd is the destination register.

Figure 6.17 shows the format of the funct field and the three variations of Src2 for data-processing instructions. funct has three subfields: I, cmd, and S. The I-bit is 1 when Src2 is an immediate. The S-bit is 1 when the instruction sets the condition flags. For example, SUBS R1, R9, #11 has S = 1. cmd indicates the specific data-processing instruction, as given in Table B.1 in Appendix B. For example, cmd is 4 (01002) for ADD and 2 (00102) for SUB.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.17. Data-processing instruction format showing the funct field and Src2 variations

Rd is short for “register destination.” Rn and Rm unintuitively indicate the first and second register sources.

Three variations of Src2 encoding allow the second source operand to be (1) an immediate, (2) a register (Rm) optionally shifted by a constant (shamt5), or (3) a register (Rm) shifted by another register (Rs). For the latter two encodings of Src2, sh encodes the type of shift to perform, as will be shown in Table 6.8.

Data-processing instructions have an unusual immediate representation involving an 8-bit unsigned immediate, imm8, and a 4-bit rotation, rot. imm8 is rotated right by 2 × rot to create a 32-bit constant. Table 6.7 gives example rotations and resulting 32-bit constants for the 8-bit immediate 0xFF. This representation is valuable because it permits many useful constants, including small multiples of any power of two, to be packed into a small number of bits. Section 6.6.1 describes how to generate arbitrary 32-bit constants.

Table 6.7. Immediate rotations and resulting 32-bit constant for imm8 = 0xFF

rot32-bit Constant00000000 0000 0000 0000 0000 0000 1111 111100011100 0000 0000 0000 0000 0000 0011 111100101111 0000 0000 0000 0000 0000 0000 1111……11110000 0000 0000 0000 0000 0011 1111 1100

If an immediate has multiple possible encodings, the representation with the smallest rotation value rot is used. For example, #12 would be represented as (rot, imm8) = (0000, 00001100), not (0001, 00110000).

Figure 6.18 shows the machine code for ADD and SUB when Src2 is a register. The easiest way to translate from assembly to machine code is to write out the values of each field and then convert these values to binary. Group the bits into blocks of four to convert to hexadecimal to make the machine language representation more compact. Beware that the destination is the first register in an assembly language instruction, but it is the second register field (Rd) in the machine language instruction. Rn and Rm are the first and second source operands, respectively. For example, the assembly instruction ADD R5, R6, R7 has Rn = 6, Rd = 5, and Rm = 7.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.18. Data-processing instructions with three register operands

Figure 6.19 shows the machine code for ADD and SUB with an immediate and two register operands. Again, the destination is the first register in an assembly language instruction, but it is the second register field (Rd) in the machine language instruction. The immediate of the ADD instruction (42) can be encoded in 8 bits, so no rotation is needed (imm8 = 42, rot = 0). However, the immediate of SUB R2, R3, 0xFF0 cannot be encoded directly using the 8 bits of imm8. Instead, imm8 is 255 (0xFF), and it is rotated right by 28 bits (rot = 14). This is easiest to interpret by remembering that the right rotation by 28 bits is equivalent to a left rotation by 32−28 = 4 bits.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.19. Data-processing instructions with an immediate and two register operands

Shifts are also data-processing instructions. Recall from Section 6.3.1 that the amount by which to shift can be encoded using either a 5-bit immediate or a register.

Figure 6.20 shows the machine code for logical shift left (LSL) and rotate right (ROR) with immediate shift amounts. The cmd field is 13 (11012) for all shift instruction, and the shift field (sh) encodes the type of shift to perform, as given in Table 6.8. Rm (i.e., R5) holds the 32-bit value to be shifted, and shamt5 gives the number of bits to shift. The shifted result is placed in Rd. Rn is not used and should be 0.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.20. Shift instructions with immediate shift amounts

Table 6.8. sh field encodings

InstructionshOperationLSL002Logical shift leftLSR012Logical shift rightASR102Arithmetic shift rightROR112Rotate right

Figure 6.21 shows the machine code for LSR and ASR with the shift amount encoded in the least significant 8 bits of Rs (R6 and R12). As before, cmd is 13 (11012), sh encodes the type of shift, Rm holds the value to be shifted, and the shifted result is placed in Rd. This instruction uses the register-shifted register addressing mode, where one register (Rm) is shifted by the amount held in a second register (Rs). Because the least significant 8 bits of Rs are used, Rm can be shifted by up to 255 positions. For example, if Rs holds the value 0xF001001C, the shift amount is 0x1C (28). A logical shift by more than 31 bits pushes all the bits off the end and produces all 0's. Rotate is cyclical, so a rotate by 50 bits is equivalent to a rotate by 18 bits.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.21. Shift instructions with register shift amounts

6.4.2 Memory Instructions

Memory instructions use a format similar to that of data-processing instructions, with the same six overall fields: cond, op, funct, Rn, Rd, and Src2, as shown in Figure 6.22. However, memory instructions use a different funct field encoding, have two variations of Src2, and use an op of 012. Rn is the base register, Src2 holds the offset, and Rd is the destination register in a load or the source register in a store. The offset is either a 12-bit unsigned immediate (imm12) or a register (Rm) that is optionally shifted by a constant (shamt5). funct is composed of six control bits: , P, U, B, W, and L. The (immediate) and U (add) bits determine whether the offset is an immediate or register and whether it should be added or subtracted, according to Table 6.9. The P (pre-index) and W (writeback) bits specify the index mode according to Table 6.10. The L (load) and B (byte) bits specify the type of memory operation according to Table 6.11.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.22. Memory instruction format for LDR, STR, LDRB, and STRB

Table 6.9. Offset type control bits for memory instructions

MeaningBitU0Immediate offset in Src2Subtract offset from base1Register offset in Src2Add offset to base

Table 6.10. Index mode control bits for memory instructions

PWIndex Mode00Post-index01Not supported10Offset11Pre-index

Table 6.11. Memory operation type control bits for memory instructions

LBInstruction00STR01STRB10LDR11LDRB

Example 6.3

Translating Memory Instructions into Machine Language

Translate the following assembly language statement into machine language.

STR R11, [R5], #-26

Solution

STR is a memory instruction, so it has an op of 012. According to Table 6.11, L = 0 and B = 0 for STR. The instruction uses post-indexing, so according to Table 6.10, P = 0 and W = 0. The immediate offset is subtracted from the base, so I¯=0and U = 0. Figure 6.23 shows each field and the machine code. Hence, the machine language instruction is 0xE405B01A.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.23. Machine code for the memory instruction of Example 6.3

Notice the counterintuitive encoding of post-indexing mode.

6.4.3 Branch Instructions

Branch instructions use a single 24-bit signed immediate operand, imm24, as shown in Figure 6.24. As with data-processing and memory instructions, branch instructions begin with a 4-bit condition field and a 2-bit op, which is 102. The funct field is only 2 bits. The upper bit of funct is always 1 for branches. The lower bit, L, indicates the type of branch operation: 1 for BL and 0 for B. The remaining 24-bit two’s complement imm24 field is used to specify an instruction address relative to PC + 8.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.24. Branch instruction format

Code Example 6.28 shows the use of the branch if less than (BLT) instruction and Figure 6.25 shows the machine code for that instruction. The branch target address (BTA) is the address of the next instruction to execute if the branch is taken. The BLT instruction in Figure 6.25 has a BTA of 0x80B4, the instruction address of the THERE label.

Code Example 6.28

Calculating the Branch Target Address

ARM Assembly Code

0x80A0     BLT THERE 

0x80A4     ADD R0, R1, R2

0x80A8     SUB R0, R0, R9

0x80AC     ADD SP, SP, #8

0x80B0     MOV PC, LR

0x80B4 THERE  SUB R0, R0, #1

0x80B8     ADD R3, R3, #0x5

The 24-bit immediate field gives the number of instructions between the BTA and PC + 8 (two instructions past the branch). In this case, the value in the immediate field (imm24) of BLT is 3 because the BTA (0x80B4) is three instructions past PC + 8 (0x80A8).

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.25. Machine code for branch if less than (BLT)

The processor calculates the BTA from the instruction by sign-extending the 24-bit immediate, shifting it left by 2 (to convert words to bytes), and adding it to PC + 8.

Example 6.4

Calculating The Immediate Field For PC-Relative Addressing

Calculate the immediate field and show the machine code for the branch instruction in the following assembly program.

0x8040 TEST  LDRB R5, [R0, R3]

0x8044   STRB R5, [R1, R3]

0x8048  ADD  R3, R3, #1

0x8044  MOV PC, LR

0x8050  BL  TEST

0x8054  LDR R3, [R1], #4

0x8058  SUB R4, R3, #9

Solution

Figure 6.26 shows the machine code for the branch and link instruction (BL). Its branch target address (0x8040) is six instructions behind PC + 8 (0x8058), so the immediate field is -6.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.26. BL machine code

6.4.4 Addressing Modes

This section summarizes the modes used for addressing instruction operands. ARM uses four main modes: register, immediate, base, and PC-relative addressing. Most other architectures provide similar addressing modes, so understanding these modes helps you easily learn other assembly languages. Register and base addressing have several submodes described below. The first three modes (register, immediate, and base addressing) define modes of reading and writing operands. The last mode (PC-relative addressing) defines a mode of writing the program counter (PC). Table 6.12 summarizes and gives examples of each addressing mode.

Table 6.12. ARM operand addressing modes

Operand Addressing ModeExampleDescriptionRegister Register-onlyADD R3, R2, R1R3R2 + R1 Immediate-shifted registerSUB R4, R5, R9, LSR #2R4R5 − (R9 >> 2) Register-shifted registerORR R0, R10, R2, ROR R7R0R10 | (R2 ROR R7)ImmediateSUB R3, R2, #25R3R2 − 25Base Immediate offsetSTR R6, [R11, #77]mem[R11+77]R6 Register offsetLDR R12, [R1, −R5]R12mem[R1 − R5] Immediate-shifted register offsetLDR R8, [R9, R2, LSL #2]R8mem[R9 + (R2 << 2)]PC-RelativeB LABEL1Branch to LABEL1

Data-processing instructions use register or immediate addressing, in which the first source operand is a register and the second is a register or immediate, respectively. ARM allows the second register to be optionally shifted by an amount specified in an immediate or a third register. Memory instructions use base addressing, in which the base address comes from a register and the offset comes from an immediate, a register, or a register shifted by an immediate. Branches use PC-relative addressing in which the branch target address is computed by adding an offset to PC + 8.

ARM is unusual among RISC architectures in that it allows the second source operand to be shifted in register and base addressing modes. This requires a shifter in series with the ALU in the hardware implementation but significantly reduces code length in common programs, especially array accesses. For example, in an array of 32-bit data elements, the array index must be left-shifted by 2 to compute the byte offset into the array. Any type of shift is permitted, but left shifts for multiplication are most common.

6.4.5 Interpreting Machine Language Code

To interpret machine language, one must decipher the fields of each 32-bit instruction word. Different instructions use different formats, but all formats start with a 4-bit condition field and a 2-bit op. The best place to begin is to look at the op. If it is 002, then the instruction is a data-processing instruction; if it is 012, then the instruction is a memory instruction; if it is 102, then it is a branch instruction. Based on that, the rest of the fields can be interpreted.

Example 6.5

Translating Machine Language to Assembly Language

Translate the following machine language code into assembly language.

0xE0475001

0xE5949010

Solution

First, we represent each instruction in binary and look at bits 27:26 to find the op for each instruction, as shown in Figure 6.27. The op fields are 002 and 012, indicating a data-processing and memory instruction, respectively. Next, we look at the funct field of each instruction.

The cmd field of the data-processing instruction is 2 (00102) and the I-bit (bit 25) is 0, indicating that it is a SUB instruction with a register Src2. Rd is 5, Rn is 7, and Rm is 1.

The funct field for the memory instruction is 0110012. B = 0 and L = 1, so this is an LDR instruction. P = 1 and W = 0, indicating offset addressing. I¯=0, so the offset is an immediate. U = 1, so the offset is added. Thus, it is a load register instruction with an immediate offset that is added to the base register. Rd is 9, Rn is 4, and imm12 is 16. Figure 6.27 shows the assembly code equivalent of the two machine instructions.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.27. Machine code to assembly code translation

6.4.6 The Power of the Stored Program

A program written in machine language is a series of 32-bit numbers representing the instructions. Like other binary numbers, these instructions can be stored in memory. This is called the stored program concept, and it is a key reason why computers are so powerful. Running a different program does not require large amounts of time and effort to reconfigure or rewire hardware; it only requires writing the new program to memory. In contrast to dedicated hardware, the stored program offers general-purpose computing. In this way, a computer can execute applications ranging from a calculator to a word processor to a video player simply by changing the stored program.

Instructions in a stored program are retrieved, or fetched, from memory and executed by the processor. Even large, complex programs are simply a series of memory reads and instruction executions.

Figure 6.28 shows how machine instructions are stored in memory. In ARM programs, the instructions are normally stored starting at low addresses, in this case 0x00008000. Remember that ARM memory is byte-addressable, so 32-bit (4-byte) instruction addresses advance by 4 bytes, not 1.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.28. Stored program

To run or execute the stored program, the processor fetches the instructions from memory sequentially. The fetched instructions are then decoded and executed by the digital hardware. The address of the current instruction is kept in a 32-bit register called the program counter (PC), which is register R15. For historical reasons, a read to the PC returns the address of the current instruction plus 8.

To execute the code in Figure 6.28, the PC is initialized to address 0x00008000. The processor fetches the instruction at that memory address and executes the instruction, 0xE3A01064 (MOV R1, #100). The processor then increments the PC by 4 to 0x00008004, fetches and executes that instruction, and repeats.

The architectural state of a microprocessor holds the state of a program. For ARM, the architectural state includes the register file and status registers. If the operating system (OS) saves the architectural state at some point in the program, it can interrupt the program, do something else, and then restore the state such that the program continues properly, unaware that it was ever interrupted. The architectural state is also of great importance when we build a microprocessor in Chapter 7.

Is the number of bits the processor can interpret and execute at a given time?

Ada Lovelace, 1815–1852.

A British mathematician who wrote the first computer program. It calculated the Bernoulli numbers using Charles Babbage’s Analytical Engine. She was the daughter of the poet Lord Byron.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128000564000066

Architecture

Sarah L. Harris, David Harris, in Digital Design and Computer Architecture, 2022

6.4.7 Interpreting Machine Language Code

To interpret machine language, one must decipher the fields of each 32-bit instruction word. Different instructions use different formats, but all formats share a 7-bit opcode field. Thus, the best place to begin is to look at the opcode to determine if it is an R-, I-, S/B-, or U/J-type instruction.

Example 6.6

Translating Machine Language to Assembly Language

Translate the following machine language code into assembly language.

 0x41FE83B3

 0xFDA48293

Solution

First, we represent each instruction in binary and look at the seven least significant bits to find the opcode for each instruction.

  0100 0001 1111 1110 1000 0011 1011 0011 (0x41FE83B3)

  1111 1101 1010 0100 1000 0010 1001 0011 (0xFDA48293)

The opcode determines how to interpret the rest of the bits. The first instruction’s opcode is 01100112 ; so, according to Table B.1 in Appendix B, it is an R-type instruction and we can divide the rest of the bits into the R-type fields, as shown at the top of Figure 6.28. The second instruction’s opcode is 00100112 , which means it is an I-type instruction. We group the remaining bits into the I-type format, as seen in Figure 6.28, which shows the assembly code equivalent of the two machine instructions.

Is the number of bits the processor can interpret and execute at a given time?

Figure 6.28. Machine code to assembly code translation

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128200643000064

Instruction Sets

Joseph Yiu, in The Definitive Guide to the ARM Cortex-M3 (Second Edition), 2009

4.1.3 Assembler Language: Unified Assembler Language

To support and get the best out of the Thumb®-2 instruction set, the Unified Assembler Language (UAL) was developed to allow selection of 16-bit and 32-bit instructions and to make it easier to port applications between ARM code and Thumb code by using the same syntax for both. (With UAL, the syntax of Thumb instructions is now the same as for ARM instructions.)

ADD R0, R1 ; R0 = R0 + R1, using Traditional Thumb syntax

ADD R0, R0, R1 ; Equivalent instruction using UAL syntax

The traditional Thumb syntax can still be used. The choice between whether the instructions are interpreted as traditional Thumb code or the new UAL syntax is normally defined by the directive in the assembly file. For example, with ARM assembler tool, a program code header with “CODE16” directive implies the code is in the traditional Thumb syntax, and “THUMB” directive implies the code is in the new UAL syntax.

One thing you need to be careful with reusing traditional Thumb is that some instructions change the flags in APSR, even if the S suffix is not used. However, when the UAL syntax is used, whether the instruction changes the flag depends on the S suffix. For example,

AND R0, R1 ; Traditional Thumb syntax

ANDS R0, R0, R1 ; Equivalent UAL syntax (S suffix is added)

With the new instructions in Thumb-2 technology, some of the operations can be handled by either a Thumb instruction or a Thumb-2 instruction. For example, R0 = R0 + 1 can be implemented as a 16-bit Thumb instruction or a 32-bit Thumb-2 instruction. With UAL, you can specify which instruction you want by adding suffixes:

ADDS R0, #1 ; Use 16-bit Thumb instruction by default

 ; for smaller size

ADDS.N R0, #1 ; Use 16-bit Thumb instruction (N=Narrow)

ADDS.W R0, #1 ; Use 32-bit Thumb-2 instruction (W=wide)

The .W (wide) suffix specifies a 32-bit instruction. If no suffix is given, the assembler tool can choose either instruction but usually defaults to 16-bit Thumb code to get a smaller size. Depending on tool support, you may also use the .N (narrow) suffix to specify a 16-bit Thumb instruction.

Again, this syntax is for ARM assembler tools. Other assemblers might have slightly different syntax. If no suffix is given, the assembler might choose the instruction for you, with the minimum code size.

In most cases, applications will be coded in C, and the C compilers will use 16-bit instructions if possible due to smaller code size. However, when the immediate data exceed a certain range or when the operation can be better handled with a 32-bit Thumb-2 instruction, the 32-bit instruction will be used.

The 32-bit Thumb-2 instructions can be half word aligned. For example, you can have a 32-bit instruction located in a half word location.

0x1000 : LDR r0,[r1] ;a 16-bit instructions (occupy 0x1000-0x1001)

0x1002 : RBIT.W r0 ;a 32-bit Thumb-2 instruction (occupy

 ; 0x1002-0x1005)

Most of the 16-bit instructions can only access registers R0–R7; 32-bit Thumb-2 instructions do not have this limitation. However, use of PC (R15) might not be allowed in some of the instructions. Refer to the ARM v7-M Architecture Application Level Reference Manual [Ref. 2] (section A4.6) if you need to find out more detail in this area.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781856179638000077

Cloud Resource Virtualization

Dan C. Marinescu, in Cloud Computing, 2013

5.13 Software fault isolation

Software fault isolation (SFI) offers a technical solution for sandboxing binary code of questionable provenance that can affect security in cloud computing. Insecure and tampered VM images are one of the security threats because binary codes of questionable provenance for native plug-ins to a Web browser can pose a security threat when Web browsers are used to access cloud services.

A recent paper [322] discusses the application of the sandboxing technology for two modern CPU architectures, ARM and 64-bit x86. ARM is a load/store architecture with 32-bit instruction and 16 general-purpose registers. It tends to avoid multicycle instructions, and it shares many RISC architecture features, but (a) it supports a “thumb” mode with 16-bit instruction extensions; (b) it has complex addressing modes and a complex barrel shifter; and (c) condition codes can be used to predicate most instructions. In the x86-64 architecture, general-purpose registers are extended to 64 bits, with an r replacing the e to identify the 64 versus 32-bit registers (e.g., rax instead of eax). There are eight new general-purpose registers, named r8–r15. To allow legacy instructions to use these additional registers, x86-64 defines a set of new prefix bytes to use for register selection.

This SFI implementation is based on the previous work of the same authors on Google Native Client (NC) and assumes an execution model in which a trusted run-time shares a process with an untrusted multithreaded plug-in. The rules for binary code generation of the untrusted plug-in are: (i) the code section is read-only and is statically linked; (ii) the code is divided into 32-byte bundles, and no instruction or pseudo-instruction crosses the bundle boundary; (iii) the disassembly starting at the bundle boundary reaches all valid instructions; and (iv) all indirect flow-control instructions are replaced by pseudo-instructions that ensure address alignment to bundle boundaries.

The features of the SFI for the Native Client on the x86-32, x86-64, and ARM are summarized in Table 5.4[322]. The control flow and store sandboxing for the ARM SFI incur less then 5%average overhead, and those for x86-64 SFI incur less than 7%average overhead.

Table 5.4. The features of the SFI for the native client on the x86-32, x86-64, and ARM. ILP stands for instruction-level parallelism.

Feature/Architecturex86-32x86-64ARMAddressable memory1 GB4 GB1 GBVirtual base addressAny44 GB0Data modelILP 32ILP 32ILP 32Reserved registers0 of 81 of 160 of 16Data address maskNoneImplicit in result widthExplicit instructionControl address maskExplicit instructionExplicit instructionExplicit instructionBundle size (bytes)323216Data in text segmentForbiddenForbiddenAllowedSafe address registersAllRSP, RBPSPOut-of-sandbox storeTrapWraps mod 4 GBNo effectOut-of-sandbox jumpTrapWraps mod 4 GBWraps mod 1 GB

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124046276000051

CPUs

Marilyn Wolf, in High-Performance Embedded Computing (Second Edition), 2014

2.4 Parallel execution mechanisms

In this section we will look at various ways that processors perform operations in parallel. We will consider very long instruction word and superscalar processing, subword parallelism, vector processing, thread level parallelism, and graphic processing units (GPUs). We will end this section with a brief consideration of the available parallelism in some embedded applications.

2.4.1 Very long instruction word processors

Very long instruction word (VLIW) architectures were originally developed as general-purpose processors but have seen widespread use in embedded systems. VLIW architectures provide instruction-level parallelism with relatively low hardware overhead.

VLIW basics

Figure 2.3 shows a simplified version of a VLIW processor to introduce the basic principles of the technique. The execution unit includes a pool of function units connected to a large register file. Using today’s terminology for VLIW machines, the execution unit reads a packet of instructions—each instruction in the packet can control one of the function units in the machine. In an ideal VLIW machine, all instructions in the packet are executed simultaneously; in modern machines, it may take several cycles to retire all the instructions in the packet. Unlike a superscalar processor, the order of execution is determined by the structure of the code and how instructions are grouped into packets; the next packet will not begin execution until all the instructions in the current packet have finished.

Is the number of bits the processor can interpret and execute at a given time?

FIGURE 2.3. Structure of a generic VLIW processor.

Because the organization of instructions into packets determines the schedule of execution, VLIW machines rely on powerful compilers to identify parallelism and schedule instructions. The compiler is responsible for enforcing resource limitations and their associated scheduling policies. In compensation, the execution unit is simpler because it does not have to check for many resource interdependencies.

The ideal VLIW is relatively easy to program because of its large, uniform register file. The register file provides a communication mechanism between the function units since each function unit can read operands from and write results to any register in the register file.

Split register files

Unfortunately, it is difficult to build large, fast register files with many ports. As a result, many modern VLIW machines use partitioned register files as shown in Figure 2.4. In the example, the registers have been split into two register files, each of which is connected to two function units. The combination of a register file and its associated function units is sometimes called a cluster. A cluster bus can be used to move values between the register files. Register file to register file movement is performed under program control using explicit instructions. As a result, partitioned register files make the compiler’s job more difficult. The compiler must partition values among the register files, determine when a value needs to be copied from one register file to another, generate the required move instructions, and adjust the schedules of the other operations to wait for the values to appear. However, the characteristics of VLIW circuits often require us to design partitioned register file architectures.

Is the number of bits the processor can interpret and execute at a given time?

FIGURE 2.4. Split register files in a VLIW machine.

Uses of VLIW

VLIW machines have been used in applications with a great deal of data parallelism. The Trimedia family of processors, for example, was designed for use in video systems. Video algorithms often perform similar operations on several pixels at time, making it relatively easy to generate parallel code. VLIW machines have also been used for signal processing and networking. Cell phone baseband systems, for example, must perform the same signal processing on many channels in parallel; the same instructions can be performed on separate data streams using VLIW architectures. Similarly, networking systems must perform the same or similar operations on several packets at the same time.

The next example describes a VLIW digital signal processor.

Example 2.3

Texas Instruments C6000 VLIW DSP

The TI C6000 family [Tex11] is a VLIW architecture designed for digital signal processing. The architecture is designed around a pair of data paths, each with its own 32-word register file (known as register files A and B). Each datapath has a .D unit for data load/store operations, a .L unit for logic and arithmetic, a .S unit for shift/branch/compare operations, and a .M unit for operations. These function units can all operate independently. They are supported by a program bus that can fetch eight 32-bit instructions on every cycle and two data buses that allow the .D1 and .D2 units to both fetch from the level 1 data memory on every cycle.

2.4.2 Superscalar processors

Superscalar processors issue more than one instruction per clock cycle. Unlike VLIW processors, they check for resource conflicts on the fly to determine what combinations of instructions can be issued at each step. Superscalar architectures dominate desktop and server architectures. Superscalar processors are not as common in the embedded world as in the desktop/server world. Embedded computing architectures are more likely to be judged by metrics such as operations per watt rather than raw performance.

A surprising number of embedded processors do, however, make use of superscalar instruction issue, though not as aggressively as do high-end servers. The embedded Pentium processor is a two-issue, in-order processor. It has two pipes: one for any integer operation and another for simple integer operations. We saw in Section 2.3.1 that other embedded processors also use superscalar techniques.

2.4.3 SIMD and vector processors

Many applications present data-level parallelism that lends itself to efficient computing structures. Furthermore, much of this data is relatively small, which allows us to build more parallel processing units to soak up more of that available parallelism.

Data operand sizes

A variety of studies have shown that many of the variables used in most programs have small dynamic ranges. Figure 2.5 shows the results of one such study by Fritts [Fri00]. He analyzed the data types of programs in the MediaBench benchmark suite [Lee97]. The results show that 8-bit (byte) and 16-bit (half-word) operands dominate this suite of programs. If we match the function unit widths to the operand sizes, we can put more function units in the available silicon than if we simply used wide-word function units to perform all operations.

Is the number of bits the processor can interpret and execute at a given time?

FIGURE 2.5. Operand sizes in mediabench benchmarks [Fri00].

Subword parallelism

One technique that exploits small operand sizes is subword parallelism [Lee94]. The processor’s ALU can either operate in normal mode or it can be split into several smaller ALUs. An ALU can easily be split by breaking the carry chain so that bit slices operate independently. Each subword can operate on independent data; the operations are all controlled by the same opcode. Because the same instruction is performed on several data values, this technique is often referred to as a form of SIMD.

Vectorization

Another technique for data parallelism is vector processing. Vector processors have been used in scientific computers for decades; they use specialized instructions that are designed to efficiently perform operations such as dot products on vectors of values. Vector processing does not rely on small data values, but vectors of smaller data types can perform more operations in parallel on available hardware, particularly when subword parallelism methods are used to manage datapath resources.

The next example describes a widely used vector processing architecture.

Example 2.4

AltiVec Vector Architecture

The AltiVec vector architecture [Ful98, Fre13][Ful98][Fre13] was defined by Motorola (now Freescale Semiconductor) for the PowerPC architecture. AltiVec provides a 128-bit vector unit that can be divided into operands of several sizes: 4 operands of 32 bits, 8 operands of 16 bits, or 16 operands of 8 bits. A register file provides 32 128-bit vectors to the vector unit. The architecture defines a number of operations, including logical and arithmetic operands within an element as well as interelement operations such as permutations.

2.4.4 Thread-level parallelism

Processors can also exploit thread- or task-level parallelism. It may be easier to find thread-level parallelism, particularly in embedded applications. The behavior of threads may be more predictable than instruction-level parallelism.

Varieties of multithreading

Multithreading architectures must provide separate registers for each thread. But because switching between threads is stylized, the control required for multithreading is relatively straightforward. Hardware multithreading alternately fetches instructions from separate threads. On one cycle, it will fetch several instructions from one thread, fetching enough instructions to be able to keep the pipelines full in the absence of interlocks. On the next cycle, it fetches instructions from another thread. Simultaneous multithreading (SMT) fetches instructions from several threads on each cycle rather than alternating between threads.

Multithreading in Atom

The Intel Atom S1200 [Int12] provides hyper-threading that allows the core to act as two logical processors. Each logical processor has its own set of general purpose and control registers. The underlying physical resources—execution units, buses, and caches—are shared.

2.4.5 GPUs

Graphic processing units (GPUs) are widely used in PCs to perform graphics operations. The most basic mode of operation in a GPU is SIMD. As illustrated in Figure 2.6, the graphics frame buffer holds the pixel values to be written onto the screen. Many graphics algorithms perform identical operations on each section of the screen, with only the data changing by position in the frame buffer. The processing elements (PEs) for the GPU can be mapped onto sections of the screen. Each PE can execute the same graphics code on its own data stream. The sections of the screen are therefore rendered in parallel.

Is the number of bits the processor can interpret and execute at a given time?

FIGURE 2.6. SIMD processing for graphics.

As mobile multimedia devices have proliferated, GPUs have migrated onto embedded systems-on-chips. For example, the BCM2835 includes both an ARM11 CPU and two VideoCore IV GPUs [Bro13]. The BCM2835 is used in the Raspberry Pi embedded computer [Ras13].

The NVIDIA Fermi [NVI09] illustrates some important aspects of modern GPUs. Although it is not deployed on embedded processors at the time of this writing, we can expect embedded GPUs to embody more of these features as Moore’s Law advances. Figure 2.7 illustrates the overall Fermi architecture. At the center are three types of processing units: cores, load/store units, and special function units that provide transcendental mathematical functions. The operation of all three units is controlled by the two warp schedulers and dispatch units. A warp is a group of 32 parallel threads. One warp scheduler and dispatch unit can control the execution of these two parallel threads across the cores, load/store, and special function units. Each warp scheduler’s warp is independent, so the two active warps execute independently. Physically, the system provides a register file, shared memory and L1 cache, and a uniform cache. Figure 2.8 shows the architecture of a single core. Each core includes floating-point and integer units. The dispatch port, operand collector, and result queue manage the retrieval of operands and storage of results.

Is the number of bits the processor can interpret and execute at a given time?

FIGURE 2.7. The Fermi architecture.

Is the number of bits the processor can interpret and execute at a given time?

FIGURE 2.8. Architecture of a CUDA core.

The programming model provides a hierarchy of programming units. The most basic is the thread, identified by a thread ID. Each thread has its own program counter, registers, private memory, and inputs and outputs. A thread block, identified by its block ID, is a set of threads that share memory and can coordinate using barrier synchronization. A grid is an array of thread blocks that execute the same kernel. The thread blocks in a grid can share results using global memory.

2.4.6 Processor resource utilization

The choice of processor architecture depends in part on the characteristics of the programs to be run on the processor. In many embedded applications we can leverage our knowledge of the core algorithms to choose effective CPU architectures. However, we must be careful to understand the characteristics of those applications. As an example, many researchers assume that multimedia algorithms exhibit embarrassing levels of parallelism. Experiments show that this is not necessarily the case.

Measurements on multimedia benchmarks

Tallu et al. [Tal03] evaluated the instruction-level parallelism available in multimedia applications. As shown in Figure 2.9, they evaluated several different processor configurations using SimpleScalar. They measured nine benchmark programs on the various architectures. The bar graphs show the instructions per cycle for each application; most applications exhibit fewer than four instructions per cycle.

Is the number of bits the processor can interpret and execute at a given time?

FIGURE 2.9. An evaluation of the available parallelism in multimedia applications [Tal03] ©2003 IEEE.

Fritts [Fri00] studied the characteristics of loops in the MediaBench suite [Lee97]. Figure 2.10 shows two measurements; in each case, results are shown with the benchmark programs grouped into categories based on their primary function. The first measurement shows the average number of iterations of a loop; fortunately, loops on average are executed many times. The second measurement shows path ratio, which is defined as

Is the number of bits the processor can interpret and execute at a given time?

FIGURE 2.10. Dynamic behavior of loops in MediaBench [Fri00].

(EQ 2.1)PR=numberofloopbodyinstructionsexecutedtotalnumberofinstructionsinloopbody×100

Path ratio measures the percentage of a loop’s instructions that are actually executed. The average path ratio over all the MediaBench benchmarks was 78%, which means that 22% of the loop instructions were not executed.

Multimedia algorithms

These results should not be surprising given the nature of modern embedded algorithms. Modern signal processing algorithms have moved well beyond filtering. Many algorithms use control to improve performance. The large specifications for multimedia standards will naturally result in complex programs.

Implications for CPUs

To take advantage of the available parallelism in multimedia and other embedded applications, we need to match the processor architecture to the application characteristics. These experiments suggest that processor architectures must exploit parallelism at several levels of abstraction.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124105119000022

Hardware and Software for Digital Signal Processors

Lizhe Tan, Jean Jiang, in Digital Signal Processing (Third Edition), 2019

14.6 Digital Signal Processing Programming Examples

In this section, we first review the TMS320C67x DSK (DSP Starter Kit), which offers floating-point and fixed-point arithmetic. We will then investigate the real-time implementation of digital filters.

14.6.1 Overview of TMS320C67x DSK

In this section, a TI TMS320C6713 DSK (DSP Starter Kit) shown in Fig. 14.18 is chosen for demonstration. This DSK board has an approximate size of 5 × 8 inches, a clock rate of 225 MHz and a 16-bit stereo codec TL V320AIC23 (AIC23) which deals with analog inputs and outputs. The onboard codec AIC23 applies sigma-delta technology for ADC and DAC functions. The codec runs using a 12 MHz system clock and the sampling rate can be selected from a range of 8–96 kHz for speech and audio processing. Other boards such as a TI TMS320C6711 DSK can also be found in the references (Kehtaranavaz and Simsek, 2000; TMS320C6x CPU and Instruction Set Reference Guide; 1999). The on-board daughter card connections facilitate the external units for advanced applications such as external peripheral and external memory interfaces (EMIFs). The TMS320C6713 DSK board consists of 16 MB (megabytes) of synchronous dynamic RAM (SDRAM) and 512 kB (kilobytes) of flash memory. There are four onboard connections: MIC IN for microphone input; LINE IN for line input; LINE OUT for line output; and HEADPHONE for a headphone output (multiplexed with LINE OUT). The four user DIP switches can be read from a program running within DSK board as well as they can provide with a user feedback control of interface. The four LEDs (light-emitting diodes) on the DSK board can be controlled by a running DSP program. The onboard voltage regulators provide 1.26 V for the DSP core while 3.3 V for the memory and peripherals. The USB port provides the connection between the DSK board and the host computer, where the user program is developed, compiled, and downloaded to the DSK for real-time applications using the user-friendly software called the Code Composer Studio (CCS), which we discuss this later.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.18. C6713 DSK board and block diagram. (A) TMS320C6713 DSK board. (B) TMS320C6713 DSK block diagram.

Courtesy of Texas Instruments.

In general, the TMS320C67x operates at a high clock rate of 300 MHz. Combining with high speed and multiple units operating at the same time has pushed its performance up to 2400 MIPS at 300 MHz. Using this number, the C67x can handle 0.3 MIPS between two speech samples at a sampling rate of 8 kHz and can handle over 54,000 instructions between the two audio samples with a sampling rate of 44.1 kHz. Hence, the C67x offers great flexibility for real-time applications with a high-level C language.

Fig. 14.19 shows a C67x architecture overview, while Fig. 14.20 displays a more detailed block diagram. C67x contains three main parts, which are the CPU, the memories, and the peripherals. As shown in Fig. 14.9, these three main parts are joined by an EMIF interconnected by internal buses to facilitate interface with common memory devices; DMA; a serial port; and a host-port interface (HPI).

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.19. Block diagram of TMS320C67x floating-point DSP.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.20. Registers of TMS320C67x floating-point DSP.

Since this section is devoted to show DSP coding examples, C67x key features and references are briefly listed here:

(1)

Architecture: The system uses Texas Instruments VelociTM architecture, which is an enhancement of the VLIW (very long instruction word architecture) (Dahnoun, 2000; Ifeachor and Jervis, 2002; Kehtaranavaz and Simsek, 2000).

(2)

CPU: As shown in Fig. 14.20, the CPU has eight functional units divided into two sides A and B, each consisting of units .D, .M, .L, and .S. For each side, an .M unit is used for multiplication operation, an. L unit is used for logical and arithmetic operations, and a .D unit is used for loading/storing and arithmetic operations. Each side of the C67x CPU has sixteen 32-bit registers that the CPU must go through for interface. More detail can be found in Appendix D (Texas Instruments, 1991) as well as in Kehtaranavaz and Simsek (2000) and Texas Instruments (1998).

(3)

Memory and internal buses: Memory space is divided into internal program memory, internal data memory, and internal peripheral and external memory space. The internal buses include a 32-bit program address bus, a 256-bit program data bus to carrying out eight 32-bit instructions (VLIW), two 32-bit data address buses, two 64-bit load data buses, two 64-bit store data buses, two 32-bit DMA buses, and two 32-bit DMA address buses responsible for reading and writing. There also exit a 22-bit address bus and a 32-bit data bus for accessing off-chip or external memory.

(4)

Peripherals:

(a)

EMIF, which provides the required timing for accessing external memory

(b)

DMA, which moves data from one memory location to another without interfering with the CPU operations

(c)

Multichannel buffered serial port (McBSP) with a high-speed multi-channel serial communication link

(d)

HPI, which lets a host access internal memory

(e)

Boot loader for loading code from off-chip memory or the HPI to internal memory

(f)

Timers (two 32-bit counters)

(g)

Power-down units for saving power for periods when the CPU is inactive.

The software tool for the C67x is the CCS provided by TI. It allows the user to build and debug programs from a user-friendly graphical user interface (GUI) and extends the capabilities of code development tools to include real-time analysis. Installation, tutorial, coding, and debugging information can be found in the CCS Getting Started Guide (Texas Instruments, 2001) and in Kehtaranavaz and Simsek (2000).

Particularly for the TMS320C6713 DSK with a clock rate of 225 MHz, it has capability to fetch eight 32-bit instructions every 4.4 ns (1/225 MHz). The functional block diagram is shown in Fig. 14.21. The detailed description can found in Chassaing and Reay (2008).

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.21. Functional block diagram and registers of TMS320C6713.

Courtesy of Texas Instruments.

14.6.2 Concept of Real-Time Processing

We illustrate the real-time implementation shown in Fig. 14.22, where the sampling rate is 8000 samples per second; that is, the sampling period T = 1/fs = 125 μs, which is the time between two samples.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.22. Concept of real-time processing.

As shown in Fig. 14.22, the required timing includes an input sample clock and an output sample clock. The input sample clock maintains the accuracy of sampling time for each ADC operation, while the output sampling clock keeps the accuracy of time instant for each DAC operation. The time between the input sample clock n and output sample clock n consists of the ADC operation, algorithm processing, and the wait for the next ADC operation. The numbers of instructions for ADC and DSP algorithm must be estimated and verified to ensure that all instructions have been completed before the DAC begins. Similarly, the number of instructions for DAC must be verified so that DAC instructions will be finished between the output sample clock n and the next input sample clock n + 1. Timing usually is set up using the DSP interrupts (we will not pursue the interrupt setup here).

Next, we focus on the implementation of the DSP algorithm in the floating-point system for simplicity. A DSK setup example (Tan and Jiang, 2010) is depicted in Fig. 14.23, while a skeleton code for the verification of the input and output is depicted in Fig. 14.24.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.23. TMS320C6713 DSK setup example.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.24. Program segment for verifying input and output.

14.6.3 Linear Buffering

During DSP such as digital filtering, past inputs, and past outputs are required to be buffered and updated for processing the next input sample. Let us first study the FIR filter implementation.

FIR Filtering:

Consider implementation for the following 3-tap FIR filter:

yn=0.5xn+0.2xn−1+0.5xn−2.

The buffer requirements are shown in Fig. 14.25. The coefficient buffer b[3] contains three FIR coefficients, and the coefficient buffer is fixed during the process. The input buffer x[3], which holds the current and past inputs, is required to be updated. The FIFO update is adopted here with the segment of codes shown in Fig. 14.25. For each input sample, we update the input buffer using FIFO, which begins at the end of the data buffer; the oldest sampled is kicked out first from the buffer and updated with the value from the upper location. When the FIFO completes, the first memory location x[0] will be free to be used to store the current input sample. The segment of code in Fig. 14.25 explains implementation.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.25. Example of FIR filtering with linear buffer update.

Note that in the code segment, x[0] holds the current input sample x(n), while b[0] is the corresponding coefficient; x[1] and x[2] hold the past input samples x(n − 1) and x(n − 2), respectively; similarly, b[1] and b[2] are the corresponding coefficients.

Again, note that using the array and loop structures in the code segment is for simplicity in notations and assume that the reader is not familiar with the C pointers in the C language. The concern for simplicity has to do mainly with the DSP algorithm. More coding efficiency can be achieved using the C pointers and a circular buffer. The DSP-oriented coding implementation can be found in Kehtaranavaz and Simsek (2000) and Chassaing and Reay (2008).

IIR Filtering:

Similarly, we can implement an IIR filter. It requires an input buffer, which holds the current and past inputs; an output buffer, which holds the past outputs; a numerator coefficient buffer; and a denominator coefficient buffer. Considering the following IIR filter for implementation,

yn=0.5xn+0.7xn−1−0.5xn−2−0.4yn−1+0.6yn−2,

we accommodate the numerator coefficient buffer b[3], the denominator coefficient buffer a[3], the input buffer x[3], and the output buffer y[3] shown in Fig. 14.26. The buffer updates for input x[3] and output y[3] are FIFO. The implementation is illustrated in the segment of code listed in Fig. 14.26.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.26. Example of IIR filtering using linear buffer update.

Again, note that in the code segment, x[0] holds the current input sample, while y[0] holds the current processed output, which will be sent to the DAC unit for conversion. The coefficient a[0] is never modified in the code. We keep that for a purpose of notation simplicity and consistency during the programming process.

Digital Oscillation with IIR Filtering:

The principle for generating digital oscillation is described in Chapter 8, where the input to the digital filter is the impulse sequence, and the transfer function is obtained by applying the z-transform of the digital sinusoid function. Applications can be found in dual-tone multifrequency (DTMF) tone generation, digital carrier generation for communications, and so on. Hence, we can modify the implementation of IIR filtering for tone generation with the input generated internally instead of using the ADC channel.

Let us generate an 800 Hz tone with the digital amplitude of 5000. According to the section in Chapter 8 (“Applications: Generation and Detection of DTMF Tones Using the Goertzel Algorithm”), the transfer function, difference equation, and the impulse input sequence are found to be, respectively,

Hz=0.587785z−11−1.618034z−1+z−2

yn=0.587785xn−1+1.618034yn−1−yn−2

xn=5000δn.

We define the numerator coefficient buffer b[2], the denominator coefficient buffer a[3], the input buffer x[2], and the output buffer y[3], shown in Fig. 14.27, which also shows the modified implementation for tone generation.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.27. Example of IIR filtering using linear buffer update and the impulse sequence input.

Initially, we set x[0] = 5000. Then it will be updated with x[0] = 0 for each current processed output sample y[0].

14.6.4 Sample C Programs

Floating-Point Implementation Example:

Real-time DSP implementation using a float-point processor is easy to program. The overflow problem hardly ever occurs. Therefore, we do not need to consider scaling factors, as described in the last section. The code segment shown in Fig. 14.28 demonstrates the simplicity of coding the floating-point IIR filter using the direct-form I structure.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.28. Sample C code for IIR filtering (floating-point implementation).

Fixed-Point Implementation Example:

When the execution time is critical, the fixed-point implementation is preferred in a floating-point processor. We implement the following IIR filter with a unit passband gain in direct-form II:

Hz=0.0201−0.0402z−2+0.0201z−41−2.1192z−1+2.6952z−2−1.6924z−3+0.6414z−4

wn=xn+2.1192wn−1−2.6952wn−2+1.6924wn−3−0.6414wn−4

yn=0.0201wn−0.0402wn−2+0.0201wn−4.

Using MATLAB to calculate the scale factor S, it follows that.

» h = impz([1][1 -2.1192 2.6952 -1.6924 0.6414]);

» sf = sum(abs(h))

 sf = 28.2196

Hence we choose S = 32. To scale the filter coefficients in the Q-15 format, we use the factors A = 4 and B = 1. Then the developed DSP equations are

xsn=xn/32

wsn=0.25xsn+0.5298wsn−1−0.6738wsn−2+0.4231wsn−3−0.16035wsn−4

wn=4wsn

ysn=0.0201wn−0.0402wn−2+0.0201wn−4

yn=32ysn.

Using the method described in Section 14.5, we can convert filter coefficients into the Q-15 format; each coefficient is listed in Table 14.4.

Table 14.4. Filter Coefficients in Q-15 Format

IIR FilterFilter CoefficientsQ-15 Format (Hex)−a10.52980x43D0−a2−0.67380xA9C1−a30.42300x3628−a4−0.160350xEB7Ab00.02010x0293b10.00000x0000b2−0.04020xFADBb30.00000x000b40.02010x0293

The list of codes for the fixed-point implementation is displayed in Fig. 14.29, and some coding notations are given in Fig. 14.30.

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.29. Sample C code for IIR filtering (fixed-point implementation).

Is the number of bits the processor can interpret and execute at a given time?

Fig. 14.30. Some coding notations for the Q-15 fixed-point implementation.

Note that this chapter has provided only basic concepts and an introduction to real-time DSP implementation. The coding detail and real-time DSP applications will be treated in a separate DSP course, which deals with real-time implementations.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128150719000142

Microarchitecture

David Money Harris, Sarah L. Harris, in Digital Design and Computer Architecture (Second Edition), 2013

7.3.1 Single-Cycle Datapath

This section gradually develops the single-cycle datapath, adding one piece at a time to the state elements from Figure 7.1. The new connections are emphasized in black (or blue, for new control signals), while the hardware that has already been studied is shown in gray.

The program counter (PC) register contains the address of the instruction to execute. The first step is to read this instruction from instruction memory. Figure 7.2 shows that the PC is simply connected to the address input of the instruction memory. The instruction memory reads out, or fetches, the 32-bit instruction, labeled Instr.

Is the number of bits the processor can interpret and execute at a given time?

Figure 7.2. Fetch instruction from memory

The processor's actions depend on the specific instruction that was fetched. First we will work out the datapath connections for the lw instruction. Then we will consider how to generalize the datapath to handle the other instructions.

For a lw instruction, the next step is to read the source register containing the base address. This register is specified in the rs field of the instruction, Instr25:21. These bits of the instruction are connected to the address input of one of the register file read ports, A1, as shown in Figure 7.3. The register file reads the register value onto RD1.

Is the number of bits the processor can interpret and execute at a given time?

Figure 7.3. Read source operand from register file

The lw instruction also requires an offset. The offset is stored in the immediate field of the instruction, Instr15:0. Because the 16-bit immediate might be either positive or negative, it must be sign-extended to 32 bits, as shown in Figure 7.4. The 32-bit sign-extended value is called SignImm. Recall from Section 1.4.6 that sign extension simply copies the sign bit (most significant bit) of a short input into all of the upper bits of the longer output. Specifically, SignImm15:0 = Instr15:0 and SignImm31:16 = Instr15.

Is the number of bits the processor can interpret and execute at a given time?

Figure 7.4. Sign-extend the immediate

The processor must add the base address to the offset to find the address to read from memory. Figure 7.5 introduces an ALU to perform this addition. The ALU receives two operands, SrcA and SrcB. SrcA comes from the register file, and SrcB comes from the sign-extended immediate. The ALU can perform many operations, as was described in Section 5.2.4. The 3-bit ALUControl signal specifies the operation. The ALU generates a 32-bit ALUResult and a Zero flag, that indicates whether ALUResult == 0. For a lw instruction, the ALUControl signal should be set to 010 to add the base address and offset. ALUResult is sent to the data memory as the address for the load instruction, as shown in Figure 7.5.

Is the number of bits the processor can interpret and execute at a given time?

Figure 7.5. Compute memory address

The data is read from the data memory onto the ReadData bus, then written back to the destination register in the register file at the end of the cycle, as shown in Figure 7.6. Port 3 of the register file is the write port. The destination register for the lw instruction is specified in the rt field, Instr20:16, which is connected to the port 3 address input, A3, of the register file. The ReadData bus is connected to the port 3 write data input, WD3, of the register file. A control signal called RegWrite is connected to the port 3 write enable input, WE3, and is asserted during a lw instruction so that the data value is written into the register file. The write takes place on the rising edge of the clock at the end of the cycle.

Is the number of bits the processor can interpret and execute at a given time?

Figure 7.6. Write data back to register file

While the instruction is being executed, the processor must compute the address of the next instruction, PC′. Because instructions are 32 bits = 4 bytes, the next instruction is at PC + 4. Figure 7.7 uses another adder to increment the PC by 4. The new address is written into the program counter on the next rising edge of the clock. This completes the datapath for the lw instruction.

Is the number of bits the processor can interpret and execute at a given time?

Figure 7.7. Determine address of next instruction for PC

Next, let us extend the datapath to also handle the sw instruction. Like the lw instruction, the sw instruction reads a base address from port 1 of the register file and sign-extends an immediate. The ALU adds the base address to the immediate to find the memory address. All of these functions are already supported by the datapath.

The sw instruction also reads a second register from the register file and writes it to the data memory. Figure 7.8 shows the new connections for this function. The register is specified in the rt field, Instr20:16. These bits of the instruction are connected to the second register file read port, A2. The register value is read onto the RD2 port. It is connected to the write data port of the data memory. The write enable port of the data memory, WE, is controlled by MemWrite. For a sw instruction, MemWrite = 1, to write the data to memory; ALUControl = 010, to add the base address and offset; and RegWrite = 0, because nothing should be written to the register file. Note that data is still read from the address given to the data memory, but that this ReadData is ignored because RegWrite = 0.

Is the number of bits the processor can interpret and execute at a given time?

Figure 7.8. Write data to memory for sw instruction

Next, consider extending the datapath to handle the R-type instructions add, sub, and, or, and slt. All of these instructions read two registers from the register file, perform some ALU operation on them, and write the result back to a third register file. They differ only in the specific ALU operation. Hence, they can all be handled with the same hardware, using different ALUControl signals.

Figure 7.9 shows the enhanced datapath handling R-type instructions. The register file reads two registers. The ALU performs an operation on these two registers. In Figure 7.8, the ALU always received its SrcB operand from the sign-extended immediate (SignImm). Now, we add a multiplexer to choose SrcB from either the register file RD2 port or SignImm.

Is the number of bits the processor can interpret and execute at a given time?

Figure 7.9. Datapath enhancements for R-type instruction

The multiplexer is controlled by a new signal, ALUSrc. ALUSrc is 0 for R-type instructions to choose SrcB from the register file; it is 1 for lw and sw to choose SignImm. This principle of enhancing the datapath's capabilities by adding a multiplexer to choose inputs from several possibilities is extremely useful. Indeed, we will apply it twice more to complete the handling of R-type instructions.

In Figure 7.8, the register file always got its write data from the data memory. However, R-type instructions write the ALUResult to the register file. Therefore, we add another multiplexer to choose between ReadData and ALUResult. We call its output Result. This multiplexer is controlled by another new signal, MemtoReg. MemtoReg is 0 for R-type instructions to choose Result from the ALUResult; it is 1 for lw to choose ReadData. We don't care about the value of MemtoReg for sw, because sw does not write to the register file.

Similarly, in Figure 7.8, the register to write was specified by the rt field of the instruction, Instr20:16. However, for R-type instructions, the register is specified by the rd field, Instr15:11. Thus, we add a third multiplexer to choose WriteReg from the appropriate field of the instruction. The multiplexer is controlled by RegDst. RegDst is 1 for R-type instructions to choose WriteReg from the rd field, Instr15:11; it is 0 for lw to choose the rt field, Instr20:16. We don't care about the value of RegDst for sw, because sw does not write to the register file.

Finally, let us extend the datapath to handle beq. beq compares two registers. If they are equal, it takes the branch by adding the branch offset to the program counter. Recall that the offset is a positive or negative number, stored in the imm field of the instruction, Instr15:0. The offset indicates the number of instructions to branch past. Hence, the immediate must be sign-extended and multiplied by 4 to get the new program counter value: PC′ = PC + 4 + SignImm × 4.

Figure 7.10 shows the datapath modifications. The next PC value for a taken branch, PCBranch, is computed by shifting SignImm left by 2 bits, then adding it to PCPlus4. The left shift by 2 is an easy way to multiply by 4, because a shift by a constant amount involves just wires. The two registers are compared by computing SrcA – SrcB using the ALU. If ALUResult is 0, as indicated by the Zero flag from the ALU, the registers are equal. We add a multiplexer to choose PC′ from either PCPlus4 or PCBranch. PCBranch is selected if the instruction is a branch and the Zero flag is asserted. Hence, Branch is 1 for beq and 0 for other instructions. For beq, ALUControl = 110, so the ALU performs a subtraction. ALUSrc = 0 to choose SrcB from the register file. RegWrite and MemWrite are 0, because a branch does not write to the register file or memory. We don't care about the values of RegDst and MemtoReg, because the register file is not written.

Is the number of bits the processor can interpret and execute at a given time?

Figure 7.10. Datapath enhancements for beq instruction

This completes the design of the single-cycle MIPS processor datapath. We have illustrated not only the design itself, but also the design process in which the state elements are identified and the combinational logic connecting the state elements is systematically added. In the next section, we consider how to compute the control signals that direct the operation of our datapath.

What is the number of bits the processor can interpret and execute at a given time?

Word size is the number of bits the processor can interpret and execute at a given time. That is, a 64-bit processor can manipulate 64 bits at a time. Computers with a larger word size can process more data in the same amount of time than computers with a smaller word size.

Which of the following refers to the process of translating instructions or data that the CPU can understand and execute upon?

Decoding or translating the instructions into a form the CPU can understand, which is machine language (binary). Executing and carrying out the given instructions. Storing the result of the execution back to memory for later retrieval if and when requested. This is also called writing to memory.

Which of the following interprets and carries out the basic instructions that operate a computer group of answer choices?

Central Processing Unit (CPU) The CPU is the brain of a computer, containing all the circuitry needed to process input, store data, and output results. The CPU is constantly following instructions of computer programs that tell it which data to process and how to process it.

Is made up of bits grouped together as a unit?

A byte is a unit of data that is equal to eight bits. A byte is a unit that most computers use to represent a character, such as a letter, number, or symbol. Each byte can hold a string of bits that are encoded into a larger unit for application purposes.