

# Intro to Modern Programmable Architectures Focus on Single-core CPU

Sorbonne Université – Master SESI – MU5IN60 – Parallel Programming

Adrien Cassagne

September 18, 2023



# Table of Contents 1 Introduction

- ▶ Introduction
- ▶ Programmable Architectures
- ► Single-core CPU Architecture
- ► Single-core CPU Optimizations

# Presentation 1 Introduction

- 6 ECTS, 11 sessions of 4 hours:
  - 6 sessions of classroom lectures  $(6 \times 2h)$
  - 6 sessions of hands-on works  $(6 \times 2h)$
  - 4 sessions of a **project**  $(4 \times 4h)$
  - 1 session for the **presentations**  $(1 \times 4h)$
- Main objectives:
  - Understand modern general purpose processor architectures (CPU & GPU)
  - **Program efficiently** the CPUs and GPUs
  - Learn the theory to upper-bound the expected performance benefit

- Take away:
  - 1. CPU programming
    - $\circ$  Scalar optimizations
    - o SIMD instructions
    - Multi-threading
  - 2. GPU programming
    - o Concepts
    - o OpenCL language
  - 3. Performance analysis
    - o Speedup
    - Efficiency
    - o Roofline model
- Sessions in **English**

# Prerequisites 1 Introduction

- Strong basis in C programming
  - Pointers
  - Structures
  - Compilation and Makefile
  - ...
- Few basis in C++ programming
  - Standard Template Library
  - Ex.: std::vector<T>
  - Templating over the data types

- Some knowledge about computers architecture
  - Central Unit Processor (CPU)
  - RAM, Caches, Bus
  - Main principles of assembly prog.
- Some knowledge about Linux OS
  - Command line
  - Memory pagination and space
  - Unix processes and threads



### Target Parallel & Embedded Architecture



- **NVIDIA Jetson TX2** development kit
  - CPU: 6 cores— GPU: 256 cores
  - RAM: 8 GB shared between CPU & GPU
  - Embedded in cars, satellites, video game consoles (Switch), medical equipment, ...
- 15 available boards (max. 2 students per board)
- Hands-on and project sessions on the same target!



## Application Framework 1 Introduction



- Hands-on sessions with EasyPAP framework
- C and C++ languages
- Visualization of the results
- Monitoring of the performance
- Focus on what really matters





- Surprise!
- Choose ProgPar to find out more...



### Mysterious Project



- Surprise!
- Choose ProgPar to find out more...
- Oh but you already chose ProgPar, be patient you will see later ;-)



#### Table of Contents

- ▶ Introduction
- ➤ Programmable Architectures
- ► Single-core CPU Architecture
- ► Single-core CPU Optimizations



## $rac{ ext{Simplified CPU}}{ ext{2 Programmable Architectures}}$ architecture - Part 1

- Compute units and memory
  - Memory is used to load and store data
  - Compute units are used to transform data





## $rac{ ext{Simplified CPU}}{ ext{2 Programmable Architectures}}$ architecture - Part 2

- More precisely, in a CPU there are:
  - Control units (if, goto, instructions placement, ...)
  - Logic units, LU (==, ! =, >, <, ...)
  - Arithmetic units, AU (+, \*, -, /, ...)





# CPU Architecture: Cycle, Freq. and Throughput

- A tick of the clock = a cycle
- In each cycle, the CPU performs an elementary task
- Frequency = number of cycles per second (in Hertz)



# CPU Architecture: Cycle, Freq. and Throughput

- A tick of the clock = a cycle
- In each cycle, the CPU performs an elementary task
- Frequency = number of cycles per second (in Hertz)
- Modern CPU/RAM frequency: between 1 GHz and 4 GHz (1 GHz =  $10^9$  Hz)
- Memory throughput (RAM)  $\approx 50 \text{ GB/s} \text{ (DDR5)}$
- CPU capable of consuming/producing data at  $\approx 5$  TB/s (i9-9900K)
  - The CPU is much faster ( $\times 100$ )!
  - How can we solve this problem?



#### CPU Architecture: Memory Hierarchy – Part 1

2 Programmable Architectures

Observation: most applications often reuse the same data



## CPU Architecture: Memory Hierarchy – Part 1 <sup>2</sup> Programmable Architectures

#### Observation: most applications often reuse the same data

- Faster memory between CPU and RAM = Cache memory
  - Faster memory is more expensive and takes up physical space
  - This memory (= cache) is much smaller than RAM
  - 3 levels of cache (in the processor):
    - $\circ$  L1, the fastest and smallest (32 KB), access time  $\approx$  2 cycles
    - $\circ$  L2, slower than L1 but larger (1 MB), access time  $\approx 10$  cycles
    - $\circ$  L3, slower than L2 but larger (4 MB), access time  $\approx 30$  cycles
  - RAM access latency  $\approx 100$  cycles



# CPU Architecture: Memory Hierarchy – Part 2 Programmable Architectures

• Scenario: first load of a data from memory



Loading data from RAM:  $\approx 100$  cycles



# CPU Architecture: Memory Hierarchy – Part 2 2 Programmable Architectures

• Scenario: first load of a data from memory



Loading data from RAM:  $\approx 100$  cycles



# CPU Architecture: Memory Hierarchy – Part 3 <sup>2</sup> Programmable Architectures

• Scenario: load the same data (= second load) from memory



Loading data from L1 cache: 1-2 cycles



# CPU Architecture: Memory Hierarchy – Part 3 <sup>2</sup> Programmable Architectures

• Scenario: load the same data (= second load) from memory



Loading data from L1 cache: 1-2 cycles

• This is the principle of **temporal locality** 



#### CPU Architecture: Memory Hierarchy – Part 4

 ${\small 2\ \ Programmable\ Architectures}$ 

- Scenario: *store* data to memory
- Two possible modes:
  - Write-through
    - o "Possibly simpler" to implement in hardware
    - Write to both RAM and cache
    - o Choice made in some embedded architectures
  - Write-back
    - o "More complicated" to implement in hardware (cache coherence protocol)
    - o Write to cache and write to RAM only if data is invalidated from smaller/faster cache
    - $\circ$  Writing to RAM less often = consuming less energy
    - The choice made by the majority of architectures (computer, HPC, embedded)



#### CPU Architecture: Spatial Locality

- One cache line = a certain amount of data (e.g. 128–512 bits)
- The smallest packet of data moving between RAM and CPU is a cache line
- It's more interesting to access data contiguously!
- Otherwise, part of the data coming from RAM will be unused (= loss of bandwidth).



#### **CPU Architecture: Spatial Locality**

- One cache line = a certain amount of data (e.g. 128–512 bits)
- The smallest packet of data moving between RAM and CPU is a cache line
- It's more interesting to access data contiguously!
- Otherwise, part of the data coming from RAM will be unused (= loss of bandwidth).
- Using data in the same cache line is called **spatial locality!**



#### CPU Architecture: SIMD instructions

2 Programmable Architectures

• Scalar instruction: produces data during 1 cycle



• SIMD instruction: produces n data during 1 cycle



• SIMD instructions operate on so-called "vector registers".



#### CPU Architecture: Multi-core

- Modern CPUs are now all multi-core
- Generally, L1 and L2 caches are dedicated to a single core.
- Cores share the Last Level Cache (LLC) or L3





- Hardware physically separated from the CPU (or not on today's SoCs...).
- Often connected to the CPU via the PCI-Express bus
- Has its own RAM (global memory) (but not always...)
- Examples of co-processors
  - Graphics Processing Unit (GPU)
  - Field-Programmable Gate Array (FPGA)
  - Many Integrated Cores (MIC)



# GPU Architecture 2 Programmable Architectures

- Originally designed for image processing
- Massively parallel architecture
- Fewer control units than CPUs, but more compute units
- Global memory faster than CPU RAM ( $\approx 500 \text{ GB/s}$ )
- Performance/power consumption ratio generally better than CPUs
- Often suitable for scientific computing
- Hardware acceleration units for AI (tensor cores)



#### GPU Architecture: Nvidia Ampere







#### Supercomputer architecture



- A very simplistic supercomputer
- Interconnecting computers with a star network (switch)
- Theoretical maximum performance is 8 times that of a node



#### Table of Contents

- ▶ Introduction
- ▶ Programmable Architectures
- ► Single-core CPU Architecture
- ► Single-core CPU Optimizations



#### Pipeless Processor

3 Single-core CPU Architecture

- Generally, an instruction can be divided into several sub-steps:
  - 1. Fetch (IF): the instruction is copied from memory
  - 2. Decode (ID): the instruction is interpreted by the CPU
  - 3. Execute (EX): the instruction is executed
- On a processor WITHOUT pipeline, this corresponds to:



• Cycle 1 and cycle 2 take 6 units of time to complete



#### Pipelined Processor – Part 1

- From what we've just seen, it's possible to divide an instruction into 3 sub-instructions  $(\mu op)$
- We assume that each sub-instruction takes the same amount of time
- On a pipelined processor, this corresponds to:



- Cycle 1, cycle 2 and cycle 3 take 5 units of time to complete
- 3-stage pipeline (IF, ID and EX can run in parallel)



#### Pipelined Processor – Part 2

- It takes some time before the pipeline is optimal
  - This is called pipeline latency (2 cycles here)
- If latency is not taken into account, a 3-stage pipeline is 3 times faster
- Yellow: the time point at which the pipeline is optimal





#### Pipelined Processor – Part 3

- Today's processors have pipelines of 10 to 20 stages, but the principle remains the same
- There is a mechanism so that the preceding instruction can directly pass its result to the following instruction (register latch + forwarding)
- Pipeline seems effective, but what happens if you have conditions (e.g. if)?
  - In this case, it's problematic: the processor can't know what the next instruction will be, which is bad to pipeline efficiency (it creates "bubbles").
  - There are branch prediction mechanisms to limit this effect...



#### Super-scalar Processor

- The ability of a processor to execute several instructions in parallel
  - Instruction Level Parallelism (ILP)
  - Processors are generally 3- to 10-wide super-scalar
- This means that a CPU is capable of executing more than one instruction per cycle (3 to 10 instructions)
  - The example on the right shows a processor with an ILP of 2 and a 3-stage pipeline
  - For instance, a processor can compute 1 multiplication and load data from memory in 1 cycle





#### Out-of-Order (OoO) Execution

- Most processors are now capable of executing instructions out of order (different order than the one given by the assembler code)
  - This maximizes the use of CPU units during a cycle (e.g. super-scalar processor).
  - It is difficult to predict the order in which the processor will execute instructions
- Here's a code example to illustrate:

```
int a = 1, b = 2, c = 3, d = 4, e = 5, f = 6, g = 7, h = 8;
c = a + b; // 1st instruction to be executed, it depends on nothing
s e = c * d; // 3rd instruction to be executed, it depends on 'c'
4 h = f * g; // 2nd instruction to be executed, it depends on nothing
```



#### Out-of-Order (OoO) Execution

- Most processors are now capable of executing instructions out of order (different order than the one given by the assembler code)
  - This maximizes the use of CPU units during a cycle (e.g. super-scalar processor).
  - It is difficult to predict the order in which the processor will execute instructions
- Here's a code example to illustrate:

```
1 int a = 1, b = 2, c = 3, d = 4, e = 5, f = 6, g = 7, h = 8;
2 c = a + b; // 1st instruction to be executed, it depends on nothing
3 e = c * d; // 3rd instruction to be executed, it depends on 'c'
4 h = f * g; // 2nd instruction to be executed, it depends on nothing
```



# Out-of-Order (OoO) Execution

- Most processors are now capable of executing instructions out of order (different order than the one given by the assembler code)
  - This maximizes the use of CPU units during a cycle (e.g. super-scalar processor).
  - It is difficult to predict the order in which the processor will execute instructions
- Here's a code example to illustrate:

```
1 int a = 1, b = 2, c = 3, d = 4, e = 5, f = 6, g = 7, h = 8;
2 c = a + b; // 1st instruction to be executed, it depends on nothing
3 e = c * d; // 3rd instruction to be executed, it depends on 'c'
4 h = f * g; // 2nd instruction to be executed, it depends on nothing
```



## Apple Silicon M1 Micro-architecture (2020)





# Intel Alder Lake Micro-architecture (2021)

3 Single-core CPU Architecture Predict I-TI B + I-Cache **Performance** MSROM Decode uop Cache uop Queue x86 Core Allocate / Rename / Move Elimination / Zero Idiom A Step Function in CPU Architecture Performance For the Next Decade of Compute A significant IPC boost at high power efficiency ALU ALU

STA Load

AGU AGU

STA

48KB Data Cache

1.25MB/2MB ML Cache

LEA

Store Data

LEA

Shift

Shift

FADD FADD





# Intel Alder Lake Micro-architecture (2021)





# Intel Alder Lake Micro-architecture (2021)





### What About our New Toy?



- Nvidia Jetson TX2 development kit
  - 2× very powerful Nvidia "Denver 2" cores @ 2.04 GHz
     Specific for performance applications
  - $4\times$  powerful ARM "Cortex A57" (Big) cores @ 2.04 GHz
    - o Generally used in smartphones
  - Weird heterogeneous design
    - $\circ~$  Classic ARM architectures combine A57 (Big) with A53 (LITTLE)
  - No energy efficient cores (LITTLE)



Cortex A57 (2015)

- 18-stage pipeline
- 3-way decoder
- 8-wide super-scalar ( $\mu$ ops)
- Out-of-order execution
- Source: Anandtech





# Nvidia Denver 2 (2016)

- 15-stage pipeline
- 2-way decoder
- 7-wide super-scalar ( $\mu$ ops)
- In-order execution
  - ARM code is translated by a hardware translator
  - Can be considered as out-of-order...
- Sources:
  - wccftech
  - WikiChip
  - Wikipedia





# Nvidia Jetson TX2: Cores and Caches Topology

| Machine (7860MB of RAM)                                |                                    |
|--------------------------------------------------------|------------------------------------|
| Package P#1(4x Cortex A57)                             | Package P#0 (2x Denver 2)          |
| L2 (2048KB)                                            | L2 (2048 KB)                       |
| L1d (32KB) L1d (32KB) L1d (32KB)                       | L1d (64KB) L1d (64KB)              |
| L1i (48KB) L1i (48KB) L1i (48KB)                       | L1i (128KB) L1i (128KB)            |
| Core P#0 PU P#0 Core P#1 PU P#3 Core P#2 PU P#4 PU P#5 | Core P#0  PU P#1  Core P#1  PU P#2 |
| Host: tegrax2c                                         |                                    |
| Indexes: physical                                      |                                    |
| Date: Fri Aug 25 11:54:52 2023                         |                                    |

Topology from hwloc-ls



# Table of Contents

- ▶ Introduction
- ▶ Programmable Architectures
- ► Single-core CPU Architecture
- ▶ Single-core CPU Optimizations

# Working with the Compiler 4 Single-core CPU Optimizations

- Compilers come with a number of options that allow you to optimize your program automatically, thus reducing execution time
- In this course, we'll be using the GNU C/C++ compiler (gcc, g++) or Clang (clang, clang++), but similar options are available in other compilers
- It's important to understand the optimizations that can and cannot be made by the compiler!
  - This maximizes the readability of the source code while maintaining efficiency
  - This allows the compiler to apply some "dirty" optimizations on our behalf when generating the binary code



## Compiler: Optimization Options

- The most famous are (-0[level]):
  - -00 no optimization
  - -01 enables a series of optimizations to reduce binary size and execution time, while keeping compilation time relatively low
  - -02 enables all possible optimizations (except those requiring a compromise between efficiency and binary size), this option requires a longer compilation time than -01
  - -03 optimizes even further, enables the following options: -finline-functions,
    -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload,
    -ftree-loop-vectorize, -ftree-loop-distribute-patterns,
    -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and
    -fipa-cp-clone
- Source: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html



# Compilateur : quelques options spécifiques 4 Single-core CPU Optimizations

- -finline-functions: enables automatic inlining, the compiler can choose whether or not to inline (this option is not a guarantee)
- -ftree-vectorize: enables automatic code vectorization
- -ffast-math: does not take IEEE 754 specifications into account in float calculations (risk of loss of precision = risk of bugs)
- -funroll-loops: unrolls loops whose bounds are known at compile time.
   This option makes the binary code larger, and does not necessarily improve execution time
- -march=native: enables instructions specific to the micro-architecture on which the compiler is running, often necessary for code vectorization



## Instruction Latency and Throughput

- Here are the costs of the main arithmetic instructions on the *Intel Skylake* micro-architecture
  - add: latency of 4 cycles, throughput of 0.5 cycles per instruction (CPI)
  - sub: latency of 4 cycles, throughput of 0.5 cycles per instruction (CPI)
  - mul: latency of 4 cycles, throughput of 0.5 cycles per instruction (CPI)
  - div: latency of 14 cycles, throughput of 4 cycles per instruction (CPI)
  - Source: Intel Intrinsics Guide
- For add, sub and mul, the CPU can achieve 2 instructions per cycle!
- As you can see, division is much less efficient than multiplication (throughput 8 times lower)
  - It is therefore interesting to compute the inverse and then multiply by the inverse
  - Be careful, however, as this operation leads to a loss of precision in the computations



# Division Example

4 Single-core CPU Optimizations

```
int div1(const float *A, float *B, const int n) {
  for (int i = 0; i < n; i++)
   B[i] = A[i] / 3.f;
4 }</pre>
```

• Number of theoretical cycles:  $n \times 4$ 

```
int div2(const float *A, float *B, const int n) {
  float inv3 = 1.f / 3.f;
  for (int i = 0; i < n; i++)
        B[i] = A[i] * inv3;
}</pre>
```

• Number of theoretical cycles:  $4 + n \times 0.5$ 



## Division Example – Assembly Code

4 Single-core CPU Optimizations

• Compiler: Clang 10.0.0 - Options: -01 -march=armv8-a+nosimd

```
1 div1:
         cmp w2, #1
        b.lt .LBB0 3
        mov w8. w2
        fmov s0, #3.00000000
  .LBBO_2: // =>This Inner Loop Header: D=1
         ldr s1. [x0], #4
         subs x8, x8, #1
  fdiv
               s1, s1, s0
         str s1. [x1], #4
10
11
         b.ne .LBBO 2
12 .LBB0_3:
13
         ret
```

```
1 div2:
         cmp w2, #1
        b.lt .LBB0 3
         mov w9, #43691
         mov w8, w2
         movk w9, #16042, lsl #16
7 .LBBO 2: // =>This Inner Loop Header: D=1
         ldr s0. [x0]. #4
8
         fmov
               s1. w9
         subs
               x8. x8. #1
         fmul s0, s0, s1
11
         str s0, [x1], #4
12
13
         b.ne
               .LBBO 2
14 .LBB0_3:
15
         ret
```



## Special Functions

- Here are the costs of some common mathematical functions (*Intel Skylake*):
  - sqrt: latency of 12 cycles, throughput of 3 CPI
  - rsqrt: latency of 4 cycles, throughput of 1 CPI
  - pow, cos, sin, tan: very expensive, depends on software implementation, no dedicated hardware unit
  - Source: Intel Intrinsics Guide
- rsqrt has a throughput of one instruction per cycle!
  - Surprisingly, this function is implemented in hardware
  - Accuracy is generally lower than for other instructions
  - Widely used for 2D/3D distance calculations
- pow, cos, sin and tan are very expensive, we must try to limit their use in our codes
  - Approximate functions are often available at lower cost



- A function call comes with an additional cost (passing parameters through registers and/or the stack, jump instructions)
- Does this mean we shouldn't make function calls?
  - Sometimes it's better to avoid
  - It depends on where the function call is located

```
void stencil_1d_core(const float *A, float *B, const int i) {
    B[i] = A[i - 1] + A[i ] + A[i + 1];
}

void stencil_1d(const float *A, float *B, const int size) {
    for (int i = 1; i < size -1; i++)
    stencil_1d_core(A, B, i); // ici on appelle la fonction très souvent
}</pre>
```



- Inlining a function means replacing the function call with the function code itself
  - This eliminates the extra cost of calling the function
- You can do it manually, but it's not a good idea...
- It's a better idea to have the compiler do it!
  - Use of the keyword inline in the C/C++ language
  - Compiler optimization option (-finline-function)
  - In the code before the function declaration: \_\_attribute\_\_((inline))

```
stencil 1d core:
                     x8, x2, #2, #32
             sbfiz
 3
             add
             1dp
                     s0, s1, [x9, #-4]
                     s2. [x9, #4]
             ldr
             fadd
                     s0, s0, s1
             fadd
                     s0, s0, s2
                     s0, [x1, x8]
             str
             ret
    stencil 1d:
11
                     x29, x30, [sp, #-48]!
             stp
12
                     x22, x21, [sp, #16]
             stp
13
                     x20, x19, [sp, #32]
             stp
14
             mov
15
             CMD
                     w2. #3
16
             b.lt
17
                     w19. w2
             mov
18
             mov
19
                     x21, x0
             mov
20
                     w22, #2
             mov
21
     .LBB1 2: // =>This Inner Loop Header: D=1
22
             sub
                     w2. w22. #1
23
             mov
                     x0. x21
24
             mov
25
             bl
26
                     w22, w22, #1
             add
27
                     w19. w22
             CMD
28
             b.ne
    .LBB1 3:
30
                     x20, x19, [sp, #32]
             ldp
31
             ldp
                     x22, x21, [sp, #16]
32
                     x29, x30, [sp], #48
             ldp
33
             ret
```

```
stencil 1d core:
                    x8, x2, #2, #32
             sbfiz
             add
                    s0, s1, [x9, #-4]
            ldp
                    s2. [x9, #4]
            ldr
            fadd
                     s0. s0. s1
            fadd
                    s0, s0, s2
                    s0, [x1, x8]
             str
            ret
    stencil 1d:
11
                    w2. #3
             cmp
12
             b.lt
13
             sub
                    w10, w2, #1
14
             add
                    x8, x0, #8
15
             add
                    x9, x1, #4
16
             sub
                    x10, x10, #1
    .LBB1 2: // =>This Inner Loop Header: D=1
18
             1dp
                    s0, s1, [x8, #-8]
19
            ldr
                    s2, [x8], #4
20
             subs
                    x10 x10 #1
21
            fadd
22
            fadd
                    s0. s0. s2
23
                    s0. [x9], #4
             str
24
             b.ne
25
    .LBB1 3:
            ret
```

• Compiler: Clang 10.0.0 - Options:

-02 -march=armv8-a+nosimd



- Loop unwinding consists in increasing the loop pitch and adapting the loop body to the loop pitch
- Can sometimes be performed by the compiler (but not always...)
- Several benefits
  - Reduces time spent in loop control
  - Reduces the risk of branch prediction error
  - Increases optimization opportunities, potentially exposes more parallelism for ILP, masks instruction latency
- Some drawbacks
  - Reduces code readability and increases the risk of bugs (not good for maintainability)
  - An epilogue (code after the loop) is often required



# $\begin{array}{c} \textbf{Loop Unrolling} - \textbf{Example} \\ \text{4 Single-core CPU Optimizations} \end{array}$

#### Initial code without loop unrolling

#### Code with 2<sup>nd</sup>-order loop unrolling



# Loop Unrolling – Example 4 Single-core CPU Optimizations

#### Initial code without loop unrolling

#### Code with 2<sup>nd</sup>-order loop unrolling

• If we assume that n value is 3, then the code with unrolling is wrong!



# Loop Unrolling – Example 4 Single-core CPU Optimizations

#### Initial code without loop unrolling

#### Code with 2<sup>nd</sup>-order loop unrolling

- If we assume that n value is 3, then the code with unrolling is wrong!
- We nee to add an epilogue L10-11





# Loop Unrolling 4 Single-core CPU Optimizations

#### Initial code without loop unrolling

• Compiler: Clang 10.0.0 - Options:

-03 -march=armv8-a+nosimd -funroll-loops

```
1 basic loop1:
                    w4, #1
            cmp
            b.lt
                    w4, #1
            CMD
            b.ne
                    x8. xzr
            mov
    .LBB0 3:
                    w9. w4. #0x1
            and
                    x8. xzr
            mov
            add
                    x10, x0, #4
            add
                    x11, x1, #4
13
            add
                    x12, x2, #4
                    w13, w4, w9
            sub
            add
                    x14, x3, #4
    .LBBO_4: // =>This Inner Loop Header: Depth=1
17
            ldur
                    s0. [x10. #-4]
            ldur
                    s1, [x11, #-4]
            ldur
                    s2. [x12. #-4]
20
            add
                    x8. x8. #2
                    w13. w8
            cmp
            fadd
                    s0, s0, s1
            fadd
                    s0, s0, s2
24
                    s0. [x14. #-4]
            stur
            ldr
                    s0. [x10], #8
            ldr
                    s1, [x11], #8
            fadd
                    s0. s0. s1
            ldr
                    s1. [x12], #8
29
            fadd
                    s0, s0, s1
30
                    s0, [x14], #8
            str
            b.ne
            cbz
    .LBB0 6:
34
            lsl
                    x8. x8. #2
            ldr
                    s0. [x0. x8]
            ldr
                    s1, [x1, x8]
```



#### Loop Unrolling 4 Single-core CPU Optimizations

#### Initial code without loop unrolling

```
1 void basic loop1(const float *A.
                  const float *B.
                  const float *C.
                  const int n)
• Compiler: Clarg 10.0.0 - Options:

varch=alm a+nosimd -funrol?
 5
10 }
```

```
1 basic loop1:
                  w4, #1
          cmp
          b.lt
          CMD
                  w4, #1
          b.ne
                  x8. xzr
          mov
                                 bigger!
  .LBB0 3:
                  w9. w4. #0x1
          and
                  x8. xzr
          mov
                  x10, x0, #4
          add
          add
          add
          sub
          add
  .LBB0_4:
                   is Inner Loop Header: Depth=1
                  s0. [x10. #-4]
                  s1, [x11, #-4]
           ldur
                  s2. [x12. #-4]
          add
                  x8. x8. #2
                  w13. w8
          cmp
          fadd
                  s0, s0, s1
          fadd
                  s0, s0, s2
                  s0. [x14. #-4]
          stur
          ldr
                  s0. [x10], #8
          ldr
                  s1, [x11], #8
          fadd
                  s0. s0. s1
          ldr
                  s1. [x12], #8
          fadd
                  s0, s0, s1
          str
                  s0, [x14], #8
          b.ne
          cbz
          lsl
                  x8. x8. #2
          ldr
                  s0. [x0. x8]
          ldr
                  s1, [x1, x8]
```



#### Code with loop unrolling

```
1 for (i = 0; i < n; i += 2) {
2   D[i + 0] = A[i + 0] + B[i + 0] * C[i + 0];
3   D[i + 1] = A[i + 1] + B[i + 1] * C[i + 1];
4 }</pre>
```

#### Code with loop unrolling and jam

```
1 for (i = 0; i < n; i += 2) {
2    d0 = A[i + 0] + B[i + 0];
3    d1 = A[i + 1] + B[i + 1];
4    D[i + 0] = d0 * C[i + 0];
5    D[i + 1] = d1 * C[i + 1];
6 }
```

- Breaks data dependencies
- We can start calculating a part of D[i+1] (in the variable d1) while D[i+0], has not yet been completely computed
- This optimization can sometimes be performed by the compiler
- Requires more registers (or memory)



#### Variables Rotation

4 Single-core CPU Optimizations

#### Sliding sum of 3 points:

```
1 void sum(const float *A, float *B, const int n) {
2  for (int i = 1; i < n; i++) {
3   B[i] = A[i - 1] + A[i + 0] + A[i + 1];
4  }
5 }</pre>
```

Sliding sum of 3 points with 3<sup>rd</sup>-order unrolling:

```
1 void sum_u3(const float *A, float *B, const int n) {
2   for (int i = 1; i < n; i += 3) {
3     B[i + 0] = A[i - 1] + A[i + 0] + A[i + 1];
4     B[i + 1] = A[i + 0] + A[i + 1] + A[i + 2];
5     B[i + 2] = A[i + 1] + A[i + 2] + A[i + 3];
6   }
7   // pas d'épilogue pour simplifier
8 }</pre>
```

Sliding sum of 3 points with 3<sup>rd</sup>-order unrolling and variables rotation:

```
sum_u3:
 2
                     w2, #3
             cmp
 3
             b.lt
                     w10, w2, #1
             sub
 5
             add
                     x8, x0, #8
             add
                     x9. x1. #8
                     x10 w10
             sxtw
 8
             mov
                     w11. #1
     .LBBO_2: // =>This Inner Loop Header: Depth=1
10
             ldp
                     s0. s1. [x8. #-8] // <= 2 loads
11
             ldr
                     s2, [x8]
12
             add
                     x11, x11, #3
13
                     x11. x10
             CMD
14
             fadd
                     s0, s0, s1
15
             fadd
                     s0, s0, s2
16
                     s0. [x9, #-4]
             stur
17
             ldp
                     s0, s1, [x8, #-4] // <= 2 loads
18
             ldr
                     s2. [x8. #4]
19
             fadd
20
             fadd
                     s0, s0, s2
21
             str
                     s0. [x9]
22
                     s0, s1, [x8]
             ldp
23
             fadd
24
             ldr
                     s1. [x8. #8]
25
             add
                     x8. x8. #12
26
             fadd
                     s0, s0, s1
27
                     s0. [x9, #4]
             str
28
             add
                     x9, x9, #12
29
             b.lt
     .LBB0 3:
31
             ret
```

```
1 sum_u3_rot:
                   w2, #2
           cmp
           b.lt
           ldp
                   s1. s0. [x0]
                   w8. w2
           mov
           add
                   x9. x0. #16
           add
                   x10. x1. #8
           mov
                   w11. #1
    .LBBO_2: // =>This Inner Loop Header: Depth=1
10
           ldp
                   s2, s3, [x9, #-8] // <= 2 loads
11
           ldr
                   s4, [x9], #12 // <= 1 load
12
           fadd
                   s1, s0, s1
13
           add
                   x11, x11, #3
14
           fadd
                   s0, s0, s2
15
           fadd
                   s1, s1, s2
16
           fadd
                   s2, s2, s3
17
           stur
                   s1, [x10, #-4]
18
           fadd
                   s0. s0. s3
19
           fadd
                   s1, s2, s4
20
           cmp
                   x11, x8
21
                   s0, s1, [x10], #12
           stp
22
           fmov
                   s1, s3
23
           fmov
                   s0. s4
24
           b.10
   .LBB0 3:
26
           ret
```



#### Variables Rotation & Reduction

- This code compute 6 additions, can we compute less?
  - Yes! Let us use a reduction!

```
woid sum u3 rot_red(const float *A, float *B, const int n)
     float a0 = A[0], a1 = A[1], a2 = A[2], a3 = A[3]:
     // sums s0, s1, s2 = reductions
     float s0 = a0 + a1 + a2:
     float s1 = a1 + a2 + a3:
     for (int i = 1; i < n; i += 3) {
      // only one read into A
       float a4 = A[i + 4]:
       // compute only 2 additions
       float s2 = a2 + a3 + a4:
       B[i + 0] = s0:
       B[i + 1] = s1:
       B[i + 2] = s2;
       // rotation on a0. a1. a2. a3 and a4 variables
       a0 = a1; a1 = a2; a2 = a3; a3 = a4;
       // rotation on the s0, s1, s2 variables
18
       s0 = s1: s1 = s2:
19
20 F
```



#### Rot. & Red – Asm

```
1 void sum u3 rot_red(const float *A, float *B, const int n)
 2 {
     float a0 = A[0], a1 = A[1], a2 = A[2], a3 = A[3];
     // sums s0. s1. s2 = reductions
     float s0 = a0 + a1 + a2:
     float s1 = a1 + a2 + a3;
     for (int i = 1; i < n; i += 3) {
      // only one read into A
 8
      float a4 = A[i + 4]
       // compute only Tad
10
       float s2 = a2 + a a
11
12
       B[i + 1] 80
13
14
15
       rotation on a0, a1, a2, a3 and a4 variables
16
       a0 = a1; a1 = a2; a2 = a3; a3 = a4;
17
       // rotation on the s0, s1, s2 variables
18
       s0 = s1; s1 = s2;
19
20 }
```

```
1 sum u3 rot red:
                   w2. #3
           cmp
           b.lt
           ldp
                   s2, s0, [x0, #4]
                 (s1, [x0, #12]
                   3, [x0], #20
                   w9, w2, #1
                   s4. s2. s0
           mov
           fadd
                   s2. s3. s2
           fadd
           fadd
           sxtw
                   x9. w9
   .LBBO 2: // =>This Inner Loop Header: Depth=1
           add
                   x10, x8, #4
                   x10, x9
           cmp
17
           lsl
                   x10, x8, #2
18
           fadd
                   s4. s1. s0 // <= 1 addition
19
           fmov
                   s1. [x0. x10]
20
           ldr
21
           add
                   x10, x1, x10
22
           stp
                   s2. s3. [x10. #4]
23
           fmov
                   s2. s3 // <= register rotation
24
           fadd
                   s3. s4. s1 // <= 1 addition
25
           add
                   x8. x8. #3
26
                   s3, [x10, #12]
           str
           b.lt
   .LBBO 3:
29
           ret
```



#### Two independent loops:

```
1 for (int i = 0; i < n; i++)
2  D[i] = A[i] + B[i];
3 for (int i = 0; i < n; i++)
4  E[i] = A[i] * C[i];</pre>
```

#### Merging the two loops into one:

```
1 for (int i = 0; i < n; i++) {
2   D[i] = A[i] + B[i];
3   E[i] = A[i] * C[i];
4 }</pre>
```

- This improves data reuse
- In the example, the second reading of A[i] (line 2) will necessarily be in the caches → Temporal locality



- Split a loop in multiple loops
- The reverse operation of loop fusion
- For special reasons (ex.: multi-threading)
- Simplifies or eliminates a dependency by cutting the loop into two parts
- Sometimes pressure on the registers makes it more interesting to have separate loops



# **Conditional Branching Instructions**

- Conditional Branching Instructions (e.g. if, switch, etc) create bubbles in the processor pipeline
- The pipeline cannot operate at full efficiency
- As far as possible, we should therefore avoid these instructions in the hotspot of the code
  - However, if the branch is mispredicted, we need to wait the pipeline latency ( $\approx$  15 cycles)



# $egin{array}{c} ext{Conditional Branching Instructions} - ext{Example} \ ext{4 Single-core CPU Optimizations} \end{array}$

Example of bad code on the left and better code on the right:

```
1 for (int i = 0; i < n; i++) {
2   if (i >= 1 && i < n - 1) {
3     switch (i % 4) {
4     case 0: B[i] = A[i] * 0.3333f;
5     case 1: B[i] = A[i] + 1.3333f;
6     case 2: B[i] = A[i] - 0.7555f;
7     case 3: B[i] = A[i] * 1.1111f;
8     default: break;
9   }
10 }
11 }</pre>
```

```
1 for (int i = 1; i < n - 1; i += 4) {
2   B[i + 0] = A[i + 0] + 1.3333f;
3   B[i + 1] = A[i + 1] - 0.7555f;
4   B[i + 2] = A[i + 2] * 1.1111f;
5   B[i + 3] = A[i + 3] * 0.3333f;
6 }</pre>
```

- if (line 2) can be removed by modifying the start and end of the loop
- switch (line 3) can be avoided by unrolling the loop at the 4<sup>th</sup> order

# Memory Accesses 4 Single-core CPU Optimizations

- When code is limited by memory throughput, you have to be very careful about how we access the data
- Memory bandwidth is a major limiting factor in modern architectures
  - There are mechanisms to reduce this problem: use of *pre-fetching* instructions
  - Memory is accessed by line of words (one word = 32-bit)
  - It's interesting to access data that follow one another (spatial locality)
  - Reduce the number of RAM accesses and favor cache accesses (temporal locality)



4 Single-core CPU Optimizations

```
1 for (int i = 0; i < n; i++) // column
2  for (int j = 0; j < n; j++) // row
3  C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
1 for (int i = 0; i < n; i++) // column
2  for (int j = 0; j < n; j++) // row
3  C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
1 for (int i = 0; i < n; i++) // column
2  for (int j = 0; j < n; j++) // row
3  C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
1 for (int i = 0; i < n; i++) // column
2  for (int j = 0; j < n; j++) // row
3  C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
1 for (int i = 0; i < n; i++) // column
2  for (int j = 0; j < n; j++) // row
3  C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
1 for (int i = 0; i < n; i++) // column
2  for (int j = 0; j < n; j++) // row
3  C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
1 for (int i = 0; i < n; i++) // column
2  for (int j = 0; j < n; j++) // row
3  C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations



- In this implementation, data accesses are not contiguous in memory
- There is a stride of 4 words between each access (not good for spatial locality)



4 Single-core CPU Optimizations

```
for (int j = 0; j < n; j++) // row
for (int i = 0; i < n; i++) // column
C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
for (int j = 0; j < n; j++) // row
for (int i = 0; i < n; i++) // column
C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
for (int j = 0; j < n; j++) // row
for (int i = 0; i < n; i++) // column
C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
for (int j = 0; j < n; j++) // row
for (int i = 0; i < n; i++) // column
C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
for (int j = 0; j < n; j++) // row
for (int i = 0; i < n; i++) // column
C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
for (int j = 0; j < n; j++) // row
for (int i = 0; i < n; i++) // column
C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```





4 Single-core CPU Optimizations

```
for (int j = 0; j < n; j++) // row
for (int i = 0; i < n; i++) // column
C[j * n + i] = A[j * n + i] + B[j * n + i];</pre>
```







Logical and hardware view of memory accesses

- In this implementation, accesses are contiguous in memory
  - Cache lines are fully used
  - Memory throughput is maximized
- i-loop and j-loop have simply been switched



- In many cases, data can be reused (spatial locality)
- Let us take the example of a *stencil* code operating on a 2D grid





Logical view of the 2D grid





Logical view of the 2D grid





Logical view of the 2D grid





Logical view of the 2D grid





Logical 2D grid memory view

- For each increment of i there are 3 new RAM accesses and 2 cache accesses (we neglect cache lines)
- Can we reduce the number of RAM accesses?
  - Yes, using a so-called *cache blocking* technique (also called *tiling* in the literature)
  - The idea is to modify the data path to maximize reuses





Logical 2D grid memory view





Logical 2D grid memory view





Logical 2D grid memory view





Logical 2D grid memory view





Logical 2D grid memory view



#### Cache Blocking - Part 2



Logical 2D grid memory view

- Cache blocking reduces the number of RAM accesses
  - On average, only one RAM access per point remains!
  - We have divided the grid into several blocks (here 2), vertical blocks



### Cache Blocking – Blocks Size 4 Single-core CPU Optimizations

- The block size depends on the problem and on the CPU architecture
- For the previous stencil code, the size of a block can be defined as follows:

$$blockSize = \frac{sizeOfCache}{2 \times 3 \times sizeOfData},$$

with sizeOfCache the size of the L3 cache in bytes and sizeOfData the size of the data (single precision = 4 bytes, double precision = 8 bytes).

- We divide by 2 because caches generally work better when half is used (grandma's recipe...)
- We divide by 3 because we need to keep 3 lines cached in our stencil
- Note that if  $blockSize \geq cols$  then the cache blocking technique is useless



#### Cache Blocking – Implementation





### Q&A

Thank you for listening!
Do you have any questions?