GPU Programming &
Shader Authoring



GPUs excel at executing thousands of threads simultaneously, making them ideal for problems that can be decomposed into many independent calculations. Shaders are small programs executed on different stages of the graphics pipeline. Shaders can be extended to perform general purpose computation, allowing for CPU and GPU parallelism. This composition will cover the basics of GPU programming and shader authoring.

The rendering pipeline is a sequence of stages that transform 3D scene data into a 2D image. Fixed-function stages perform specific, predefined tasks, while programmable stages (shaders) allow developers to write custom code for vertex processing, fragment processing, and general-purpose computing.

Vertex Shader
This programmable stage processes each vertex of the input geometry. Vertex shaders can transform vertex positions, calculate per-vertex lighting, and perform other vertex-level operations. The output of this stage is a transformed vertex position and any additional vertex attributes.
Tessellation (optional)
If enabled, this stage subdivides input primitives into smaller primitives, allowing for the creation of more detailed geometry. The tessellation stage consists of two programmable shaders: the Tessellation Control Shader (TCS) and the Tessellation Evaluation Shader (TES).
Geometry Shader (optional)
This programmable stage operates on primitives (triangles, lines, points) and can modify, create, or discard them. Geometry shaders are useful for techniques like particle systems, fur rendering, and shadow volume generation, however they are less commonly used due to performance considerations. They should only be considered if a feature cannot be implemented efficiently in other stages.
Primitive Assembly
In this fixed-function stage, vertices are grouped into primitives (points, lines, or triangles) based on the primitive topology specified by the application.
This fixed-function stage converts primitives into fragments (pixel candidates). It determines which pixels are covered by each primitive and interpolates vertex attributes across the primitive's surface.
Fragment Shader
Also known as the pixel shader, this programmable stage processes each fragment generated by the rasterizer. Fragment shaders calculate the final color and depth of each pixel, apply textures, perform per-pixel lighting, and can even discard fragments.
Output Merger
In this final fixed-function stage, the pipeline combines the shaded fragments with the existing data in the color and depth buffers. It performs depth testing, stencil testing, and blending operations to determine the final pixel color.

The CPU plays a role in the rendering process by preparing and feeding data to the GPU. The CPU is responsible for tasks such as scene graph management, animation updates, physics simulations, and issuing draw calls to the GPU. Efficient CPU-GPU communication and minimizing driver overhead are essential for maximizing throughput and efficiency.

A CPU is a Single Instruction, Single Data (SISD) processor, executing one instruction on one data element per clock cycle. In contrast, a GPU is a Single Instruction, Multiple Data (SIMD) processor, capable of applying the same instruction to multiple data elements concurrently. SIMD instructions load multiple values into a special SIMD register and perform operations on all of them simultaneously, enabling parallel processing. GPUs, with their thousands of cores, are optimized for SIMD operations and can process vast numbers of threads in parallel.

use std::arch::aarch64::*; // Specific to ARM processors
use std::time::Instant;

const N: usize = 1_000_000_000;

fn main() {
    let mut a = vec![0.0; N];
    let b = vec![1.0; N];
    let c = vec![2.0; N];

    // Warm up the cache
    // With or without this, doesn't seem to make a difference
    for i in 0..N {
        a[i] = b[i] + c[i];

    // SISD
    let start_sisd = Instant::now();
    for i in 0..N {
        a[i] = b[i] + c[i];
    let elapsed_sisd = start_sisd.elapsed();

    // SIMD
    let start_simd = Instant::now();
    unsafe {
        for i in (0..N).step_by(4) {
            // Same operation as SISD but using SIMD
            let b_ptr = b.as_ptr().add(i);
            let c_ptr = c.as_ptr().add(i);
            let a_ptr = a.as_mut_ptr().add(i);

            // First load 4 elements from b and c
            // vld1q_f32 = Vector Load 1 Quadword
            let b_vec = vld1q_f32(b_ptr);
            let c_vec = vld1q_f32(c_ptr);

            // Then add them together
            // vaddq_f32 = Vector Add 1 Quadword
            let a_vec = vaddq_f32(b_vec, c_vec);

            // Finally store the result back to a
            vst1q_f32(a_ptr, a_vec);
    let elapsed_simd = start_simd.elapsed();

    // Tested on M1 Pro - SIMD is 2.53x faster!
    println!("SISD time: {:?}", elapsed_sisd); // 36.742599458s
    println!("SIMD time: {:?}", elapsed_simd); // 14.531914s

SIMD instructions offer significant performance benefits and energy savings. For instance, a SIMD3 instruction can load three values into a register and perform an operation on all of them in a single instruction. This cheap parallelism is valuable for accelerating tasks such as image processing, data analysis, procedural generation, and linear algebra.

CPUs have fixed-size SIMD registers, limiting the number of elements that can be processed in parallel. A CPU with 128-bit SIMD registers can process 4 floats, 2 doubles, 4 ints, or any combination that fits within 128 bits. Compilers often generate SIMD instructions automatically through a process called autovectorization. Developers can facilitate effective vectorization by optimizing code structure and memory layout, as discussed in Memory Alignment and Memory Layout sections.

SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions) are SIMD instruction sets for x86 and x86-64 architectures, providing instructions for floating-point and integer operations. ARM processors offer similar functionality through the ARM NEON SIMD instruction set.

SIMD vectors are typically multiples of 2 in size. To create a SIMD vector of size 3, you can use a SIMD4 vector and ignore the last element, or use a SIMD2 vector and a scalar value. Careful structuring of data and algorithms is necessary to fully harness the power of SIMD instructions.

Draw calls are commands issued by the CPU to the GPU to render a set of primitives (points, lines, or triangles). Every draw call incurs an overhead, as the CPU is required to configure the GPU's pipeline state and submit work to be executed, per draw call, per frame. Minimizing the number of draw calls is crucial for optimizing rendering performance.

In modern graphics APIs like Direct3D 12, Vulkan, and Metal, a typical draw call workflow involves:

  1. Setting up the rendering pipeline state (shaders, blend state, depth/stencil state, etc.)
  2. Binding resources (buffers, textures) to the pipeline
  3. Specifying the primitive topology (point list, line list, triangle list, etc.)
  4. Issuing the draw command with parameters like vertex count and instance count

Each change to the pipeline state or bound resources requires a new draw call. Techniques to reduce draw calls include:

// Pseudocode for a draw call using Direct3D 12

// Set pipeline state

// Bind vertex and index buffers  

// Bind shader resources
SetGraphicsRootDescriptorTable(rootParam, descriptorTable);

// Issue draw call
DrawIndexedInstanced(indexCount, instanceCount, startIndex, baseVertex, startInstance);

Careful management of draw calls and GPU resources is essential for achieving high-performance rendering. Techniques like GPU-driven rendering and bindless architectures aim to further reduce CPU overhead and enable more efficient usage of GPU resources.

GPUs have a complex memory hierarchy designed to balance high bandwidth, low latency, and power efficiency. Understanding and optimizing for this hierarchy is crucial for achieving optimal performance in GPU-accelerated applications. The memory hierarchy typically consists of the following levels:

Registers: The fastest and most efficient memory, registers are private to each thread and are used for storing frequently accessed variables and intermediate results.

L1 Cache: A small, fast cache that is private to each Streaming Multiprocessor (SM) or Compute Unit (CU). L1 cache is used to store frequently accessed data and can significantly reduce memory latency.

Shared Memory: A low-latency, high-bandwidth memory shared among threads within the same workgroup (CUDA block or OpenCL workgroup). Shared memory is useful for inter-thread communication and can be used as a programmable cache.

L2 Cache: A larger, slower cache that is shared among all SMs or CUs. L2 cache is used to cache accesses to global memory and can reduce the impact of high-latency memory operations.

Global Memory: The largest and slowest memory, global memory is accessible by all threads and is used to store input, output, and intermediate data that does not fit in faster memory levels.

GPU Memory Hierarchy

To optimize GPU memory usage and performance, consider the following techniques:

Coalesced Memory Access: Organize data in global memory to enable coalesced memory accesses. When threads in a warp (CUDA) or wavefront (OpenCL) access contiguous memory locations, the GPU can combine these accesses into a single memory transaction, improving bandwidth utilization.

Shared Memory Utilization: Leverage shared memory to store frequently accessed data or to perform inter-thread communication. By preloading data into shared memory and reusing it multiple times, you can reduce global memory accesses and improve performance.

Data Locality: Exploit spatial and temporal locality by accessing data that is close together in memory and by reusing data as much as possible. This helps to maximize cache utilization and minimize cache misses.

Texture Memory: For read-only data with spatial locality, consider using texture memory. Texture memory is cached and optimized for 2D spatial access patterns, making it efficient for operations like image processing and sampling.

Constant Memory: For small, read-only data that is accessed uniformly by all threads, use constant memory. Constant memory is cached and can provide high-bandwidth access for read-only data.

Memory Access Patterns: Minimize divergent memory accesses within a warp or wavefront. When threads in a warp access different memory locations, the GPU serializes these accesses, reducing performance. Strive for uniform memory access patterns to maximize memory bandwidth utilization.

Optimizing GPU memory usage requires careful consideration of data layout, access patterns, and the utilization of different memory types. By aligning data structures, using memory padding to ensure coalesced accesses, and leveraging faster memory types like shared memory and caches, you can significantly improve GPU performance.

Tools like NVIDIA Nsight and AMD Radeon GPU Profiler can help identify memory bottlenecks and optimize memory usage. These tools provide insights into memory bandwidth utilization, cache hit rates, and memory access patterns, enabling developers to make informed optimization decisions.

Remember that optimal memory usage patterns may vary depending on the specific GPU architecture and the nature of the workload. It's essential to profile and benchmark your code on target hardware to identify and address memory-related performance issues.

By understanding and optimizing for the GPU memory hierarchy, developers can unlock the full potential of GPU acceleration and achieve significant performance improvements in their applications.

Vertex shaders perform basic processing of each individual vertex. Vertex shaders receive the attribute inputs from the vertex rendering and converts each incoming vertex into a single outgoing vertex based on an arbitrary, user-defined program.

Textures are a fundamental aspect of graphical rendering, providing the visual details necessary to bring realism and depth to digital scenes. In GPU programming, textures are not just images but a versatile tool for storing a variety of data types that shaders can sample during rendering.

Texture Types: The most common texture types are 2D textures, used for standard image data; 3D textures, which contain volumetric data; cube maps for environment mapping; and texture arrays which are sequences of textures of the same size and format, treated as a single unit.

Texture Sampling: Shaders access texture data through a process called sampling, which interpolates the texture data based on texture coordinates provided per vertex or generated procedurally. This process involves various filtering methods to determine how texture pixels (texels) are blended when they map to a smaller or larger area on the screen than their native resolution.

Filtering Techniques: The most common filtering techniques include:

Mipmaps: Mipmapping is a technique to enhance texture rendering quality and performance. It involves creating multiple scaled-down versions of a texture (mipmaps), each typically half the resolution of the previous. GPUs use mipmaps to efficiently render textures based on their distance from the viewer, minimizing artifacts like moiré patterns.

Usage in Shaders: In shader programs, textures can be used for a variety of tasks beyond simple surface decoration. They can store displacement maps to modify vertex positions, normal maps to alter surface normals, specular maps to control reflectivity, and more. Each of these applications can dramatically enhance the realism of a scene by adding depth, detail, and dynamic lighting effects.

GPU Memory Considerations: Textures can consume a significant amount of GPU memory, and effective management is crucial. Techniques like compression, which reduces the memory footprint of textures at the cost of some quality, are commonly used. Formats like BCn (Block Compression), ASTC (Adaptive Scalable Texture Compression), and ETC (Ericsson Texture Compression) provide different balances of compression rate and quality.

Dynamic Updates: While static textures are common, dynamic textures updated by the CPU or GPU during runtime can be used for effects like video playback, dynamic lighting, or procedural content generation. Care must be taken to manage the synchronization and performance impacts of such updates.

Understanding and utilizing textures effectively requires a mastery of both their implementation in graphics APIs and their conceptual role in rendering. With the right techniques, textures not only increase the visual fidelity of scenes but also open up new possibilities for creative and dynamic content.

Buffers are a fundamental aspect of GPU programming, serving as the main method for storing and manipulating data directly on the GPU. They are essentially contiguous blocks of memory allocated on the GPU, designed to store an array of arbitrary data such as vertices, colors, normals, indices, or any user-defined per-vertex data. Vertex Buffers Vertex buffers are one of the most common types of buffers used in GPU programming. They store the vertex data required for rendering objects. Each vertex can contain multiple attributes such as position, color, texture coordinates, and normals. Efficiently managing vertex buffers is crucial as they directly influence the rendering performance. By properly structuring vertex data, you can minimize the bandwidth used during rendering operations. Index Buffers Index buffers, another vital type of buffer, allow developers to reuse vertices. They hold indices into the vertex buffer, defining how vertices are assembled into primitives (triangles, lines, etc.). Using index buffers can significantly reduce the amount of vertex data passed to the GPU, which is especially beneficial when rendering complex meshes. Uniform Buffers Uniform buffers are used to store uniform data that does not change frequently during the draw calls, such as transformation matrices, lighting information, or shader constants. They are shared across different shaders and draw calls, making it easier to manage shader parameters consistently. Storage Buffers Storage buffers offer a flexible way to read and write data within a shader. Unlike uniform buffers, storage buffers allow for large amounts of data that can be dynamically indexed in shaders. This makes them particularly useful for tasks such as particle systems, complex simulations, or compute shaders where shaders need to perform operations on large datasets. Buffer Usage and Best Practices Buffer Creation: Creating buffers is a resource-intensive operation. It's advised to create buffers at initialization or during low-intensity periods of your application to avoid stutters or frame drops. Data Streaming: For dynamic data that changes often, like transformations in a highly interactive environment, consider using techniques like buffer streaming. OpenGL offers mechanisms like buffer orphaning or double-buffering buffers to efficiently stream data. Memory Alignment: Aligning buffer data to the GPU's memory alignment requirements ensures optimal data access speed. Misaligned data can cause slower memory fetches and lead to a decrease in performance. Buffer Mapping: Instead of frequently updating buffer content with new data, mapping a buffer can provide a direct pointer to its memory. This technique allows for efficient data updates but requires careful synchronization to avoid reading from or writing to the buffer while it's used by the GPU

In GPU programming, both buffers and textures are used to store data that can be accessed by shaders. However, they serve distinct purposes and have different memory access patterns and optimization techniques.

Buffers: Buffers are linear blocks of memory used primarily for storing vertex data, indices, or other structured data. They are well-suited for handling large volumes of data where each element is accessed sequentially or via a known pattern. Buffers are also used for more general-purpose computing tasks, such as storing computational outputs or transfer data between the CPU and GPU.

Buffers provide:

Textures: Textures, on the other hand, are grid-based and optimized for 2D or 3D data access. They are used not only for storing image data but also for any data that benefits from spatial locality in access patterns. Textures are typically accessed through sampling operations, which can interpolate between data points and access data non-linearly.

Textures provide:

Choosing Between Buffers and Textures: The choice between using a buffer or a texture can significantly affect the performance and flexibility of a GPU program.

Understanding the underlying hardware implications of each type of storage is key to optimizing performance and memory usage in GPU-accelerated applications.

Memory alignment is the process of aligning data structures in memory so that their starting address is a multiple of a certain byte size. CPUs and GPUs read and write memory in chunks, determined by the architecture's memory bus width and cache line size. If data is well aligned in a way that fully fills the cache line, it can be read and written in fewer transactions. Otherwise, the processor will have to perform multiple memory transactions to read or write the data. Accessing data sequentially will also make better use of the memory bus width.

typedef struct {
    char a; // 1 byte
    // 3 bytes padding needed here for alignment
    int b;  // 4 bytes
    char c; // 1 byte
    // 3 bytes padding needed here to align
} UnalignedStruct;

// Offset |  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  | etc
// -------|-----|-----|-----|-----|-----|-----|-----|-----|-----
// Data   |  a  | pad | pad | pad |  b  |  b  |  b  |  b  |

typedef struct {
    int b;  // 4 bytes
    char a; // 1 byte
    char c; // 1 byte
    // 2 bytes padding added here to make the total size a multiple of 4 bytes
} AlignedStruct;

// Offset |  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7
// -------|-----|-----|-----|-----|-----|-----|-----|-----
// Data   |  b  |  b  |  b  |  b  |  a  |  c  | pad | pad


A struct's alignment will be determined by the largest alignment requirement of its members. In the example above, the UnalignedStruct has a size of 12 bytes because of the padding needed for alignment. The AlignedStruct has a size of 8 bytes, which is a multiple of 4 bytes. The memory layout of the AlignedStruct is more efficient because it is aligned to the size of the int, which is 4 bytes. The UnalignedStruct is less efficient because it is not aligned to the size of the int.

Compilers automatically add padding to structures to ensure each member is correctly aligned according to the target architecture's requirements. You can manually control the alignment of your structures by carefully designing the layout of your data and explicitly specifying the alignment of your data structures through compiler-specific attributes or padding bytes.

class student {
    std::string name;
    double average_mark;

double calculate_average_mark(const std::vector<student>& students) {
    double sum = 0.0;
    for (int i = 0; i < students.size(); i++) {
        sum += students[i].average_mark;
    return sum / students.size();

The code consists of a class student with two fields: name and an average_mark. It also consists of a function calculate_average_mark which takes the array of student instances and calculate the average mark of all students. Let’s assume that the sizeof(std::string) is eight, and size(double) is also eight. In a vector of class student, float values containing the average mark are non-consecutive, i.e. they are located at memory offsets 8, 24, 40, etc. Even though it is possible to use SIMD in this example, most compilers will switch to scalar code because doing SIMD on a non-consecutive memory layout is slow.2 When your program is actually processing data that is important to be processed fast, you can rearrange your data in memory so that it can be processed more efficiently. Game developers have been doing it for years in a paradigm called data-oriented design, that which focuses more on data and operations, and less on abstractions and encapsulations. The basic idea revolves around using struct-of-arrays instead of array-of-structs. We define a class called students that has two arrays, one for names and one for average marks

class students {
    std::vector<std::name> names;
    std::vector<double> average_marks;
    double calculate_average_mark() {
        double sum = 0;
        for (int i = 0; i < average_marks.size(); i++) {
            sum += average_marks[i];
        return sum / average_marks.size();

The compiler with the right switches will generate SIMD instructions to calculate the average mark since the doubles containing the average marks are laid out sequentially in memory. "An uber shader is an example of a long and complex shader that can be used to render any possible material. These types of shaders have lots of branches for any possible combination. When artists create material, the material parameters are stored in a Metal buffer, used by the material shader. This buffer gets updated when you change the parameters, but there is no recompilation required. An uber shader would query material parameters from a buffer doing conditional branches at runtime to enable and disable features.Developers sometimes make uber shaders that read material parameters from a buffer and then a material shader chooses different control parts at runtime based on the buffer's contents. This approach lets the shader render a new material effect without recompiling because the only changes are parameters in the buffer."
"The Metal function constants features specializes the shader efficiently and removes the code that isn't reachable at runtime. For example, uber shaders typically benefit from function constants. This responsive approach is great during development, however, the shader has to account for several possibilities and read from additional buffers which may affect an app's performance. Another approach is to specialize the shader at compile time instead of at runtime. By building the shader variance offline with preprocessor macros. This approach means you have to compile all the possible variant combinations offline. For example, a glossy variant could be the combination of enabling both is_glossy and use_shadows macros, by disabling the remaining macros. Even if you compile them offline ahead of time, each variant adds up which can significantly increase the size of your Metal library. It can also increase compile time because each shade of variant has to be compiled starting from Metal source."
"Function constants can provide another way to specialize the shaders. Compared to using macros, it can reduce both compile time and the size of the Metal library. With function constants, you compile an uber shader one time from source to an intermediate Metal function. From that function, you only create the variance your app needs based on function constants you define."