GPU Programming &
Shader Authoring
Content
ReturnOverview
GPUs excel at executing thousands of threads simultaneously, making them ideal for problems that can be decomposed into many independent calculations. Shaders are small programs executed on different stages of the graphics pipeline. Shaders can be extended to perform general purpose computation, allowing for CPU and GPU parallelism. This composition will cover the basics of GPU programming and shader authoring.
Rendering Pipeline
The rendering pipeline is a sequence of stages that transform 3D scene data into a 2D image. Fixed-function stages perform specific, predefined tasks, while programmable stages (shaders) allow developers to write custom code for vertex processing, fragment processing, and general-purpose computing.
The CPU plays a role in the rendering process by preparing and feeding data to the GPU. The CPU is responsible for tasks such as scene graph management, animation updates, physics simulations, and issuing draw calls to the GPU. Efficient CPU-GPU communication and minimizing driver overhead are essential for maximizing throughput and efficiency.
SIMD, SISD, and MIMD
A CPU is a Single Instruction, Single Data (SISD) processor, executing one instruction on one data element per clock cycle. In contrast, a GPU is a Single Instruction, Multiple Data (SIMD) processor, capable of applying the same instruction to multiple data elements concurrently. SIMD instructions load multiple values into a special SIMD register and perform operations on all of them simultaneously, enabling parallel processing. GPUs, with their thousands of cores, are optimized for SIMD operations and can process vast numbers of threads in parallel.
use std::arch::aarch64::*; // Specific to ARM processors use std::time::Instant; const N: usize = 1_000_000_000; fn main() { let mut a = vec![0.0; N]; let b = vec![1.0; N]; let c = vec![2.0; N]; // Warm up the cache // With or without this, doesn't seem to make a difference for i in 0..N { a[i] = b[i] + c[i]; } // SISD let start_sisd = Instant::now(); for i in 0..N { a[i] = b[i] + c[i]; } let elapsed_sisd = start_sisd.elapsed(); // SIMD let start_simd = Instant::now(); unsafe { for i in (0..N).step_by(4) { // Same operation as SISD but using SIMD let b_ptr = b.as_ptr().add(i); let c_ptr = c.as_ptr().add(i); let a_ptr = a.as_mut_ptr().add(i); // First load 4 elements from b and c // vld1q_f32 = Vector Load 1 Quadword let b_vec = vld1q_f32(b_ptr); let c_vec = vld1q_f32(c_ptr); // Then add them together // vaddq_f32 = Vector Add 1 Quadword let a_vec = vaddq_f32(b_vec, c_vec); // Finally store the result back to a vst1q_f32(a_ptr, a_vec); } } let elapsed_simd = start_simd.elapsed(); // Tested on M1 Pro - SIMD is 2.53x faster! println!("SISD time: {:?}", elapsed_sisd); // 36.742599458s println!("SIMD time: {:?}", elapsed_simd); // 14.531914s }
SIMD instructions offer significant performance benefits and energy
savings. For instance, a SIMD3 instruction can load three values
into a register and perform an operation on all of them in a single
instruction. This cheap parallelism is valuable for accelerating
tasks such as image processing, data analysis, procedural
generation, and linear algebra.
CPUs have fixed-size SIMD registers, limiting the number of elements
that can be processed in parallel. A CPU with 128-bit SIMD registers
can process 4 floats, 2 doubles, 4 ints, or any combination that
fits within 128 bits. Compilers often generate SIMD instructions
automatically through a process called autovectorization. Developers can facilitate effective vectorization by optimizing
code structure and memory layout, as discussed in Memory Alignment and Memory Layout sections.
SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions)
are SIMD instruction sets for x86 and x86-64 architectures,
providing instructions for floating-point and integer operations.
ARM processors offer similar functionality through the ARM NEON SIMD
instruction set.
SIMD vectors are typically multiples of 2 in size. To create a SIMD
vector of size 3, you can use a SIMD4 vector and ignore the last
element, or use a SIMD2 vector and a scalar value. Careful
structuring of data and algorithms is necessary to fully harness the
power of SIMD instructions.
Draw Calls
Draw calls are commands issued by the CPU to the GPU to render a set
of primitives (points, lines, or triangles). Every draw call incurs
an overhead, as the CPU is required to configure the GPU's pipeline
state and submit work to be executed, per draw call, per frame.
Minimizing the number of draw calls is crucial for optimizing
rendering performance.
In modern graphics APIs like Direct3D 12, Vulkan, and Metal, a
typical draw call workflow involves:
- Setting up the rendering pipeline state (shaders, blend state, depth/stencil state, etc.)
- Binding resources (buffers, textures) to the pipeline
- Specifying the primitive topology (point list, line list, triangle list, etc.)
- Issuing the draw command with parameters like vertex count and instance count
Each change to the pipeline state or bound resources requires a new draw call. Techniques to reduce draw calls include:
- Batching: Grouping similar objects that share the same material or shader into a single draw call.
- Instancing: Rendering multiple copies of the same mesh with different properties (position, rotation, scale) in a single draw call.
- Texture atlasing: Combining multiple textures into a single larger texture to reduce texture binding changes.
- Command lists: Recording a sequence of commands (shader configurations, textures, draw calls) once and executing them multiple times to amortize CPU overhead.
// Pseudocode for a draw call using Direct3D 12
// Set pipeline state
SetPipelineState(pipelineState);
// Bind vertex and index buffers
SetVertexBuffer(vertexBuffer);
SetIndexBuffer(indexBuffer);
// Bind shader resources
SetDescriptorHeaps(descriptorHeaps);
SetGraphicsRootDescriptorTable(rootParam, descriptorTable);
// Issue draw call
DrawIndexedInstanced(indexCount, instanceCount, startIndex, baseVertex, startInstance);
Careful management of draw calls and GPU resources is essential for achieving high-performance rendering. Techniques like GPU-driven rendering and bindless architectures aim to further reduce CPU overhead and enable more efficient usage of GPU resources.
GPU Memory Hierarchy Optimization
GPUs have a complex memory hierarchy designed to balance high bandwidth, low latency, and power efficiency. Understanding and optimizing for this hierarchy is crucial for achieving optimal performance in GPU-accelerated applications. The memory hierarchy typically consists of the following levels:
• Registers: The fastest and most
efficient memory, registers are private to each thread and are used
for storing frequently accessed variables and intermediate results.
• L1 Cache: A small, fast cache
that is private to each Streaming Multiprocessor (SM) or Compute
Unit (CU). L1 cache is used to store frequently accessed data and
can significantly reduce memory latency.
• Shared Memory: A low-latency, high-bandwidth memory shared among threads within
the same workgroup (CUDA block or OpenCL workgroup). Shared memory
is useful for inter-thread communication and can be used as a
programmable cache.
• L2 Cache: A larger, slower cache that is shared among all SMs or CUs. L2
cache is used to cache accesses to global memory and can reduce the
impact of high-latency memory operations.
• Global Memory: The largest and slowest memory, global memory is accessible by
all threads and is used to store input, output, and intermediate
data that does not fit in faster memory levels.
![GPU Memory Hierarchy](/gpu_memory_hierarchy.png)
To optimize GPU memory usage and performance, consider the following techniques:
• Coalesced Memory Access: Organize data
in global memory to enable coalesced memory accesses. When threads
in a warp (CUDA) or wavefront (OpenCL) access contiguous memory
locations, the GPU can combine these accesses into a single memory
transaction, improving bandwidth utilization.
• Shared Memory Utilization: Leverage shared memory to store frequently accessed data or to
perform inter-thread communication. By preloading data into shared
memory and reusing it multiple times, you can reduce global memory
accesses and improve performance.
• Data Locality: Exploit spatial and temporal locality by accessing data that is
close together in memory and by reusing data as much as possible.
This helps to maximize cache utilization and minimize cache misses.
• Texture Memory: For read-only
data with spatial locality, consider using texture memory. Texture
memory is cached and optimized for 2D spatial access patterns,
making it efficient for operations like image processing and
sampling.
• Constant Memory: For small, read-only data that is accessed uniformly by all
threads, use constant memory. Constant memory is cached and can
provide high-bandwidth access for read-only data.
•
Memory Access Patterns: Minimize divergent
memory accesses within a warp or wavefront. When threads in a warp
access different memory locations, the GPU serializes these
accesses, reducing performance. Strive for uniform memory access
patterns to maximize memory bandwidth utilization.
Optimizing GPU memory usage requires careful consideration of data layout, access patterns, and the utilization of different memory types. By aligning data structures, using memory padding to ensure coalesced accesses, and leveraging faster memory types like shared memory and caches, you can significantly improve GPU performance.
Tools like NVIDIA Nsight and AMD Radeon GPU Profiler can help identify memory bottlenecks and optimize memory usage. These tools provide insights into memory bandwidth utilization, cache hit rates, and memory access patterns, enabling developers to make informed optimization decisions.
Remember that optimal memory usage patterns may vary depending on the specific GPU architecture and the nature of the workload. It's essential to profile and benchmark your code on target hardware to identify and address memory-related performance issues.
By understanding and optimizing for the GPU memory hierarchy, developers can unlock the full potential of GPU acceleration and achieve significant performance improvements in their applications.
Vertex Shaders
Vertex shaders perform basic processing of each individual vertex. Vertex shaders receive the attribute inputs from the vertex rendering and converts each incoming vertex into a single outgoing vertex based on an arbitrary, user-defined program.
Fragment Shaders
Compute Shaders
Textures
Textures are a fundamental aspect of graphical rendering, providing the visual details necessary to bring realism and depth to digital scenes. In GPU programming, textures are not just images but a versatile tool for storing a variety of data types that shaders can sample during rendering.
Texture Types: The most common texture types are 2D textures, used for standard image data; 3D textures, which contain volumetric data; cube maps for environment mapping; and texture arrays which are sequences of textures of the same size and format, treated as a single unit.
Texture Sampling: Shaders access texture data through a process called sampling, which interpolates the texture data based on texture coordinates provided per vertex or generated procedurally. This process involves various filtering methods to determine how texture pixels (texels) are blended when they map to a smaller or larger area on the screen than their native resolution.
Filtering Techniques: The most common filtering techniques include:
- Nearest-Neighbor: The simplest form of filtering that selects the texel nearest to the specified coordinate. While fast, it can result in blocky images when upscaling.
- Bilinear: A method that takes an average of the four closest texels to the specified coordinate, smoothing transitions but potentially blurring details when viewed up close.
- Trilinear: An extension of bilinear filtering that also interpolates between mipmaps, which are precomputed, downscaled versions of the texture used to improve performance and reduce aliasing at distant or oblique angles.
- Anisotropic Filtering: An advanced technique that varies the sample area shape based on the angle of texture to the view direction, reducing distortion and preserving detail, especially on surfaces seen in oblique angles.
Mipmaps: Mipmapping is a technique to enhance texture rendering quality and performance. It involves creating multiple scaled-down versions of a texture (mipmaps), each typically half the resolution of the previous. GPUs use mipmaps to efficiently render textures based on their distance from the viewer, minimizing artifacts like moiré patterns.
Usage in Shaders: In shader programs, textures can be used for a variety of tasks beyond simple surface decoration. They can store displacement maps to modify vertex positions, normal maps to alter surface normals, specular maps to control reflectivity, and more. Each of these applications can dramatically enhance the realism of a scene by adding depth, detail, and dynamic lighting effects.
GPU Memory Considerations: Textures can consume a significant amount of GPU memory, and effective management is crucial. Techniques like compression, which reduces the memory footprint of textures at the cost of some quality, are commonly used. Formats like BCn (Block Compression), ASTC (Adaptive Scalable Texture Compression), and ETC (Ericsson Texture Compression) provide different balances of compression rate and quality.
Dynamic Updates: While static textures are common, dynamic textures updated by the CPU or GPU during runtime can be used for effects like video playback, dynamic lighting, or procedural content generation. Care must be taken to manage the synchronization and performance impacts of such updates.
Understanding and utilizing textures effectively requires a mastery of both their implementation in graphics APIs and their conceptual role in rendering. With the right techniques, textures not only increase the visual fidelity of scenes but also open up new possibilities for creative and dynamic content.
Buffers
Buffers are a fundamental aspect of GPU programming, serving as the main method for storing and manipulating data directly on the GPU. They are essentially contiguous blocks of memory allocated on the GPU, designed to store an array of arbitrary data such as vertices, colors, normals, indices, or any user-defined per-vertex data. Vertex Buffers Vertex buffers are one of the most common types of buffers used in GPU programming. They store the vertex data required for rendering objects. Each vertex can contain multiple attributes such as position, color, texture coordinates, and normals. Efficiently managing vertex buffers is crucial as they directly influence the rendering performance. By properly structuring vertex data, you can minimize the bandwidth used during rendering operations. Index Buffers Index buffers, another vital type of buffer, allow developers to reuse vertices. They hold indices into the vertex buffer, defining how vertices are assembled into primitives (triangles, lines, etc.). Using index buffers can significantly reduce the amount of vertex data passed to the GPU, which is especially beneficial when rendering complex meshes. Uniform Buffers Uniform buffers are used to store uniform data that does not change frequently during the draw calls, such as transformation matrices, lighting information, or shader constants. They are shared across different shaders and draw calls, making it easier to manage shader parameters consistently. Storage Buffers Storage buffers offer a flexible way to read and write data within a shader. Unlike uniform buffers, storage buffers allow for large amounts of data that can be dynamically indexed in shaders. This makes them particularly useful for tasks such as particle systems, complex simulations, or compute shaders where shaders need to perform operations on large datasets. Buffer Usage and Best Practices Buffer Creation: Creating buffers is a resource-intensive operation. It's advised to create buffers at initialization or during low-intensity periods of your application to avoid stutters or frame drops. Data Streaming: For dynamic data that changes often, like transformations in a highly interactive environment, consider using techniques like buffer streaming. OpenGL offers mechanisms like buffer orphaning or double-buffering buffers to efficiently stream data. Memory Alignment: Aligning buffer data to the GPU's memory alignment requirements ensures optimal data access speed. Misaligned data can cause slower memory fetches and lead to a decrease in performance. Buffer Mapping: Instead of frequently updating buffer content with new data, mapping a buffer can provide a direct pointer to its memory. This technique allows for efficient data updates but requires careful synchronization to avoid reading from or writing to the buffer while it's used by the GPU
Buffer vs Texture
In GPU programming, both buffers and textures are used to store data that can be accessed by shaders. However, they serve distinct purposes and have different memory access patterns and optimization techniques.
Buffers: Buffers are linear blocks of memory used primarily for storing vertex data, indices, or other structured data. They are well-suited for handling large volumes of data where each element is accessed sequentially or via a known pattern. Buffers are also used for more general-purpose computing tasks, such as storing computational outputs or transfer data between the CPU and GPU.
Buffers provide:
- Flexibility: They can hold any form of data from simple arrays to complex structured data.
- Direct Access: Data in buffers can be accessed directly using indices, making them ideal for operations like vertex fetching where the GPU accesses vertices in a non-sequential order based on index buffers.
Textures: Textures, on the other hand, are grid-based and optimized for 2D or 3D data access. They are used not only for storing image data but also for any data that benefits from spatial locality in access patterns. Textures are typically accessed through sampling operations, which can interpolate between data points and access data non-linearly.
Textures provide:
- Sampling: They support hardware-accelerated filtering and mip-mapping, which are crucial for rendering operations where textures are sampled at varying resolutions.
- Specialized Formats: Textures can be stored in formats optimized for specific types of data, such as depth, stencil, or various compressed formats, enhancing memory efficiency and performance.
Choosing Between Buffers and Textures: The choice between using a buffer or a texture can significantly affect the performance and flexibility of a GPU program.
- Use buffers for structured data where access is typically predictable and non-spatial, such as physics calculations or vertex data.
- Use textures when you need to leverage complex access patterns, such as random access or interpolation, and for operations that benefit from hardware-accelerated filtering, like rendering detailed images or handling data with spatial coherence.
Understanding the underlying hardware implications of each type of storage is key to optimizing performance and memory usage in GPU-accelerated applications.
Memory Alignment
Memory alignment is the process of aligning data structures in memory so that their starting address is a multiple of a certain byte size. CPUs and GPUs read and write memory in chunks, determined by the architecture's memory bus width and cache line size. If data is well aligned in a way that fully fills the cache line, it can be read and written in fewer transactions. Otherwise, the processor will have to perform multiple memory transactions to read or write the data. Accessing data sequentially will also make better use of the memory bus width.
typedef struct {
char a; // 1 byte
// 3 bytes padding needed here for alignment
int b; // 4 bytes
char c; // 1 byte
// 3 bytes padding needed here to align
} UnalignedStruct;
// Offset | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | etc
// -------|-----|-----|-----|-----|-----|-----|-----|-----|-----
// Data | a | pad | pad | pad | b | b | b | b |
typedef struct {
int b; // 4 bytes
char a; // 1 byte
char c; // 1 byte
// 2 bytes padding added here to make the total size a multiple of 4 bytes
} AlignedStruct;
// Offset | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
// -------|-----|-----|-----|-----|-----|-----|-----|-----
// Data | b | b | b | b | a | c | pad | pad
A struct's alignment will be determined by the largest alignment
requirement of its members. In the example above, the UnalignedStruct
has a size of 12 bytes because of the padding needed for alignment. The
AlignedStruct has a size of 8 bytes, which
is a multiple of 4 bytes. The memory layout of the AlignedStruct is more efficient because it is aligned to the size of the int, which
is 4 bytes. The UnalignedStruct is less efficient
because it is not aligned to the size of the int.
Compilers automatically add padding to structures to ensure each member
is correctly aligned according to the target architecture's requirements.
You can manually control the alignment of your structures by carefully
designing the layout of your data and explicitly specifying the alignment
of your data structures through compiler-specific attributes or padding
bytes.
Memory Layout
class student {
std::string name;
double average_mark;
};
double calculate_average_mark(const std::vector<student>& students) {
double sum = 0.0;
for (int i = 0; i < students.size(); i++) {
sum += students[i].average_mark;
}
return sum / students.size();
}
The code consists of a class student with two fields: name and an average_mark. It also consists of a function calculate_average_mark which takes the array of student instances and calculate the average mark of all students. Let’s assume that the sizeof(std::string) is eight, and size(double) is also eight. In a vector of class student, float values containing the average mark are non-consecutive, i.e. they are located at memory offsets 8, 24, 40, etc. Even though it is possible to use SIMD in this example, most compilers will switch to scalar code because doing SIMD on a non-consecutive memory layout is slow.2 When your program is actually processing data that is important to be processed fast, you can rearrange your data in memory so that it can be processed more efficiently. Game developers have been doing it for years in a paradigm called data-oriented design, that which focuses more on data and operations, and less on abstractions and encapsulations. The basic idea revolves around using struct-of-arrays instead of array-of-structs. We define a class called students that has two arrays, one for names and one for average marks
class students {
std::vector<std::name> names;
std::vector<double> average_marks;
double calculate_average_mark() {
double sum = 0;
for (int i = 0; i < average_marks.size(); i++) {
sum += average_marks[i];
}
return sum / average_marks.size();
}
};
The compiler with the right switches will generate SIMD instructions to calculate the average mark since the doubles containing the average marks are laid out sequentially in memory.
Descriptors
Synchronization
Double/Triple Buffering
Pipeline State Changes
Ubershaders
https://developer.apple.com/videos/play/tech-talks/111373/ "An uber
shader is an example of a long and complex shader that can be used
to render any possible material. These types of shaders have lots of
branches for any possible combination. When artists create material,
the material parameters are stored in a Metal buffer, used by the
material shader. This buffer gets updated when you change the
parameters, but there is no recompilation required. An uber shader
would query material parameters from a buffer doing conditional
branches at runtime to enable and disable features.Developers
sometimes make uber shaders that read material parameters from a
buffer and then a material shader chooses different control parts at
runtime based on the buffer's contents. This approach lets the
shader render a new material effect without recompiling because the
only changes are parameters in the buffer."
"The Metal function constants features specializes the shader efficiently
and removes the code that isn't reachable at runtime. For example, uber
shaders typically benefit from function constants. This responsive approach
is great during development, however, the shader has to account for several
possibilities and read from additional buffers which may affect an app's
performance. Another approach is to specialize the shader at compile
time instead of at runtime. By building the shader variance offline with
preprocessor macros. This approach means you have to compile all the
possible variant combinations offline. For example, a glossy variant
could be the combination of enabling both is_glossy and use_shadows macros,
by disabling the remaining macros. Even if you compile them offline ahead
of time, each variant adds up which can significantly increase the size
of your Metal library. It can also increase compile time because each
shade of variant has to be compiled starting from Metal source."
"Function constants can provide another way to specialize the
shaders. Compared to using macros, it can reduce both compile time
and the size of the Metal library. With function constants, you
compile an uber shader one time from source to an intermediate Metal
function. From that function, you only create the variance your app
needs based on function constants you define."