1. 计算着色器概述

计算着色器(Compute Shader)是一种在GPU上执行通用并行计算的着色器类型。与渲染管线不同,计算着色器不绘制图形,而是利用GPU的大规模并行能力处理数据。1

计算着色器 vs 渲染管线

特性计算着色器渲染管线
用途通用计算图形渲染
输入Buffer/Texture顶点/纹理
输出Buffer/Texture帧缓冲
线程模型workgroup顶点/片元
编程模型CSP/数据并行图形管线

应用场景

  • 科学计算:矩阵运算、线性代数
  • 图像处理:卷积、滤镜、傅里叶变换
  • 物理模拟:粒子系统、流体动力学
  • 机器学习:张量运算、神经网络推理

2. WGSL计算着色器基础

基本语法

@compute @workgroup_size(64)
fn computeMain(
  @builtin(global_invocation_id) global_id: vec3<u32>
) {
  let index = global_id.x;
  // 计算逻辑
}

关键装饰器

装饰器说明
@compute标记为计算着色器入口
@workgroup_size(N)每个workgroup的线程数
@builtin(global_invocation_id)全局唯一线程ID
@builtin(workgroup_id)workgroup在dispatch中的位置
@builtin(local_invocation_id)workgroup内的线程位置

3. Workgroup与线程模型

概念层次

┌─────────────────────────────────────────────────────────┐
│              dispatchWorkgroups(4, 3, 2)                 │
│                                                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │
│  │ Workgroup   │  │ Workgroup   │  │ Workgroup   │       │
│  │ (0,0,0)     │  │ (1,0,0)     │  │ (2,0,0)     │       │
│  │ ┌───┬───┐  │  │ ┌───┬───┐  │  │ ┌───┬───┐  │       │
│  │ │ T │ T │  │  │ │ T │ T │  │  │ │ T │ T │  │       │
│  │ ├───┼───┤  │  │ ├───┼───┤  │  │ ├───┼───┤  │       │
│  │ │ T │ T │  │  │ │ T │ T │  │  │ │ T │ T │  │       │
│  │ └───┴───┘  │  │ └───┴───┘  │  │ └───┴───┘  │       │
│  └─────────────┘  └─────────────┘  └─────────────┘       │
│        ...              ...              ...              │
└─────────────────────────────────────────────────────────┘

T = Thread (线程)
每个线程执行相同的计算逻辑

内置变量

@compute @workgroup_size(2, 4, 2)
fn main(
  @builtin(global_invocation_id) global_id: vec3<u32>,
  @builtin(workgroup_id) group_id: vec3<u32>,
  @builtin(local_invocation_id) local_id: vec3<u32>,
  @builtin(num_workgroups) num_groups: vec3<u32>
) {
  // global_id: 线程在所有workgroups中的全局索引 (0 ~ 4*3*2*8-1)
  // group_id: 当前workgroup在dispatch中的位置
  // local_id: 当前线程在workgroup内的位置
  // num_groups: dispatch的workgroup数量
}

Workgroup大小限制

限制典型值
maxComputeInvocationsPerWorkgroup256
maxComputeWorkgroupSizeX256
maxComputeWorkgroupSizeY256
maxComputeWorkgroupSizeZ64

最佳实践:通常选择64或128作为workgroup大小,除非有特定原因。

4. Buffer管理

Buffer创建

// 创建只读输入缓冲
const inputBuffer = device.createBuffer({
  size: array.byteLength,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
 
// 创建输出缓冲
const outputBuffer = device.createBuffer({
  size: outputSize,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});
 
// 创建可映射缓冲(用于CPU读取结果)
const readBuffer = device.createBuffer({
  size: outputSize,
  usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
});

Buffer用途标志

标志说明
MAP_READ可映射读取(CPU→GPU)
MAP_WRITE可映射写入(GPU→CPU)
COPY_SRC作为复制源
COPY_DST作为复制目标
VERTEX顶点缓冲
UNIFORMUniform缓冲
STORAGE存储缓冲(计算着色器)
INDEX索引缓冲

数据传输

// CPU → GPU
device.queue.writeBuffer(buffer, 0, dataArray);
 
// GPU → CPU (需要映射)
async function readBuffer(device, buffer, size) {
  // 1. 创建临时读取缓冲
  const readBuffer = device.createBuffer({
    size: size,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });
  
  // 2. 编码复制命令
  const commandEncoder = device.createCommandEncoder();
  commandEncoder.copyBufferToBuffer(
    buffer, 0,      // 源
    readBuffer, 0,  // 目标
    size
  );
  device.queue.submit([commandEncoder.finish()]);
  
  // 3. 映射并读取
  await readBuffer.mapAsync(GPUMapMode.READ);
  const data = new Float32Array(readBuffer.getMappedRange());
  
  // 4. 处理完成后取消映射
  readBuffer.unmap();
  
  return data;
}

5. Bind Group与资源绑定

Bind Group Layout

const bindGroupLayout = device.createBindGroupLayout({
  entries: [
    {
      binding: 0,
      visibility: GPUShaderStage.COMPUTE,
      buffer: { type: 'read-only-storage' },
    },
    {
      binding: 1,
      visibility: GPUShaderStage.COMPUTE,
      buffer: { type: 'read-only-storage' },
    },
    {
      binding: 2,
      visibility: GPUShaderStage.COMPUTE,
      buffer: { type: 'storage' },
    },
  ],
});

Bind Group

const bindGroup = device.createBindGroup({
  layout: bindGroupLayout,
  entries: [
    { binding: 0, resource: { buffer: inputBufferA } },
    { binding: 1, resource: { buffer: inputBufferB } },
    { binding: 2, resource: { buffer: outputBuffer } },
  ],
});

WGSL中的绑定

@group(0) @binding(0) var<storage, read> inputA: array<f32>;
@group(0) @binding(1) var<storage, read> inputB: array<f32>;
@group(0) @binding(2) var<storage, read_write> output: array<f32>;
 
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
  let index = id.x;
  output[index] = inputA[index] + inputB[index];
}

6. 计算管线

创建计算管线

const computePipeline = device.createComputePipeline({
  layout: device.createPipelineLayout({
    bindGroupLayouts: [bindGroupLayout],
  }),
  compute: {
    module: device.createShaderModule({
      code: computeShaderCode,
    }),
    entryPoint: 'main',
  },
});

编码计算命令

function dispatchCompute(device, pipeline, bindGroup, dataSize) {
  const commandEncoder = device.createCommandEncoder();
  
  const computePass = commandEncoder.beginComputePass();
  computePass.setPipeline(pipeline);
  computePass.setBindGroup(0, bindGroup);
  computePass.dispatchWorkgroups(
    Math.ceil(dataSize / 64),  // x
    1,  // y
    1   // z
  );
  computePass.end();
  
  device.queue.submit([commandEncoder.finish()]);
}

7. 实战:向量相加

完整示例

// WGSL计算着色器
const shaderCode = `
  @group(0) @binding(0) var<storage, read> a: array<f32>;
  @group(0) @binding(1) var<storage, read> b: array<f32>;
  @group(0) @binding(2) var<storage, read_write> output: array<f32>;
 
  @compute @workgroup_size(64)
  fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    let index = id.x;
    output[index] = a[index] + b[index];
  }
`;
 
async function vectorAddition(device, a, b) {
  const n = a.length;
  const bufferSize = n * 4;  // f32 = 4 bytes
 
  // 创建缓冲
  const aBuffer = device.createBuffer({
    size: bufferSize,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });
  const bBuffer = device.createBuffer({
    size: bufferSize,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });
  const outputBuffer = device.createBuffer({
    size: bufferSize,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  });
 
  // 复制数据
  device.queue.writeBuffer(aBuffer, 0, a);
  device.queue.writeBuffer(bBuffer, 0, b);
 
  // 创建Bind Group
  const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: aBuffer } },
      { binding: 1, resource: { buffer: bBuffer } },
      { binding: 2, resource: { buffer: outputBuffer } },
    ],
  });
 
  // 执行计算
  const commandEncoder = device.createCommandEncoder();
  const pass = commandEncoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(Math.ceil(n / 64));
  pass.end();
  device.queue.submit([commandEncoder.finish()]);
 
  // 读取结果
  return await readBuffer(device, outputBuffer, bufferSize);
}

8. 实战:矩阵乘法

WGSL矩阵乘法

struct Matrix {
  size: vec2f,
}
 
@group(0) @binding(0) var<storage, read> a: Matrix;
@group(0) @binding(1) var<storage, read> aData: array<f32>,
@group(0) @binding(2) var<storage, read> b: Matrix,
@group(0) @binding(3) var<storage, read> bData: array<f32>,
@group(0) @binding(4) var<storage, read_write> c: Matrix,
@group(0) @binding(5) var<storage, read_write> cData: array<f32>,
 
@compute @workgroup_size(8, 8)
fn main(
  @builtin(global_invocation_id) id: vec3<u32>
) {
  let row = id.x;
  let col = id.y;
  
  if (row >= u32(a.size.x) || col >= u32(b.size.y)) {
    return;
  }
  
  var sum = 0.0;
  for (var k = 0u; k < u32(a.size.y); k = k + 1u) {
    let aVal = aData[row * u32(a.size.y) + k];
    let bVal = bData[k * u32(b.size.y) + col];
    sum = sum + aVal * bVal;
  }
  
  cData[row * u32(c.size.y) + col] = sum;
}

9. 性能优化

Workgroup大小选择

// 一般建议:64
@workgroup_size(64)
 
// 适合2D计算
@workgroup_size(8, 8)
 
// 适合3D计算
@workgroup_size(4, 4, 4)

内存访问模式

  • 合并访问:相邻线程访问相邻内存
  • 避免bank冲突:shared memory访问模式
  • 使用local memory:高频访问数据放入workgroup本地内存

race condition处理

// 错误:多个线程同时写入同一位置
output[index] = value;  // race condition!
 
// 正确:使用原子操作
storageBarrier();  // 确保所有写操作完成
workgroupBarrier();

10. 常见问题

dispatch数量计算

// 假设有1024个元素,workgroup大小为64
const workgroupSize = 64;
const numElements = 1024;
const numWorkgroups = Math.ceil(numElements / workgroupSize);
 
// dispatch
pass.dispatchWorkgroups(numWorkgroups);

Buffer映射失败

// 原因:Buffer正在被GPU使用时映射
await buffer.mapAsync(GPUMapMode.READ);
// 如果GPU仍在使用,会抛出错误
 
// 解决:确保GPU操作完成后再映射
device.queue.submit([commandEncoder.finish()]);
await device.queue.onSubmittedWorkDone();
await buffer.mapAsync(GPUMapMode.READ);

参考资料

Footnotes

  1. Chrome for Developers - GPU Compute