Having been working on a pretty simple physics engine in c++ for the past few months, I was starting to reach the limit of my CPU in terms of number of simulated bodies, this limit being around 150 bodies. I began thinking that maybe there would be a better way to deal with detecting and resolving contacts than serially on the CPU with a O(N*(N-1)/2). I decided I would look into offloading the contact detection from the CPU on to the GPU, using CUDA. I initially got the idea after reading Daniel Houghton’s blog post about Compute Shaders in DirectX, but thought that I could do more than just simulate the particle positions on the GPU.
To begin with, I thought it might be valuable to benchmark what the performance is with certain numbers of bodies being simulated.
Contact Detection | Contact Resolution | FPS with 200 bodies | FPS with 1500 bodies |
CPU | CPU | 83 | 2-3 |
CPU | GPU(CUDA) | 98 | 4-5 |
GPU(CUDA) | CPU | 165 | 24-25 |
GPU(CUDA) | GPU(CUDA) | 202 | 27-28 |
As can be seen, using CUDA dramatically improved performance and allowed for 10x as many particles without visually ruining the simulation.
To start off with, you’ll need to install the CUDA SDK and Toolkit. I’m using verson 6.0, which is available to members of the NVIDIA developers program, which you can sign up for and join for free. All of the functionality needed in this demo, however, is available with CUDA 5.5.
Next, you can either create a new CUDA project or add the functionality to an existing project. An overview of how to integrate CUDA into an existing project, or create a new CUDA project, can be found here for Visual Studio.
The first thing that you need to understand about CUDA, if you didn’t already, is that CUDA utilizes parallel computing on the GPU, as opposed to the standard serial computing which runs on a single CPU. What this means is that CUDA excels at thing where you might use a foreach loop where each element doesn’t necessarily rely on another element’s output to work. Think of taking in an array of integers and getting the square root of each. Finding the root of the 5th element doesn’t have any baring on what the result of the 2nd or 9th will be. There are ways to share data between threads and blocks, but if at all possible, your CUDA kernel should not rely on other executions. You can assign how many blocks of the GPU you wish to use and how many threads each block should consist of. This allows you to scale up and down the usage of the GPU based on how many executions you expect to need.
The second most important thing to know is that GPU memory and CPU memory are distinct entities(Unless you are utilizing Unified Memory available in CUDA 6.0 on GPUs that are compute 3.0+). This means that when you want to pass data back and forth, you need to allocate memory on the GPU, copy the data over, and free the memory afterwards. This is accomplished through the cudaMalloc, cudaMemcpy and cudaFree calls. Because of the distinct memory segments for the CPU and GPU you cannot pass pointers to CPU memory to the GPU and vice verse, as they cannot access each other’s memory.
Now that a basic understanding of CUDA has been established, some examples.
When you are writing a function that will be called from the CPU but execute on the GPU it is called a kernel. Kernels are signified by __global__ at the beginning of the function declaration.
__global__ void findResolveContacts(PhysicsParticle* particles, int* contactIndex, int particleCount, float deltaTime) |
This is the declaration for my kernel, which finds and resolves all physics contacts between particles in my engine. It takes in an array of PhysicsParticle objects, an int pointer that is shared between all blocks and threads, the number of particles, and the time delta for the frame.
Before calling the kernel, we need to allocate some device memory and copy over our particle data from CPU to GPU memory.
PhysicsParticle* d_a; cudaMalloc((void **)&d_a, particleCount*sizeof(PhysicsParticle)); cudaMemcpy(d_a, particles, particleCount*sizeof(PhysicsParticle), cudaMemcpyHostToDevice); int *contactIndex, *d_contactIndex; contactIndex = (int*)malloc(sizeof(int)); *contactIndex = 0; cudaMalloc((void **)&d_contactIndex, sizeof(int)); cudaMemcpy(d_contactIndex, contactIndex, sizeof(int), cudaMemcpyHostToDevice); |
The CUDA versions of each call are standard, other than memcpy, which requires the type of copy occuring to be given. The two we will use are cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost, which are pretty self explanatory as to when they are used.
With all of our device memory assigned we can now call our kernel. It is called by
findResolveContacts<<<(particleCount + 512-1)/512,512>>>(d_a, d_contactIndex, particleCount, deltaTime); |
The <<<>>> indicate that it is a kernel being called in the GPU and contains the number of blocks and the threads per block where n is the number of executions required, m is the number of threads per block, and (n + m-1)/m is the number of blocks. In this case I am using 512 threads per block with a variable particleCount. Following the >>> are the arguments, which must either be pointers to device memory, signified with d_ or direct values. If we were to simulate 1024 particles, for example, we would have 2 blocks with 512 threads each.
To actually achieve functionality we not only need to execute the kernel the correct number of times, but also need each execution to know what part of the data it needs to work with. We achieve this through finding the thread id
int i = threadIdx.x + blockIdx.x * blockDim.x; |
threadIdx, blockIdx and blockDim are all values known to the kernel and can be used to determine exactly which thread the kernel is running in. threadIdx is the offset of the thread within the block, blockIdx is the offset of the block the thread is in, and blockDim is the size of each block. Thus with 1024 particles, we will have a range of ids from 0 to 1023, which exactly correlates with the array of PhysicsParticles we fed in.
Before the access the array, we need to do a quick check to make sure we aren’t in the thread beyond the length of the array
if(i < particleCount && i >= 0) |
Then we can begin looking for contacts between the particle at particles[i] and each other element of particles. Because this is a simple particles physics engine, we are only dealing with sphere collisions.
Vector3 iPos = particles[i].GetPosition(); for(int j = i + 1; j < particleCount; ++j) { Vector3 jPos = particles[j].GetPosition(); Vector3 direction; direction.x = iPos.x - jPos.x; direction.y = iPos.y - jPos.y; direction.z = iPos.z - jPos.z; float distance = direction.length(); |
Another important aspect to mention now, is that any time you want a function to be able to be called from the kernel, it must be prefaced with __device__. If you want to be able to call it from the kernel and the CPU you must preface it with __device__ __host__. I set j to i+1 so that two executions don’t both resolve the same contact. Each execution checks for contacts only with particles at higher indices.
if(distance <= (particles[i].GetRadius() + particles[j].GetRadius())) { if(*contactIndex < LIMIT) { ParticleContact contact; contact.first = i; contact.second = j; int index = atomicAdd(contactIndex, 1); |
This section finds all particles that are within a distance that is less than or equal to their total radii, thus are in contact or interpenetrating. I mentioned earlier that there may be cases when you do need multiple executions to reference each other or otherwise share active data. The solution to this is in the form of a series of atomic operations. Each atomic operation will put a lock on the pointer, perform the operation, then unlock it. atomic operations generally return the value at the pointer prior to being changed.
V3 contactNormal = direction.normalized(); float penetration = (particles[contact.first].GetRadius() + particles[contact.second].GetRadius()) - distance; ResolveVelocity(particles[contact.first], particles[contact.second], contactNormal, deltaTime); ResolveInterpenetration(particles[contact.first], particles[contact.second], contactNormal, penetration, deltaTime); |
Finally I call the __device__ functions that will take in the two contacting particles and resolve their velocities and interpenetration. It is important to remember that pointers can be used within __device__ functions when they point to device memory, thus my ResolveVelocity and ResolveInterpenetration have the PhysicsParticle passed by reference, so that the values for position and velocity can be changed.
We will need to wait until every execution is complete before the proceed in the serial code on the CPU, so after we call findResolveContacts, we will
cudaDeviceSynchronize(); |
Synchronizing blocks the CPU until every device execution is complete. Once we are sure every contact has been found and resolved, we must copy the altered PhysicsParticle array back from device to host memory.
cudaMemcpy(particles, d_a, particleCount*sizeof(PhysicsParticle), cudaMemcpyDeviceToHost); |
With that, we can free all the memory we used
free(contactIndex); cudaFree(d_a); cudaFree(d_contactIndex); |
and return the particle array to be copied back into the pointers used everywhere else in the engine.
i = 0; for(auto it = particles.begin(); it != particles.end(); ++it) { (*it)->SetPosition(particleArray[i].GetPosition()); (*it)->SetVelocity(particleArray[i].GetVelocity()); ++i; } |
With that, we have a physics particle system that runs on the GPU using CUDA many times faster than it did running serially on the CPU. If you have any questions or comments, feel free to leave them below.
Pingback: Game Programming Research & Development