## Fine-grain parallel RRB-solver for 5-/9-point stencil problems suitable for GPU-type processors

Martijn de Jong, Auke van der Ploeg, Auke Ditzel, and Kees Vuik

### Abstract

Preconditioners based on incomplete factorization are very popular for a fast convergence of the PCG-algorithm. However, these preconditioners are hard to parallelize since most operations are inherently sequential. In this paper we present the RRB-solver, which is a PCG-type solver using an incomplete Cholesky factorization based on the Repeated Red-Black (RRB) method. The RRB-solver scales nearly as well as Multigrid, and in this paper we show that this method can be parallelized very efficiently on modern computing architectures including GPUs. For an efficient parallel implementation a clever storage scheme turns out to be the key. The storage scheme is called $r_1/r_2/b_1/b_2$ and it ensures that memory transfers are coalesced throughout the algorithm, yielding near-optimal performance of the RRB-solver. The $r_1/r_2/b_1/b_2$-storage scheme in combination with a CUDA implementation on the GPU gives speedup factors of more than 30 compared to a sequential implementation on one CPU core for 5-/9-point stencils problems. A comparison with algebraic Multigrid further shows that the RRB-solver can be implemented very efficiently on the GPU.

Full Text (PDF) [708 KB], BibTeX

### Key words

repeated red-black, conjugate gradient, incomplete Cholesky, parallelization, CUDA, 2D Poisson

### AMS subject classifications

35J05, 65F08, 65F10, 65F50, 65Y05, 68W10

< Back