A. R. Brodtkorb
Simplified Ocean Models on GPUs, Submitted, 2016.
This article describes the implementation of three different simplified ocean models on a GPU (graphics processing unit) using Python and PyOpenCL. The three models are all based on the solving the shallow water equations on Cartesian grids, and our work is motivated by the aim of running very large ensembles of forecast models for fully nonlinear data assimilation. The models are the linearized shallow water equations, the non-linear shallow water equations, and the two-layer non-linear shallow water equations, respectively, and they contain progressively more physical properties of the ocean dynamics. We show how these models are discretized to run efficiently on a graphics processing unit, discuss how to implement them, and show some simulation results. The implementation is available online under an open source license, and may serve as a starting point for others to implement similar oceanographic models.
T. Gierlinger, A.R. Brodtkorb, A. Stumpf, M. Weilera, and F. Michel.
Visualization of marine sand dune displacements utilizing modern GPU techniques, In The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015. Paper (DOI)
Quantifying and visualizing deformation and material fluxes is an indispensable tool for many geoscientific applications at different scales comprising for example global convective models (Burstedde et al., 2013), co-seismic slip (Leprince et al., 2007) or local slope deformation (Stumpf et al., 2014b). Within the European project IQmulus (http://www.iqmulus.eu) a special focus is laid on the efficient detection and visualization of submarine sand dune displacements. In this paper we present our approaches on the visualization of the calculated displacements utilizing modern GPU techniques to enable the user to interactively analyse intermediate and final results
within the whole workflow.
M. L. SÃ¦tra, A. R. Brodtkorb, K-A. Lie,
Efficient GPU-Implementation of Adaptive Mesh Refinement for the Shallow-Water Equations, Journal of Scientific Computing, 2014.
[Draft (PDF)] | Paper (DOI)
The shallow-water equations model hydrostatic flow below a free surface for cases in which the ratio between the vertical and horizontal length scales is small and are used to describe waves in lakes, rivers, oceans, and the atmosphere. The equations admit discontinuous solutions, and numerical solutions are typically computed using
high-resolution schemes. For many practical problems, there is a need to increase the grid resolution locally to capture complicated structures or steep gradients in the solution. An efficient method to this end is adaptive mesh refinement (AMR), which recursively refines the grid in parts of the domain and adaptively updates the refinement as the simulation progresses. Several authors have demonstrated that the explicit stencil computations of high-resolution schemes map particularly well to many-core architectures seen in hardware accelerators such as graphics processing units (GPUs). Herein, we present the first full GPU-implementation of a block-based AMR method for the second-order Kurganovâ€“Petrova central scheme. We discuss implementation details, potential pitfalls, and key insights, and present a series of
performance and accuracy tests. Although it is only presented for a particular case herein, we believe our approach to GPU-implementation of AMR is transferable to other hyperbolic conservation laws, numerical schemes, and architectures similar to the GPU.
Volume carving is a well-known technique for reconstructing a 3D scene from a set of 2D images, using features, usually foreground estimations, detected in individual cameras, and camera parameters to backproject the 2D images into 3D. Spatial calibration of the cameras is trivial, but the resulting carved volume is very sensitive to temporal offsets between the cameras. Automatic synchronization between the cameras is therefore desired. In this paper, we present a highly efficient implementation of volume carving and synchronization on a heterogeneous system fitted with commodity GPUs.
An online, real-time synchronization system is described and evaluated on surveillance video of an indoor scene. Improvements to the state of the art CPU-based algorithms are described.
This report gives an introduction to using GPUs for computer vision. We start by giving an introduction to GPUs, followed by a state-of-the art survey of computer vision on GPUs. We then present our implementation of a real-time system for running low-level image processing algorithms on the GPU, based on live H.264 data originating from commodity-level IP cameras.
A. R. Brodtkorb, T. R. Hagen, C. Schulz and G. Hasle
GPU Computing in Discrete Optimization Part I: Introduction to the GPU, EURO Journal on Transportation and Logistics, 2013.
[Draft (PDF)][Paper (Springer)
C. Schulz, G. Hasle, A. R. Brodtkorb and T. R. Hagen
GPU Computing in Discrete Optimization Part II: Survey Focused on Routing Problems, EURO Journal on Transportation and Logistics, 2013.
[Draft (PDF)][Paper (Springer)
Abstract: Today there is still a large gap between the performance of current optimization technology and the requirements of real world applications. However, hardware development nowadays no longer results in higher speed for sequential algorithms but in increased parallelism in terms of multi core architectures and massively parallel accelerators like GPUs. The gap has therefore to be closed by utilizing this parallelism and all available hardware. Modern commodity PCs include both a multi-core CPU and at least one GPU, providing a low cost, easy accessible heterogeneous environment for optimization algorithms. This has led to several studies of how certain optimization algorithms can use the GPU to accelerate their computations. This paper begins with
a short historical introduction to modern mainstream computer architectures and the evolution of modern GPUs. To facilitate the development of optimization algorithms that utilize the GPU efficiently, we provide a thorough discussion of best practice and strategies for the development of scalable, high performance GPU code. The heterogeneous aspect of using both the CPU and the GPU for computations is considered as well. In the second part of the paper we provide a general survey of the existing literature on heterogeneous computing in discrete optimization, followed by an in-depth critical discussion of selected papers on routing problems. We hope that the lessons that arise from the combination of the strategies for heterogeneous computing with the study of existing literature will stimulate further high quality research related to the development of efficient and powerful new heterogeneous optimization algorithms. Our point of view regarding those lessons and future research completes the paper.
A. R. Brodtkorb, and M. L. SÃ¦tra
Explicit Shallow Water Simulations on GPUs: Guidelines and Best Practices, in Proceedings of the XIX International Conference on Computational Methods for Water Resources, 2012.
[Draft (PDF)] | [Paper (PDF)]
Abstract: Graphics processing units have now been used for scientic calculations for
over a decade, going from early proof-of-concepts to industrial use today. The inherent
reason is that graphics processors are far more powerful than CPUs when it comes to
both floating point operations and memory bandwidth, illustrated by the fact that three
of the top 500 supercomputers in the world now utilize GPUs. In this paper, we present
guidelines and best practices for harvesting the power of graphics processing units for
shallow water simulations through stencil computations.
A. R. Brodtkorb, M. L. SÃ¦tra and T. R. Hagen,
GPU Programming Strategies and Trends in GPU Computing, Journal of Parallel and Distributed Computing, Volume 73, Issue 1, January 2013, Pages 4â€“13, DOI: 10.1016/j.jpdc.2012.04.003.
[Draft (PDF)] | [Paper (Elsevier)]
Abstract: Over the last decade, there has been a growing interest in the use of graphics processing units (GPUs) for non-graphics applications. From early academic proof-of-concept papers around the year 2000, the use of GPUs has now matured to a point where there are countless industrial applications. Together with the expanding use of GPUs, we have also seen a tremendous development in the programming languages and tools, and getting started programming GPUs has never been easier. However, whilst getting started with GPU programming can be simple, being able to fully utilize GPU hardware is an art that can take months and years to master. In this article, we give an overview of GPU programming strategies, with a focus on efficient hardware utilization. We give general advice in addition to step-by-step approaches to locating and removing bottlenecks through profile driven development. We conclude the article with our view on current and future trends.
M. L. SÃ¦tra and A. R. Brodtkorb,
Shallow Water Simulations on Multiple GPUs,
Proceedings of the Para 2010 Conference Part II, Lecture Notes in Computer Science 7134 (2012), pp 56--66, DOI: 10.1007/978-3-642-28145-7_6.
[Draft (PDF)] | [Paper (Springer)]
Abstract: We present a state-of-the-art shallow water simulator running on multiple GPUs. Our implementation is based on an explicit high-resolution finite volume scheme for the shallow water equations, suitable for modeling dam breaks and flooding. We use row domain decomposition to enable multi-GPU computations, and perform traditional CUDA block decomposition within each GPU for further parallelism. Our implementation shows near perfect weak and strong scaling, and enables simulation of domains consisting of up-to 378 million cells at a rate of almost 400 megacells per second on the four GPUs of a Tesla S1070. Our experiments with the more recent Fermi architecture gives an estimate of over 1 gigacells per second performance.
A. R. Brodtkorb,
Scientific Computing on Heterogeneous Architectures, Ph.D. thesis, University of Oslo, ISSN 1501-7710, No. 1031, 2010.
[Thesis (PDF)][Slides (PDF)]
The CPU has traditionally been the computational work horse in scientific computing,
but we have seen a tremendous increase in the use of accelerators, such as Graphics
Processing Units (GPUs), in the last decade. These architectures are used because they
consume less power and offer higher performance than equivalent CPU solutions. They are
typically also far less expensive, as more CPUs, and even clusters, are required to match their performance. Even though these accelerators are powerful in terms of floating point operations per second, they are considerably more primitive in terms of capabilities. For example, they cannot even open a file on disk without the use of the CPU. Thus, most applications can benefit from using accelerators to perform heavy computation, whilst running complex tasks on the CPU. This use of different compute resources is often referred to as heterogeneous computing, and we explore the use of heterogeneous architectures for scientific computing in this thesis. Through six papers, we present qualitative and quantitative comparisons of different heterogeneous architectures, the use of GPUs to accelerate linear algebra operations in MATLAB, and efficient shallow water simulation on GPUs. Our results show that the use of heterogeneous architectures can give large performance gains.
A. R. Brodtkorb, M. L. SÃ¦tra, and M. Altinakar,
Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation, Computers & Fuids, 55, (2011), pp 1--12. DOI: 10.1016/j.compfluid.2011.10.012.
[Draft (PDF)][Paper (Elsevier)]
In this paper, we present an efficient implementation of a state-of-the-art
high-resolution explicit scheme for the shallow water equations on graphics
processing units. The selected scheme is well balanced, supports dry states, and
suits the execution model of graphics processing units well. We verify and
validate our implementation and show that efficient use of single precision
hardware is sufficiently accurate. Our framework further supports real-time
visualization with both photo-realistic and non-photo-realistic display of the
physical quantities. We present performance results showing that we can
accurately simulate the first 4000 seconds of the Malpasset dam break case 27
A. R. Brodtkorb, T. R. Hagen, K.-A. Lie and J. R. Natvig,
Simulation and Visualization of the Saint-Venant System using GPUs,
Computing and Visualization in Science, special issue on Hot topics in Computational Engineering, 13(7), (2011), pp. 341--353, DOI: 10.1007/s00791-010-0149-x.
[Draft (PDF)][Paper (Springer)]
Abstract: We consider three high-resolution schemes for computing shallow-water waves as described by the Saint-Venant system and discuss how to develop highly efficient implementations using graphical processing units (GPUs). The schemes are well-balanced for lake-at-rest problems, handle dry states, and support linear friction models. The first two schemes handle dry states by switching variables in the reconstruction step, so that that bilinear reconstructions are computed using physical variables for small water depths and conserved variables elsewhere. In the third scheme, reconstructed slopes are modified in cells containing dry zones to ensure non-negative values at integration points.We discuss how single and double-precision arithmetics affect accuracy and efficiency, discuss scalability and resource utilization for our implementations, and demonstrate that all three schemes map very well to current GPU hardware. We have also implemented direct and close-to-photo-realistic visualization of simulation results on the GPU, giving visual simulations with interactive speeds for reasonably-sized grids.
A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik and O. O. Storaasli, State-of-the-Art in Heterogeneous Computing, Scientific Programming, 18(1) (2010), pp. 1--33
[Paper (PDF)][Paper (IOS Press)]
Abstract: Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to
traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained
parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a
good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing,
focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and
field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-ofthe-
art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give
our view on the future of heterogeneous computing.
A. R. Brodtkorb, An Asynchronous API for Numerical Linear Algebra, Scalable Computing: Practice and Experience, special issue on Recent Developments in Multi-Core Computing Systems, 9(3) (2008), pp. 153--163.
[Draft (PDF)][Paper (SCPE)]
Abstract: We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple CPUs, multiple GPUs, or a combination of both. Furthermore, we present a wrapper of this interface for use in MATLAB. Our API imposes only small overheads, scales perfectly to two processor cores, and shows even better performance when utilizing computational resources on the GPU.
A. R. Brodtkorb and T. R. Hagen,
A Comparison of Three Commodity-Level Parallel Architectures: Multi-core CPU, the Cell BE and the GPU, Seventh International Conference on Mathematical Methods for Curves and Surfaces, Lecture Notes in Computer Science, 5862 (2010), pp. 70--80
[Draft (PDF)][Paper (Springer)]
Abstract: We explore three commodity parallel architectures: multi-core CPUs, the Cell BE processor, and graphics processing units. We have implemented four algorithms on these three architectures: solving the heat equation, inpainting using the heat equation, computing the Mandelbrot set, and MJPEG movie compression. We use these four algorithms to exemplify the benefits and drawbacks of each parallel architecture.
A. R. Brodtkorb, The Graphics Processor as a Mathematical Coprocessor in MATLAB, The Second International Conference on Complex, Intelligent and Software Intensive Systems, pp. 822--827, March 2008, DOI: 10.1109/CISIS.2008.68.
[Draft (PDF)][Paper (DOI)]
We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms from numerical linear algebra available through this interface; matrix-matrix multiplication, Gauss-Jordan elimination, PLU factorization, and tridiagonal Gaussian elimination. In addition to being a highlevel abstraction to the GPU, the interface offers background processing, enabling computations to be executed on the CPU simultaneously. The algorithms are shown to be up-to 31 times faster than highly optimized CPU code. The algorithms have only been tested on single precision hardware, but will easily run on new double precision hardware.
A. R. Brodtkorb, A MATLAB Interface to the GPU, Masterâ€™s thesis,
Department of Informatics, Faculty of Mathematics and Natural Sciences, University of Oslo, May 2007.
[Thesis (DUO)][Thesis (PDF)]
Abstract: This thesis delves into the field of general purpose computation on graphics processing units (GPGPU). A MATLAB interface for solving numerical linear algebra on the graphics processing unit (GPU), and three algorithms from numerical linear algebra are presented. The algorithms are shown to be faster than the highly efficient ATLAS implementations used in MATLAB. In addition, the interface allows background processing on the GPU, enabling it to be used as a mathematical coprocessor. The computations are shown to be sufficiently accurate, and solving the shallow water equations implicitly is shown where both the CPU and the GPU are both utilized for maximumperformance. A comparison of the interface and other high-level languages for GPGPU is also presented.
A. R. Brodtkorb, T. Fladby and M. L. SÃ¦tra, PLU factorization on a Cluster of GPUs Using Fast Ethernet, Technical report, 2007.
Abstract: In this white paper, we present a novel approach to solve linear systems of equations on a cluster using the PLU factorization. We use the graphics processing unit (GPU) as the main computational engine at each node, and a block-cyclic data distribution to solve the system. The local computation is a new way of solving the PLU factorization on the GPU. It utilizes the full four-way vectorized arithmetic found in most GPUs, and a new pivoting strategy. The global algorithm uses the message passing interface (MPI) for communication between nodes. We show that our algorithm is highly efficient on the local nodes, but bounded by the relatively slow network. A faster network will eliminate this bottleneck, and the speed of the local computations show promising results.
A. R. Brodtkorb, Matrix-Matrix Multiplication in MATLAB using the GPU, Technical report, 2006.
Abstract: The use of GPU's as the main computing resource has yielded great speed-up factors in several fields including solving differential equations, linear algebra, signal processing and database queries. There have been several attempts at implementing efficient algorithms for matrix-matrix products with varying results. In-depth analysis of the algorithms has been presented as well. In this paper I review the work done in the field, and present a crude implementation of matrix-matrix products using the GPU. The implementation is run in Matlab.