GPU support via opencl

Darktable offers a gpu supported pixelpipeline. This is nice as it increases the performance on editing and exporting. I have a nvidia geforce 960GTX. So theres is a good reason to perform the calculations on the GPU instead of the main CPU. Using the GPU will reduce the amount of time needed to render the image.

My system is quite old and so you feel every bit that the geforce takes over. Here are my current Specs:

Type component
CPU core i7-2600k @ 3.40GHz
Graphics nvidia GeForce 960GTX (2GB)
Disk Samsung SSD 840 Pro

Of course you have to have opencl including the support of nvidia installed on your system.

the setting to modify

First of all you have to enable opencl support. This can be done within the settings dialog under core options by activating activate OpenCL support.

nano /home/user/.config/darktable/darktablerc

The setting we will adjust is the following:


The value above is the default value. The settings tells darktable how many MB out of the totally available amount should be left free for driver and video purposes. So if you go to high, darktable will not have enough memory to use and if you are to low, the driver will prevent the usage of the memory and force the system down the CPU path.

Now let’s start darktable in a terminal with darktable -d opencl -d perf and have a look at the output. I export an image two times. The first time with default settings and the second run with adjusted parameters.

Performance with default settings

[default_process_tiling_opencl_ptp] couldn't run process_cl() for module 'shadhi' in tiling mode: 0
[opencl_pixelpipe] could not run module 'shadhi' on gpu. falling back to cpu path
[dev_pixelpipe] took 1.087 secs (4.994 CPU) processed `shadows and highlights' on CPU, blended on CPU [export]
[dev_pixelpipe] took 0.116 secs (0.349 CPU) processed `gamma' on CPU, blended on CPU [export]
[opencl_profiling] profiling device 0 ('GeForce GTX 960'):
[opencl_profiling] spent  0.4383 seconds in [Write Image (from host to device)]
[opencl_profiling] spent  0.0035 seconds in rawprepare_1f
[opencl_profiling] spent  0.0149 seconds in sharpen_hblur
[opencl_profiling] spent  0.0140 seconds in highpass_invert
[opencl_profiling] spent  0.0327 seconds in colorout
[opencl_profiling] spent  1.2059 seconds totally in command queue (with 5 events missing)
[dev_process_export] pixel pipeline processing took 17.520 secs (62.199 CPU)

My test image with a few modules activated took about 17 seconds to cmplete. And as you see in the first lines, it falled back to the CPU. You can see this as well by monitoring the CPU at 100%.

check your nvidia usage

If there are are all required packages installed, you may use nvidia-smi to get the current memory usage. This will look similar to the output below. Add the currrent usage to get an idea where the configuration value should be.

Mon Apr  2 16:37:31 2018
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 960     On   | 00000000:01:00.0  On |                  N/A |
|  0%   60C    P2    31W / 130W |   1059MiB /  2001MiB |      6%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0       768      G   /usr/lib/xorg-server/Xorg                    324MiB |
|    0      1066      G   /usr/bin/kwin_x11                            102MiB |
|    0      1074      G   /usr/bin/krunner                               9MiB |
|    0      1076      G   /usr/bin/plasmashell                         511MiB |
|    0      1139      G   /usr/bin/seafile-applet                        2MiB |
|    0      5084      C   darktable                                     42MiB |

This is around 950MB of usage. So I’d like to keep some spare memory for something like firefox, vlc or other apps. So This would suggest me a headroom of about 1200. The required memory variable is at 768 MB and this added to 1200 would fit into the graphics cards memory.

Performance with adjusted setting

I adjusted the settings to the value below.


And this is the result of the second run.

opencl_profiling] profiling device 0 ('GeForce GTX 960'):
[opencl_profiling] spent  0.8706 seconds in [Write Image (from host to device)]
[opencl_profiling] spent  0.0054 seconds in rawprepare_1f
[opencl_profiling] spent  0.0043 seconds in whitebalance_1f
[opencl_profiling] spent  0.9206 seconds in [Read Image (from device to host)]
[opencl_profiling] spent  0.0127 seconds in ppg_demosaic_green
[opencl_profiling] spent  0.0156 seconds in ppg_demosaic_redblue
[opencl_profiling] spent  0.0011 seconds in border_interpolate
[opencl_profiling] spent  0.0153 seconds in exposure
[opencl_profiling] spent  0.0140 seconds in colorin_unbound
[opencl_profiling] spent  0.0435 seconds in [Copy Image (on device)]
[opencl_profiling] spent  0.0123 seconds in [Copy Image to Buffer (on device)]
[opencl_profiling] spent  0.0703 seconds in gaussian_column_4c
[opencl_profiling] spent  0.0287 seconds in gaussian_transpose_4c
[opencl_profiling] spent  0.0152 seconds in [Copy Buffer to Image (on device)]
[opencl_profiling] spent  0.0241 seconds in shadows_highlights_mix
[opencl_profiling] spent  1.1125 seconds in eaw_decompose
[opencl_profiling] spent  0.2567 seconds in eaw_synthesize
[opencl_profiling] spent  0.0220 seconds in tonecurve
[opencl_profiling] spent  0.0163 seconds in sharpen_hblur
[opencl_profiling] spent  0.0121 seconds in sharpen_vblur
[opencl_profiling] spent  0.0241 seconds in sharpen_mix
[opencl_profiling] spent  0.0153 seconds in highpass_invert
[opencl_profiling] spent  0.0606 seconds in highpass_hblur
[opencl_profiling] spent  0.0623 seconds in highpass_vblur
[opencl_profiling] spent  0.0250 seconds in highpass_mix
[opencl_profiling] spent  0.0274 seconds in colorout
[opencl_profiling] spent  3.6878 seconds totally in command queue (with 0 events missing)
[dev_process_export] pixel pipeline processing took 5.374 secs (7.241 CPU)

All Modules moved to the CPU and the image was rendered in about 5 seconds. This is a quite nice improvement.