GPU benchmarking with Llama.cpp

After adding a GPU and configuring my setup, I wanted to benchmark my graphics card. I used Llama.cpp and compiled it to leverage an NVIDIA GPU. Here, I summarize the steps I followed.

Image designed by author

Hardware Used

OS: Ubuntu 24.04 LTS (Official page)
GPU: NVIDIA RTX 3060 (affiliate link)
CPU: AMD Ryzen 7 5700G (affiliate link)
RAM: 52 GB
Storage: Samsung SSD 990 EVO 1TB (affiliate link)

Installing the NVIDIA CUDA Toolkit

To compile llama.cpp, you need to install the NVIDIA CUDA Toolkit. The process is straightforward—just follow the well-documented guide.

Once installed, don’t forget to configure the environment as described in the post-installation actions.

You can verify the installation of the toolkit (and the GPU compiler) by running the provided samples. Specifically, execute:

git clone https://github.com/NVIDIA/cuda-samples.git  
cd cuda-samples/Samples/1_Utilities/deviceQuery
make

This should produce:

./deviceQuery                                                                                                                                                       
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 3060"
  CUDA Driver Version / Runtime Version          12.4 / 12.6
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 12004 MBytes (12587106304 bytes)
  (028) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
  GPU Max Clock rate:                            1777 MHz (1.78 GHz)
  Memory Clock rate:                             7501 Mhz
  Memory Bus Width:                              192-bit
  
  [...]

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 12.6, NumDevs = 1
Result = PASS

(I trimmed the output for brevity.)

Now, you can proceed to install llama.cpp.

Installing Llama.cpp

The project is available here: llama.cpp. For the impatient, here are the steps:

TL;DR

Install prerequisites
Retrieve the code
Compile with NVIDIA GPU options

Prerequisites

⁠sudo apt install cmake

Retrieving the Code

Simple and straightforward:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Compilation

Refer to the documentation for additional options: build.

On my machine, the steps were:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

The compilation took 17 minutes and 41 seconds—long enough for one (or two) coffee breaks. 😊

Benchmarking

Now it’s time to run benchmarks.

Retrieving Models

A key point is that llama.cpp requires models in the GGUF format. You can find such models easily on HuggingFace. I found most of them in Bartowski’s repository.

Running Tests

In the llama.cpp/build/bin directory, I ran:

./llama-bench -p 0 -n 512 -m ../../models/gemma2-2b-GGUF/dolphin-2.9.4-gemma2-2b-Q5_K_M.gguf -m ../../models/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf -m ../../models/Qwen2.5-Coder-14B-Instruct-GGUF/Qwen2.5-14B-Instruct-Q5_K_M.gguf

This produced:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| gemma2 2B Q5_K - Medium        |   1.79 GiB |     2.61 B | CUDA       |  99 |         tg512 |        119.00 ± 0.72 |
| llama 8B Q5_K - Medium         |   5.33 GiB |     8.03 B | CUDA       |  99 |         tg512 |         53.01 ± 0.06 |
| qwen2 14B Q5_K - Medium        |   9.78 GiB |    14.77 B | CUDA       |  99 |         tg512 |         28.88 ± 0.03 |

Note: If a model exceeds the GPU’s capacity, you will encounter an error: failed to load model 'xxx.gguf'.

Conclusion

With an NVIDIA RTX3060 GPU, it is therefore possible to use LLM models of 14B (14 billion parameters) with quantization of the Q5_K_M type in a completely satisfactory way (the same order of speed of response as ChatGPT 4o).