Zachary W. Huang
August 9, 2025
If you’re anything like me, you might have a slightly contemptuous view of C++ as a programming language (“bah, they never should have added objects to C!”), but still recognize its importance in industry. But as I’ve gained more experience with using C++ in a professional setting, I’ve come to realize that there are some of the language features I scoffed at previously are actually pretty interesting.
C++11 added anonymous functions to the language, allowing you to write code like this:
// automatic type deduction because I'm a sane individual
auto handler = [](int a){
printf("Hello, %d!\n", a);
};
handler(10); // prints out "Hello, 10!"
But unlike raw function pointers in C, lambdas can capture variables, both by value and by reference.
int foo = 10;
// the "=" means to capture by value by default
auto handler_by_value = [=](int bar){
printf("Hello, %d!\n", foo + bar);
};
// the "&" means to capture by reference by default
auto handler_by_ref = [&](int bar){
foo--;
printf("Howdy, %d!\n", foo + bar);
};
handler_by_value(10); // prints out "Hello, 20!"
handler_by_ref(10); // prints out "Howdy, 19!"
handler_by_value(10); // prints out "Hello, 20!"
handler_by_ref(10); // prints out "Howdy, 18!"
You can also specify the specific capture type for different variables by putting it in the square brackets, with something like [foo, &bar]
capturing foo
by value and bar
by reference.
At first, this might just seem like a slightly more convenient syntax over C function pointers, but it turns out it goes deeper than that.
Both the C standard library and C++ standard template library include utility functions for sorting.
In C, you would use qsort
, and in C++, you would use std::sort
.
Both functions take (effectively) a pointer to the data along with a comparator function, and will sort the data in-place.
In addition, both functions (effectively) use quicksort (with some tricks) to get runtime on average.
The point is, you would expect similar performance from these two functions, and I probably might even hypothesized that C++ would be a little bit slower, since my impression was that the STL was complicated and maybe even bloated.
But let’s write a benchmark to see what happens.
I wrote a quick program to sort a random integer array with 100,000 elements in a loop in both C and C++. First, in C:
#include <stdlib.h>
#define T 100
#define N 100000
int arr[N];
// Custom comparator to sort in reverse order.
int foo_cmp(const void *a, const void *b) {
return *((int *) b) - *((int*) a);
}
int main(){
for (size_t t = 0; t < T; t++) {
for (size_t i = 0; i < N; i++) {
arr[i] = rand();
}
qsort(arr, N, sizeof(int), &foo_cmp);
}
return 0;
}
And now in C++:
#include <algorithm>
#define T 100
#define N 100000
int arr[N];
int main(){
// Custom comparator to sort in reverse order.
// Notice that comparison function is implemented as a lambda!
auto foo_cmp = [](auto &a, auto &b){
return b < a;
};
for (size_t t = 0; t < T; t++) {
for (size_t i = 0; i < N; i++) {
arr[i] = rand();
}
std::sort(&arr[0], &arr[N], foo_cmp);
}
return 0;
}
On my laptop, benchmarking with hyperfine gives the following:
$ gcc sort.c && hyperfine ./a.out
Benchmark 1: ./a.out
Time (mean ± σ): 912.4 ms ± 3.0 ms [User: 907.2 ms, System: 4.3 ms]
Range (min … max): 906.8 ms … 915.3 ms 10 runs
$ g++ -std=c++20 sort.cpp && hyperfine ./a.out
Benchmark 1: ./a.out
Time (mean ± σ): 1.175 s ± 0.004 s [User: 1.169 s, System: 0.005 s]
Range (min … max): 1.171 s … 1.180 s 10 runs
We see that the C++ implementation is a little bit slower on average.
But now, let’s enable optimizations with the -O3
flag.
$ gcc -O3 sort.c && hyperfine ./a.out
Benchmark 1: ./a.out
Time (mean ± σ): 760.1 ms ± 4.0 ms [User: 752.8 ms, System: 5.7 ms]
Range (min … max): 755.7 ms … 768.5 ms 10 runs
$ g++ -std=c++20 -O3 lambda.cpp && hyperfine ./a.out
Benchmark 1: ./a.out
Time (mean ± σ): 504.5 ms ± 1.7 ms [User: 498.9 ms, System: 5.0 ms]
Range (min … max): 503.1 ms … 508.9 ms 10 runs
As we can see, the C++ implementation is actually faster!
Now, compiler optimization is a hugely complicated process, and I would be lying if I told you I completely understood how it works, but to my knowledge, this particular result is a direct result of C++ lambdas.
If we put both of these programs into godbolt with the -O3
flag, we see that the compiled assembly looks something like the following (with my annotations):
For C:
foo_cmp:
; our custom comparator function
mov eax, DWORD PTR [rdi]
sub eax, DWORD PTR [rsi]
ret
main:
; Function prologue
push rbp
mov ebp, 100
push rbx
sub rsp, 8
.L4:
xor ebx, ebx
.L5:
; Inner for loop, filling in array with random values
call rand
mov DWORD PTR arr[0+rbx*4], eax
add rbx, 1
cmp rbx, 100000
jne .L5
; Outer for loop (exit condition)
mov ecx, OFFSET FLAT:foo_cmp
mov edx, 4
mov esi, 100000
mov edi, OFFSET FLAT:arr
; Call sorting function
call qsort
sub rbp, 1
jne .L4
; Function epilogue
add rsp, 8
xor eax, eax
pop rbx
pop rbp
ret
arr:
.zero 400000
This is clearly a direct translation of the code provided.
The two for loops are implemented, along with a call to the qsort
function that’s defined elsewhere.
Presumably, this function will call our foo_cmp
function in order to compare the two integers.
Meanwhile, for C++
void std::__insertion_sort<int*, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>>(auto, auto, auto) (.constprop.0):
; a whole bunch of complicated sorting logic, omitted...
void std::__adjust_heap<int*, long, int, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>>(auto, auto, auto, int, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>) (.isra.0):
; a whole bunch of complicated sorting logic, omitted...
void std::__introsort_loop<int*, long, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>>(auto, auto, auto, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>) (.isra.0):
; a whole bunch of complicated sorting logic, omitted ...
main:
; function prologue
push r13
push r12
push rbp
mov ebp, 10
push rbx
sub rsp, 8
.L77:
xor ebx, ebx
.L78:
; Inner for loop, filling in array with random values
call rand
mov DWORD PTR arr[0+rbx*4], eax
add rbx, 1
cmp rbx, 1000000
jne .L78
mov eax, 4000000
mov edi, OFFSET FLAT:arr+4000000
; a bunch of complicated sorting logic, including calls to the functions above.
; omitted...
.L100:
; Outer for loop
mov rsi, rdi
add rdi, 4
mov DWORD PTR [rsi], ecx
cmp rdi, OFFSET FLAT:arr+4000000
jne .L97
jmp .L116
arr:
.zero 4000000
Note that the GCC implementation of std::sort
uses the introsort algorithm, which combines quicksort (plus a fallback to heapsort) and insertion sort, which you can see from the de-mangled names in the assembly.
I had to omit a ton of the code, but it does has a very similar structure to the C assembly.
However, notice that our foo_cmp
lambda is nowhere to be found.
Essentially, our comparison function was inlined into the std::sort
implementation itself, which is why the output assembly contains all the sorting logic, with the comparison itself spread across many lines of assembly.
But why does this give better performance than in C?
Well, since the C sorting algorithm is only provided with a pointer to the comparator function, it was not able to do any sort of inlining.
Function pointers are opaque, and so every time the qsort
function needs to do a comparison (which is very often), it has to do it through a function call to the comparator.
But in C++, since the comparison function is provided as a lambda, the compiler has free reign to aggressively inline the comparison logic, moving it into the implementation itself and eliding all calls to the comparator lambda.
As a result, the C++ compiler essentially outputs a custom version of std::sort
that implements the comparison exactly as specified by the lambda rather than a generic function that calls a comparator, as in C.
Note that this specific example is not exactly bulletproof, as C compilers can also perform this type of inlining if the definition of qsort
is available at compile-time, as mentioned here.
But I think that the idea is still very powerful.
Effectively, because lambdas are part of the C++ language itself and not a thin wrapper over function pointers, they are easier for compilers to optimize away, plus they also give you more expressive power.
I had previously known that one of the taglines of C++ is “zero-cost abstractions”, but I never really understood it until I learned about this.
Yes, the syntax may be a little funny, but lambdas in C++ may be more powerful than you think.
Basically, I feel less guilty about using lambdas in C++ code now. Yay!