Zachary W. Huang

C++ Lambdas are Actually Pretty Cool

August 9, 2025

If you’re anything like me, you might have a slightly contemptuous view of C++ as a programming language (“bah, they never should have added objects to C!”), but still recognize its importance in industry. But as I’ve gained more experience with using C++ in a professional setting, I’ve come to realize that there are some of the language features I scoffed at previously are actually pretty interesting.

Lambda

C++11 added anonymous functions to the language, allowing you to write code like this:

// automatic type deduction because I'm a sane individual
auto handler = [](int a){
    printf("Hello, %d!\n", a);
};

handler(10); // prints out "Hello, 10!"

But unlike raw function pointers in C, lambdas can capture variables, both by value and by reference.

int foo = 10;

// the "=" means to capture by value by default
auto handler_by_value = [=](int bar){
    printf("Hello, %d!\n", foo + bar);
};

// the "&" means to capture by reference by default
auto handler_by_ref = [&](int bar){
    foo--;
    printf("Howdy, %d!\n", foo + bar);
};

handler_by_value(10);  // prints out "Hello, 20!"
handler_by_ref(10);    // prints out "Howdy, 19!"
handler_by_value(10);  // prints out "Hello, 20!"
handler_by_ref(10);    // prints out "Howdy, 18!"

You can also specify the specific capture type for different variables by putting it in the square brackets, with something like [foo, &bar] capturing foo by value and bar by reference.

At first, this might just seem like a slightly more convenient syntax over C function pointers, but it turns out it goes deeper than that.

A Sorting Mystery

Both the C standard library and C++ standard template library include utility functions for sorting. In C, you would use qsort, and in C++, you would use std::sort. Both functions take (effectively) a pointer to the data along with a comparator function, and will sort the data in-place. In addition, both functions (effectively) use quicksort (with some tricks) to get $O(n \log n)$ runtime on average. The point is, you would expect similar performance from these two functions, and I probably might even hypothesized that C++ would be a little bit slower, since my impression was that the STL was complicated and maybe even bloated. But let’s write a benchmark to see what happens.

I wrote a quick program to sort a random integer array with 100,000 elements in a loop in both C and C++. First, in C:

#include <stdlib.h>

#define T 100
#define N 100000

int arr[N];

// Custom comparator to sort in reverse order.
int foo_cmp(const void *a, const void *b) {
    return *((int *) b) - *((int*) a);
}

int main(){
    for (size_t t = 0; t < T; t++) {
        for (size_t i = 0; i < N; i++) {
            arr[i] = rand();
        }

        qsort(arr, N, sizeof(int), &foo_cmp);
    }

    return 0;
}

And now in C++:

#include <algorithm>

#define T 100
#define N 100000

int arr[N];

int main(){
    // Custom comparator to sort in reverse order.
    // Notice that comparison function is implemented as a lambda!
    auto foo_cmp = [](auto &a, auto &b){
        return b < a;
    };

    for (size_t t = 0; t < T; t++) {
        for (size_t i = 0; i < N; i++) {
            arr[i] = rand();
        }

        std::sort(&arr[0], &arr[N], foo_cmp);
    }

    return 0;
}

On my laptop, benchmarking with hyperfine gives the following:

$ gcc sort.c && hyperfine ./a.out
Benchmark 1: ./a.out
  Time (mean ± σ):     912.4 ms ±   3.0 ms    [User: 907.2 ms, System: 4.3 ms]
  Range (min … max):   906.8 ms … 915.3 ms    10 runs

$ g++ -std=c++20 sort.cpp && hyperfine ./a.out
Benchmark 1: ./a.out
  Time (mean ± σ):      1.175 s ±  0.004 s    [User: 1.169 s, System: 0.005 s]
  Range (min … max):    1.171 s …  1.180 s    10 runs

We see that the C++ implementation is a little bit slower on average. But now, let’s enable optimizations with the -O3 flag.

$ gcc -O3 sort.c && hyperfine ./a.out
Benchmark 1: ./a.out
  Time (mean ± σ):     760.1 ms ±   4.0 ms    [User: 752.8 ms, System: 5.7 ms]
  Range (min … max):   755.7 ms … 768.5 ms    10 runs

$ g++ -std=c++20 -O3 lambda.cpp && hyperfine ./a.out
Benchmark 1: ./a.out
  Time (mean ± σ):     504.5 ms ±   1.7 ms    [User: 498.9 ms, System: 5.0 ms]
  Range (min … max):   503.1 ms … 508.9 ms    10 runs

As we can see, the C++ implementation is actually faster! Now, compiler optimization is a hugely complicated process, and I would be lying if I told you I completely understood how it works, but to my knowledge, this particular result is a direct result of C++ lambdas. If we put both of these programs into godbolt with the -O3 flag, we see that the compiled assembly looks something like the following (with my annotations):

For C:

foo_cmp:
        ; our custom comparator function
        mov     eax, DWORD PTR [rdi]
        sub     eax, DWORD PTR [rsi]
        ret
main:
        ; Function prologue
        push    rbp
        mov     ebp, 100
        push    rbx
        sub     rsp, 8
.L4:
        xor     ebx, ebx
.L5:
        ; Inner for loop, filling in array with random values
        call    rand
        mov     DWORD PTR arr[0+rbx*4], eax
        add     rbx, 1
        cmp     rbx, 100000
        jne     .L5
        ; Outer for loop (exit condition)
        mov     ecx, OFFSET FLAT:foo_cmp
        mov     edx, 4
        mov     esi, 100000
        mov     edi, OFFSET FLAT:arr
        ; Call sorting function
        call    qsort
        sub     rbp, 1
        jne     .L4
        ; Function epilogue
        add     rsp, 8
        xor     eax, eax
        pop     rbx
        pop     rbp
        ret
arr:
        .zero   400000

This is clearly a direct translation of the code provided. The two for loops are implemented, along with a call to the qsort function that’s defined elsewhere. Presumably, this function will call our foo_cmp function in order to compare the two integers.

Meanwhile, for C++

void std::__insertion_sort<int*, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>>(auto, auto, auto) (.constprop.0):
        ; a whole bunch of complicated sorting logic, omitted...
void std::__adjust_heap<int*, long, int, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>>(auto, auto, auto, int, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>) (.isra.0):
        ; a whole bunch of complicated sorting logic, omitted...
void std::__introsort_loop<int*, long, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>>(auto, auto, auto, __gnu_cxx::__ops::_Iter_comp_iter<main::'lambda'(auto&, auto&)>) (.isra.0):
        ; a whole bunch of complicated sorting logic, omitted ...
main:
        ; function prologue
        push    r13
        push    r12
        push    rbp
        mov     ebp, 10
        push    rbx
        sub     rsp, 8
.L77:
        xor     ebx, ebx
.L78:
        ; Inner for loop, filling in array with random values
        call    rand
        mov     DWORD PTR arr[0+rbx*4], eax
        add     rbx, 1
        cmp     rbx, 1000000
        jne     .L78
        mov     eax, 4000000
        mov     edi, OFFSET FLAT:arr+4000000
        ; a bunch of complicated sorting logic, including calls to the functions above.
        ; omitted...
.L100:
        ; Outer for loop
        mov     rsi, rdi
        add     rdi, 4
        mov     DWORD PTR [rsi], ecx
        cmp     rdi, OFFSET FLAT:arr+4000000
        jne     .L97
        jmp     .L116
arr:
        .zero   4000000

Note that the GCC implementation of std::sort uses the introsort algorithm, which combines quicksort (plus a fallback to heapsort) and insertion sort, which you can see from the de-mangled names in the assembly. I had to omit a ton of the code, but it does has a very similar structure to the C assembly. However, notice that our foo_cmp lambda is nowhere to be found. Essentially, our comparison function was inlined into the std::sort implementation itself, which is why the output assembly contains all the sorting logic, with the comparison itself spread across many lines of assembly. But why does this give better performance than in C?

Well, since the C sorting algorithm is only provided with a pointer to the comparator function, it was not able to do any sort of inlining. Function pointers are opaque, and so every time the qsort function needs to do a comparison (which is very often), it has to do it through a function call to the comparator. But in C++, since the comparison function is provided as a lambda, the compiler has free reign to aggressively inline the comparison logic, moving it into the implementation itself and eliding all calls to the comparator lambda. As a result, the C++ compiler essentially outputs a custom version of std::sort that implements the comparison exactly as specified by the lambda rather than a generic function that calls a comparator, as in C.

Note that this specific example is not exactly bulletproof, as C compilers can also perform this type of inlining if the definition of qsort is available at compile-time, as mentioned here. But I think that the idea is still very powerful. Effectively, because lambdas are part of the C++ language itself and not a thin wrapper over function pointers, they are easier for compilers to optimize away, plus they also give you more expressive power. I had previously known that one of the taglines of C++ is “zero-cost abstractions”, but I never really understood it until I learned about this. Yes, the syntax may be a little funny, but lambdas in C++ may be more powerful than you think.

Conclusion

Basically, I feel less guilty about using lambdas in C++ code now. Yay!

back