Created 2025/01/19 at 11:36AM

Last Modified 2025/01/20 at 06:54PM

Today, I'm going to talk about built-in functions provided by GCC that let you poke the hardware in just the right way, extract maximum performance. These aren’t part of the standard library, but rather a compiler-specific backstage pass that reveals hidden optimizations, CPU instructions, and neat tricks lurking under the hood.

I'll be talking about not all but some of the built-in functions that I find interesting.

__builtin_parity

Parity of a number tells whether it has odd or even number of bits set to 1. __builtin_parity returns true in case of odd parity and false in case of even parity.

// parity.cpp

#include <iostream>
#include <chrono>

bool has_odd_parity(unsigned int x){
    x ^= x >> 16;
    x ^= x >> 8;
    x ^= x >> 4;

    unsigned int lowNibble = x & 0xF;

    // 0x6996 is a 16-bit constant (binary: 0110 1001 1001 0110)
    // that encodes the parity of each possible 4-bit value.
    return static_cast<bool>((0x6996 >> lowNibble) & 1);
}

int main() {
    auto start = std::chrono::high_resolution_clock::now();
    for (unsigned int v=0; v<INT32_MAX; v++) {
        bool parity = has_odd_parity(v);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration1 = end - start;
    std::cout << "Time taken: " << duration1.count() << " seconds.\n";
}

// parity_builtin.cpp

#include <iostream>
#include <chrono>

bool has_odd_parity(unsigned int x){
    return __builtin_parity(x);
}

int main() {
    auto start = std::chrono::high_resolution_clock::now();
    for (unsigned int v=0; v<INT32_MAX; v++) {
        bool parity = has_odd_parity(v);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration1 = end - start;
    std::cout << "Time taken: " << duration1.count() << " seconds.\n";
}

Let's compile and run both, and analyze.

$ g++-13 parity.cpp -o parity -march=native
$ ./parity
Time taken: 4.32415 seconds.

$ g++-13 parity_builtin.cpp -o parity_builtin -march=native
$ ./parity_builtin 
Time taken: 3.15765 seconds.

If we analyze the parity_builtin binary (objdump -D parity_builtin), we can see it using popcnt CPU instruction.

0000000000001169 <_Z14has_odd_parityj>:
    1169:       f3 0f 1e fa             endbr64 
    116d:       55                      push   %rbp
    116e:       48 89 e5                mov    %rsp,%rbp
    1171:       89 7d fc                mov    %edi,-0x4(%rbp)
    1174:       f3 0f b8 45 fc          popcnt -0x4(%rbp),%eax
    1179:       83 e0 01                and    $0x1,%eax
    117c:       85 c0                   test   %eax,%eax
    117e:       0f 95 c0                setne  %al
    1181:       5d                      pop    %rbp
    1182:       c3                      retq

__builtin_constant_p

Used to determine if the expression is known to be constant at compile time and hence that GCC can perform constant-folding on expressions involving that value. The argument of the function is the expression to test. The expression is not evaluated, side-effects are discarded. By evaluating expressions at compile time when possible, you reduce runtime overhead.

// constant_p.cpp

#include <iostream>

constexpr double SCALE = 1.5;
constexpr double OFFSET = 0.75;

double Scale(double x) {
    return x * SCALE + OFFSET;
}

#define Scale_Value(X) (__builtin_constant_p(X) ? X : Scale(X))

int main() {
    constexpr double const_input = 2.0;
    double scaled_const = Scale_Value(const_input);
    std::cout << "Scaled (constant input): " << scaled_const << std::endl;

    double scaled_literal = Scale_Value(3.0);
    std::cout << "Scaled (literal input): " << scaled_literal << std::endl;

    double runtime_input = 4.0;
    double scaled_runtime = Scale_Value(runtime_input);
    std::cout << "Scaled (runtime input): " << scaled_runtime << std::endl;

    return 0;
}

$ g++-13 constant_p.cpp -o constant_p -march=native
$ ./constant_p
Scaled (constant input): 2
Scaled (literal input): 3
Scaled (runtime input): 6.75

__builtin_prefetch

This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.

There are two optional arguments, rw and locality.

The value of rw is a compile-time constant zero, one or two

one means that the prefetch is preparing for a write to the memory address
two means that the prefetch is preparing for a shared read (expected to be read by at least one other processor before it is written if written at all)
zero (the default) means that the prefetch is preparing for a read.

The value locality must be a compile-time constant integer between zero and three. A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three.

NOTE - While prefetching can theoretically improve performance by reducing cache miss penalties, several factors can influence its effectiveness, sometimes resulting in the opposite outcome. You should always experiment each use case and analyze assembly code and see if it really helps or not.

Let's take a look at two examples.

Sum of numbers in a large list -- we prefetch data in batches of size of "4096" into cache line in our for loop.

// prefetch_sum.cpp

#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>

const size_t ARRAY_SIZE = 100'000'000;

double compute_without_prefetch(const std::vector<double>& data) {
    double sum = 0.0;
    for (size_t i = 0; i < data.size(); ++i) {
        sum += data[i] * 0.5;
    }
    return sum;
}

double compute_with_prefetch(const std::vector<double>& data) {
    double sum = 0.0;
    size_t prefetch_distance = 4096;

    for (size_t i = 0; i < data.size(); ++i) {
        if (i % prefetch_distance == 0) {
            __builtin_prefetch(&data[i + prefetch_distance], 0, 1);
        }
        sum += data[i] * 0.5;
    }
    return sum;
}

int main() {
    std::vector<double> data(ARRAY_SIZE);
    for (size_t i = 0; i < ARRAY_SIZE; i++) {
        data[i] = drand48() * (rand() % 100);
    }

    double warm_up = compute_without_prefetch(data);
    std::cout << "Warm-up sum: " << warm_up << "\n";

    auto start = std::chrono::high_resolution_clock::now();
    double sum1 = compute_without_prefetch(data);
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration1 = end - start;
    std::cout << "Sum without prefetch: " << sum1 
              << " in " << duration1.count() << " seconds.\n";

    start = std::chrono::high_resolution_clock::now();
    double sum2 = compute_with_prefetch(data);
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration2 = end - start;
    std::cout << "Sum with prefetch: " << sum2 
              << " in " << duration2.count() << " seconds.\n";
    return 0;
}

$ g++-13 prefetch_sum.cpp -o prefetch_sum -O3 -march=native && ./prefetch_sum
Warm-up sum: 1.23735e+09
Sum without prefetch: 1.23735e+09 in 0.099292 seconds.
Sum with prefetch: 1.23735e+09 in 0.0677383 seconds.

So we see almost ~1.5x speed-up.

Binary search in a large list of numbers -- At the moment of data prefetching, we don’t know if the binary search will continue in the left part of the array or the right part. That’s why we prefetch middle element of both the left and the right part.

// prefetch_bs.cpp

#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>

const size_t ARRAY_SIZE = 100'000'000;

int bin_search_without_prefetch(const std::vector<double>& data, double value) {
    int low = 0, high = data.size() - 1, mid;
    while (low <= high) {
        mid = (low + high)/2;
        if(data[mid] < value)
            low = mid + 1; 
        else if(data[mid] == value)
            return mid;
        else if(data[mid] > value)
            high = mid-1;
    }
    return -1;
}

int bin_search_with_prefetch(const std::vector<double>& data, double value) {
    int low = 0, high = data.size() - 1, mid;
    while (low <= high) {
        mid = (low + high)/2;
        __builtin_prefetch (&data[(mid + 1 + high)/2], 0, 1);
        __builtin_prefetch (&data[(low + mid - 1)/2], 0, 1);
        if(data[mid] < value)
            low = mid + 1; 
        else if(data[mid] == value)
            return mid;
        else if(data[mid] > value)
            high = mid-1;
    }
    return -1;
}

int main() {
    std::vector<double> data(ARRAY_SIZE);
    for (size_t i = 0; i < ARRAY_SIZE; i++) {
        data[i] = drand48() * (rand() % 100);
    }

    data[6969] = 69.69;
    std::sort(data.begin(), data.end());
    for (size_t i = 0; i < ARRAY_SIZE; i++) {
        if (data[i] == 69.69) {
            std::cout << "69.69 is at index " << i << "\n";
        }
    }

    auto start = std::chrono::high_resolution_clock::now();
    int idx1 = bin_search_without_prefetch(data, 69.69);
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration1 = end - start;
    std::cout << "Index without prefetch: " << idx1 
              << " in " << duration1.count() << " seconds.\n";

    start = std::chrono::high_resolution_clock::now();
    int idx2 = bin_search_with_prefetch(data, 69.69);
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration2 = end - start;
    std::cout << "Index with prefetch: " << idx2 
              << " in " << duration2.count() << " seconds.\n";

    return 0;
}

$ g++-13 prefetch_bs.cpp -o prefetch_bs -march=native -O3 && ./prefetch_bs
69.69 is at index 95012343
Index without prefetch: 95012343 in 2.66e-06 seconds.
Index with prefetch: 95012343 in 4.9e-07 seconds.

And voila! We see almost 5x speed-up in our binary search.

If I analyze both the above binaries using objdump -D, I can see it using prefetcht2 instruction, and from my processor's manual

PREFETCHT2 -- Prefetches temporal data into the third-level (L3) and higher-level caches, but not into the L1 or L2 cache.

__builtin_expect

This function can be used to give the compiler a hint about which branch of an if statement is more likely to be taken at runtime. This does not force the compiler to do anything, but it often influences how the compiler arranges code in memory (e.g., layout of branches) and can marginally improve performance if the actual runtime behavior matches the hint.

NOTE - These are only hints. Modern CPUs have sophisticated branch predictors and may or may not benefit significantly from these hints. Also, if the actual runtime behavior does not match your hints, performance can degrade.

I'm going to use clang here because I coudln't get it to work with g++. In the example, I'm going do some processing on a list of numbers. I make sure that most of our numbers are positive, and then hint compiler to prefer branch with positive numbers in one function, and to prefer branch with negative numbers in another function. Since we have lot more positive numbers than negative numbers, theoretically, the one preferring branch of positive numbers should perform better.

// expect.cpp

#include <iostream>
#include <vector>
#include <chrono>
#include <random>

#define LIKELY(x)   __builtin_expect(!!(x), 1)
#define UNLIKELY(x) __builtin_expect(!!(x), 0)

void process_numbers_correct(const std::vector<int>& numbers) {
    volatile int dummy = 0;
    for (size_t i = 0; i < numbers.size(); ++i) {
        if (LIKELY(numbers[i] >= 0)) {
            int square = numbers[i] * numbers[i];
            dummy += square;
        } else {
            int cube = numbers[i] * numbers[i] * numbers[i];
            dummy += cube;
        }
    }
}

void process_numbers_incorrect(const std::vector<int>& numbers) {
    volatile int dummy = 0;
    for (size_t i = 0; i < numbers.size(); ++i) {
        if (UNLIKELY(numbers[i] >= 0)) {
            int square = numbers[i] * numbers[i];
            dummy += square;
        } else {
            int cube = numbers[i] * numbers[i] * numbers[i];
            dummy += cube;
        }
    }
}

int main() {
    const size_t NUM_ELEMENTS = 100'000'000;
    std::vector<int> numbers(NUM_ELEMENTS, 1);

    std::mt19937 rng(42);
    std::uniform_int_distribution<int> dist(-100, 100);

    for (size_t i = 0; i < NUM_ELEMENTS; i += 1000) {
        // Every 1000th element might be negative
        numbers[i] = dist(rng);
    }

    auto start_correct = std::chrono::high_resolution_clock::now();
    process_numbers_correct(numbers);
    auto end_correct = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration_correct = end_correct - start_correct;
    std::cout << "Time with correct __builtin_expect usage: " 
              << duration_correct.count() << " seconds.\n";

    auto start_incorrect = std::chrono::high_resolution_clock::now();
    process_numbers_incorrect(numbers);
    auto end_incorrect = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration_incorrect = end_incorrect - start_incorrect;
    std::cout << "Time with incorrect __builtin_expect usage: " 
              << duration_incorrect.count() << " seconds.\n";

    return 0;
}

$ clang++-18 expect.cpp -o expect
$ ./expect 
Time with correct __builtin_expect usage: 0.519882 seconds.
Time with incorrect __builtin_expect usage: 0.529057 seconds.
$ ./expect
Time with correct __builtin_expect usage: 0.520423 seconds.
Time with incorrect __builtin_expect usage: 0.529819 seconds.
$ ./expect
Time with correct __builtin_expect usage: 0.520897 seconds.
Time with incorrect __builtin_expect usage: 0.533373 seconds.

So, our observation matches the theoretical assumption. I'd also recommend you to check __builtin_expect_with_probability which works in a same way, but you can provide likelihood / unlikelihood probabilities for each branch, when you have multi-branch code.

Before I end this post, one of the major pain point using builtin functions is that they are not portable across compilers. Compilers that do support these functions may provide them for multiple architectures, but implementation quality and performance is likely to vary. If optimal low level performance is your goal, you should also check the assembly output of the compiler, to determine whether it really is generating the instruction that you expect it to use.