Question

Why C++ Can Run Faster Than Hand-Written Assembly in Collatz Code

cppperformanceassemblyoptimizationx86

Question

I wrote two versions of a brute-force solution for Project Euler problem 14, one in x86-64 assembly and one in C++. Both test Collatz sequence lengths for numbers below one million.

The assembly version was built with:

nasm -felf64 p14.asm && gcc p14.o -o p14

The C++ version was built with:

g++ p14.cpp -o p14

Assembly (`p14.asm`)

section .data
    fmt db "%d", 10, 0

global main
extern printf

section .text

main:
    mov rcx, 1000000
    xor rdi, rdi        ; max count
    xor rsi, rsi        ; value producing max count

l1:
    dec rcx
    xor r10, r10        ; count
    mov rax, rcx

l2:
    test rax, 1
    jpe even

    mov rbx, 3
    mul rbx
    inc rax
    jmp c1

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

c1:
    inc r10
    cmp rax, 1
    jne l2

    cmp rdi, r10
    cmovl rdi, r10
    cmovl rsi, rcx

    cmp rcx, 2
    jne l1

    mov rdi, fmt
    xor rax, rax
    call printf
    ret

C++ (`p14.cpp`)

#include <iostream>

int sequence(long n) {
    int count = 1;
    while (n != 1) {
        if (n % 2 == 0)
            n /= 2;
        else
            n = 3 * n + 1;
        ++count;
    }
    return count;
}

int main() {
    int max = 0, maxi;
    for (int i = 999999; i > 0; --i) {
        int s = sequence(i);
        if (s > max) {
            max = s;
            maxi = i;
        }
    }
    std::cout << maxi << std::endl;
}

I understand that compilers can apply optimizations, but I do not immediately see many ways to improve the assembly version further at the instruction level.

The C++ code appears to use modulus on every iteration and division on every even value, while the assembly code performs only one explicit division on even values.

However, the assembly version runs about one second slower on average than the C++ version. Why does this happen?

This is mainly a performance and code-generation question: why can hand-written assembly be slower than compiled C++ here, and what kinds of instruction choices or CPU effects explain the difference?

Short Answer

By the end of this page, you will understand why hand-written assembly is not automatically faster than C++, especially when a compiler applies strong optimizations. You will see how instruction choice, register usage, expensive operations like div, branching, function structure, and CPU microarchitecture can make a huge performance difference. You will also learn how to reason about low-level speed in a practical way.

Concept

The core concept behind this question is performance depends on the machine code that actually runs, not on whether the source language is “high-level” or “low-level”.

A common beginner assumption is:

C++ is abstract, so it must be slower.
Assembly is explicit, so it must be faster.

That is not always true.

Modern C++ compilers can generate highly optimized machine code by:

replacing expensive operations with cheaper ones
inlining functions
keeping values in registers efficiently
removing redundant work
choosing instructions that fit the CPU well
reordering work to reduce stalls

By contrast, hand-written assembly can be slower if it uses instructions that are technically correct but inefficient on real CPUs.

In the Collatz example, the biggest issue is that some assembly instructions are very expensive:

div is much slower than a shift like shr
mul can be heavier than using lea for simple arithmetic like 3*n + 1
repeatedly loading constants into registers inside a loop wastes instructions

For example:

dividing by 2 does not need div 2
multiplying by 3 does not need a general-purpose multiply instruction

A compiler often turns code like this:

Mental Model

Think of this like traveling across a city.

C++ source code is like telling a professional route planner your destination.
The compiler is the route planner that knows traffic patterns, road types, and shortcuts.
Assembly is you manually choosing every road yourself.

If you know the city extremely well, your manual route can be excellent. But if you choose roads that look direct but are actually congested, your route can be slower than the planner’s route.

In this question:

div is like taking a slow road with lots of traffic lights.
shr is like taking a fast highway.
loading rbx with 2 or 3 every loop is like stopping to check the map over and over.
using lea for 3*n + 1 is like taking a shortcut the route planner knows.

So the lesson is simple:

Assembly is only faster when the instructions you choose are better than what the compiler would choose.

Take Quiz

Syntax and Examples

Core idea: expensive instructions vs cheap instructions

In low-level optimization, different instructions that produce the same result can have very different costs.

Slower style

mov rbx, 2
xor rdx, rdx
div rbx          ; divide RDX:RAX by RBX

This works, but div is a heavy instruction.

Faster style for dividing by 2

shr rax, 1

If the value is known to be even and you only want n / 2, a right shift is much cheaper.

Multiplying by 3

General multiply

mov rbx, 3
mul rbx
inc rax

This computes 3*n + 1, but mul is a general multiplication instruction and can be slower than necessary.

More efficient arithmetic

lea rax, [rax + rax*2 + 1]

This also computes 3*n + 1, often with fewer costs.

What the compiler may do from C++

C++:

Step by Step Execution

Consider this C++ version:

int collatz_length(long n) {
    int count = 1;
    while (n != 1) {
        if ((n & 1) == 0)
            n >>= 1;
        else
            n = 3 * n + 1;
        ++count;
    }
    return count;
}

Let’s trace collatz_length(6).

Initial state

n = 6
count = 1

Iteration 1

n != 1 → continue
6 & 1 is 0, so 6 is even
n >>= 1 → n = 3
++count → count = 2

Real World Use Cases

This concept appears in many real programs, not just benchmark puzzles.

1. Numeric and algorithmic code

In:

simulations
compression
cryptography support code
parsing loops
search algorithms

small instruction-level choices inside hot loops can dominate runtime.

2. Game engines and graphics tools

Performance-critical code often runs millions of times per frame or per asset. Developers rely on profilers and compiler output because naive “low-level looking” code is not always fastest.

3. Systems programming

In operating systems, drivers, embedded software, and runtimes, developers sometimes use assembly. But even there, they usually keep assembly minimal and only where measurement proves it helps.

4. High-performance servers

Fast request handling often depends on:

branch-friendly code
avoiding expensive arithmetic
efficient memory access
letting the compiler optimize aggressively

5. Data processing and scientific code

Loops over large datasets benefit from simple arithmetic, predictable branches, and code shapes that compilers can optimize well.

The practical lesson is:

write clear code first
measure
inspect generated assembly if needed
optimize only the true bottlenecks

Take Quiz

Real Codebase Usage

In real projects, developers rarely choose between “all C++” and “all assembly” in a vacuum. They usually follow patterns like these.

Use high-level code in most places

Most code is written in C++ because it is easier to:

read
test
maintain
refactor

Then developers compile with optimization flags such as -O2 or -O3.

Inspect generated assembly when performance matters

A common workflow is:

write clear C++
profile the program
find the hot path
inspect compiler output
adjust source code to help optimization

For example, developers may replace:

n % 2 == 0

with:

(n & 1) == 0

if they want intent to be explicit, though compilers often optimize this anyway.

Prefer simple operations in hot loops

Real codebases try to avoid expensive instructions inside repeated loops. Common patterns include:

shifts instead of division by powers of two
additions and lea-style arithmetic instead of general multiply where possible
reducing repeated constant setup inside loops

Common Mistakes

1. Assuming assembly is automatically faster

This is the biggest misconception.

Incorrect assumption

“I wrote assembly, so it must beat C++.”

Reality

A compiler can emit better machine code than a human’s first assembly draft.

2. Using `div` for powers of two

Slower code

mov rbx, 2
xor rdx, rdx
div rbx

Better approach

shr rax, 1

Division is far more expensive than shifting.

3. Using `mul` when simple arithmetic works

Less efficient

mov rbx, 3
mul rbx
inc rax

Better

lea rax, [rax + rax*2 + 1]

For 3*n + 1, a full multiply is often unnecessary.

4. Re-loading constants inside tight loops

Wasteful

Comparisons

Source language vs generated machine code

Idea	Common assumption	Actual performance reality
C++	High-level, so slower	Can compile to very efficient machine code
Assembly	Low-level, so faster	Only faster if the chosen instructions are better

Expensive vs cheap arithmetic instructions

Operation	Slower choice	Faster common alternative	Why
Divide by 2	`div`	`shr`	Shift is much cheaper
Multiply by 3	`mul`

Cheat Sheet

Performance lessons from this question

Assembly is not automatically faster than C++.
Compare generated machine code, not source syntax.
In hot loops, instruction choice matters a lot.

Important x86-64 ideas

Check odd/even

test rax, 1
jnz odd

Divide by 2

shr rax, 1

Compute `3*n + 1`

lea rax, [rax + rax*2 + 1]

Expensive instructions to be careful with

div
idiv
sometimes mul / imul when simpler arithmetic is possible

Compiler optimization facts

-O0 is not representative of real performance
-O2 and -O3 can remove large amounts of overhead
% 2 and often become bit operations for integers

FAQ

Why can optimized C++ be faster than hand-written assembly?

Because the compiler may choose better instructions, reduce overhead, and optimize register usage more effectively than a manual assembly implementation.

Does `% 2` in C++ always generate a division instruction?

No. For integer code, compilers often optimize % 2 into a bit test.

Why is `div` so slow on x86?

div is a complex instruction with high latency compared to simple operations like shr, add, or lea.

Is assembly still useful for optimization?

Yes, but mostly in small, carefully measured hotspots where a developer understands the target CPU well.

Should I use `& 1` instead of `% 2` in C++?

It can make intent explicit, but modern compilers often optimize % 2 well for integers anyway.

What is the biggest lesson from this Collatz example?

Instruction selection matters. A correct assembly program can be much slower if it uses expensive operations repeatedly.

Is compiler optimization more important than using a low-level language?

Often, yes. Good optimization flags and good code structure usually matter more than writing everything in assembly.

Related Concepts

Compiler optimization — directly related because the compiler can transform high-level code into efficient machine instructions.
Instruction latency and throughput — explains why some CPU instructions are far slower than others.
Bitwise operations — important because parity checks and divide-by-two operations can often be expressed with bits.
Branch prediction — relevant because repeated if/else logic in tight loops can be affected by branch behavior.
Strength reduction — the optimization technique of replacing expensive operations with cheaper equivalents, such as division with shifts.
Inlining — relevant because compilers may remove function-call overhead in performance-critical code.
Profiling — essential for finding where time is really spent before optimizing.
Algorithmic optimization — important because improving the algorithm usually gives larger gains than instruction-level tuning.
x86-64 calling conventions — related because assembly correctness and efficiency depend on proper register usage.
Loop optimization — relevant because this question centers on a very hot inner loop.

Take Quiz

Mini Project

Description

Build a small C++ program that computes the starting number under a limit that produces the longest Collatz sequence. Then create two versions of the step logic: one using straightforward arithmetic (% and /) and another using bit operations (& and >>). This project demonstrates that readable high-level code can still be fast, and it helps you compare implementation styles in a controlled way.

Goal

Write a Collatz benchmark program in C++ and compare two equivalent implementations to see how code shape affects performance.

Requirements

Write a function that returns the Collatz sequence length for a given positive integer.
Test every starting value from 1 up to a chosen limit such as 1000000.
Track which starting value produces the longest sequence.
Implement one version using % 2 and / 2, and another using & 1 and >> 1.
Print the winning starting value and the sequence length for each version.

Take Quiz

Keep learning

Build mode	Typical behavior
`-O0`	Easier to debug, usually much slower
`-O2` / `-O3`	Inlining, strength reduction, better register use, much faster

Assembly style	Result
Correct but instruction-heavy	Often slower than optimized C++
Microarchitecture-aware	Can beat compiler output in narrow hot paths

Expression	Meaning	Performance note
`n % 2 == 0`	even check using remainder	Compiler often optimizes this well
`(n & 1) == 0`	even check using lowest bit	Makes bit-level intent explicit

Why C++ Can Run Faster Than Hand-Written Assembly in Collatz Code

Question

Assembly (p14.asm)

C++ (p14.cpp)

Short Answer

Concept

Mental Model

Syntax and Examples

Core idea: expensive instructions vs cheap instructions

Slower style

Faster style for dividing by 2

Multiplying by 3

General multiply

More efficient arithmetic

What the compiler may do from C++

Step by Step Execution

Initial state

Iteration 1

Real World Use Cases

1. Numeric and algorithmic code

2. Game engines and graphics tools

3. Systems programming

4. High-performance servers

5. Data processing and scientific code

Real Codebase Usage

Use high-level code in most places

Inspect generated assembly when performance matters

Prefer simple operations in hot loops

Common Mistakes

1. Assuming assembly is automatically faster

Incorrect assumption

Reality

2. Using div for powers of two

Slower code

Better approach

3. Using mul when simple arithmetic works

Less efficient

Better

4. Re-loading constants inside tight loops

Wasteful

Comparisons

Source language vs generated machine code

Expensive vs cheap arithmetic instructions

Cheat Sheet

Performance lessons from this question

Important x86-64 ideas

Check odd/even

Divide by 2

Compute 3*n + 1

Expensive instructions to be careful with

Compiler optimization facts

FAQ

Why can optimized C++ be faster than hand-written assembly?

Does % 2 in C++ always generate a division instruction?

Why is div so slow on x86?

Is assembly still useful for optimization?

Should I use & 1 instead of % 2 in C++?

What is the biggest lesson from this Collatz example?

Is compiler optimization more important than using a low-level language?

Related Concepts

Mini Project

Description

Goal

Requirements

Related questions

Basic Rules and Idioms for Operator Overloading in C++

C++ Base Class Constructor Rules Explained

C++ Casts Explained: C-Style Cast vs static_cast vs dynamic_cast

Beginner-friendly C++ example

Why this version is useful

Iteration 2

Iteration 3

Iteration 4

Iteration 5

Iteration 6

Iteration 7

Iteration 8

Loop ends

Guard clauses and early exits

Caching and memoization

Assembly (`p14.asm`)

C++ (`p14.cpp`)

2. Using `div` for powers of two

3. Using `mul` when simple arithmetic works

Compute `3*n + 1`

Does `% 2` in C++ always generate a division instruction?

Why is `div` so slow on x86?

Should I use `& 1` instead of `% 2` in C++?

`% 2` vs `& 1` in C++