Question
Why C++ Can Run Faster Than Hand-Written Assembly in Collatz Code
Question
I wrote two versions of a brute-force solution for Project Euler problem 14, one in x86-64 assembly and one in C++. Both test Collatz sequence lengths for numbers below one million.
The assembly version was built with:
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ version was built with:
g++ p14.cpp -o p14
Assembly (p14.asm)
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max count
xor rsi, rsi ; value producing max count
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++ (p14.cpp)
#include <iostream>
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = 3 * n + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
std::cout << maxi << std::endl;
}
I understand that compilers can apply optimizations, but I do not immediately see many ways to improve the assembly version further at the instruction level.
The C++ code appears to use modulus on every iteration and division on every even value, while the assembly code performs only one explicit division on even values.
However, the assembly version runs about one second slower on average than the C++ version. Why does this happen?
This is mainly a performance and code-generation question: why can hand-written assembly be slower than compiled C++ here, and what kinds of instruction choices or CPU effects explain the difference?
Short Answer
By the end of this page, you will understand why hand-written assembly is not automatically faster than C++, especially when a compiler applies strong optimizations. You will see how instruction choice, register usage, expensive operations like div, branching, function structure, and CPU microarchitecture can make a huge performance difference. You will also learn how to reason about low-level speed in a practical way.
Concept
The core concept behind this question is performance depends on the machine code that actually runs, not on whether the source language is “high-level” or “low-level”.
A common beginner assumption is:
- C++ is abstract, so it must be slower.
- Assembly is explicit, so it must be faster.
That is not always true.
Modern C++ compilers can generate highly optimized machine code by:
- replacing expensive operations with cheaper ones
- inlining functions
- keeping values in registers efficiently
- removing redundant work
- choosing instructions that fit the CPU well
- reordering work to reduce stalls
By contrast, hand-written assembly can be slower if it uses instructions that are technically correct but inefficient on real CPUs.
In the Collatz example, the biggest issue is that some assembly instructions are very expensive:
divis much slower than a shift likeshrmulcan be heavier than usingleafor simple arithmetic like3*n + 1- repeatedly loading constants into registers inside a loop wastes instructions
For example:
- dividing by 2 does not need
div 2 - multiplying by 3 does not need a general-purpose multiply instruction
A compiler often turns code like this:
Mental Model
Think of this like traveling across a city.
- C++ source code is like telling a professional route planner your destination.
- The compiler is the route planner that knows traffic patterns, road types, and shortcuts.
- Assembly is you manually choosing every road yourself.
If you know the city extremely well, your manual route can be excellent. But if you choose roads that look direct but are actually congested, your route can be slower than the planner’s route.
In this question:
divis like taking a slow road with lots of traffic lights.shris like taking a fast highway.- loading
rbxwith2or3every loop is like stopping to check the map over and over. - using
leafor3*n + 1is like taking a shortcut the route planner knows.
So the lesson is simple:
Assembly is only faster when the instructions you choose are better than what the compiler would choose.
Syntax and Examples
Core idea: expensive instructions vs cheap instructions
In low-level optimization, different instructions that produce the same result can have very different costs.
Slower style
mov rbx, 2
xor rdx, rdx
div rbx ; divide RDX:RAX by RBX
This works, but div is a heavy instruction.
Faster style for dividing by 2
shr rax, 1
If the value is known to be even and you only want n / 2, a right shift is much cheaper.
Multiplying by 3
General multiply
mov rbx, 3
mul rbx
inc rax
This computes 3*n + 1, but mul is a general multiplication instruction and can be slower than necessary.
More efficient arithmetic
lea rax, [rax + rax*2 + 1]
This also computes 3*n + 1, often with fewer costs.
What the compiler may do from C++
C++:
Step by Step Execution
Consider this C++ version:
int collatz_length(long n) {
int count = 1;
while (n != 1) {
if ((n & 1) == 0)
n >>= 1;
else
n = 3 * n + 1;
++count;
}
return count;
}
Let’s trace collatz_length(6).
Initial state
n = 6count = 1
Iteration 1
n != 1→ continue6 & 1is0, so6is evenn >>= 1→n = 3++count→count = 2
Real World Use Cases
This concept appears in many real programs, not just benchmark puzzles.
1. Numeric and algorithmic code
In:
- simulations
- compression
- cryptography support code
- parsing loops
- search algorithms
small instruction-level choices inside hot loops can dominate runtime.
2. Game engines and graphics tools
Performance-critical code often runs millions of times per frame or per asset. Developers rely on profilers and compiler output because naive “low-level looking” code is not always fastest.
3. Systems programming
In operating systems, drivers, embedded software, and runtimes, developers sometimes use assembly. But even there, they usually keep assembly minimal and only where measurement proves it helps.
4. High-performance servers
Fast request handling often depends on:
- branch-friendly code
- avoiding expensive arithmetic
- efficient memory access
- letting the compiler optimize aggressively
5. Data processing and scientific code
Loops over large datasets benefit from simple arithmetic, predictable branches, and code shapes that compilers can optimize well.
The practical lesson is:
- write clear code first
- measure
- inspect generated assembly if needed
- optimize only the true bottlenecks
Real Codebase Usage
In real projects, developers rarely choose between “all C++” and “all assembly” in a vacuum. They usually follow patterns like these.
Use high-level code in most places
Most code is written in C++ because it is easier to:
- read
- test
- maintain
- refactor
Then developers compile with optimization flags such as -O2 or -O3.
Inspect generated assembly when performance matters
A common workflow is:
- write clear C++
- profile the program
- find the hot path
- inspect compiler output
- adjust source code to help optimization
For example, developers may replace:
n % 2 == 0
with:
(n & 1) == 0
if they want intent to be explicit, though compilers often optimize this anyway.
Prefer simple operations in hot loops
Real codebases try to avoid expensive instructions inside repeated loops. Common patterns include:
- shifts instead of division by powers of two
- additions and
lea-style arithmetic instead of general multiply where possible - reducing repeated constant setup inside loops
Common Mistakes
1. Assuming assembly is automatically faster
This is the biggest misconception.
Incorrect assumption
- “I wrote assembly, so it must beat C++.”
Reality
A compiler can emit better machine code than a human’s first assembly draft.
2. Using div for powers of two
Slower code
mov rbx, 2
xor rdx, rdx
div rbx
Better approach
shr rax, 1
Division is far more expensive than shifting.
3. Using mul when simple arithmetic works
Less efficient
mov rbx, 3
mul rbx
inc rax
Better
lea rax, [rax + rax*2 + 1]
For 3*n + 1, a full multiply is often unnecessary.
4. Re-loading constants inside tight loops
Wasteful
Comparisons
Source language vs generated machine code
| Idea | Common assumption | Actual performance reality |
|---|---|---|
| C++ | High-level, so slower | Can compile to very efficient machine code |
| Assembly | Low-level, so faster | Only faster if the chosen instructions are better |
Expensive vs cheap arithmetic instructions
| Operation | Slower choice | Faster common alternative | Why |
|---|---|---|---|
| Divide by 2 | div | shr | Shift is much cheaper |
| Multiply by 3 | mul |
Cheat Sheet
Performance lessons from this question
- Assembly is not automatically faster than C++.
- Compare generated machine code, not source syntax.
- In hot loops, instruction choice matters a lot.
Important x86-64 ideas
Check odd/even
test rax, 1
jnz odd
Divide by 2
shr rax, 1
Compute 3*n + 1
lea rax, [rax + rax*2 + 1]
Expensive instructions to be careful with
dividiv- sometimes
mul/imulwhen simpler arithmetic is possible
Compiler optimization facts
-O0is not representative of real performance-O2and-O3can remove large amounts of overhead% 2and often become bit operations for integers
FAQ
Why can optimized C++ be faster than hand-written assembly?
Because the compiler may choose better instructions, reduce overhead, and optimize register usage more effectively than a manual assembly implementation.
Does % 2 in C++ always generate a division instruction?
No. For integer code, compilers often optimize % 2 into a bit test.
Why is div so slow on x86?
div is a complex instruction with high latency compared to simple operations like shr, add, or lea.
Is assembly still useful for optimization?
Yes, but mostly in small, carefully measured hotspots where a developer understands the target CPU well.
Should I use & 1 instead of % 2 in C++?
It can make intent explicit, but modern compilers often optimize % 2 well for integers anyway.
What is the biggest lesson from this Collatz example?
Instruction selection matters. A correct assembly program can be much slower if it uses expensive operations repeatedly.
Is compiler optimization more important than using a low-level language?
Often, yes. Good optimization flags and good code structure usually matter more than writing everything in assembly.
Mini Project
Description
Build a small C++ program that computes the starting number under a limit that produces the longest Collatz sequence. Then create two versions of the step logic: one using straightforward arithmetic (% and /) and another using bit operations (& and >>). This project demonstrates that readable high-level code can still be fast, and it helps you compare implementation styles in a controlled way.
Goal
Write a Collatz benchmark program in C++ and compare two equivalent implementations to see how code shape affects performance.
Requirements
- Write a function that returns the Collatz sequence length for a given positive integer.
- Test every starting value from 1 up to a chosen limit such as 1000000.
- Track which starting value produces the longest sequence.
- Implement one version using
% 2and/ 2, and another using& 1and>> 1. - Print the winning starting value and the sequence length for each version.
Keep learning
Related questions
Basic Rules and Idioms for Operator Overloading in C++
Learn the core rules, syntax, and common idioms for operator overloading in C++, including member vs non-member operators.
C++ Base Class Constructor Rules Explained
Learn how C++ base class constructors are called from derived classes, including order, syntax, defaults, and common mistakes.
C++ Casts Explained: C-Style Cast vs static_cast vs dynamic_cast
Learn the difference between C-style casts, static_cast, and dynamic_cast in C++ with clear examples, safety rules, and real usage tips.