Optimize Rust compilation
From the compiler to how you manage your project, this article is a complete walk-through to improve compilation time, runtime performance and binary size.
engineering/dev/lang/rust
32 min read
Mar 13, 2026
Rust runtime performance and reliability (compile-time checks) are two of Rust's major pros. But it comes at a cost: (learning curve and) compilation time. It is considered by developers as one of the language's biggest flaws.
However, hardware, project crate / workspace management, better code and compiler understanding, and tweaking compiler options can help you mitigate this trade-off.
Before jumping in, I'd suggest that you take a quick look at my compiler reference. It is an overview of how the compiler operates, but more importantly, it contains a table referencing all the options we'll benchmark here.
This article is split in two parts:
- Compiler-related optimizations: Options, toolchains, backend one can leverage to improve compile time. Based on tests and actual numbers accessible below.
- Environment-related optimizations: Project management, hardware, first and third-party tools.
Preamble: a word on iterative compilation
While the former influences the latter, cold compilation and iterative compilation are not to be put on the same level.
Cold compilation is compilation from scratch. It matters the most when deploying to production.
Iterative compilation is incremental compilation + linking. This is the one that matters to developers on a daily basis.
Incremental implies the following:
- Frontend (lexer, parser, HIR, MIR) is skipped for all unchanged parts (dependencies and local modules).
- LLVM / cranelift backend:
- Unchanged codegen units are skipped (which is why
codegen-unitsshould remain high). - Changed codegen units are recreated.
- Unchanged codegen units are skipped (which is why
- All the generated code is linked again.
What is a codegen-unit?
A codegen unit (CGU) is a chunk of MIR handed to the backend (LLVM or Cranelift) as a single unit of parallel work. Each CGU produces one object file (.o).
More units means more parallelism, but less cross-unit optimization (low impact).
So: a performant backend and linker matter a lot when it comes to rust daily development.
Optimizing the compilation process
Depending on the context, one likely wants to either improve: compilation time, runtime performance, or binary size.
Here are some recommendations based on my experience and tests, that you can find in the next section.
Honest spoiler: default cargo behavior is well set, nightly offers big gains, and an alternative linker finishes the job with brio.
There are plenty of resources out there, here are some valuable ones:
- https://doc.rust-lang.org/cargo/guide/build-performance.html, and as a reference: https://doc.rust-lang.org/cargo/reference/profiles.html
- https://corrode.dev/blog/tips-for-faster-rust-compile-times/
- https://bevy.org/learn/quick-start/getting-started/setup/
Synthesis
Options having the biggest impact, per need.
Incremental compilation time
Mostly benefits development iteration speed.
| Option | What it does for Incremental Builds | The Trade-off |
|---|---|---|
incremental=true |
Huge speedup: only recompile actual source changes since last build. | M increase in target/ artifact bloat. |
| wild linker (mold as a fallback) |
Drops link times to < 1 s. Link time can be an even bigger bottleneck than compilation depending on the project. |
- Setup & compatibility restrictions. - Runtime performance (negligible). |
split-debuginfo="unpacked" |
Lightens the work of the linker (and lighter binary) | Free |
lto=false |
Having LTO on forces a global re-link on every minor code change. | Runtime performance. |
Cold compilation time
Benefits development iteration speed (often less important than incremental). Benefits release deployment time.
| Option | What it does for Cold Builds | The Trade-off |
|---|---|---|
-Zthreads=N(N = physical cores) |
Parallelizes rustc's frontend work (parsing, analysis, MIR building) across N threads. Biggest cold-build win: 48.8 s → 33.8 s (~30 %) with Zthreads=8. Sweet spot is around the number of physical cores; going beyond (e.g. 32 threads on a 16-core CPU) hurts due to contention. |
Nightly only. Stabilization is tracked in rust-lang/rust#122292. |
codegen-backend="cranelift" |
Bypasses LLVM, generating unoptimized code really fast. Most valuable in pure edit-compile-test cycles where you never care about runtime speed (UI iteration, TDD loops). | Nightly only. - Compatibility restrictions. Runtime execution speed (~25 % slower in these tests). Not always as fast as the sum of other options. |
opt-level=0 |
Skips optimizations. | Runtime performance. |
codegen-units=16 |
Enables more parallelism. | Disk space. |
sccache |
Shares compiled dependency artifacts in a cached repository across projects. Incompatible with incremental=true (they conflict; use one or the other). |
Space management — cargo clean won't touch it; flush manually after rustup update. (Not benchmarked here; included from community consensus.) |
-Zshare-generics=y |
Shares monomorphized generics across codegen units within the same crate, avoiding redundant work per CGU. | Only meaningful at higher codegen-units counts; pairs well with opt-level=0. |
Runtime performance
Benefits production (release) performance.
| Option | What it does for Runtime Performance | The Trade-off |
|---|---|---|
opt-level=3 |
Applies all compiler optimization passes. | XXL penalty to compilation time. L penalty to artifact deps size. L penalty to binary size. |
lto="thin" |
Allows the compiler to inline functions from dependencies into the code. | XXL penalty to link times / compilation time. |
debug-assertions=false |
Prevents the compiler from emitting runtime assertion checks, e.g. integer overflows. | XXL — Don't disable this for critical services. |
Binary size
Benefits production (release) systems: either saving cost or enabling more hardware.
| Option | What it does for Binary Size | The Trade-off |
|---|---|---|
strip="symbols" |
Rips out all debug symbols and names from the final compiled binary. (XXL reduction) | XL loss of observability. Crash reports will have no stack traces. |
opt-level="s" |
Explicitly instructs LLVM to favor small machine code over fast machine code. (XL reduction) | M penalty to execution speed compared to opt-level=3. |
lto="thin" |
Aggressively finds and deletes unused code across all dependency boundaries. (XL reduction) | XXL penalty to link times / compilation time. |
panic="abort" |
Removes the "landing pads" and stack unwinding code. (M reduction) | Cannot gracefully catch panics; the app just dies instantly. |
codegen-units=1 |
(Avoidance): Setting to 1 allows LLVM to see the whole crate to deduplicate code. | S penalty to compilation time. |
-Zshare-generics=y |
Prevents compiling the exact same Vec<T> monomorphization multiple times. (L reduction) |
Nightly only. - XS runtime overhead. |
split-debuginfo="unpacked" |
Lighter binary | No effect if debug=false |
A note on linkers
Cargo option name is misleading: cc is not a linker — it is a compiler driver that delegates to a linker.
wild, mold, and lld are actual linkers.
In these tests, clang + wild consistently produced the shortest link times. mold is a solid fallback when wild is not available.
rust-lld (the WIP official Rust linker) showed no improvement over cc for now.
A note on the toolchain
Nightly offers access to up to date dependencies and features such as: parallel frontend, shared generics, and cranelift.
If your environment allows so, nightly compilation is the way to go.
If you know the final binary is going to be released on a given stable version X, you can always switch to X-nightly.
~/.cargo/config.toml
On debug, the order of priority is usually:
- compilation time
- ability to debug
- runtime performance
- binary size
On release, the order of priority goes from:
- runtime performance
- ability to trace back crashes
- binary size (but sometimes size matters more than performance, e.g. embedded)
- compilation time
WASM is a different beast. Several options are incompatible.
Based on my rig, tests and available information, here is my ~/.cargo/config.toml:
[toolchain]
channel = "1.94.0-nightly"
[unstable]
# codegen-backend = true
[target.x86_64-unknown-linux-gnu]
linker = "clang"
rustflags = [
"-Zunstable-options",
"-C", "link-arg=--ld-path=wild",
# "-C", "link-arg=-fuse-ld=mold",
# "-C", "linker-features=+lld",
# !Warning: binaries won't work on cpus that have a different architecture!
# "-C", "target-cpu=native",
"-Z", "threads=16",
"-Z", "share-generics=y",
]
# Some WASM-specific optimizations, out of scope
# Cranelift, mold, wild, are WASM-incompatible
[target.wasm32-unknown-unknown]
rustflags = [
"-Zunstable-options",
"-C", "link-arg=--ld-path=wild",
"-C", "target-feature=+bulk-memory,+mutable-globals,+nontrapping-fptoint,+sign-ext",
"-Z", "threads=16",
"-Z", "share-generics=y",
]
[profile.release]
opt-level = 3 # default
debug-assertions = false # default
debug = false # default
codegen-units = 1
lto = "thin"
strip = "symbols"
panic = "abort"
split-debuginfo = "off"
# minimal optimization for my own code in debug
[profile.dev]
incremental = true # default
debug-assertions = true # default
opt-level = 0 # default
debug = true # default
lto = "off" # default
strip = "none" # default
panic = "unwind" # default
split-debuginfo = "unpacked"
codegen-units = 32
# Not useful for me but you should definitely try and compare.
# codegen-backend = "cranelift"
[profile.dev.package."*"]
debug = "line-tables-only"
opt-level = 3
debug-assertions = false
Where does each option go?
Options under [profile.*] and [profile.*.package."*"] can go in either ~/.cargo/config.toml or local crate Cargo.toml.
Options like linker, [build], [target.*], and [unstable] (including -Z flags passed via rustflags) go in ~/.cargo/config.toml.
Note that [profile.release.package."*"] targets all dependencies and excludes all local crates.
Of course, one may need panic="unwind" or keep debug = "line-tables-only". It may be an acceptable trade-off.
Testing compiler options
Results may vary.
Expect a variance of roughly ±2–3 %. The relative ranking between options is consistent across runs, but absolute numbers should be read as approximations.
My rig
- CPU: Ryzen 9 9950X. 16 cores, 32 threads.
- RAM: 64 GB running at 6000 MHz CAS 30.
- SSD: NVMe 980 Pro.
- OS family: Linux (EndeavourOS)
Set up
Simple: cargo new, cargo add bevy@0.18.1.
And a basic main.rs that spawns particles (entities) that are then updated (transform) with some heavy floating-point math every frame.
Throughout the whole test process, I did not open or close a process to avoid CPU/RAM interference as much as I (easily) could.
Also, in-between each test was run: cargo clean && rm -rf ~/.cache/sccache && rm -rf target.
Results may vary
The relative ranking between options is consistent across runs, but absolute numbers should be read as approximations.
Toolchains
The same tests were performed on the following toolchains:
- 1.94.0-stable Cleared most of the tests as results remained consistent when compared to 1.94.0-nightly.
- 1.94.0-nightly
Nightly benchmarking is basically required as significant options are only accessible from the nightly toolchain.
Versus stable:
- ~1 % variance in execution speed
- An increase of ~10 % compilation artifacts
- An increase of ~15 % up to ~20 % binary size
Why is nightly heavier? Probable causes: new compiler features and tooling versions that have not fully been optimized. Could also be bevy
cfg(nightly)-only features.
- 1.95.0-nightly Entirely removed as it suffered from a significant compilation time regression (issue filed and bisected).
REFERENCE configuration
Every test runs against the reference configuration unless specified otherwise.
[unstable]
[build]
[target.x86_64-unknown-linux-gnu]
linker="cc"
[profile.dev]
incremental = false
debug-assertions = true
opt-level = 0
debug = true
split-debuginfo = "off"
strip = "none"
lto = false # false = "thin-local" unless opt-level=0: then "off"
panic = "unwind"
codegen-units = 256
This means most tests are about toggling on/off 1 option.
Tests are biased
Options influence each other. When combined, they often yield diminishing returns or in the worst case, conflict with each other. I tried to compensate for this by toggling on multiple options at once when it made sense (always explicitly stated).
Stable (1.94)
| stable | time (s) | Disk deps (GB) w/ loop | bin size (MB) w/ loop | exec speed (ms) w/ loop |
|---|---|---|---|---|
| REFERENCE | 48.5 | 6.9 | 1100 | 158 |
codegen-units=1 |
53.3 | 4.7 | 784 | 158 |
opt-level=3 |
96.4 | 8.9 | 1600 | 10.8 |
lto="thin" |
62.4 | 5 | 312 | 158 |
debug=false + debug-assertions=false |
42.5 | 2.3 | 87 | 130 |
strip="symbols" |
47.8 | 4.9 | 55 | 158.4 |
| optimized release build | 69.6 | 1.5 | 20 | 10.4 |
| optimized dev build | 52.3 | 3.0 | 157 | 92.1 |
Nightly (1.94)
Significant improvements in bold. Significant regressions in italic.
Legend: each row toggles only the stated option(s) on top of the REFERENCE config, unless it belongs to the FULL CONFIGURATIONS or INCREMENTAL sub-tables below.
Single-option toggles
| nightly | time (s) | Disk deps (GB) | bin size (MB) | exec speed (ms) | Notes |
|---|---|---|---|---|---|
REFERENCE (cc) |
48.8 | 7.6 | 1300 | 157.3 | |
rust-lld linker |
49.7 | 7.6 | 1300 | 157.3 | rust-lld, the WIP official linker |
clang linker |
59 | 7.5 | 1200 | 158 | |
cc + mold linker |
49.7 | 7.7 | 1300 | 158.6 | |
clang + wild linker |
48.7 | 7.6 | 1200 | 158.5 | wild is not compatible with cc |
| cranelift | 43.7 | 4.9 | 916 | 196.9 | |
codegen-units=32 |
47.5 | 6.5 | 1100 | 157 | |
codegen-units=16 |
47.6 | 6.2 | 1000 | 157 | |
codegen-units=1 |
52.9 | 5.2 | 912 | 157 | |
Zthreads=1 |
46.9 | 7.6 | 1200 | 158 | Seems to be the default value |
Zthreads=8 |
33.8 | 7.6 | 1200 | 158 | != 1.95: here sweet spot is 16 |
Zthreads=8 + codegen-units=16 |
31.4 | 7.6 | 1200 | 158 | |
Zthreads=16 |
33.4 | 7.6 | 1200 | 158 | 9950X has 16 cores |
Zthreads=32 |
40 | 7.6 | 1200 | 158 | 9950X has 32 threads |
Zshare-generics=y |
48.6 | 6.1 | 1000 | 158 | |
split-debuginfo="unpacked" |
49.1 | 6.3 | 238 | 157 | |
Ctarget-cpu=native |
49.2 | 7.6 | 1200 | 166 | Susprisingly ineffective. So you should test it. |
panic="abort" |
48.1 | 7.4 | 1200 | 155.3 | |
debug=false |
43.15 | 2.3 | 107 | 154.2 | makes split-debuginfo ineffective |
debug=line-tables-only |
44.6 | 3.3 | 288 | 154.3 | |
debug-assertions=false |
48.6 | 7.4 | 1200 | 139.8 | |
debug=false + debug-assertions=false |
42.8 | 2.7 | 103 | 127 | cumulative |
strip="symbols" |
48.8 | 5.3 | 55 | 157.9 | |
strip="debuginfo" |
48.2 | 5.4 | 106 | 158 | |
lto=thin |
63.5 | 6.4 | 347 | 157.4 | |
opt-level=1 |
68.5 | 7.6 | 1400 | 15.4 | |
opt-level=2 |
75.2 | 8.2 | 1500 | 10.8 | |
opt-level=3 |
77.3 | 8.2 | 1500 | 14.7 | |
opt-level="s" |
67.5 | 6.8 | 1200 | 14.7 | also tried on slimmed binaries: "s" -> 72mb and 11ms vs 3 -> 68mb and 10.5ms vs 0 -> 93mb and 127ms "s" is the most balanced option. |
opt-level="z" |
63.3 | 6.4 | 1100 | 18.9 |
Full configurations
| config | time (s) | Disk deps (GB) | bin size (MB) | exec speed (ms) | Notes |
|---|---|---|---|---|---|
| REF_RELEASE = opt-level = 3 lto = "thin" debug = false debug-assertions=false strip = "symbols" panic = "abort" codegen-units = 1 split-debuginfo = "off" + Zthreads=8 |
92.4 | 1.5 | 22 | 10.6 | (release profile) |
REF_RELEASE + Ctarget-cpu=native |
92.9 | 1.5 | 22 | 10.3 | small performance benefit |
REF_RELEASE + Zshare-generics=y |
66.75 | 1.5 | 20 | 10.5 | Free compilation time for release builds |
| REF_RELEASE + wild | 92.3 | 1.5 | 22 | 10.3 | |
| REF_DEV = opt-level = 0 debug = false debug-assertions=false split-debuginfo = 'off' strip = false lto = "off" panic = 'abort' codegen-units = 16 + Ctarget-cpu=native+ Zthreads=8+ Zshare-generics=y |
29.34 | 2 | 84 | 117.6 | shortest compilation time (w/o linker) |
REF_DEV + debug-assertions=true |
29.7 | 2.1 | 88 | 145.2 | |
| REF_DEV + mold | 25.74 | 2.1 | 116 | 118 | |
| REF_DEV + wild | 24.93 | 2 | 85 | 116.8 | |
| REF_DEV + cranelift | 29.5 | 2.7 | 376 | 195 | |
| REF_DEV + cranelift + wild | 24.7 | 2.7 | 366 | 196 | |
| REF_DEV_2 incremental = true codegen-units = 16 opt-level = 0 lto = "off" strip = "none" panic = "unwind" debug = true split-debuginfo = "unpacked" debug-assertions = true Zthreads=8 |
41.5 | 5.2 | 220 | 157.1 | best balance |
| REF_DEV_2 + wild | 30.3 | 156.4 | |||
REF_DEV_2 + Zshare-generics=y + Ctarget-cpu=native |
33.8 | 6.3 | 219 | 166.5 | |
| REF_DEV_P = REF_DEV_2 + [profile.dev.package."*"] opt-level = 3 + wild |
72.2 | 5.1 | 234 | 98.6 |
Incremental builds
Recompiling a single file (~50 LoC change).
| config | time (s) | Disk deps (GB) | bin size (MB) | Notes |
|---|---|---|---|---|
REFERENCE (cc) |
1.83 | 7.6 | 1300 | |
REFERENCE + clang linker |
11.3 | 8.3 | 1200 | |
| REFERENCE + cranelift | 1.53 | 4.9 | 916 | |
REFERENCE + mold |
1.3 | 7.7 | 1300 | |
REFERENCE + clang + wild |
1.13 | 7.7 | 1200 | |
REF_DEV_2 + clang |
5.46 | 6.1 | 220 | |
REF_DEV_2 + Zshare-generics=y + Ctarget-cpu=native |
5.3 | 6.1 | 219 | |
REF_DEV_2 + mold (cc) |
0.83 | 5.3 | 254 | |
REF_DEV_2 + clang + wild |
0.67 | 5.2 | 220 | This is great. |
REF_DEV_P + clang + wild |
0.48 | 5.1 | 234 |
Beyond the compiler
Hardware
Two key factors: RAM and CPU.
RAM
48 GB is the current sweet spot for day-to-day development. The main bottleneck is not the compiler: rust-analyzer will definitely overflow 16 GB. I often witnessed 32 GB machines crashing too.
CPU & OS
CPU and OS are codependent.
macOS and Linux share a common ancestor (UNIX) and are good options.
Linux, whatever the flavour, is a first-class citizen when it comes to development. It just works and it's highly customizable.
macOS M-series chips will outperform pretty much every other laptop and most average desktop CPUs. Only higher-end desktop CPUs will beat them... in multithreaded workloads. Apple has truly impressive hardware with best-in-market single-core performance, great multi-core performance and an unmatched efficiency ratio; as well as unified memory with impressive bandwidth. Depending on your usage, you may encounter incompatibilities caused by the OS ecosystem (ARM or closed-OS related issues, e.g. Metal), but most Rust libraries and tools will be compatible out of the box.
Stay far from Windows if you get to choose. Expect:
- +60 % compilation time increase.
- To fall asleep waiting for cargo locks to release.
- For OneDrive to mess with the target folder.
WSL is a decent way out. Not on par with native Linux though, and it comes with other constraints such as high RAM usage. Also note that fast linkers are not available on Windows.
Good CPUs (single and multi-core performance both matter!) are great for both cold and incremental compilation. While for the latter the difference is barely noticeable in small projects, it makes all the difference in bigger ones.
Project management
Use a workspace, keep crates small
Bad crate management is one of the main causes of long incremental compilation times. Turning a big crate into a modular workspace will definitely shorten compilation time (each independent crate's compilation runs in parallel).
Start projects with a workspace: one binary crate, multiple library crates. Find an architecture that suits the project's needs and apply programming principles to crate management. For example:
- By following the Single Responsibility Principle, local crates will emerge naturally.
- Don't Repeat Yourself: dependencies of a given version that are shared across crates should belong to
workspace.dependencies, so they are only compiled and linked once. - Separation of Concerns: don't mix layers, don't mix responsibilities: rendering in one crate, UI in another.
For local libraries: isolate self-contained code that grows large into a library crate. If you can't, isolate each independent responsibility behind an optional feature. This will also serve your project architecture.
Note that while creating (optional) features is more important for libraries than it is for a user-facing application, it is still a good habit to have.
Be careful though, it is all about balance: a micro-crate architecture can hurt compilation time and runtime optimization.
Unused dependency features
Unused features of your dependencies still get compiled.
- One solution is to add dependencies but disable default features (
--no-default-features). - Another is to
cargo install cargo-features-managerand thencargo features prune. - A third is sometimes to simply switch a dependency for a lighter alternative.
Code awareness
At the code level, what impacts compilation time the most? The answer is not functions or variables. It is mainly:
- Monomorphization: Each generic (T) and trait will duplicate the function/impl block for all the concrete type variants that are actually used by the source code. Ever heard of YAGNI and KISS? Don't write generalized interfaces for a use case you don't have yet.
- Macro expansion: Everywhere a macro is called (be it declarative or procedural), the code is expanded.
- Procedural macro crates:
In a Bevy project,
bevy_ecs's derive macros represent a significant chunk of compile time. Usecargo build --timingsto identify which proc-macro crates dominate your build, then consider whether you really need all of them.
Tools
- Monitoring:
cargo build --timingsto see how long each dep takes to compile in an HTML graph. - Monitoring:
cargo-llvm-linesprints the number of cumulative lines and copies of each generic function in the binary. - Monitoring:
-Zprint-mono-items=yesto see exactly what is monomorphized in your code. - Hygiene:
cargo-macheteandcargo-udepsto identify unused dependencies. - Hygiene:
cargo-outdatedto identify outdated dependencies (andcargo-editto automatically update them).
Thanks for reading. I hope it has sparked some new thoughts or insights! Questions, remarks, or caught an inconsistency? My inbox is open, please reach out.