Optimize Rust compilation

From the compiler to how you manage your project, this article is a complete walk-through to improve compilation time, runtime performance and binary size.

engineering/dev/lang/rust

32 min read

Mar 13, 2026

deep-diveguide

Rust runtime performance and reliability (compile-time checks) are two of Rust's major pros. But it comes at a cost: (learning curve and) compilation time. It is considered by developers as one of the language's biggest flaws.

However, hardware, project crate / workspace management, better code and compiler understanding, and tweaking compiler options can help you mitigate this trade-off.

Before jumping in, I'd suggest that you take a quick look at my compiler reference. It is an overview of how the compiler operates, but more importantly, it contains a table referencing all the options we'll benchmark here.

This article is split in two parts:

  1. Compiler-related optimizations: Options, toolchains, backend one can leverage to improve compile time. Based on tests and actual numbers accessible below.
  2. Environment-related optimizations: Project management, hardware, first and third-party tools.

Preamble: a word on iterative compilation

While the former influences the latter, cold compilation and iterative compilation are not to be put on the same level.

Cold compilation is compilation from scratch. It matters the most when deploying to production.

Iterative compilation is incremental compilation + linking. This is the one that matters to developers on a daily basis.

Incremental implies the following:

  1. Frontend (lexer, parser, HIR, MIR) is skipped for all unchanged parts (dependencies and local modules).
  2. LLVM / cranelift backend:
    • Unchanged codegen units are skipped (which is why codegen-units should remain high).
    • Changed codegen units are recreated.
  3. All the generated code is linked again.

What is a codegen-unit?

A codegen unit (CGU) is a chunk of MIR handed to the backend (LLVM or Cranelift) as a single unit of parallel work. Each CGU produces one object file (.o). More units means more parallelism, but less cross-unit optimization (low impact).

So: a performant backend and linker matter a lot when it comes to rust daily development.

Optimizing the compilation process

Depending on the context, one likely wants to either improve: compilation time, runtime performance, or binary size.

Here are some recommendations based on my experience and tests, that you can find in the next section.

Honest spoiler: default cargo behavior is well set, nightly offers big gains, and an alternative linker finishes the job with brio.

There are plenty of resources out there, here are some valuable ones:

Synthesis

Options having the biggest impact, per need.

Incremental compilation time

Mostly benefits development iteration speed.

Option What it does for Incremental Builds The Trade-off
incremental=true Huge speedup: only recompile actual source changes since last build. M increase in target/ artifact bloat.
wild linker
(mold as a fallback)
Drops link times to < 1 s.
Link time can be an even bigger bottleneck than compilation depending on the project.
- Setup & compatibility restrictions.
- Runtime performance (negligible).
split-debuginfo="unpacked" Lightens the work of the linker (and lighter binary) Free
lto=false Having LTO on forces a global re-link on every minor code change. Runtime performance.

Cold compilation time

Benefits development iteration speed (often less important than incremental). Benefits release deployment time.

Option What it does for Cold Builds The Trade-off
-Zthreads=N

(N = physical cores)
Parallelizes rustc's frontend work (parsing, analysis, MIR building) across N threads.
Biggest cold-build win: 48.8 s → 33.8 s (~30 %) with Zthreads=8. Sweet spot is around the number of physical cores; going beyond (e.g. 32 threads on a 16-core CPU) hurts due to contention.
Nightly only. Stabilization is tracked in rust-lang/rust#122292.
codegen-backend="cranelift" Bypasses LLVM, generating unoptimized code really fast. Most valuable in pure edit-compile-test cycles where you never care about runtime speed (UI iteration, TDD loops). Nightly only.
- Compatibility restrictions.
Runtime execution speed (~25 % slower in these tests).
Not always as fast as the sum of other options.
opt-level=0 Skips optimizations. Runtime performance.
codegen-units=16 Enables more parallelism. Disk space.
sccache Shares compiled dependency artifacts in a cached repository across projects. Incompatible with incremental=true (they conflict; use one or the other). Space management — cargo clean won't touch it; flush manually after rustup update. (Not benchmarked here; included from community consensus.)
-Zshare-generics=y Shares monomorphized generics across codegen units within the same crate, avoiding redundant work per CGU. Only meaningful at higher codegen-units counts; pairs well with opt-level=0.

Runtime performance

Benefits production (release) performance.

Option What it does for Runtime Performance The Trade-off
opt-level=3 Applies all compiler optimization passes. XXL penalty to compilation time.
L penalty to artifact deps size.
L penalty to binary size.
lto="thin" Allows the compiler to inline functions from dependencies into the code. XXL penalty to link times / compilation time.
debug-assertions=false Prevents the compiler from emitting runtime assertion checks, e.g. integer overflows. XXL — Don't disable this for critical services.

Binary size

Benefits production (release) systems: either saving cost or enabling more hardware.

Option What it does for Binary Size The Trade-off
strip="symbols" Rips out all debug symbols and names from the final compiled binary. (XXL reduction) XL loss of observability. Crash reports will have no stack traces.
opt-level="s" Explicitly instructs LLVM to favor small machine code over fast machine code. (XL reduction) M penalty to execution speed compared to opt-level=3.
lto="thin" Aggressively finds and deletes unused code across all dependency boundaries. (XL reduction) XXL penalty to link times / compilation time.
panic="abort" Removes the "landing pads" and stack unwinding code. (M reduction) Cannot gracefully catch panics; the app just dies instantly.
codegen-units=1 (Avoidance): Setting to 1 allows LLVM to see the whole crate to deduplicate code. S penalty to compilation time.
-Zshare-generics=y Prevents compiling the exact same Vec<T> monomorphization multiple times. (L reduction) Nightly only.
- XS runtime overhead.
split-debuginfo="unpacked" Lighter binary No effect if debug=false

A note on linkers

Cargo option name is misleading: cc is not a linker — it is a compiler driver that delegates to a linker.

wild, mold, and lld are actual linkers.

In these tests, clang + wild consistently produced the shortest link times. mold is a solid fallback when wild is not available.

rust-lld (the WIP official Rust linker) showed no improvement over cc for now.

A note on the toolchain

Nightly offers access to up to date dependencies and features such as: parallel frontend, shared generics, and cranelift.

If your environment allows so, nightly compilation is the way to go. If you know the final binary is going to be released on a given stable version X, you can always switch to X-nightly.

~/.cargo/config.toml

On debug, the order of priority is usually:

  1. compilation time
  2. ability to debug
  3. runtime performance
  4. binary size

On release, the order of priority goes from:

  1. runtime performance
  2. ability to trace back crashes
  3. binary size (but sometimes size matters more than performance, e.g. embedded)
  4. compilation time

WASM is a different beast. Several options are incompatible.

Based on my rig, tests and available information, here is my ~/.cargo/config.toml:

[toolchain]
channel = "1.94.0-nightly"

[unstable]
# codegen-backend = true

[target.x86_64-unknown-linux-gnu]
linker = "clang"
rustflags = [
    "-Zunstable-options",
    "-C", "link-arg=--ld-path=wild",
    # "-C", "link-arg=-fuse-ld=mold",
    # "-C", "linker-features=+lld",
    # !Warning: binaries won't work on cpus that have a different architecture!
    # "-C", "target-cpu=native",
    "-Z", "threads=16",
    "-Z", "share-generics=y",
]

# Some WASM-specific optimizations, out of scope
# Cranelift, mold, wild, are WASM-incompatible
[target.wasm32-unknown-unknown]
rustflags = [
    "-Zunstable-options",
    "-C", "link-arg=--ld-path=wild",
    "-C", "target-feature=+bulk-memory,+mutable-globals,+nontrapping-fptoint,+sign-ext",
    "-Z", "threads=16",
    "-Z", "share-generics=y",
]

[profile.release]
opt-level = 3  # default
debug-assertions = false # default
debug = false # default
codegen-units = 1
lto = "thin"
strip = "symbols"
panic = "abort"
split-debuginfo = "off"

# minimal optimization for my own code in debug
[profile.dev]
incremental = true # default
debug-assertions = true # default
opt-level = 0 # default
debug = true # default
lto = "off" # default
strip = "none" # default
panic = "unwind" # default
split-debuginfo = "unpacked"
codegen-units = 32
# Not useful for me but you should definitely try and compare.
# codegen-backend = "cranelift"
[profile.dev.package."*"]
debug = "line-tables-only"
opt-level = 3
debug-assertions = false

Where does each option go?

Options under [profile.*] and [profile.*.package."*"] can go in either ~/.cargo/config.toml or local crate Cargo.toml. Options like linker, [build], [target.*], and [unstable] (including -Z flags passed via rustflags) go in ~/.cargo/config.toml.

Note that [profile.release.package."*"] targets all dependencies and excludes all local crates.

Of course, one may need panic="unwind" or keep debug = "line-tables-only". It may be an acceptable trade-off.

Testing compiler options

Results may vary.

Expect a variance of roughly ±2–3 %. The relative ranking between options is consistent across runs, but absolute numbers should be read as approximations.

My rig

  • CPU: Ryzen 9 9950X. 16 cores, 32 threads.
  • RAM: 64 GB running at 6000 MHz CAS 30.
  • SSD: NVMe 980 Pro.
  • OS family: Linux (EndeavourOS)

Set up

Simple: cargo new, cargo add bevy@0.18.1.

And a basic main.rs that spawns particles (entities) that are then updated (transform) with some heavy floating-point math every frame.

Throughout the whole test process, I did not open or close a process to avoid CPU/RAM interference as much as I (easily) could. Also, in-between each test was run: cargo clean && rm -rf ~/.cache/sccache && rm -rf target.

Results may vary

The relative ranking between options is consistent across runs, but absolute numbers should be read as approximations.

Toolchains

The same tests were performed on the following toolchains:

  • 1.94.0-stable Cleared most of the tests as results remained consistent when compared to 1.94.0-nightly.
  • 1.94.0-nightly Nightly benchmarking is basically required as significant options are only accessible from the nightly toolchain. Versus stable:
    • ~1 % variance in execution speed
    • An increase of ~10 % compilation artifacts
    • An increase of ~15 % up to ~20 % binary size Why is nightly heavier? Probable causes: new compiler features and tooling versions that have not fully been optimized. Could also be bevy cfg(nightly)-only features.
  • 1.95.0-nightly Entirely removed as it suffered from a significant compilation time regression (issue filed and bisected).

REFERENCE configuration

Every test runs against the reference configuration unless specified otherwise.

[unstable]

[build]

[target.x86_64-unknown-linux-gnu]
linker="cc"

[profile.dev]
incremental = false
debug-assertions = true
opt-level = 0
debug = true
split-debuginfo = "off"
strip = "none"
lto = false # false = "thin-local" unless opt-level=0: then "off"
panic = "unwind"
codegen-units = 256

This means most tests are about toggling on/off 1 option.

Tests are biased

Options influence each other. When combined, they often yield diminishing returns or in the worst case, conflict with each other. I tried to compensate for this by toggling on multiple options at once when it made sense (always explicitly stated).

Stable (1.94)

stable time (s) Disk deps (GB) w/ loop bin size (MB) w/ loop exec speed (ms) w/ loop
REFERENCE 48.5 6.9 1100 158
codegen-units=1 53.3 4.7 784 158
opt-level=3 96.4 8.9 1600 10.8
lto="thin" 62.4 5 312 158
debug=false + debug-assertions=false 42.5 2.3 87 130
strip="symbols" 47.8 4.9 55 158.4
optimized release build 69.6 1.5 20 10.4
optimized dev build 52.3 3.0 157 92.1

Nightly (1.94)

Significant improvements in bold. Significant regressions in italic.

Legend: each row toggles only the stated option(s) on top of the REFERENCE config, unless it belongs to the FULL CONFIGURATIONS or INCREMENTAL sub-tables below.

Single-option toggles

nightly time (s) Disk deps (GB) bin size (MB) exec speed (ms) Notes
REFERENCE (cc) 48.8 7.6 1300 157.3
rust-lld linker 49.7 7.6 1300 157.3 rust-lld, the WIP official linker
clang linker 59 7.5 1200 158
cc + mold linker 49.7 7.7 1300 158.6
clang + wild linker 48.7 7.6 1200 158.5 wild is not compatible with cc
cranelift 43.7 4.9 916 196.9
codegen-units=32 47.5 6.5 1100 157
codegen-units=16 47.6 6.2 1000 157
codegen-units=1 52.9 5.2 912 157
Zthreads=1 46.9 7.6 1200 158 Seems to be the default value
Zthreads=8 33.8 7.6 1200 158 != 1.95: here sweet spot is 16
Zthreads=8 + codegen-units=16 31.4 7.6 1200 158
Zthreads=16 33.4 7.6 1200 158 9950X has 16 cores
Zthreads=32 40 7.6 1200 158 9950X has 32 threads
Zshare-generics=y 48.6 6.1 1000 158
split-debuginfo="unpacked" 49.1 6.3 238 157
Ctarget-cpu=native 49.2 7.6 1200 166 Susprisingly ineffective. So you should test it.
panic="abort" 48.1 7.4 1200 155.3
debug=false 43.15 2.3 107 154.2 makes split-debuginfo ineffective
debug=line-tables-only 44.6 3.3 288 154.3
debug-assertions=false 48.6 7.4 1200 139.8
debug=false + debug-assertions=false 42.8 2.7 103 127 cumulative
strip="symbols" 48.8 5.3 55 157.9
strip="debuginfo" 48.2 5.4 106 158
lto=thin 63.5 6.4 347 157.4
opt-level=1 68.5 7.6 1400 15.4
opt-level=2 75.2 8.2 1500 10.8
opt-level=3 77.3 8.2 1500 14.7
opt-level="s" 67.5 6.8 1200 14.7 also tried on slimmed binaries:
"s" -> 72mb and 11ms vs
3 -> 68mb and 10.5ms vs
0 -> 93mb and 127ms
"s" is the most balanced option.
opt-level="z" 63.3 6.4 1100 18.9

Full configurations

config time (s) Disk deps (GB) bin size (MB) exec speed (ms) Notes
REF_RELEASE =
opt-level = 3
lto = "thin"
debug = false
debug-assertions=false
strip = "symbols"
panic = "abort"
codegen-units = 1
split-debuginfo = "off"
+ Zthreads=8
92.4 1.5 22 10.6 (release profile)
REF_RELEASE + Ctarget-cpu=native 92.9 1.5 22 10.3 small performance benefit
REF_RELEASE + Zshare-generics=y 66.75 1.5 20 10.5 Free compilation time for release builds
REF_RELEASE + wild 92.3 1.5 22 10.3
REF_DEV =
opt-level = 0
debug = false
debug-assertions=false
split-debuginfo = 'off'
strip = false
lto = "off"
panic = 'abort'
codegen-units = 16
+ Ctarget-cpu=native
+ Zthreads=8
+ Zshare-generics=y
29.34 2 84 117.6 shortest compilation time (w/o linker)
REF_DEV + debug-assertions=true 29.7 2.1 88 145.2
REF_DEV + mold 25.74 2.1 116 118
REF_DEV + wild 24.93 2 85 116.8
REF_DEV + cranelift 29.5 2.7 376 195
REF_DEV + cranelift + wild 24.7 2.7 366 196
REF_DEV_2
incremental = true
codegen-units = 16
opt-level = 0
lto = "off"
strip = "none"
panic = "unwind"
debug = true
split-debuginfo = "unpacked"
debug-assertions = true
Zthreads=8
41.5 5.2 220 157.1 best balance
REF_DEV_2 + wild 30.3 156.4
REF_DEV_2 + Zshare-generics=y + Ctarget-cpu=native 33.8 6.3 219 166.5
REF_DEV_P =
REF_DEV_2
+
[profile.dev.package."*"]
opt-level = 3
+ wild
72.2 5.1 234 98.6

Incremental builds

Recompiling a single file (~50 LoC change).

config time (s) Disk deps (GB) bin size (MB) Notes
REFERENCE (cc) 1.83 7.6 1300
REFERENCE + clang linker 11.3 8.3 1200
REFERENCE + cranelift 1.53 4.9 916
REFERENCE + mold 1.3 7.7 1300
REFERENCE + clang + wild 1.13 7.7 1200
REF_DEV_2 + clang 5.46 6.1 220
REF_DEV_2 + Zshare-generics=y + Ctarget-cpu=native 5.3 6.1 219
REF_DEV_2 + mold (cc) 0.83 5.3 254
REF_DEV_2 + clang + wild 0.67 5.2 220 This is great.
REF_DEV_P + clang + wild 0.48 5.1 234

Beyond the compiler

Hardware

Two key factors: RAM and CPU.

RAM

48 GB is the current sweet spot for day-to-day development. The main bottleneck is not the compiler: rust-analyzer will definitely overflow 16 GB. I often witnessed 32 GB machines crashing too.

CPU & OS

CPU and OS are codependent.

macOS and Linux share a common ancestor (UNIX) and are good options.

Linux, whatever the flavour, is a first-class citizen when it comes to development. It just works and it's highly customizable.

macOS M-series chips will outperform pretty much every other laptop and most average desktop CPUs. Only higher-end desktop CPUs will beat them... in multithreaded workloads. Apple has truly impressive hardware with best-in-market single-core performance, great multi-core performance and an unmatched efficiency ratio; as well as unified memory with impressive bandwidth. Depending on your usage, you may encounter incompatibilities caused by the OS ecosystem (ARM or closed-OS related issues, e.g. Metal), but most Rust libraries and tools will be compatible out of the box.

Stay far from Windows if you get to choose. Expect:

  • +60 % compilation time increase.
  • To fall asleep waiting for cargo locks to release.
  • For OneDrive to mess with the target folder.

WSL is a decent way out. Not on par with native Linux though, and it comes with other constraints such as high RAM usage. Also note that fast linkers are not available on Windows.

Good CPUs (single and multi-core performance both matter!) are great for both cold and incremental compilation. While for the latter the difference is barely noticeable in small projects, it makes all the difference in bigger ones.

Project management

Use a workspace, keep crates small

Bad crate management is one of the main causes of long incremental compilation times. Turning a big crate into a modular workspace will definitely shorten compilation time (each independent crate's compilation runs in parallel).

Start projects with a workspace: one binary crate, multiple library crates. Find an architecture that suits the project's needs and apply programming principles to crate management. For example:

  • By following the Single Responsibility Principle, local crates will emerge naturally.
  • Don't Repeat Yourself: dependencies of a given version that are shared across crates should belong to workspace.dependencies, so they are only compiled and linked once.
  • Separation of Concerns: don't mix layers, don't mix responsibilities: rendering in one crate, UI in another.

For local libraries: isolate self-contained code that grows large into a library crate. If you can't, isolate each independent responsibility behind an optional feature. This will also serve your project architecture.

Note that while creating (optional) features is more important for libraries than it is for a user-facing application, it is still a good habit to have.

Be careful though, it is all about balance: a micro-crate architecture can hurt compilation time and runtime optimization.

Unused dependency features

Unused features of your dependencies still get compiled.

  • One solution is to add dependencies but disable default features (--no-default-features).
  • Another is to cargo install cargo-features-manager and then cargo features prune.
  • A third is sometimes to simply switch a dependency for a lighter alternative.

Code awareness

At the code level, what impacts compilation time the most? The answer is not functions or variables. It is mainly:

  1. Monomorphization: Each generic (T) and trait will duplicate the function/impl block for all the concrete type variants that are actually used by the source code. Ever heard of YAGNI and KISS? Don't write generalized interfaces for a use case you don't have yet.
  2. Macro expansion: Everywhere a macro is called (be it declarative or procedural), the code is expanded.
  3. Procedural macro crates: In a Bevy project, bevy_ecs's derive macros represent a significant chunk of compile time. Use cargo build --timings to identify which proc-macro crates dominate your build, then consider whether you really need all of them.

Tools

  • Monitoring: cargo build --timings to see how long each dep takes to compile in an HTML graph.
  • Monitoring: cargo-llvm-lines prints the number of cumulative lines and copies of each generic function in the binary.
  • Monitoring: -Zprint-mono-items=yes to see exactly what is monomorphized in your code.
  • Hygiene: cargo-machete and cargo-udeps to identify unused dependencies.
  • Hygiene: cargo-outdated to identify outdated dependencies (and cargo-edit to automatically update them).

Thanks for reading. I hope it has sparked some new thoughts or insights! Questions, remarks, or caught an inconsistency? My inbox is open, please reach out.