From the compiler to how you manage your project, this article is a complete walk-through to improve compilation time, runtime performance and binary size.

Rust runtime performance and reliability (compile-time checks) are two of Rust's major pros. But it comes at a cost: (learning curve and) compilation time. It is considered by developers as one of the language's biggest flaws.

However, hardware, project crate / workspace management, better code and compiler understanding, and tweaking compiler options can help you mitigate this trade-off.

Before jumping in, I'd suggest that you take a quick look at my compiler reference. It is an overview of how the compiler operates, but more importantly, it contains a table referencing all the options we'll benchmark here.

This article is split in two parts:

Compiler-related optimizations: Options, toolchains, backend one can leverage to improve compile time. Based on tests and actual numbers accessible below.
Environment-related optimizations: Project management, hardware, first and third-party tools.

Preamble: a word on iterative compilation

While the former influences the latter, cold compilation and iterative compilation are not to be put on the same level.

Cold compilation is compilation from scratch. It matters the most when deploying to production.

Iterative compilation is incremental compilation + linking. This is the one that matters to developers on a daily basis.

Incremental implies the following:

Frontend (lexer, parser, HIR, MIR) is skipped for all unchanged parts (dependencies and local modules).
LLVM / cranelift backend:
- Unchanged codegen units are skipped (which is why codegen-units should remain high).
- Changed codegen units are recreated.
All the generated code is linked again.

What is a codegen-unit?

A codegen unit (CGU) is a chunk of MIR handed to the backend (LLVM or Cranelift) as a single unit of parallel work. Each CGU produces one object file (.o). More units means more parallelism, but less cross-unit optimization (low impact).

So: a performant backend and linker matter a lot when it comes to rust daily development.

Optimizing the compilation process

Depending on the context, one likely wants to either improve: compilation time, runtime performance, or binary size.

Here are some recommendations based on my experience and tests, that you can find in the next section.

Honest spoiler: default cargo behavior is well set, nightly offers big gains, and an alternative linker finishes the job with brio.

There are plenty of resources out there, here are some valuable ones:

Synthesis

Options having the biggest impact, per need.

Incremental compilation time

Mostly benefits development iteration speed.

Option	What it does for Incremental Builds	The Trade-off
`incremental=true`	Huge speedup: only recompile actual source changes since last build.	M increase in `target/` artifact bloat.
wild linker (mold as a fallback)	Drops link times to < 1 s. Link time can be an even bigger bottleneck than compilation depending on the project.	- Setup & compatibility restrictions. - Runtime performance (negligible).
`split-debuginfo="unpacked"`	Lightens the work of the linker (and lighter binary)	Free
`lto=false`	Having LTO on forces a global re-link on every minor code change.	Runtime performance.

Cold compilation time

Benefits development iteration speed (often less important than incremental). Benefits release deployment time.

Option	What it does for Cold Builds	The Trade-off
`-Zthreads=N` (N = physical cores)	Parallelizes rustc's frontend work (parsing, analysis, MIR building) across N threads. Biggest cold-build win: 48.8 s → 33.8 s (~30 %) with `Zthreads=8`. Sweet spot is around the number of physical cores; going beyond (e.g. 32 threads on a 16-core CPU) hurts due to contention.	Nightly only. Stabilization is tracked in rust-lang/rust#122292.
`codegen-backend="cranelift"`	Bypasses LLVM, generating unoptimized code really fast. Most valuable in pure edit-compile-test cycles where you never care about runtime speed (UI iteration, TDD loops).	Nightly only. - Compatibility restrictions. Runtime execution speed (~25 % slower in these tests). Not always as fast as the sum of other options.
`opt-level=0`	Skips optimizations.	Runtime performance.
`codegen-units=16`	Enables more parallelism.	Disk space.
`sccache`	Shares compiled dependency artifacts in a cached repository across projects. Incompatible with `incremental=true` (they conflict; use one or the other).	Space management — `cargo clean` won't touch it; flush manually after `rustup update`. (Not benchmarked here; included from community consensus.)
`-Zshare-generics=y`	Shares monomorphized generics across codegen units within the same crate, avoiding redundant work per CGU.	Only meaningful at higher `codegen-units` counts; pairs well with `opt-level=0`.

Runtime performance

Benefits production (release) performance.

Option	What it does for Runtime Performance	The Trade-off
`opt-level=3`	Applies all compiler optimization passes.	XXL penalty to compilation time. L penalty to artifact deps size. L penalty to binary size.
`lto="thin"`	Allows the compiler to inline functions from dependencies into the code.	XXL penalty to link times / compilation time.
`debug-assertions=false`	Prevents the compiler from emitting runtime assertion checks, e.g. integer overflows.	XXL — Don't disable this for critical services.

Binary size

Benefits production (release) systems: either saving cost or enabling more hardware.

Option	What it does for Binary Size	The Trade-off
`strip="symbols"`	Rips out all debug symbols and names from the final compiled binary. (XXL reduction)	XL loss of observability. Crash reports will have no stack traces.
`opt-level="s"`	Explicitly instructs LLVM to favor small machine code over fast machine code. (XL reduction)	M penalty to execution speed compared to `opt-level=3`.
`lto="thin"`	Aggressively finds and deletes unused code across all dependency boundaries. (XL reduction)	XXL penalty to link times / compilation time.
`panic="abort"`	Removes the "landing pads" and stack unwinding code. (M reduction)	Cannot gracefully catch panics; the app just dies instantly.
`codegen-units=1`	(Avoidance): Setting to 1 allows LLVM to see the whole crate to deduplicate code.	S penalty to compilation time.
`-Zshare-generics=y`	Prevents compiling the exact same `Vec<T>` monomorphization multiple times. (L reduction)	Nightly only. - XS runtime overhead.
`split-debuginfo="unpacked"`	Lighter binary	No effect if `debug=false`

A note on linkers

Cargo option name is misleading: cc is not a linker — it is a compiler driver that delegates to a linker.

wild, mold, and lld are actual linkers.

In these tests, clang + wild consistently produced the shortest link times. mold is a solid fallback when wild is not available.

rust-lld (the WIP official Rust linker) showed no improvement over cc for now.

A note on the toolchain

Nightly offers access to up to date dependencies and features such as: parallel frontend, shared generics, and cranelift.

If your environment allows so, nightly compilation is the way to go. If you know the final binary is going to be released on a given stable version X, you can always switch to X-nightly.

`~/.cargo/config.toml`

On debug, the order of priority is usually:

compilation time
ability to debug
runtime performance
binary size

On release, the order of priority goes from:

runtime performance
ability to trace back crashes
binary size (but sometimes size matters more than performance, e.g. embedded)
compilation time

WASM is a different beast. Several options are incompatible.

Based on my rig, tests and available information, here is my ~/.cargo/config.toml:

[toolchain]
channel = "1.94.0-nightly"

[unstable]
# codegen-backend = true

[target.x86_64-unknown-linux-gnu]
linker = "clang"
rustflags = [
    "-Zunstable-options",
    "-C", "link-arg=--ld-path=wild",
    # "-C", "link-arg=-fuse-ld=mold",
    # "-C", "linker-features=+lld",
    # !Warning: binaries won't work on cpus that have a different architecture!
    # "-C", "target-cpu=native",
    "-Z", "threads=16",
    "-Z", "share-generics=y",
]

# Some WASM-specific optimizations, out of scope
# Cranelift, mold, wild, are WASM-incompatible
[target.wasm32-unknown-unknown]
rustflags = [
    "-Zunstable-options",
    "-C", "link-arg=--ld-path=wild",
    "-C", "target-feature=+bulk-memory,+mutable-globals,+nontrapping-fptoint,+sign-ext",
    "-Z", "threads=16",
    "-Z", "share-generics=y",
]

[profile.release]
opt-level = 3  # default
debug-assertions = false # default
debug = false # default
codegen-units = 1
lto = "thin"
strip = "symbols"
panic = "abort"
split-debuginfo = "off"

# minimal optimization for my own code in debug
[profile.dev]
incremental = true # default
debug-assertions = true # default
opt-level = 0 # default
debug = true # default
lto = "off" # default
strip = "none" # default
panic = "unwind" # default
split-debuginfo = "unpacked"
codegen-units = 32
# Not useful for me but you should definitely try and compare.
# codegen-backend = "cranelift"
[profile.dev.package."*"]
debug = "line-tables-only"
opt-level = 3
debug-assertions = false

Where does each option go?

Options under [profile.*] and [profile.*.package."*"] can go in either ~/.cargo/config.toml or local crate Cargo.toml. Options like linker, [build], [target.*], and [unstable] (including -Z flags passed via rustflags) go in ~/.cargo/config.toml.

Note that [profile.release.package."*"] targets all dependencies and excludes all local crates.

Of course, one may need panic="unwind" or keep debug = "line-tables-only". It may be an acceptable trade-off.

Testing compiler options

Results may vary.

Expect a variance of roughly ±2–3 %. The relative ranking between options is consistent across runs, but absolute numbers should be read as approximations.

My rig

CPU: Ryzen 9 9950X. 16 cores, 32 threads.
RAM: 64 GB running at 6000 MHz CAS 30.
SSD: NVMe 980 Pro.
OS family: Linux (EndeavourOS)

Set up

Simple: cargo new, cargo add bevy@0.18.1.

And a basic main.rs that spawns particles (entities) that are then updated (transform) with some heavy floating-point math every frame.

Throughout the whole test process, I did not open or close a process to avoid CPU/RAM interference as much as I (easily) could. Also, in-between each test was run: cargo clean && rm -rf ~/.cache/sccache && rm -rf target.

Results may vary

The relative ranking between options is consistent across runs, but absolute numbers should be read as approximations.

Toolchains

The same tests were performed on the following toolchains:

1.94.0-stable Cleared most of the tests as results remained consistent when compared to 1.94.0-nightly.
1.94.0-nightly Nightly benchmarking is basically required as significant options are only accessible from the nightly toolchain. Versus stable:
- ~1 % variance in execution speed
- An increase of ~10 % compilation artifacts
- An increase of ~15 % up to ~20 % binary size Why is nightly heavier? Probable causes: new compiler features and tooling versions that have not fully been optimized. Could also be bevy cfg(nightly)-only features.
1.95.0-nightly Entirely removed as it suffered from a significant compilation time regression (issue filed and bisected).

`REFERENCE` configuration

Every test runs against the reference configuration unless specified otherwise.

[unstable]

[build]

[target.x86_64-unknown-linux-gnu]
linker="cc"

[profile.dev]
incremental = false
debug-assertions = true
opt-level = 0
debug = true
split-debuginfo = "off"
strip = "none"
lto = false # false = "thin-local" unless opt-level=0: then "off"
panic = "unwind"
codegen-units = 256

This means most tests are about toggling on/off 1 option.

Tests are biased

Options influence each other. When combined, they often yield diminishing returns or in the worst case, conflict with each other. I tried to compensate for this by toggling on multiple options at once when it made sense (always explicitly stated).

Stable (1.94)

stable	time (s)	Disk deps (GB) w/ loop	bin size (MB) w/ loop	exec speed (ms) w/ loop
REFERENCE	48.5	6.9	1100	158
`codegen-units=1`	53.3	4.7	784	158
`opt-level=3`	96.4	8.9	1600	10.8
`lto="thin"`	62.4	5	312	158
`debug=false` + `debug-assertions=false`	42.5	2.3	87	130
`strip="symbols"`	47.8	4.9	55	158.4
optimized release build	69.6	1.5	20	10.4
optimized dev build	52.3	3.0	157	92.1

Nightly (1.94)

Significant improvements in bold. Significant regressions in italic.

Legend: each row toggles only the stated option(s) on top of the REFERENCE config, unless it belongs to the FULL CONFIGURATIONS or INCREMENTAL sub-tables below.

Single-option toggles

nightly	time (s)	Disk deps (GB)	bin size (MB)	exec speed (ms)	Notes
REFERENCE (`cc`)	48.8	7.6	1300	157.3
`rust-lld` linker	49.7	7.6	1300	157.3	`rust-lld`, the WIP official linker
`clang` linker	59	7.5	1200	158
`cc + mold` linker	49.7	7.7	1300	158.6
`clang + wild` linker	48.7	7.6	1200	158.5	wild is not compatible with cc
cranelift	43.7	4.9	916	196.9
`codegen-units=32`	47.5	6.5	1100	157
`codegen-units=16`	47.6	6.2	1000	157
`codegen-units=1`	52.9	5.2	912	157
`Zthreads=1`	46.9	7.6	1200	158	Seems to be the default value
`Zthreads=8`	33.8	7.6	1200	158	!= 1.95: here sweet spot is 16
`Zthreads=8` + `codegen-units=16`	31.4	7.6	1200	158
`Zthreads=16`	33.4	7.6	1200	158	9950X has 16 cores
`Zthreads=32`	40	7.6	1200	158	9950X has 32 threads
`Zshare-generics=y`	48.6	6.1	1000	158
`split-debuginfo="unpacked"`	49.1	6.3	238	157
`Ctarget-cpu=native`	49.2	7.6	1200	166	Susprisingly ineffective. So you should test it.
`panic="abort"`	48.1	7.4	1200	155.3
`debug=false`	43.15	2.3	107	154.2	makes `split-debuginfo` ineffective
`debug=line-tables-only`	44.6	3.3	288	154.3
`debug-assertions=false`	48.6	7.4	1200	139.8
`debug=false` + `debug-assertions=false`	42.8	2.7	103	127	cumulative
`strip="symbols"`	48.8	5.3	55	157.9
`strip="debuginfo"`	48.2	5.4	106	158
`lto=thin`	63.5	6.4	347	157.4
`opt-level=1`	68.5	7.6	1400	15.4
`opt-level=2`	75.2	8.2	1500	10.8
`opt-level=3`	77.3	8.2	1500	14.7
`opt-level="s"`	67.5	6.8	1200	14.7	also tried on slimmed binaries: "s" -> 72mb and 11ms vs 3 -> 68mb and 10.5ms vs 0 -> 93mb and 127ms "s" is the most balanced option.
`opt-level="z"`	63.3	6.4	1100	18.9

Full configurations

config	time (s)	Disk deps (GB)	bin size (MB)	exec speed (ms)	Notes
REF_RELEASE = opt-level = 3 lto = "thin" debug = false debug-assertions=false strip = "symbols" panic = "abort" codegen-units = 1 split-debuginfo = "off" + `Zthreads=8`	92.4	1.5	22	10.6	(release profile)
REF_RELEASE + `Ctarget-cpu=native`	92.9	1.5	22	10.3	small performance benefit
REF_RELEASE + `Zshare-generics=y`	66.75	1.5	20	10.5	Free compilation time for release builds
REF_RELEASE + wild	92.3	1.5	22	10.3
REF_DEV = opt-level = 0 debug = false debug-assertions=false split-debuginfo = 'off' strip = false lto = "off" panic = 'abort' codegen-units = 16 + `Ctarget-cpu=native` + `Zthreads=8` + `Zshare-generics=y`	29.34	2	84	117.6	shortest compilation time (w/o linker)
REF_DEV + `debug-assertions=true`	29.7	2.1	88	145.2
REF_DEV + mold	25.74	2.1	116	118
REF_DEV + wild	24.93	2	85	116.8
REF_DEV + cranelift	29.5	2.7	376	195
REF_DEV + cranelift + wild	24.7	2.7	366	196
REF_DEV_2 incremental = true codegen-units = 16 opt-level = 0 lto = "off" strip = "none" panic = "unwind" debug = true split-debuginfo = "unpacked" debug-assertions = true Zthreads=8	41.5	5.2	220	157.1	best balance
REF_DEV_2 + wild	30.3			156.4
REF_DEV_2 + `Zshare-generics=y` + `Ctarget-cpu=native`	33.8	6.3	219	166.5
REF_DEV_P = REF_DEV_2 + [profile.dev.package."*"] opt-level = 3 + wild	72.2	5.1	234	98.6

Incremental builds

Recompiling a single file (~50 LoC change).

config	time (s)	Disk deps (GB)	bin size (MB)	Notes
REFERENCE (`cc`)	1.83	7.6	1300
REFERENCE + `clang` linker	11.3	8.3	1200
REFERENCE + cranelift	1.53	4.9	916
REFERENCE + `mold`	1.3	7.7	1300
REFERENCE + `clang` + `wild`	1.13	7.7	1200
REF_DEV_2 + `clang`	5.46	6.1	220
REF_DEV_2 + `Zshare-generics=y` + `Ctarget-cpu=native`	5.3	6.1	219
REF_DEV_2 + mold (`cc`)	0.83	5.3	254
REF_DEV_2 + `clang` + wild	0.67	5.2	220	This is great.
REF_DEV_P + `clang` + wild	0.48	5.1	234

Beyond the compiler

Hardware

Two key factors: RAM and CPU.

RAM

48 GB is the current sweet spot for day-to-day development. The main bottleneck is not the compiler: rust-analyzer will definitely overflow 16 GB. I often witnessed 32 GB machines crashing too.

CPU & OS

CPU and OS are codependent.

macOS and Linux share a common ancestor (UNIX) and are good options.

Linux, whatever the flavour, is a first-class citizen when it comes to development. It just works and it's highly customizable.

macOS M-series chips will outperform pretty much every other laptop and most average desktop CPUs. Only higher-end desktop CPUs will beat them... in multithreaded workloads. Apple has truly impressive hardware with best-in-market single-core performance, great multi-core performance and an unmatched efficiency ratio; as well as unified memory with impressive bandwidth. Depending on your usage, you may encounter incompatibilities caused by the OS ecosystem (ARM or closed-OS related issues, e.g. Metal), but most Rust libraries and tools will be compatible out of the box.

Stay far from Windows if you get to choose. Expect:

+60 % compilation time increase.
To fall asleep waiting for cargo locks to release.
For OneDrive to mess with the target folder.

WSL is a decent way out. Not on par with native Linux though, and it comes with other constraints such as high RAM usage. Also note that fast linkers are not available on Windows.

Good CPUs (single and multi-core performance both matter!) are great for both cold and incremental compilation. While for the latter the difference is barely noticeable in small projects, it makes all the difference in bigger ones.

Project management

Use a workspace, keep crates small

Bad crate management is one of the main causes of long incremental compilation times. Turning a big crate into a modular workspace will definitely shorten compilation time (each independent crate's compilation runs in parallel).

Start projects with a workspace: one binary crate, multiple library crates. Find an architecture that suits the project's needs and apply programming principles to crate management. For example:

By following the Single Responsibility Principle, local crates will emerge naturally.
Don't Repeat Yourself: dependencies of a given version that are shared across crates should belong to workspace.dependencies, so they are only compiled and linked once.
Separation of Concerns: don't mix layers, don't mix responsibilities: rendering in one crate, UI in another.

For local libraries: isolate self-contained code that grows large into a library crate. If you can't, isolate each independent responsibility behind an optional feature. This will also serve your project architecture.

Note that while creating (optional) features is more important for libraries than it is for a user-facing application, it is still a good habit to have.

Be careful though, it is all about balance: a micro-crate architecture can hurt compilation time and runtime optimization.

Unused dependency features

Unused features of your dependencies still get compiled.

One solution is to add dependencies but disable default features (--no-default-features).
Another is to cargo install cargo-features-manager and then cargo features prune.
A third is sometimes to simply switch a dependency for a lighter alternative.

Code awareness

At the code level, what impacts compilation time the most? The answer is not functions or variables. It is mainly:

Monomorphization: Each generic (T) and trait will duplicate the function/impl block for all the concrete type variants that are actually used by the source code. Ever heard of YAGNI and KISS? Don't write generalized interfaces for a use case you don't have yet.
Macro expansion: Everywhere a macro is called (be it declarative or procedural), the code is expanded.
Procedural macro crates: In a Bevy project, bevy_ecs's derive macros represent a significant chunk of compile time. Use cargo build --timings to identify which proc-macro crates dominate your build, then consider whether you really need all of them.

Tools

Monitoring: cargo build --timings to see how long each dep takes to compile in an HTML graph.
Monitoring: cargo-llvm-lines prints the number of cumulative lines and copies of each generic function in the binary.
Monitoring: -Zprint-mono-items=yes to see exactly what is monomorphized in your code.
Hygiene: cargo-machete and cargo-udeps to identify unused dependencies.
Hygiene: cargo-outdated to identify outdated dependencies (and cargo-edit to automatically update them).

Thanks for reading. I hope it has sparked some new thoughts or insights! Questions, remarks, or caught an inconsistency? My inbox is open, please reach out.

Optimize Rust compilation