Profiling Python CLI Startup Time

Before you speed up a slow CLI, you have to know where the time actually goes — and for almost every Python CLI, the answer is imports, not your code. This guide shows you how to profile startup precisely: find the expensive imports with python -X importtime, visualize the tree with tuna, measure real wall-clock time with hyperfine, and lock in your gains with a startup budget you enforce in CI.

TL;DR

Run python -X importtime -c "import yourcli" and read the cumulative column — the biggest numbers at the top of the tree are your targets.
Visualize the same data with tuna to see the import waterfall as a flame graph.
Measure end-to-end wall-clock time of the real command with hyperfine 'yourcli --help' — that's what users feel.
The usual offenders are requests, pandas, pydantic, and cloud SDKs; interpreter site setup adds a fixed floor you can inspect with -S.
Encode a budget as a pytest that asserts --help runs under N milliseconds, so a heavy import added later fails CI instead of taxing every user.

Find expensive imports with -X importtime

-X importtime is a built-in interpreter flag that logs every import and how long it took. Point it at importing your CLI's top module — that import is what runs on startup:

$ python -X importtime -c "import yourcli.cli" 2> importtime.log

The output goes to stderr (hence 2>), one line per import, with three columns:

import time:      self [us] | cumulative | imported package
import time:       125 |        125 |   collections.abc
import time:      2140 |      48210 |     pandas
import time:       310 |      51900 |   yourcli.commands.convert
import time:       180 |      54120 | yourcli.cli

self — time spent importing just that module, excluding its children.
cumulative — time for that module and everything it imports. This is the column that matters.
Indentation shows the import tree; a deeply nested module was pulled in by its less-indented parent.

Read it top-down by cumulative cost. In the sample above, pandas at 48 ms cumulative is dragged in by yourcli.commands.convert — so the fix is to stop importing that command module at startup. You are hunting for a single parent import with a large cumulative number; deferring that one import removes its whole subtree from the startup path.

A useful one-liner to surface the worst offenders sorts the log by cumulative time:

$ python -X importtime -c "import yourcli.cli" 2>&1 \
    | sort -t'|' -k2 -n -r | head -15

Visualize the tree with tuna

Raw importtime output gets unwieldy for a large app. tuna turns it into an interactive flame graph in your browser — the wide bars are the expensive subtrees, and you can click to zoom:

$ uv tool install tuna          # or: pipx install tuna
$ python -X importtime -c "import yourcli.cli" 2> importtime.log
$ tuna importtime.log

The graphical view makes it obvious when one dependency dominates — a single wide block for pandas or boto3 next to a sea of thin standard-library imports tells you exactly what to defer. Installing tuna as an isolated tool keeps it out of your project's environment; if you're weighing how to install these dev tools, see uv tool install vs pipx for CLIs.

Measure real wall-clock time with hyperfine

importtime measures imports; it does not measure the full user-visible latency, which also includes interpreter start and site initialization. For the number users actually feel, benchmark the real command with hyperfine, which runs it many times and reports mean, standard deviation, and warmup behavior:

$ hyperfine --warmup 3 'yourcli --help'
Benchmark 1: yourcli --help
  Time (mean ± σ):     412.6 ms ±   9.1 ms    [User: 380.2 ms, System: 41.7 ms]
  Range (min … max):   401.3 ms … 428.9 ms    10 runs

The --warmup 3 runs discard the first few executions so filesystem and bytecode caches are warm — otherwise the first run's .pyc compilation skews the mean. Use hyperfine to compare two versions directly:

$ hyperfine --warmup 3 'yourcli-eager --help' 'yourcli-lazy --help'

It prints a "N times faster" ratio, which is the single most convincing before/after number to put in a pull request. This end-to-end measurement is what you optimize against; importtime just tells you which import to attack.

Understanding the interpreter's fixed floor

Not all startup time is your imports. Python itself does work before your code runs: initializing built-in modules and processing site.py, which sets up sys.path, .pth files, and site-packages. You can measure that floor by disabling site initialization with -S:

$ hyperfine --warmup 3 'python -c pass' 'python -S -c pass'

The difference is your site overhead. In environments with many installed packages (lots of .pth files), site processing can add tens of milliseconds. You generally shouldn't ship -S — it breaks sys.path assumptions and many packages — but knowing the floor tells you how much of your startup is even addressable. If python -c pass already takes 40 ms, no amount of import-deferring gets your CLI below that.

Spotting the usual offenders

A handful of popular libraries dominate CLI startup profiles. Knowing them lets you predict problems before profiling:

requests — pulls in urllib3, charset_normalizer, and certifi; commonly 30–60 ms. For a CLI that only occasionally makes HTTP calls, defer it or use urllib/httpx behind a lazy import.
pandas — imports NumPy, pytz, and its own large module tree; 100–300 ms is typical. Almost never belongs at a module top level in a CLI.
pydantic — v2's compiled core is fast at runtime but still adds noticeable import cost; if you use it only for one command's config model, defer it.
Cloud SDKs (boto3, google-cloud-*) — among the heaviest, often 200 ms+, because they build service clients from large data files at import.

The pattern is consistent: these are fine inside the command that needs them and expensive at your package's top level. The fix is deferral — either an import inside the function or, better, lazy loading the whole subcommand so the module holding the heavy import isn't touched until invoked. Profile to confirm which offender you actually have before rewriting anything.

A pytest that enforces a startup budget

Profiling is a one-time act; a budget keeps the win permanent. Encode your target as a test that runs the real CLI in a subprocess and asserts it finishes under a threshold:

# tests/test_startup_budget.py
import subprocess
import sys
import time

import pytest

BUDGET_MS = 200  # generous headroom over local measurement for slower CI

def _time_help() -> float:
    start = time.perf_counter()
    result = subprocess.run(
        [sys.executable, "-m", "yourcli", "--help"],
        capture_output=True,
        text=True,
    )
    elapsed_ms = (time.perf_counter() - start) * 1000
    assert result.returncode == 0, result.stderr
    return elapsed_ms

def test_help_within_budget() -> None:
    # Warm bytecode/filesystem caches, then take the best of three runs
    # to reduce CI noise from a single unlucky sample.
    _time_help()
    best = min(_time_help() for _ in range(3))
    assert best < BUDGET_MS, f"--help took {best:.0f} ms (budget {BUDGET_MS} ms)"

Two design choices make this robust rather than flaky: warm the caches with a throwaway run first, and take the best of several samples so one noisy CI moment doesn't fail the build. Set BUDGET_MS well above your measured local time — CI runners are slower and shared — and tighten it only if you have headroom. The goal isn't a precise benchmark; it's a tripwire. The day someone adds import boto3 to a shared module, this test fails with a clear message pointing at the regression, and you profile again with importtime to find the new offender.

Production notes

Always warm up. The first invocation compiles .pyc files and populates OS caches; an unwarmed measurement overstates startup by tens of milliseconds. Both hyperfine --warmup and the test above account for this.
Profile in a clean environment. A cluttered virtualenv with many .pth files inflates site overhead; measure in something close to what users install into. Virtual environment isolation keeps this reproducible.
-X importtime counts each import once. A module already in sys.modules shows near-zero cost, so import order affects the attribution — the first importer of a shared dependency gets "charged" for it. Read the tree, not just one line.
CI numbers are not local numbers. Don't copy your laptop's 90 ms into the budget; measure on the CI runner and add margin. A budget that flaps erodes trust and gets disabled.
Measure the hot path, not just --help. If shell completion is latency-critical, benchmark the completion invocation too — it runs your CLI on every Tab. See shell completion for Python CLIs.

CLI Startup Performance and Lazy Loading — the overview that frames the fixes.
Lazy loading subcommands for faster startup — the deferral technique your profiling justifies.
uv tool install vs pipx for CLIs — installing tools like tuna in isolation.
Virtual environments & isolation best practices — profiling in a clean, reproducible environment.