CLI Startup Performance and Lazy Loading

Python CLIs get slow to start long before they get slow to run. The culprit is almost never your code — it's the import graph that fires the instant the interpreter loads your entry module, dragging in pandas, requests, or a cloud SDK before the user has even chosen a subcommand. This section shows you how to measure where the startup time actually goes, then defer the expensive work so --help, tab completion, and quick commands stay instant.

TL;DR

Startup latency is felt most where users don't expect any work: --help, shell completion, --version, and short-lived commands in scripts and loops.
On a cold CLI, imports dominate — often 80–95% of wall-clock time before your function runs. Your logic is rarely the problem.
Measure first. Use python -X importtime to find the expensive imports before you change anything; optimizing by guesswork wastes effort on cheap imports.
Three fixes, in order of payoff: move module-level heavy imports inside the functions that use them; lazy-load whole subcommands so their dependencies load only when invoked; and avoid pulling heavy libraries into your top-level package at all.
Set a startup budget (e.g. --help under 100 ms) and assert it in CI so a careless import can't quietly regress it.

Why startup latency matters

A CLI is not a web server that pays its startup cost once and amortizes it over millions of requests. It pays that cost on every single invocation. If your tool takes 600 ms to print --help, that half-second is charged to the user every time, and it compounds in the places that should feel free:

Shell completion. Every time the user hits Tab, your CLI runs to produce candidates. If completion takes 400 ms, the shell feels broken — users stop pressing Tab. This is the single most latency-sensitive path in any CLI. See shell completion for Python CLIs for how that path is wired.
--help and --version. These are pure metadata. A user running mycli --help to remember a flag should never wait on pandas importing NumPy and pytz.
Scripts and loops. When your CLI is called inside a shell loop — for f in *.csv; do mycli convert "$f"; done — a 500 ms startup on 1,000 files is over eight minutes of pure import overhead.

Sub-100 ms feels instant. Past ~250 ms, interactive use starts to feel sluggish. The good news: the fix is almost always mechanical once you know where the time goes.

Where the time goes: imports dominate

Here is the mental model. When the shell runs mycli, Python starts, initializes the site machinery, imports your entry-point module, and then parses arguments and dispatches. That import step transitively pulls in everything your modules reference at the top level. A single import pandas at the top of a command module can cost 150–300 ms on its own — and it runs whether or not the user invoked the command that needs it.

Consider a CLI with convert (needs pandas), fetch (needs requests), and report (needs matplotlib). If all three command modules are imported when the app is constructed, then mycli --help pays for all three. The user asked for one line of help and got charged for the entire dependency tree.

The insight most people miss: your own code is almost never the bottleneck. Parsing arguments, building a Click group, and dispatching are microsecond-scale operations. The wall-clock time is spent inside import statements in third-party packages you don't control. That is why the fix is about when imports happen, not about making anything faster.

The measure-first rule

Do not optimize startup by intuition. The expensive import is frequently not the one you'd guess — a validation library or a lazily-configured logging package can outweigh the "obviously heavy" one. Python has a built-in profiler for exactly this:

$ python -X importtime -c "import mycli.cli" 2> importtime.log

The cumulative column shows each import's cost including its children, so the biggest numbers at the top of the tree are your targets. Reading that output — and visualizing it with tuna, timing the real command with hyperfine, and spotting the usual offenders — is a topic on its own: profiling Python CLI startup time walks through the full workflow. Measure, change one thing, measure again. Anything else is superstition.

Fix 1: defer module-level imports

The lowest-effort win is moving a heavy import off the module top level and into the function that uses it. Nothing else about your code changes:

# Before: pandas loads the moment this module is imported,
# i.e. every time the CLI starts, for every command.
import pandas as pd

def convert(path: str) -> None:
    df = pd.read_csv(path)
    df.to_parquet(path.replace(".csv", ".parquet"))

# After: pandas loads only when convert() actually runs.
def convert(path: str) -> None:
    import pandas as pd            # deferred to call time
    df = pd.read_csv(path)
    df.to_parquet(path.replace(".csv", ".parquet"))

The deferred import still gets cached in sys.modules after the first call, so a command that calls convert in a loop pays the import cost once, not per iteration. This one change often halves --help time in a CLI that touches data libraries. It reads as unusual to programmers trained to keep imports at the top — but for a CLI, an import inside a function is a deliberate performance tool, not a smell.

Fix 2: lazy-load whole subcommands

Deferring imports inside a function still requires importing the module that defines the command, which may itself pull in heavy dependencies at its own top level. The complete fix is to not import a subcommand's module at all until that subcommand is invoked. With Click you do this by subclassing Group and overriding command lookup so each subcommand is imported on demand from a string path:

import importlib
import click

class LazyGroup(click.Group):
    def __init__(self, *args, lazy_subcommands: dict[str, str] | None = None, **kwargs):
        super().__init__(*args, **kwargs)
        self._lazy = lazy_subcommands or {}   # name -> "module:attr"

    def list_commands(self, ctx):
        return sorted({*super().list_commands(ctx), *self._lazy})

    def get_command(self, ctx, name):
        if name in self._lazy:
            module_path, attr = self._lazy[name].split(":")
            return getattr(importlib.import_module(module_path), attr)
        return super().get_command(ctx, name)

Now mycli --help calls list_commands (cheap — it just lists names) but never get_command, so no command module is imported. Only mycli convert … triggers the import of the convert module and its pandas. The full runnable version, the registry pattern, and the Typer equivalent are in lazy loading subcommands for faster startup.

This technique layers cleanly on top of a well-organized command tree. If your CLI is still a single file, restructure it first — how to structure a large Python CLI project covers the package layout that makes per-command lazy loading natural.

Fix 3: keep heavy deps out of your top-level package

The trap that undoes both fixes above is a heavy import in your package's __init__.py or in a shared utils module that every command imports. If mycli/__init__.py does from .analytics import tracker and analytics imports pandas, then importing your package at all pays for pandas — lazy subcommands won't save you.

Audit your top-level and shared modules ruthlessly. A package __init__.py for a CLI should be nearly empty. Push heavy dependencies down into the specific command modules that need them, behind the lazy boundary. The same discipline applies to plugin systems: an extensible plugin architecture already loads each plugin only when needed via entry points, which is lazy loading by another name — don't undo it by importing every plugin eagerly at startup.

A startup budget you enforce in CI

Optimizations rot. Someone adds import boto3 to a shared module six months from now and your instant CLI is slow again. Prevent the regression by encoding a budget as a test:

import subprocess
import sys
import time

def test_help_is_fast():
    # Warm the filesystem/bytecode cache once, then time a clean run.
    subprocess.run([sys.executable, "-m", "mycli", "--help"], capture_output=True)
    start = time.perf_counter()
    subprocess.run([sys.executable, "-m", "mycli", "--help"], capture_output=True)
    elapsed_ms = (time.perf_counter() - start) * 1000
    assert elapsed_ms < 150, f"--help took {elapsed_ms:.0f} ms (budget 150 ms)"

Give the budget generous headroom over your measured local time — CI machines are slower and noisier — and run it as a normal part of the suite. Now a heavy import added anywhere in the import graph fails a test with a clear message, instead of silently taxing every user. The full budgeting approach, including choosing the number and reducing CI flakiness, is covered in the profiling guide.

Where to go next

Work the problem in order: measure, then fix the biggest offenders.

Profiling Python CLI startup time — find the expensive imports with -X importtime, tuna, and hyperfine, and set the budget.
Lazy loading subcommands for faster startup — the full LazyGroup recipe and Typer notes to defer whole commands.

Modern Python CLI Frameworks & Architecture — the track this section belongs to.
Profiling Python CLI startup time — measure before you optimize.
Lazy loading subcommands for faster startup — the deep recipe for deferring command imports.
Structuring multi-command Python CLIs — the command layout lazy loading builds on.
Plugin architectures for extensible CLIs — entry-point loading that is already lazy by design.