Skip to content

Anonymisation Guide

anonymise — exclusion-based full-scramble PDF anonymisation utility.

Unlike a traditional inclusion-based approach (where you specify what to redact), this module starts from a completely scrambled PDF — every letter on every page is replaced with a different random letter of the same case — and then uses anonymise.toml to specify what to keep unchanged and what numbers to scramble.

Scrambling Rules

  • Only letters (a-z, A-Z) are scrambled by default — digits, punctuation, symbols and whitespace are left exactly as they appear in the PDF.
  • Numbers listed under [numbers_to_scramble] are an opt-in exception: any Tj fragment whose decoded text contains a listed number has its digit characters scrambled (replaced by random different digits); all non-digit characters in that fragment are left unchanged.
  • Text listed under [words_to_not_scramble] (exact, case-insensitive) is never scrambled, regardless of its content.

Config file (anonymise.toml) [numbers_to_scramble] values = [...] — list of number strings (e.g. "11-22-33", "12345678") whose digit characters should be scrambled wherever those strings appear as a substring of a Tj fragment's decoded text. Matching is substring-based and case-insensitive. Only the digit characters inside the matching fragment are replaced; hyphens, spaces and other separators are preserved.

[words_to_not_scramble] exclude = [...] — list of words or phrases (e.g. month names, transaction type codes, date conjunctions) that must appear unchanged in the output. Matching is case-insensitive and ignores all whitespace (so multi-token phrases rendered as separate Tj calls are detected via a sliding-window accumulator). Pre-populate with English month names, "from", "to", and any bank-specific transaction type codes.

[filename_replacements] Key/value pairs applied to the output file stem before prepending the anonymised_ prefix. Same format as the old [global_replacements] section.

Implementation Notes

This module shares its low-level PDF engine with the shared helpers in :mod:_anonymise_shared. The scramble-map builder, content-stream rewriter, and ToUnicode CMap parser live in that shared module and are imported here.

The scope of scrambling is full-page: every Tj fragment on every page is scrambled by default, relying on the exclusion rules in the config file to protect text that must remain readable.

Public API

anonymise_pdf()

anonymise_pdf(input_path: Path, output_path: Path | None = None, config_path: Path | None = None) -> Path

Anonymise a single PDF using exclusion-based full-page letter scrambling.

Every letter on every page is scrambled. Digits and symbols are left unchanged unless they match a [numbers_to_scramble] entry. Text listed in [words_to_not_scramble] is preserved as-is.

Args:

  • input_path — Path to the source PDF to anonymise.
  • output_path — Destination path for the anonymised PDF. When None, the output filename is derived from input_path by applying filename_replacements from the config and prepending anonymised_.
  • config_path — Path to the anonymise.toml exclusion config. When None, uses the default project config.

Returns:

Path to the anonymised output PDF.

Raises:

  • FileNotFoundError — If input_path or the config file does not exist.

anonymise_folder()

anonymise_folder(folder_path: Path, pattern: str = '*.pdf', output_dir: Path | None = None, config_path: Path | None = None) -> list[Path]

Anonymise all PDFs matching pattern in folder_path using exclusion-based scrambling.

Skips any PDF whose stem already starts with anonymised_ to avoid re-processing previously anonymised files.

Args:

  • folder_path — Directory to search for PDFs.
  • pattern — Glob pattern used to find PDFs within folder_path. Defaults to "*.pdf".
  • output_dir — Directory to write anonymised PDFs into. When None, each output file is written alongside its source.
  • config_path — Path to the anonymise.toml config file. When None, uses the default project config.

Returns:

List of paths to the anonymised output PDFs, in the order processed.

Raises:

  • FileNotFoundError — If folder_path or the config file does not exist.

CLI Usage

The bsp anonymise command wraps the Python API:

# Anonymise a single PDF
bsp anonymise statement.pdf

# Anonymise all PDFs in a folder
bsp anonymise ~/statements --folder

# Use a custom config file
bsp anonymise statement.pdf --config anonymise.toml

# Specify output location
bsp anonymise statement.pdf --output ~/anonymised/output.pdf

See the CLI Reference for all available options.