Anonymisation Guide¶
anonymise — exclusion-based full-scramble PDF anonymisation utility.
Unlike a traditional inclusion-based approach (where you specify what to redact),
this module starts from a completely scrambled PDF — every letter on
every page is replaced with a different random letter of the same case —
and then uses anonymise.toml to specify what to keep unchanged and
what numbers to scramble.
Scrambling Rules¶
- Only letters (a-z, A-Z) are scrambled by default — digits, punctuation, symbols and whitespace are left exactly as they appear in the PDF.
- Numbers listed under
[numbers_to_scramble]are an opt-in exception: any Tj fragment whose decoded text contains a listed number has its digit characters scrambled (replaced by random different digits); all non-digit characters in that fragment are left unchanged. - Text listed under
[words_to_not_scramble](exact, case-insensitive) is never scrambled, regardless of its content.
Config file (anonymise.toml)
[numbers_to_scramble]
values = [...] — list of number strings (e.g. "11-22-33",
"12345678") whose digit characters should be scrambled wherever
those strings appear as a substring of a Tj fragment's decoded text.
Matching is substring-based and case-insensitive. Only the digit
characters inside the matching fragment are replaced; hyphens, spaces
and other separators are preserved.
[words_to_not_scramble]
exclude = [...] — list of words or phrases (e.g. month names,
transaction type codes, date conjunctions) that must appear unchanged
in the output. Matching is case-insensitive and ignores all whitespace
(so multi-token phrases rendered as separate Tj calls are detected via
a sliding-window accumulator). Pre-populate with English month names,
"from", "to", and any bank-specific transaction type codes.
[filename_replacements]
Key/value pairs applied to the output file stem before prepending the
anonymised_ prefix. Same format as the old [global_replacements]
section.
Implementation Notes¶
This module shares its low-level PDF engine with the shared helpers in
:mod:_anonymise_shared. The scramble-map builder, content-stream
rewriter, and ToUnicode CMap parser live in that shared module and are
imported here.
The scope of scrambling is full-page: every Tj fragment on every page is scrambled by default, relying on the exclusion rules in the config file to protect text that must remain readable.
Public API¶
anonymise_pdf()¶
anonymise_pdf(input_path: Path, output_path: Path | None = None, config_path: Path | None = None) -> Path
Anonymise a single PDF using exclusion-based full-page letter scrambling.
Every letter on every page is scrambled. Digits and symbols are left
unchanged unless they match a [numbers_to_scramble] entry. Text
listed in [words_to_not_scramble] is preserved as-is.
Args:
input_path— Path to the source PDF to anonymise.output_path— Destination path for the anonymised PDF. WhenNone, the output filename is derived from input_path by applyingfilename_replacementsfrom the config and prependinganonymised_.config_path— Path to theanonymise.tomlexclusion config. WhenNone, uses the default project config.
Returns:
Path to the anonymised output PDF.
Raises:
FileNotFoundError— If input_path or the config file does not exist.
anonymise_folder()¶
anonymise_folder(folder_path: Path, pattern: str = '*.pdf', output_dir: Path | None = None, config_path: Path | None = None) -> list[Path]
Anonymise all PDFs matching pattern in folder_path using exclusion-based scrambling.
Skips any PDF whose stem already starts with anonymised_ to avoid
re-processing previously anonymised files.
Args:
folder_path— Directory to search for PDFs.pattern— Glob pattern used to find PDFs within folder_path. Defaults to"*.pdf".output_dir— Directory to write anonymised PDFs into. WhenNone, each output file is written alongside its source.config_path— Path to theanonymise.tomlconfig file. WhenNone, uses the default project config.
Returns:
List of paths to the anonymised output PDFs, in the order processed.
Raises:
FileNotFoundError— If folder_path or the config file does not exist.
CLI Usage¶
The bsp anonymise command wraps the Python API:
# Anonymise a single PDF
bsp anonymise statement.pdf
# Anonymise all PDFs in a folder
bsp anonymise ~/statements --folder
# Use a custom config file
bsp anonymise statement.pdf --config anonymise.toml
# Specify output location
bsp anonymise statement.pdf --output ~/anonymised/output.pdf
See the CLI Reference for all available options.