Skip to content

Adding a New Bank

This guide walks through the process of configuring bank_statement_parser to parse PDF statements from a new bank. The configuration is entirely TOML-based and does not require writing any Python code.

Overview

Adding support for a new bank involves creating and editing several TOML files that describe how to identify the bank's PDFs, locate tables on each page, extract field values, and map them to standard output columns.

The configuration lives in two places:

Location Purpose
project/config/import/<BANK_COUNTRY>/ Bank-specific config folder (4 TOML files)
project/config/import/account_types.toml Shared account type registry
project/config/import/standard_fields.toml Shared standard field mappings

Bank config folder structure

Each bank has its own subfolder named in SCREAMING_SNAKE_CASE (e.g. HSBC_UK, TSB_UK). A complete folder contains exactly four files:

File Purpose Key Dataclass
companies.toml Bank identification (name + PDF detection rule) Company
accounts.toml Account definitions (one per product/card type) Account
statement_types.toml Statement layout definitions (header + lines extraction) StatementType
statement_tables.toml Physical table extraction rules (locations, fields, bookends) StatementTable

Processing pipeline

Understanding the processing order helps when writing config:

  1. Company identification — the Company.config extraction is run against page 1 to determine which bank issued the PDF.
  2. Account identification — each Account.config is tried until one matches, identifying the specific account product.
  3. Header extraction — the StatementType.header configs run to extract statement-level metadata (dates, balances, account details).
  4. Lines extraction — the StatementType.lines configs run per-page to extract transaction rows.
  5. Standard field mapping — raw extracted fields are mapped to STD_* output columns via standard_fields.toml.
  6. Checks & balances — opening balance + payments in - payments out = closing balance is validated.

Step 1: Register the Account Type

If your bank uses an account type not already in account_types.toml, add a new entry. Most banks will use the existing types (CRD, CUR, SAV, ISA).

File: project/config/import/account_types.toml

[CRD]
account_type = "Credit Card"

[CUR]
account_type = "Current Account"

[SAV]
account_type = "Savings Account"

[ISA]
account_type = "ISA"

AccountType

Simple lookup label for an account type category.

Field Type Status Description
account_type str ACTIVE Account type label (e.g. "CRD" for credit card, "CUR" for current account). Populated at load time but not subsequently consumed by the pipeline; present for potential reporting or routing use.

Step 2: Create the Bank Config Folder

Create a new subfolder under project/config/import/ using the naming convention <BANK>_<COUNTRY> in SCREAMING_SNAKE_CASE:

project/config/import/
  HSBC_UK/          # existing
  TSB_UK/           # existing
  NEWBANK_UK/       # <- your new folder
    companies.toml
    accounts.toml
    statement_types.toml
    statement_tables.toml

Step 3: Define the Company

Create companies.toml in your new folder. This file identifies the bank by extracting a distinguishing piece of text from page 1 of the PDF (typically a website URL or bank name).

Example (HSBC_UK/companies.toml):

[HSBC_UK]
company = 'HSBC Bank UK'
[HSBC_UK.config]
    config = 'Company Info'
    locations = [
        {page_number = 1, top_left = [475,110], bottom_right = [575, 130]},
        {page_number = 1, top_left = [460,145], bottom_right = [575, 165]},
        {page_number = 1, top_left = [460,165], bottom_right = [575, 185]},
    ]
    field = {field = 'website', vital=true, type="string", string_pattern ='^www\.hsbc\.co\.uk$'}

How it works: The config block defines a small extraction region on page 1. The field spec extracts text from that region and checks it against string_pattern. If the pattern matches, this company is selected. Multiple locations can be provided — the pipeline tries each until one succeeds.

Key dataclasses

Company

Configuration for a financial institution (bank/provider).

Field Type Status Description
company str ACTIVE Human-readable company name (e.g. "HSBC UK"). Used to populate the STD_COMPANY standard field.
config Config ACTIVE Extraction config used during the company-identification pass. Extracts a discriminating field (e.g. a bank-specific header string) to confirm the PDF belongs to this company before attempting account matching.
accounts dict STUB Declared but never accessed by the pipeline after load. Intended as a lookup from account key to Account object but currently unused.

Config

A single extraction step: one table (or one standalone field) from one location.

Field Type Status Description
config str ACTIVE Human-readable label for this extraction step (e.g. "Statement Balances"). Written into the "config" column of the long-format results DataFrame for traceability.
statement_table_key str ACTIVE Key into statement_tables.toml that identifies the StatementTable to use. Resolved to statement_table at load time. Set to None for inline single-field configs.
statement_table StatementTable ACTIVE Resolved at load time from statement_table_key. The StatementTable object used during extraction. Not set directly in TOML.
locations list[Location ACTIVE Used only for inline single-field configs (where statement_table is None). Defines where on the page to find the field value.
field Field ACTIVE Used only for inline single-field configs. Defines the extraction spec for the single value to read from the location.

Location

Describes a rectangular region on a PDF page from which a table or text is extracted.

Field Type Status Description
page_number int ACTIVE 1-based page number. When set the location is used only on that page. When None the location is cloned for every page (spawn_locations()).
top_left list[int ACTIVE [x, y] coordinates of the top-left corner of the crop rectangle. Must be set together with bottom_right. When both are None the full page is used.
bottom_right list[int ACTIVE [x, y] coordinates of the bottom-right corner of the crop rectangle. Must be set together with top_left.
vertical_lines list[int ACTIVE Explicit x-coordinates of vertical column dividers supplied to pdfplumber as explicit_vertical_lines. Pairs of identical values create a zero-width gap that forces a column boundary (e.g. [100, 100, 200, 200]). When set, pdfplumber's automatic column detection is disabled for this region.
dynamic_last_vertical_line DynamicLineSpec ACTIVE When set, the final value in vertical_lines is replaced at runtime with an x-coordinate derived from a PDF image's bounding box. See DynamicLineSpec. Used where the rightmost column boundary floats with a logo.
allow_text_failover bool ACTIVE When True and the extracted table has the wrong number of columns, the extraction is retried without vertical_lines, falling back to pdfplumber's text-based column detection. Useful as a safety net for pages where the explicit dividers produce a malformed table.
try_shift_down int ACTIVE Number of PDF points to shift the crop rectangle downward (applied to both top_left[1] and bottom_right[1]) when the initial extraction returns an empty region. Handles statements where the table top boundary varies slightly between pages.

Field

Extraction specification for a single column or cell within a PDF table.

Field Type Status Description
field str ACTIVE Output column name for this field (e.g. "date", "£_paid_out"). Used as the field identifier throughout the pipeline and in the output Parquet files.
cell Cell ACTIVE Row/column address for summary or detail table extraction. Mutually exclusive with column; set to None for transaction tables.
column int | None ACTIVE Zero-based column index for transaction table extraction. Mutually exclusive with cell; set to None for summary/detail tables.
vital bool ACTIVE When True, extraction failure for this field causes the row to be flagged as a hard failure and excluded from output. When False, failure is recorded but the row is retained.
type str ACTIVE Data type: "string", "numeric", or "currency". * "string" — raw text extraction; pattern matching and trimming applied. * "numeric" — numeric extraction with optional explicit currency stripping via currency_override. * "currency" — identical to "numeric" but inherits the CurrencySpec from the account's Account.currency rather than requiring an explicit currency_override on every field. Use this for all monetary amount fields; reserve "numeric" for non-monetary numerics (e.g. APR, sort code).
strip_characters_start str ACTIVE Characters to strip from the start of the raw string before pattern matching (passed to Polars str.strip_chars_start()). Useful for leading currency symbols not covered by the account currency spec.
strip_characters_end str ACTIVE Characters to strip from the end of the raw string before pattern matching (passed to Polars str.strip_chars_end()).
currency_override str | None ACTIVE Explicit ISO 4217 currency key (e.g. "GBP") used when type == "numeric" and currency stripping is needed but should differ from the account-level Account.currency. Ignored when type == "currency" (which always uses the account-level currency). Omit for non-monetary numeric fields (e.g. APR, sort code) where no currency stripping is required.
numeric_modifier NumericModifier ACTIVE Sign/multiplier transformation applied after numeric casting. See NumericModifier. Omit for straightforward positive numeric values.
string_pattern str ACTIVE Regex pattern the extracted string must match. Extraction is marked as failed (success = False) if the value does not match. Used to validate field contents (e.g. date format) and to skip blank or irrelevant rows.
string_max_length int ACTIVE Maximum character length for string values; longer strings are truncated via str.head(). Useful for capping free-text description fields. Defaults to 999 if not set.
date_format str STUB Intended strptime format for date parsing at the Field level. Declared but never read by the pipeline; date format parsing is handled via StdRefs.format in get_standard_fields() instead.
value_offset 'FieldOffset' ACTIVE When set, reads the field's value from an adjacent column (Field.column + FieldOffset.cols_offset) using the type and currency rules defined in the FieldOffset rather than those on this Field. The primary field column is still extracted normally; the offset column value replaces it in the output. See FieldOffset.

Step 4: Define Statement Tables

Create statement_tables.toml to define how tables are physically extracted from the PDF pages. This is usually the most complex configuration file, as it requires understanding the precise layout of the bank's PDF statements.

Each table entry defines:

  • Where on the page to look (bounding box coordinates, vertical column dividers)
  • What fields to extract (column indices or cell addresses, data types, patterns)
  • How to handle multi-row transactions (bookend detection, field merging)

Table types

There are three table types, determined by the presence of transaction_spec:

Type Use Case Field Addressing Has transaction_spec?
summary Account balances, totals cell = {row, col} No
detail Account holder info, sort codes cell = {row, col} No
transaction Transaction line items column = N Yes

Summary table example

A summary table extracts fixed values from known cell positions (e.g. opening balance at row 1, column 1):

[HSBC_UK_CUR_ACCT_SUM]
type = "summary"
statement_table = 'Account Summary'
table_columns = 2
table_rows = 4
row_spacing = 7
locations = [
    {page_number=1, top_left = [345, 180], bottom_right = [575, 300], vertical_lines = [360, 475, 475, 550], dynamic_last_vertical_line = {image_id = 0, image_location_tag = "x1"}, allow_text_failover = true},
]
fields = [
    {field = 'opening_balance', cell = {row = 1, col = 1}, vital=true, type = 'currency', numeric_modifier = {suffix = "D", multiplier = -1}},
    {field = 'payments_in', cell = {row = 2, col = 1}, vital=true, type = 'currency'},
    {field = 'payments_out', cell = {row = 3, col = 1}, vital=true, type = 'currency'},
    {field = 'closing_balance', cell = {row = 4, col = 1}, vital=true, type = 'currency', numeric_modifier = {suffix = "D", multiplier = -1}},
]

Transaction table example

A transaction table extracts rows of variable length, using bookend detection to identify where each transaction starts and ends:

[HSBC_UK_CUR_TRANSACTIONS]
type = "transaction"
statement_table = 'Transactions'
table_columns = 6
locations = [
    {vertical_lines = [50, 100, 100, 130, 130, 320, 320, 400, 400, 480, 480, 555]},
]
fields = [
    {field = 'date', column = 0, vital=false, type = "string", string_pattern ='^[0-3][0-9]\s?[A-Z][a-z]{2}\s?[0-3][0-9]$'},
    {field = 'payment_type', column = 1, vital=false, type = "string", string_pattern ='(^[A-Z0-9]{1,3}$)|(^[)]{3}$)'},
    {field = 'details', column = 2, vital=true, type = "string", string_pattern ='.+', string_max_length = 100},
    {field = '£_paid_out', column = 3, vital=false, type = "currency"},
    {field = '£_paid_in', column = 4, vital=false, type = "currency"},
    {field = '£_balance', column = 5, vital=false, type = "currency", numeric_modifier = {suffix = "D", multiplier = -1.0000}},
]
delete_success_false = true
delete_cast_success_false = true
delete_rows_with_missing_vital_fields = true

[HSBC_UK_CUR_TRANSACTIONS.transaction_spec]
transaction_bookends = [
    {start_fields = ['payment_type','details'], min_non_empty_start = 2, end_fields = ['£_paid_out','£_paid_in'], min_non_empty_end = 1}
]
fill_forward_fields = ['date','payment_type']
merge_fields = {fields=['details'], separator=' | '}

Key dataclasses

StatementTable

Full configuration for extracting one table from a PDF statement.

Field Type Status Description
type str STUB Table type label: "transaction", "summary", or "detail". Loaded from TOML but not currently read by the pipeline; the extraction path is determined by whether transaction_spec is present rather than this field.
statement_table str STUB Human-readable table label (e.g. "Transactions", "Account Summary"). Loaded from TOML for documentation purposes but not consumed by the pipeline.
header_text str ACTIVE When set, the first table row whose text matches this string is stripped before extraction. Use when pdfplumber includes the column header row in the extracted data.
remove_header bool ACTIVE When True the first table row is unconditionally stripped. Use when the header row is always present but its text varies (making header_text impractical).
locations list[Location] ACTIVE One or more Location entries describing where on the page to find this table. Locations without a page_number are cloned for every page.
fields list[Field] ACTIVE Ordered list of field extraction specs. For transaction tables each field must have a column; for summary/detail tables each field must have a cell.
table_columns int ACTIVE Expected minimum number of columns in the extracted table. Passed to pdfplumber as min_words_horizontal and used to validate column count after extraction. Also triggers allow_text_failover retry logic.
table_rows int ACTIVE Expected minimum number of rows in the extracted table. Passed to pdfplumber as min_words_vertical.
row_spacing int ACTIVE pdfplumber snap_y_tolerance in PDF points. Rows whose top edges fall within this distance of each other are merged into the same table row. Increase if the statement uses tight line spacing that splits a single visual row across multiple pdfplumber rows.
tests list[Test STUB Declarative post-extraction assertions. Declared and accepted in TOML but no pipeline code evaluates them. Reserved for a future config validation pass.
delete_success_false bool STUB Intended to drop rows where any field extraction returned success = False. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag.
delete_cast_success_false bool STUB Intended to drop rows where numeric casting failed. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag.
delete_rows_with_missing_vital_fields bool STUB Intended to drop rows where any vital field is missing after extraction. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. Note: vital-field hard-failure logic exists in validate() but is separate from this flag.
transaction_spec TransactionSpec ACTIVE When set, the table is processed as a transaction table using the bookend-based multi-row extraction path. Must be None for summary/detail tables.

Location

Describes a rectangular region on a PDF page from which a table or text is extracted.

Field Type Status Description
page_number int ACTIVE 1-based page number. When set the location is used only on that page. When None the location is cloned for every page (spawn_locations()).
top_left list[int ACTIVE [x, y] coordinates of the top-left corner of the crop rectangle. Must be set together with bottom_right. When both are None the full page is used.
bottom_right list[int ACTIVE [x, y] coordinates of the bottom-right corner of the crop rectangle. Must be set together with top_left.
vertical_lines list[int ACTIVE Explicit x-coordinates of vertical column dividers supplied to pdfplumber as explicit_vertical_lines. Pairs of identical values create a zero-width gap that forces a column boundary (e.g. [100, 100, 200, 200]). When set, pdfplumber's automatic column detection is disabled for this region.
dynamic_last_vertical_line DynamicLineSpec ACTIVE When set, the final value in vertical_lines is replaced at runtime with an x-coordinate derived from a PDF image's bounding box. See DynamicLineSpec. Used where the rightmost column boundary floats with a logo.
allow_text_failover bool ACTIVE When True and the extracted table has the wrong number of columns, the extraction is retried without vertical_lines, falling back to pdfplumber's text-based column detection. Useful as a safety net for pages where the explicit dividers produce a malformed table.
try_shift_down int ACTIVE Number of PDF points to shift the crop rectangle downward (applied to both top_left[1] and bottom_right[1]) when the initial extraction returns an empty region. Handles statements where the table top boundary varies slightly between pages.

DynamicLineSpec

Locates the position of the last vertical column divider from an embedded PDF image.

Field Type Status Description
image_id int ACTIVE Zero-based index into the list of images on the page, identifying which image provides the boundary coordinate.
image_location_tag str ACTIVE Bounding-box attribute of the image to use as the x-coordinate (e.g. "x0" for left edge, "x1" for right edge).

Field

Extraction specification for a single column or cell within a PDF table.

Field Type Status Description
field str ACTIVE Output column name for this field (e.g. "date", "£_paid_out"). Used as the field identifier throughout the pipeline and in the output Parquet files.
cell Cell ACTIVE Row/column address for summary or detail table extraction. Mutually exclusive with column; set to None for transaction tables.
column int | None ACTIVE Zero-based column index for transaction table extraction. Mutually exclusive with cell; set to None for summary/detail tables.
vital bool ACTIVE When True, extraction failure for this field causes the row to be flagged as a hard failure and excluded from output. When False, failure is recorded but the row is retained.
type str ACTIVE Data type: "string", "numeric", or "currency". * "string" — raw text extraction; pattern matching and trimming applied. * "numeric" — numeric extraction with optional explicit currency stripping via currency_override. * "currency" — identical to "numeric" but inherits the CurrencySpec from the account's Account.currency rather than requiring an explicit currency_override on every field. Use this for all monetary amount fields; reserve "numeric" for non-monetary numerics (e.g. APR, sort code).
strip_characters_start str ACTIVE Characters to strip from the start of the raw string before pattern matching (passed to Polars str.strip_chars_start()). Useful for leading currency symbols not covered by the account currency spec.
strip_characters_end str ACTIVE Characters to strip from the end of the raw string before pattern matching (passed to Polars str.strip_chars_end()).
currency_override str | None ACTIVE Explicit ISO 4217 currency key (e.g. "GBP") used when type == "numeric" and currency stripping is needed but should differ from the account-level Account.currency. Ignored when type == "currency" (which always uses the account-level currency). Omit for non-monetary numeric fields (e.g. APR, sort code) where no currency stripping is required.
numeric_modifier NumericModifier ACTIVE Sign/multiplier transformation applied after numeric casting. See NumericModifier. Omit for straightforward positive numeric values.
string_pattern str ACTIVE Regex pattern the extracted string must match. Extraction is marked as failed (success = False) if the value does not match. Used to validate field contents (e.g. date format) and to skip blank or irrelevant rows.
string_max_length int ACTIVE Maximum character length for string values; longer strings are truncated via str.head(). Useful for capping free-text description fields. Defaults to 999 if not set.
date_format str STUB Intended strptime format for date parsing at the Field level. Declared but never read by the pipeline; date format parsing is handled via StdRefs.format in get_standard_fields() instead.
value_offset 'FieldOffset' ACTIVE When set, reads the field's value from an adjacent column (Field.column + FieldOffset.cols_offset) using the type and currency rules defined in the FieldOffset rather than those on this Field. The primary field column is still extracted normally; the offset column value replaces it in the output. See FieldOffset.

Cell

Zero-based row and column address of a cell within a PDF table.

Field Type Status Description
row int ACTIVE Zero-based row index within the extracted table.
col int ACTIVE Zero-based column index within the extracted table.

NumericModifier

Optional sign/multiplier transformation applied after numeric casting.

Field Type Status Description
prefix str ACTIVE If the raw value starts with this string the prefix is stripped and the multiplier applied. Use for formats like "(123.45)" where "(" signals a negative value.
suffix str ACTIVE If the raw value ends with this string the suffix is stripped and the multiplier applied. Use for formats like "123.45 CR" or "123.45D".
multiplier float ACTIVE Scalar applied to the cast value when the prefix/suffix matches, or unconditionally if neither prefix nor suffix is set. Typically -1 to invert sign.
exclude_negative_values bool ACTIVE When True, any negative result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column.
exclude_positive_values bool ACTIVE When True, any positive result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column.

FieldOffset

Reads a field's value from an adjacent column rather than the field's own column.

Field Type Status Description
rows_offset int STUB Intended row offset for reading the value from a different row. Declared and accepted in TOML but never read by the pipeline; only cols_offset is currently consumed. Always set to 0 in TOML examples.
cols_offset int ACTIVE Column offset applied to Field.column to locate the source cell (e.g. 1 reads from the column immediately to the right).
vital bool ACTIVE Passed to the extraction pipeline for the offset field; when True extraction failure is treated as a hard failure for that row.
type str ACTIVE Data type for the offset value: "string", "numeric", or "currency". Overrides the parent Field.type for this value read.
currency_override str | None ACTIVE Explicit currency key (e.g. "GBP") for numeric stripping of the offset value when type == "numeric". Overrides the account-level currency. When type == "currency" the account-level currency is used and this is ignored.
numeric_modifier NumericModifier ACTIVE Sign/multiplier modifier for the offset value. Overrides the parent Field.numeric_modifier.

CurrencySpec

Currency formatting rules used to strip symbols and separators before numeric casting.

Field Type Status Description
name str ACTIVE Human-readable currency name (e.g. "British Pound Sterling").
symbols list[str] ACTIVE List of currency symbol strings to strip from the raw value before casting (e.g. ["£", "$"]). Replaced with empty string via str.replace_many().
seperator_decimal str STUB Intended decimal separator character (e.g. "."). Declared but never read by the pipeline; decimal handling is implicit after symbols and thousands separators are stripped.
seperators_thousands list[str] ACTIVE List of thousands-separator strings to strip (e.g. [","]). Replaced with empty string via str.replace_many() before casting.
round_decimals int STUB Intended rounding precision after casting. Declared but never read by the pipeline; no rounding is currently applied.
pattern str ACTIVE Regex pattern used to extract the numeric substring from the raw cell text before symbol/separator stripping. Passed to patmatch() via build_pattern().

TransactionSpec

Full specification for extracting transactions from a transaction-type table.

Field Type Status Description
transaction_bookends list[TransactionBookend] ACTIVE One or more bookend definitions that identify transaction boundaries. Evaluated in order; a row matched by an earlier bookend is not re-matched by a later one. At least one bookend is required.
fill_forward_fields list[str ACTIVE Field names whose null values should be forward-filled across rows within the same page after pivot. Use for sparse columns where a value (e.g. a date or payment type) appears only on the first row of a multi-row block and needs propagating to the end row.
merge_fields MergeFields ACTIVE When set, collapses multi-row text fields within each transaction into a single joined string. See MergeFields.
exclude_rows list[FieldValidation ACTIVE Rows where any rule's field value matches its pattern are removed from the results before bookend detection runs. Use to suppress known non-transaction rows (e.g. a closing balance summary line) that would otherwise interfere with transaction counting or checks & balances. Each rule is a {field, pattern} pair; a row is excluded if any rule matches.

TransactionBookend

Defines how the start and end of a single transaction are detected within a table.

Field Type Status Description
start_fields list[str] ACTIVE Field names that are checked to identify the first row of a transaction. A row qualifies as a start row when at least min_non_empty_start of these fields extracted successfully (success = True).
min_non_empty_start int ACTIVE Minimum number of start_fields that must have extracted successfully for a row to be flagged as transaction_start = True.
end_fields list[str] ACTIVE Field names checked to identify the last row of a transaction. A row qualifies as an end row when at least min_non_empty_end of these fields extracted successfully.
min_non_empty_end int ACTIVE Minimum number of end_fields that must have extracted successfully for a row to be flagged as transaction_end = True.
extra_validation_start FieldValidation ACTIVE When set, any row where the named field's value does NOT match the pattern is excluded from being a start-bookend candidate for this bookend. Rows excluded here may still be captured by another bookend in the list. Useful for bookends that should only trigger on a specific row shape (e.g. an interest charge line identified by its details text).
extra_validation_end FieldValidation STUB Symmetric counterpart to extra_validation_start for end rows. Declared but not yet implemented in the pipeline; no code currently reads this field. Reserved for future use.
sticky_fields list[str STUB Intended to forward-fill named fields from the start row of a transaction down to its end row, scoped within a single transaction (as opposed to fill_forward_fields which fills across transactions). Declared but not implemented; no pipeline code reads this field.

FieldValidation

A field-name/regex-pattern pair used as a row filter or row qualification rule.

Field Type Status Description
field str ACTIVE Name of the extracted field (output column name) whose value is tested against the pattern.
pattern str ACTIVE Regex pattern tested via Polars str.contains(). For exclude_rows a match causes exclusion; for extra_validation_start a non-match causes exclusion.

MergeFields

Specifies how multi-row text fields are collapsed into a single output value.

Field Type Status Description
fields list[str] ACTIVE Names of the fields whose per-row values should be joined.
separator str ACTIVE Delimiter inserted between joined values (e.g. " | ").

Step 5: Define Statement Types

Create statement_types.toml to define the extraction workflow for each distinct statement layout. A single bank may have multiple statement types (e.g. current account vs. credit card) if their PDF layouts differ.

Each statement type groups extraction into two sections:

  • header — runs once per statement to extract metadata (dates, balances, account info)
  • lines — runs per-page to extract transaction rows

Configs within each section either reference a statement_table_key from statement_tables.toml or define an inline single-field extraction.

Example (HSBC_UK/statement_types.toml):

[HSBC_UK_CUR]
statement_type = 'HSBC UK Current Account'
    [HSBC_UK_CUR.header]
        [[HSBC_UK_CUR.header.configs]]
            config = 'Statement Balances'
            statement_table_key = 'HSBC_UK_CUR_ACCT_SUM'

        [[HSBC_UK_CUR.header.configs]]
            config = 'Statement Info'
            locations = [
                {page_number=1, top_left = [37, 325], bottom_right = [290, 385]}
                ]
            field = {field = 'statement_date', vital=true, type = "string", string_pattern ='^[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s(\d{4}\s)?to\s[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s[0-9]{4}$'}

        [[HSBC_UK_CUR.header.configs]]
            config = 'Account Info'
            statement_table_key = 'HSBC_UK_CUR_ACCT_DET'

    [HSBC_UK_CUR.lines]
        [[HSBC_UK_CUR.lines.configs]]
            config = 'Transaction Lines'
            statement_table_key = 'HSBC_UK_CUR_TRANSACTIONS'

[HSBC_UK_SAV]
statement_type = 'HSBC UK Saving Account'
    [HSBC_UK_SAV.header]
        [[HSBC_UK_SAV.header.configs]]
            config = 'Statement Balances'
            statement_table_key = 'HSBC_UK_CUR_ACCT_SUM'

        [[HSBC_UK_SAV.header.configs]]
            config = 'Statement Info'
            locations = [{page_number=1, top_left = [37, 325], bottom_right = [290, 385]}]
            field = {field = 'statement_date', vital=true, type = "string", string_pattern ='^[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s(\d{4}\s)?to\s[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s[0-9]{4}$'}

        [[HSBC_UK_SAV.header.configs]]
            config = 'Account Info'
            statement_table_key = 'HSBC_UK_SAV_ACCT_DET'

    [HSBC_UK_SAV.lines]
        [[HSBC_UK_SAV.lines.configs]]
            config = 'Transaction Lines'
            statement_table_key = 'HSBC_UK_CUR_TRANSACTIONS'

[HSBC_UK_CRD]
statement_type = 'HSBC UK Credit Card'
    [HSBC_UK_CRD.header]
        [[HSBC_UK_CRD.header.configs]]
            config = 'Statement Balances'
            statement_table_key = 'HSBC_UK_CRD_ACCT_SUM'

        [[HSBC_UK_CRD.header.configs]]
            config = 'Statement Info'
            locations = [{page_number=1, top_left = [120, 470], bottom_right = [250, 500]}]
            field = {field = 'statement_date', vital=true, type = "string", string_pattern ='^[0-3]?[0-9]\s[A-Z][a-z]{2,8}\s[0-9]{4}$'}

        [[HSBC_UK_CRD.header.configs]]
            config = 'Account Info'
            locations = [{page_number=2, top_left = [44, 174], bottom_right = [215, 200]}]
            field = {field = 'account_name', vital=true, type = "string", string_pattern ='^[A-Z]+[a-z]*\s[A-Z]+[a-z]*.*$'}

        [[HSBC_UK_CRD.header.configs]]
            config = 'Account Info'
            locations = [{page_number=2, top_left = [220, 175], bottom_right = [340, 197]}]
            field = {field = 'card_number', vital=true, type = "string", string_pattern ='^[0-9]{4}\s[0-9]{4}\s[0-9]{4}\s[0-9]{4}\s?$'}

    [HSBC_UK_CRD.lines]
        [[HSBC_UK_CRD.lines.configs]]
            config = 'Transaction Lines'
            statement_table_key = 'HSBC_UK_CRD_TRANSACTIONS'

Key dataclasses

StatementType

Full extraction specification for one statement layout variant.

Field Type Status Description
statement_type str ACTIVE Human-readable label matching the value used in StdRefs.statement_type (e.g. "HSBC UK Current Account"). Used to select the correct StdRefs mapping when promoting raw fields to standard columns.
header ConfigGroup ACTIVE Config steps that extract statement-level metadata: dates, account numbers, opening/closing balances, etc.
lines ConfigGroup ACTIVE Config steps that extract per-transaction data from the body of each page.

ConfigGroup

An ordered list of Config extraction steps for one pipeline section.

Field Type Status Description
configs list[Config ACTIVE Ordered list of Config steps. Executed in sequence during extraction; results are stacked into the section's results DataFrame.

Config

A single extraction step: one table (or one standalone field) from one location.

Field Type Status Description
config str ACTIVE Human-readable label for this extraction step (e.g. "Statement Balances"). Written into the "config" column of the long-format results DataFrame for traceability.
statement_table_key str ACTIVE Key into statement_tables.toml that identifies the StatementTable to use. Resolved to statement_table at load time. Set to None for inline single-field configs.
statement_table StatementTable ACTIVE Resolved at load time from statement_table_key. The StatementTable object used during extraction. Not set directly in TOML.
locations list[Location ACTIVE Used only for inline single-field configs (where statement_table is None). Defines where on the page to find the field value.
field Field ACTIVE Used only for inline single-field configs. Defines the extraction spec for the single value to read from the location.

Step 6: Define Accounts

Create accounts.toml to define each account product offered by the bank. Each account links together a company, an account type, and a statement type, plus defines a PDF detection rule to identify which account a given statement belongs to.

Example (HSBC_UK/accounts.toml — first entry shown):

[HSBC_UK_CRD_RCC]
account = "Rewards Credit Card"
company_key = 'HSBC_UK'
account_type_key = 'CRD'
statement_type_key = 'HSBC_UK_CRD'
exclude_last_n_pages = 1
currency = "GBP"
    [HSBC_UK_CRD_RCC.config]
    config = 'Account Info'
    locations = [{page_number = 1, top_left = [275, 20], bottom_right = [575, 70]}]
    field = {field = 'account', vital=true, type="string", string_pattern ='^Your[\s]*Rewards[\s]*Credit[\s]*Card[\s]*statement[\s]*'}

Key fields:

  • company_key — must match a key in your companies.toml
  • account_type_key — must match a key in the shared account_types.toml (e.g. CRD, CUR)
  • statement_type_key — must match a key in your statement_types.toml
  • exclude_last_n_pages — number of trailing pages to skip (terms & conditions, etc.)
  • config — inline extraction rule to identify this account from page 1 of the PDF

Key dataclasses

Account

Full runtime configuration for one bank account.

Field Type Status Description
account str ACTIVE Human-readable account name (e.g. "Current Account"). Written to the STD_ACCOUNT standard field in the output.
company_key str ACTIVE Key into companies.toml identifying the issuing bank. Used to build ID_ACCOUNT and to look up the Company object at load time.
company Company ACTIVE Resolved at load time from company_key. Provides the company name and company-level identification config.
account_type_key str ACTIVE Key into account_types.toml (e.g. "CRD", "CUR", "SAV"). Used to look up the AccountType object at load time.
account_type AccountType STUB Resolved at load time from account_type_key. The AccountType object is populated but never subsequently read by any pipeline consumer.
statement_type_key str ACTIVE Key into statement_types.toml identifying the extraction layout for this account's statements. Used to look up the StatementType object at load time.
statement_type StatementType ACTIVE Resolved at load time from statement_type_key. Provides the header and lines ConfigGroups used during extraction.
exclude_last_n_pages int ACTIVE Number of trailing pages to skip when cloning per-page locations. Set to 1 (or more) when the final page(s) contain terms & conditions or other non-transaction content that would otherwise be passed to the extraction pipeline.
currency str ACTIVE ISO 4217 currency code for all monetary fields on this account (e.g. "GBP", "USD", "PHP"). Must be a key in currency_spec in currency.py; validated at config load time. Used by the extraction pipeline to resolve the CurrencySpec for fields of type "currency".
config Config ACTIVE Account-level identification config. A lightweight extraction step run to confirm a PDF belongs to this account before the full extraction pass. Defined inline under [ACCOUNT_KEY.config] in accounts.toml.

Step 7: Register Standard Field Mappings

Finally, add entries for your new statement type(s) to the shared standard_fields.toml. This file maps bank-specific raw field names to standardised output columns (STD_*).

For each STD_* field, add a new std_refs entry with your statement type's name and the corresponding raw field name from your statement_tables.toml.

File: project/config/import/standard_fields.toml

Example (showing STD_OPENING_BALANCE with entries for multiple banks):

[STD_OPENING_BALANCE]
    section = "header"
    type = "numeric"
    vital = true
    std_refs = [
        {statement_type="HSBC UK Current Account", field="opening_balance"},
        {statement_type="HSBC UK Saving Account", field="opening_balance"},
        {statement_type="HSBC UK Credit Card", field="previous_balance", multiplier=-1.0000},
        {statement_type="TSB UK Current Account", field="opening_balance"},
    ]

Standard fields reference

The following standard fields must be mapped for each statement type. Fields marked vital = true will raise a ConfigError if no mapping is found.

Standard Field Section Type Vital Purpose
STD_STATEMENT_DATE header date Yes Statement period end date
STD_SORTCODE header string No Bank sort code
STD_ACCOUNT_NUMBER header string Yes Account or card number
STD_ACCOUNT_HOLDER header string No Account holder name
STD_OPENING_BALANCE header numeric Yes Opening balance
STD_CLOSING_BALANCE header numeric Yes Closing balance
STD_PAYMENTS_IN header numeric Yes Total credits in period
STD_PAYMENTS_OUT header numeric Yes Total debits in period
STD_TRANSACTION_DATE lines date Yes Individual transaction date
STD_TRANSACTION_TYPE lines str Yes Payment type code
STD_TRANSACTION_DESC lines string Yes Transaction description
STD_PAYMENT_IN lines numeric Yes Credit amount per transaction
STD_PAYMENT_OUT lines numeric Yes Debit amount per transaction

std_refs entry options

Each std_refs entry supports the following options:

StdRefs

Mapping rule that promotes a raw extracted field to a standard output column.

Field Type Status Description
statement_type str ACTIVE Key used to select this rule; matched against the statement type string of the PDF being processed (e.g. "HSBC UK Current Account").
field str ACTIVE Name of the raw extracted column to promote. Set to None (or omit) when a literal default value should be used instead of a column value.
format str ACTIVE strptime format string applied when StandardFields.type == "date" (e.g. "%-d %B %Y"). Ignored for numeric and string types.
default str ACTIVE Literal string value used as the output when field is None/absent. Useful for injecting constant metadata (e.g. transaction_type = "CC").
multiplier float ACTIVE Scalar applied to the value after casting when StandardFields.type == "numeric". Use -1 to invert sign (e.g. to convert a credit amount stored as positive into a negative figure).
exclude_positive_values bool ACTIVE When True, any positive numeric value is replaced with 0 after casting. Used to isolate debit-side figures from a combined amount column.
exclude_negative_values bool ACTIVE When True, any negative numeric value is replaced with 0 after casting. Used to isolate credit-side figures from a combined amount column.
terminator str ACTIVE Regex pattern; when present the string value is truncated at the first match position before being written to the standard column. Useful for stripping trailing boilerplate appended by merge_fields (e.g. " | BALANCE CARRIED FORWARD").

StandardFields

Declaration of a single standard output column and how to derive it.

Field Type Status Description
section str ACTIVE Pipeline section this field belongs to: "header" (statement-level metadata extracted once per statement) or "lines" (per-transaction data). Used to dispatch the field to the correct extraction pass.
type str ACTIVE Data type of the standard column: "string", "numeric", or "date". Controls casting, multiplier application, and date parsing in get_standard_fields().
vital bool ACTIVE When True a ConfigError is raised if no matching StdRefs entry is found for the current statement type, halting processing. Set False for optional fields that not all statement types provide.
std_refs list[StdRefs] ACTIVE One entry per supported statement type. The correct entry is selected at runtime by matching StdRefs.statement_type.

Configuration Checklist

Use this checklist to verify your configuration is complete:

  • [ ] Account type registered in account_types.toml (or existing type reused)
  • [ ] Bank config folder created: project/config/import/<BANK_COUNTRY>/
  • [ ] companies.toml — company key, name, and PDF detection rule
  • [ ] statement_tables.toml — all table extraction rules (summary, detail, transaction)
  • [ ] statement_types.toml — header and lines config groups referencing your table keys
  • [ ] accounts.toml — account entries linking company, type, and statement type
  • [ ] standard_fields.tomlstd_refs entries added for all 13 standard fields
  • [ ] Test with a real PDF: bsp process --pdfs /path/to/statements
  • [ ] Verify checks & balances pass (opening + payments_in - payments_out = closing)

Dataclass Reference

Complete reference for all configuration dataclasses defined in bank_statement_parser.modules.data. Fields marked STUB are declared but not currently read by the pipeline — they are reserved for future use.

Company

Configuration for a financial institution (bank/provider).

Field Type Status Description
company str ACTIVE Human-readable company name (e.g. "HSBC UK"). Used to populate the STD_COMPANY standard field.
config Config ACTIVE Extraction config used during the company-identification pass. Extracts a discriminating field (e.g. a bank-specific header string) to confirm the PDF belongs to this company before attempting account matching.
accounts dict STUB Declared but never accessed by the pipeline after load. Intended as a lookup from account key to Account object but currently unused.

Account

Full runtime configuration for one bank account.

Field Type Status Description
account str ACTIVE Human-readable account name (e.g. "Current Account"). Written to the STD_ACCOUNT standard field in the output.
company_key str ACTIVE Key into companies.toml identifying the issuing bank. Used to build ID_ACCOUNT and to look up the Company object at load time.
company Company ACTIVE Resolved at load time from company_key. Provides the company name and company-level identification config.
account_type_key str ACTIVE Key into account_types.toml (e.g. "CRD", "CUR", "SAV"). Used to look up the AccountType object at load time.
account_type AccountType STUB Resolved at load time from account_type_key. The AccountType object is populated but never subsequently read by any pipeline consumer.
statement_type_key str ACTIVE Key into statement_types.toml identifying the extraction layout for this account's statements. Used to look up the StatementType object at load time.
statement_type StatementType ACTIVE Resolved at load time from statement_type_key. Provides the header and lines ConfigGroups used during extraction.
exclude_last_n_pages int ACTIVE Number of trailing pages to skip when cloning per-page locations. Set to 1 (or more) when the final page(s) contain terms & conditions or other non-transaction content that would otherwise be passed to the extraction pipeline.
currency str ACTIVE ISO 4217 currency code for all monetary fields on this account (e.g. "GBP", "USD", "PHP"). Must be a key in currency_spec in currency.py; validated at config load time. Used by the extraction pipeline to resolve the CurrencySpec for fields of type "currency".
config Config ACTIVE Account-level identification config. A lightweight extraction step run to confirm a PDF belongs to this account before the full extraction pass. Defined inline under [ACCOUNT_KEY.config] in accounts.toml.

AccountType

Simple lookup label for an account type category.

Field Type Status Description
account_type str ACTIVE Account type label (e.g. "CRD" for credit card, "CUR" for current account). Populated at load time but not subsequently consumed by the pipeline; present for potential reporting or routing use.

StatementType

Full extraction specification for one statement layout variant.

Field Type Status Description
statement_type str ACTIVE Human-readable label matching the value used in StdRefs.statement_type (e.g. "HSBC UK Current Account"). Used to select the correct StdRefs mapping when promoting raw fields to standard columns.
header ConfigGroup ACTIVE Config steps that extract statement-level metadata: dates, account numbers, opening/closing balances, etc.
lines ConfigGroup ACTIVE Config steps that extract per-transaction data from the body of each page.

ConfigGroup

An ordered list of Config extraction steps for one pipeline section.

Field Type Status Description
configs list[Config ACTIVE Ordered list of Config steps. Executed in sequence during extraction; results are stacked into the section's results DataFrame.

Config

A single extraction step: one table (or one standalone field) from one location.

Field Type Status Description
config str ACTIVE Human-readable label for this extraction step (e.g. "Statement Balances"). Written into the "config" column of the long-format results DataFrame for traceability.
statement_table_key str ACTIVE Key into statement_tables.toml that identifies the StatementTable to use. Resolved to statement_table at load time. Set to None for inline single-field configs.
statement_table StatementTable ACTIVE Resolved at load time from statement_table_key. The StatementTable object used during extraction. Not set directly in TOML.
locations list[Location ACTIVE Used only for inline single-field configs (where statement_table is None). Defines where on the page to find the field value.
field Field ACTIVE Used only for inline single-field configs. Defines the extraction spec for the single value to read from the location.

StatementTable

Full configuration for extracting one table from a PDF statement.

Field Type Status Description
type str STUB Table type label: "transaction", "summary", or "detail". Loaded from TOML but not currently read by the pipeline; the extraction path is determined by whether transaction_spec is present rather than this field.
statement_table str STUB Human-readable table label (e.g. "Transactions", "Account Summary"). Loaded from TOML for documentation purposes but not consumed by the pipeline.
header_text str ACTIVE When set, the first table row whose text matches this string is stripped before extraction. Use when pdfplumber includes the column header row in the extracted data.
remove_header bool ACTIVE When True the first table row is unconditionally stripped. Use when the header row is always present but its text varies (making header_text impractical).
locations list[Location] ACTIVE One or more Location entries describing where on the page to find this table. Locations without a page_number are cloned for every page.
fields list[Field] ACTIVE Ordered list of field extraction specs. For transaction tables each field must have a column; for summary/detail tables each field must have a cell.
table_columns int ACTIVE Expected minimum number of columns in the extracted table. Passed to pdfplumber as min_words_horizontal and used to validate column count after extraction. Also triggers allow_text_failover retry logic.
table_rows int ACTIVE Expected minimum number of rows in the extracted table. Passed to pdfplumber as min_words_vertical.
row_spacing int ACTIVE pdfplumber snap_y_tolerance in PDF points. Rows whose top edges fall within this distance of each other are merged into the same table row. Increase if the statement uses tight line spacing that splits a single visual row across multiple pdfplumber rows.
tests list[Test STUB Declarative post-extraction assertions. Declared and accepted in TOML but no pipeline code evaluates them. Reserved for a future config validation pass.
delete_success_false bool STUB Intended to drop rows where any field extraction returned success = False. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag.
delete_cast_success_false bool STUB Intended to drop rows where numeric casting failed. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag.
delete_rows_with_missing_vital_fields bool STUB Intended to drop rows where any vital field is missing after extraction. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. Note: vital-field hard-failure logic exists in validate() but is separate from this flag.
transaction_spec TransactionSpec ACTIVE When set, the table is processed as a transaction table using the bookend-based multi-row extraction path. Must be None for summary/detail tables.

Location

Describes a rectangular region on a PDF page from which a table or text is extracted.

Field Type Status Description
page_number int ACTIVE 1-based page number. When set the location is used only on that page. When None the location is cloned for every page (spawn_locations()).
top_left list[int ACTIVE [x, y] coordinates of the top-left corner of the crop rectangle. Must be set together with bottom_right. When both are None the full page is used.
bottom_right list[int ACTIVE [x, y] coordinates of the bottom-right corner of the crop rectangle. Must be set together with top_left.
vertical_lines list[int ACTIVE Explicit x-coordinates of vertical column dividers supplied to pdfplumber as explicit_vertical_lines. Pairs of identical values create a zero-width gap that forces a column boundary (e.g. [100, 100, 200, 200]). When set, pdfplumber's automatic column detection is disabled for this region.
dynamic_last_vertical_line DynamicLineSpec ACTIVE When set, the final value in vertical_lines is replaced at runtime with an x-coordinate derived from a PDF image's bounding box. See DynamicLineSpec. Used where the rightmost column boundary floats with a logo.
allow_text_failover bool ACTIVE When True and the extracted table has the wrong number of columns, the extraction is retried without vertical_lines, falling back to pdfplumber's text-based column detection. Useful as a safety net for pages where the explicit dividers produce a malformed table.
try_shift_down int ACTIVE Number of PDF points to shift the crop rectangle downward (applied to both top_left[1] and bottom_right[1]) when the initial extraction returns an empty region. Handles statements where the table top boundary varies slightly between pages.

DynamicLineSpec

Locates the position of the last vertical column divider from an embedded PDF image.

Field Type Status Description
image_id int ACTIVE Zero-based index into the list of images on the page, identifying which image provides the boundary coordinate.
image_location_tag str ACTIVE Bounding-box attribute of the image to use as the x-coordinate (e.g. "x0" for left edge, "x1" for right edge).

Field

Extraction specification for a single column or cell within a PDF table.

Field Type Status Description
field str ACTIVE Output column name for this field (e.g. "date", "£_paid_out"). Used as the field identifier throughout the pipeline and in the output Parquet files.
cell Cell ACTIVE Row/column address for summary or detail table extraction. Mutually exclusive with column; set to None for transaction tables.
column int | None ACTIVE Zero-based column index for transaction table extraction. Mutually exclusive with cell; set to None for summary/detail tables.
vital bool ACTIVE When True, extraction failure for this field causes the row to be flagged as a hard failure and excluded from output. When False, failure is recorded but the row is retained.
type str ACTIVE Data type: "string", "numeric", or "currency". * "string" — raw text extraction; pattern matching and trimming applied. * "numeric" — numeric extraction with optional explicit currency stripping via currency_override. * "currency" — identical to "numeric" but inherits the CurrencySpec from the account's Account.currency rather than requiring an explicit currency_override on every field. Use this for all monetary amount fields; reserve "numeric" for non-monetary numerics (e.g. APR, sort code).
strip_characters_start str ACTIVE Characters to strip from the start of the raw string before pattern matching (passed to Polars str.strip_chars_start()). Useful for leading currency symbols not covered by the account currency spec.
strip_characters_end str ACTIVE Characters to strip from the end of the raw string before pattern matching (passed to Polars str.strip_chars_end()).
currency_override str | None ACTIVE Explicit ISO 4217 currency key (e.g. "GBP") used when type == "numeric" and currency stripping is needed but should differ from the account-level Account.currency. Ignored when type == "currency" (which always uses the account-level currency). Omit for non-monetary numeric fields (e.g. APR, sort code) where no currency stripping is required.
numeric_modifier NumericModifier ACTIVE Sign/multiplier transformation applied after numeric casting. See NumericModifier. Omit for straightforward positive numeric values.
string_pattern str ACTIVE Regex pattern the extracted string must match. Extraction is marked as failed (success = False) if the value does not match. Used to validate field contents (e.g. date format) and to skip blank or irrelevant rows.
string_max_length int ACTIVE Maximum character length for string values; longer strings are truncated via str.head(). Useful for capping free-text description fields. Defaults to 999 if not set.
date_format str STUB Intended strptime format for date parsing at the Field level. Declared but never read by the pipeline; date format parsing is handled via StdRefs.format in get_standard_fields() instead.
value_offset 'FieldOffset' ACTIVE When set, reads the field's value from an adjacent column (Field.column + FieldOffset.cols_offset) using the type and currency rules defined in the FieldOffset rather than those on this Field. The primary field column is still extracted normally; the offset column value replaces it in the output. See FieldOffset.

Cell

Zero-based row and column address of a cell within a PDF table.

Field Type Status Description
row int ACTIVE Zero-based row index within the extracted table.
col int ACTIVE Zero-based column index within the extracted table.

FieldOffset

Reads a field's value from an adjacent column rather than the field's own column.

Field Type Status Description
rows_offset int STUB Intended row offset for reading the value from a different row. Declared and accepted in TOML but never read by the pipeline; only cols_offset is currently consumed. Always set to 0 in TOML examples.
cols_offset int ACTIVE Column offset applied to Field.column to locate the source cell (e.g. 1 reads from the column immediately to the right).
vital bool ACTIVE Passed to the extraction pipeline for the offset field; when True extraction failure is treated as a hard failure for that row.
type str ACTIVE Data type for the offset value: "string", "numeric", or "currency". Overrides the parent Field.type for this value read.
currency_override str | None ACTIVE Explicit currency key (e.g. "GBP") for numeric stripping of the offset value when type == "numeric". Overrides the account-level currency. When type == "currency" the account-level currency is used and this is ignored.
numeric_modifier NumericModifier ACTIVE Sign/multiplier modifier for the offset value. Overrides the parent Field.numeric_modifier.

NumericModifier

Optional sign/multiplier transformation applied after numeric casting.

Field Type Status Description
prefix str ACTIVE If the raw value starts with this string the prefix is stripped and the multiplier applied. Use for formats like "(123.45)" where "(" signals a negative value.
suffix str ACTIVE If the raw value ends with this string the suffix is stripped and the multiplier applied. Use for formats like "123.45 CR" or "123.45D".
multiplier float ACTIVE Scalar applied to the cast value when the prefix/suffix matches, or unconditionally if neither prefix nor suffix is set. Typically -1 to invert sign.
exclude_negative_values bool ACTIVE When True, any negative result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column.
exclude_positive_values bool ACTIVE When True, any positive result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column.

CurrencySpec

Currency formatting rules used to strip symbols and separators before numeric casting.

Field Type Status Description
name str ACTIVE Human-readable currency name (e.g. "British Pound Sterling").
symbols list[str] ACTIVE List of currency symbol strings to strip from the raw value before casting (e.g. ["£", "$"]). Replaced with empty string via str.replace_many().
seperator_decimal str STUB Intended decimal separator character (e.g. "."). Declared but never read by the pipeline; decimal handling is implicit after symbols and thousands separators are stripped.
seperators_thousands list[str] ACTIVE List of thousands-separator strings to strip (e.g. [","]). Replaced with empty string via str.replace_many() before casting.
round_decimals int STUB Intended rounding precision after casting. Declared but never read by the pipeline; no rounding is currently applied.
pattern str ACTIVE Regex pattern used to extract the numeric substring from the raw cell text before symbol/separator stripping. Passed to patmatch() via build_pattern().

TransactionSpec

Full specification for extracting transactions from a transaction-type table.

Field Type Status Description
transaction_bookends list[TransactionBookend] ACTIVE One or more bookend definitions that identify transaction boundaries. Evaluated in order; a row matched by an earlier bookend is not re-matched by a later one. At least one bookend is required.
fill_forward_fields list[str ACTIVE Field names whose null values should be forward-filled across rows within the same page after pivot. Use for sparse columns where a value (e.g. a date or payment type) appears only on the first row of a multi-row block and needs propagating to the end row.
merge_fields MergeFields ACTIVE When set, collapses multi-row text fields within each transaction into a single joined string. See MergeFields.
exclude_rows list[FieldValidation ACTIVE Rows where any rule's field value matches its pattern are removed from the results before bookend detection runs. Use to suppress known non-transaction rows (e.g. a closing balance summary line) that would otherwise interfere with transaction counting or checks & balances. Each rule is a {field, pattern} pair; a row is excluded if any rule matches.

TransactionBookend

Defines how the start and end of a single transaction are detected within a table.

Field Type Status Description
start_fields list[str] ACTIVE Field names that are checked to identify the first row of a transaction. A row qualifies as a start row when at least min_non_empty_start of these fields extracted successfully (success = True).
min_non_empty_start int ACTIVE Minimum number of start_fields that must have extracted successfully for a row to be flagged as transaction_start = True.
end_fields list[str] ACTIVE Field names checked to identify the last row of a transaction. A row qualifies as an end row when at least min_non_empty_end of these fields extracted successfully.
min_non_empty_end int ACTIVE Minimum number of end_fields that must have extracted successfully for a row to be flagged as transaction_end = True.
extra_validation_start FieldValidation ACTIVE When set, any row where the named field's value does NOT match the pattern is excluded from being a start-bookend candidate for this bookend. Rows excluded here may still be captured by another bookend in the list. Useful for bookends that should only trigger on a specific row shape (e.g. an interest charge line identified by its details text).
extra_validation_end FieldValidation STUB Symmetric counterpart to extra_validation_start for end rows. Declared but not yet implemented in the pipeline; no code currently reads this field. Reserved for future use.
sticky_fields list[str STUB Intended to forward-fill named fields from the start row of a transaction down to its end row, scoped within a single transaction (as opposed to fill_forward_fields which fills across transactions). Declared but not implemented; no pipeline code reads this field.

FieldValidation

A field-name/regex-pattern pair used as a row filter or row qualification rule.

Field Type Status Description
field str ACTIVE Name of the extracted field (output column name) whose value is tested against the pattern.
pattern str ACTIVE Regex pattern tested via Polars str.contains(). For exclude_rows a match causes exclusion; for extra_validation_start a non-match causes exclusion.

MergeFields

Specifies how multi-row text fields are collapsed into a single output value.

Field Type Status Description
fields list[str] ACTIVE Names of the fields whose per-row values should be joined.
separator str ACTIVE Delimiter inserted between joined values (e.g. " | ").

StandardFields

Declaration of a single standard output column and how to derive it.

Field Type Status Description
section str ACTIVE Pipeline section this field belongs to: "header" (statement-level metadata extracted once per statement) or "lines" (per-transaction data). Used to dispatch the field to the correct extraction pass.
type str ACTIVE Data type of the standard column: "string", "numeric", or "date". Controls casting, multiplier application, and date parsing in get_standard_fields().
vital bool ACTIVE When True a ConfigError is raised if no matching StdRefs entry is found for the current statement type, halting processing. Set False for optional fields that not all statement types provide.
std_refs list[StdRefs] ACTIVE One entry per supported statement type. The correct entry is selected at runtime by matching StdRefs.statement_type.

StdRefs

Mapping rule that promotes a raw extracted field to a standard output column.

Field Type Status Description
statement_type str ACTIVE Key used to select this rule; matched against the statement type string of the PDF being processed (e.g. "HSBC UK Current Account").
field str ACTIVE Name of the raw extracted column to promote. Set to None (or omit) when a literal default value should be used instead of a column value.
format str ACTIVE strptime format string applied when StandardFields.type == "date" (e.g. "%-d %B %Y"). Ignored for numeric and string types.
default str ACTIVE Literal string value used as the output when field is None/absent. Useful for injecting constant metadata (e.g. transaction_type = "CC").
multiplier float ACTIVE Scalar applied to the value after casting when StandardFields.type == "numeric". Use -1 to invert sign (e.g. to convert a credit amount stored as positive into a negative figure).
exclude_positive_values bool ACTIVE When True, any positive numeric value is replaced with 0 after casting. Used to isolate debit-side figures from a combined amount column.
exclude_negative_values bool ACTIVE When True, any negative numeric value is replaced with 0 after casting. Used to isolate credit-side figures from a combined amount column.
terminator str ACTIVE Regex pattern; when present the string value is truncated at the first match position before being written to the standard column. Useful for stripping trailing boilerplate appended by merge_fields (e.g. " | BALANCE CARRIED FORWARD").

Test

Declarative test assertion attached to a StatementTable.

Field Type Status Description
test_desc str STUB Human-readable description of the test assertion.
assertion str STUB The assertion expression to evaluate (format TBD).