Adding a New Bank¶

This guide walks through the process of configuring bank_statement_parser to parse PDF statements from a new bank. The configuration is entirely TOML-based and does not require writing any Python code.

Not comfortable with TOML configuration?

Adding a new bank is a technical process that requires understanding the PDF layout of your statements and writing structured configuration files. If you'd prefer to request support instead, open a new bank request on the issue tracker — please attach an anonymised statement to help us build and test the configuration.

Overview¶

Adding support for a new bank involves creating and editing several TOML files that describe how to identify the bank's PDFs, locate tables on each page, extract field values, and map them to standard output columns.

The configuration lives in two places:

Location	Purpose
`project/config/import/<BANK_COUNTRY>/`	Bank-specific config folder (4 TOML files)
`project/config/import/account_types.toml`	Shared account type registry
`project/config/import/standard_fields.toml`	Shared standard field mappings

Bank config folder structure¶

Each bank has its own subfolder named in SCREAMING_SNAKE_CASE (e.g. HSBC_UK, TSB_UK). A complete folder contains exactly four files:

File	Purpose	Key Dataclass
`companies.toml`	Bank identification (name + PDF detection rule)	`Company`
`accounts.toml`	Account definitions (one per product/card type)	`Account`
`statement_types.toml`	Statement layout definitions (header + lines extraction)	`StatementType`
`statement_tables.toml`	Physical table extraction rules (locations, fields, bookends)	`StatementTable`

Processing pipeline¶

Understanding the processing order helps when writing config:

Company identification — the Company.config extraction is run against page 1 to determine which bank issued the PDF.
Account identification — each Account.config is tried until one matches, identifying the specific account product.
Header extraction — the StatementType.header configs run to extract statement-level metadata (dates, balances, account details).
Lines extraction — the StatementType.lines configs run per-page to extract transaction rows.
Standard field mapping — raw extracted fields are mapped to STD_* output columns via standard_fields.toml.
Checks & balances — opening balance + payments in - payments out = closing balance is validated.

Step 1: Register the Account Type¶

If your bank uses an account type not already in account_types.toml, add a new entry. Most banks will use the existing types (CRD, CUR, SAV, ISA).

File: project/config/import/account_types.toml

[CRD]
account_type = "Credit Card"

[CUR]
account_type = "Current Account"

[SAV]
account_type = "Savings Account"

[ISA]
account_type = "ISA"

`AccountType`¶

Simple lookup label for an account type category.

Field	Type	Status	Description
`account_type`	`str`	ACTIVE	Account type label (e.g. "CRD" for credit card, "CUR" for current account). Populated at load time but not subsequently consumed by the pipeline; present for potential reporting or routing use.

Step 2: Create the Bank Config Folder¶

Create a new subfolder under project/config/import/ using the naming convention <BANK>_<COUNTRY> in SCREAMING_SNAKE_CASE:

project/config/import/
  HSBC_UK/          # existing
  TSB_UK/           # existing
  NEWBANK_UK/       # <- your new folder
    companies.toml
    accounts.toml
    statement_types.toml
    statement_tables.toml

Step 3: Define the Company¶

Create companies.toml in your new folder. This file identifies the bank by extracting a distinguishing piece of text from page 1 of the PDF (typically a website URL or bank name).

Example (HSBC_UK/companies.toml):

[HSBC_UK]
company = 'HSBC Bank UK'
[HSBC_UK.config]
    config = 'Company Info'
    locations = [
        {page_number = 1, top_left = [475,110], bottom_right = [575, 130]},
        {page_number = 1, top_left = [460,145], bottom_right = [575, 165]},
        {page_number = 1, top_left = [460,165], bottom_right = [575, 185]},
    ]
    field = {field = 'website', vital=true, type="string", string_pattern ='^www\.hsbc\.co\.uk$'}

How it works: The config block defines a small extraction region on page 1. The field spec extracts text from that region and checks it against string_pattern. If the pattern matches, this company is selected. Multiple locations can be provided — the pipeline tries each until one succeeds.

Key dataclasses¶

`Company`¶

Configuration for a financial institution (bank/provider).

Field	Type	Status	Description
`company`	`str`	ACTIVE	Human-readable company name (e.g. "HSBC UK"). Used to populate the STD_COMPANY standard field.
`config`	`Config`	ACTIVE	Extraction config used during the company-identification pass. Extracts a discriminating field (e.g. a bank-specific header string) to confirm the PDF belongs to this company before attempting account matching.
`accounts`	`dict`	STUB	Declared but never accessed by the pipeline after load. Intended as a lookup from account key to Account object but currently unused.

`Config`¶

A single extraction step: one table (or one standalone field) from one location.

Field	Type	Status	Description
`config`	`str`	ACTIVE	Human-readable label for this extraction step (e.g. "Statement Balances"). Written into the "config" column of the long-format results DataFrame for traceability.
`statement_table_key`	`str`	ACTIVE	Key into statement_tables.toml that identifies the StatementTable to use. Resolved to statement_table at load time. Set to None for inline single-field configs.
`statement_table`	`StatementTable`	ACTIVE	Resolved at load time from statement_table_key. The StatementTable object used during extraction. Not set directly in TOML.
`locations`	`list[Location`	ACTIVE	Used only for inline single-field configs (where statement_table is None). Defines where on the page to find the field value.
`field`	`Field`	ACTIVE	Used only for inline single-field configs. Defines the extraction spec for the single value to read from the location.

`Location`¶

Describes a rectangular region on a PDF page from which a table or text is extracted.

Field	Type	Status	Description
`page_number`	`int`	ACTIVE	1-based page number. When set the location is used only on that page. When None the location is cloned for every page (spawn_locations()).
`top_left`	`list[int`	ACTIVE	[x, y] coordinates of the top-left corner of the crop rectangle. Must be set together with bottom_right. When both are None the full page is used.
`bottom_right`	`list[int`	ACTIVE	[x, y] coordinates of the bottom-right corner of the crop rectangle. Must be set together with top_left.
`vertical_lines`	`list[int`	ACTIVE	Explicit x-coordinates of vertical column dividers supplied to pdfplumber as explicit_vertical_lines. Pairs of identical values create a zero-width gap that forces a column boundary (e.g. [100, 100, 200, 200]). When set, pdfplumber's automatic column detection is disabled for this region.
`dynamic_last_vertical_line`	`DynamicLineSpec`	ACTIVE	When set, the final value in vertical_lines is replaced at runtime with an x-coordinate derived from a PDF image's bounding box. See DynamicLineSpec. Used where the rightmost column boundary floats with a logo.
`allow_text_failover`	`bool`	ACTIVE	When True and the extracted table has the wrong number of columns, the extraction is retried without vertical_lines, falling back to pdfplumber's text-based column detection. Useful as a safety net for pages where the explicit dividers produce a malformed table.
`try_shift_down`	`int`	ACTIVE	Number of PDF points to shift the crop rectangle downward (applied to both top_left[1] and bottom_right[1]) when the initial extraction returns an empty region. Handles statements where the table top boundary varies slightly between pages.

`Field`¶

Extraction specification for a single column or cell within a PDF table.

Field	Type	Status	Description
`field`	`str`	ACTIVE	Output column name for this field (e.g. "date", "£_paid_out"). Used as the field identifier throughout the pipeline and in the output Parquet files.
`cell`	`Cell`	ACTIVE	Row/column address for summary or detail table extraction. Mutually exclusive with `column`; set to None for transaction tables.
`column`	`int \| None`	ACTIVE	Zero-based column index for transaction table extraction. Mutually exclusive with `cell`; set to None for summary/detail tables.
`vital`	`bool`	ACTIVE	When True, extraction failure for this field causes the row to be flagged as a hard failure and excluded from output. When False, failure is recorded but the row is retained.
`type`	`str`	ACTIVE	Data type: "string", "numeric", or "currency". * "string" — raw text extraction; pattern matching and trimming applied. * "numeric" — numeric extraction with optional explicit currency stripping via `currency_override`. * "currency" — identical to "numeric" but inherits the CurrencySpec from the account's `Account.currency` rather than requiring an explicit `currency_override` on every field. Use this for all monetary amount fields; reserve "numeric" for non-monetary numerics (e.g. APR, sort code).
`strip_characters_start`	`str`	ACTIVE	Characters to strip from the start of the raw string before pattern matching (passed to Polars str.strip_chars_start()). Useful for leading currency symbols not covered by the account currency spec.
`strip_characters_end`	`str`	ACTIVE	Characters to strip from the end of the raw string before pattern matching (passed to Polars str.strip_chars_end()).
`currency_override`	`str \| None`	ACTIVE	Explicit ISO 4217 currency key (e.g. "GBP") used when `type == "numeric"` and currency stripping is needed but should differ from the account-level `Account.currency`. Ignored when `type == "currency"` (which always uses the account-level currency). Omit for non-monetary numeric fields (e.g. APR, sort code) where no currency stripping is required.
`numeric_modifier`	`NumericModifier`	ACTIVE	Sign/multiplier transformation applied after numeric casting. See NumericModifier. Omit for straightforward positive numeric values.
`string_pattern`	`str`	ACTIVE	Regex pattern the extracted string must match. Extraction is marked as failed (success = False) if the value does not match. Used to validate field contents (e.g. date format) and to skip blank or irrelevant rows.
`string_max_length`	`int`	ACTIVE	Maximum character length for string values; longer strings are truncated via str.head(). Useful for capping free-text description fields. Defaults to 999 if not set.
`date_format`	`str`	STUB	Intended strptime format for date parsing at the Field level. Declared but never read by the pipeline; date format parsing is handled via StdRefs.format in get_standard_fields() instead.
`value_offset`	`'FieldOffset'`	ACTIVE	When set, reads the field's value from an adjacent column (Field.column + FieldOffset.cols_offset) using the type and currency rules defined in the FieldOffset rather than those on this Field. The primary field column is still extracted normally; the offset column value replaces it in the output. See FieldOffset.

Step 4: Define Statement Tables¶

Create statement_tables.toml to define how tables are physically extracted from the PDF pages. This is usually the most complex configuration file, as it requires understanding the precise layout of the bank's PDF statements.

Each table entry defines:

Where on the page to look (bounding box coordinates, vertical column dividers)
What fields to extract (column indices or cell addresses, data types, patterns)
How to handle multi-row transactions (bookend detection, field merging)

Table types¶

There are three table types, determined by the presence of transaction_spec:

Type	Use Case	Field Addressing	Has `transaction_spec`?
`summary`	Account balances, totals	`cell = {row, col}`	No
`detail`	Account holder info, sort codes	`cell = {row, col}`	No
`transaction`	Transaction line items	`column = N`	Yes

Summary table example¶

A summary table extracts fixed values from known cell positions (e.g. opening balance at row 1, column 1):

[HSBC_UK_CUR_ACCT_SUM]
type = "summary"
statement_table = 'Account Summary'
table_columns = 2
table_rows = 4
row_spacing = 7
locations = [
    {page_number=1, top_left = [345, 180], bottom_right = [575, 300], vertical_lines = [360, 475, 475, 550], dynamic_last_vertical_line = {image_id = 0, image_location_tag = "x1"}, allow_text_failover = true},
]
fields = [
    {field = 'opening_balance', cell = {row = 1, col = 1}, vital=true, type = 'currency', numeric_modifier = {suffix = "D", multiplier = -1}},
    {field = 'payments_in', cell = {row = 2, col = 1}, vital=true, type = 'currency'},
    {field = 'payments_out', cell = {row = 3, col = 1}, vital=true, type = 'currency'},
    {field = 'closing_balance', cell = {row = 4, col = 1}, vital=true, type = 'currency', numeric_modifier = {suffix = "D", multiplier = -1}},
]

Transaction table example¶

A transaction table extracts rows of variable length, using bookend detection to identify where each transaction starts and ends:

[HSBC_UK_CUR_TRANSACTIONS]
type = "transaction"
statement_table = 'Transactions'
table_columns = 6
locations = [
    {vertical_lines = [50, 100, 100, 130, 130, 320, 320, 400, 400, 480, 480, 555]},
]
fields = [
    {field = 'date', column = 0, vital=false, type = "string", string_pattern ='^[0-3][0-9]\s?[A-Z][a-z]{2}\s?[0-3][0-9]$'},
    {field = 'payment_type', column = 1, vital=false, type = "string", string_pattern ='(^[A-Z0-9]{1,3}$)|(^[)]{3}$)'},
    {field = 'details', column = 2, vital=true, type = "string", string_pattern ='.+', string_max_length = 100},
    {field = '£_paid_out', column = 3, vital=false, type = "currency"},
    {field = '£_paid_in', column = 4, vital=false, type = "currency"},
    {field = '£_balance', column = 5, vital=false, type = "currency", numeric_modifier = {suffix = "D", multiplier = -1.0000}},
]
delete_success_false = true
delete_cast_success_false = true
delete_rows_with_missing_vital_fields = true

[HSBC_UK_CUR_TRANSACTIONS.transaction_spec]
transaction_bookends = [
    {start_fields = ['payment_type','details'], min_non_empty_start = 2, end_fields = ['£_paid_out','£_paid_in'], min_non_empty_end = 1}
]
fill_forward_fields = ['date','payment_type']
merge_fields = {fields=['details'], separator=' | '}

Key dataclasses¶

`StatementTable`¶

Full configuration for extracting one table from a PDF statement.

Field	Type	Status	Description
`type`	`str`	STUB	Table type label: "transaction", "summary", or "detail". Loaded from TOML but not currently read by the pipeline; the extraction path is determined by whether transaction_spec is present rather than this field.
`statement_table`	`str`	STUB	Human-readable table label (e.g. "Transactions", "Account Summary"). Loaded from TOML for documentation purposes but not consumed by the pipeline.
`header_text`	`str`	ACTIVE	When set, the first table row whose text matches this string is stripped before extraction. Use when pdfplumber includes the column header row in the extracted data.
`remove_header`	`bool`	ACTIVE	When True the first table row is unconditionally stripped. Use when the header row is always present but its text varies (making header_text impractical).
`locations`	`list[Location]`	ACTIVE	One or more Location entries describing where on the page to find this table. Locations without a page_number are cloned for every page.
`fields`	`list[Field]`	ACTIVE	Ordered list of field extraction specs. For transaction tables each field must have a column; for summary/detail tables each field must have a cell.
`table_columns`	`int`	ACTIVE	Expected minimum number of columns in the extracted table. Passed to pdfplumber as min_words_horizontal and used to validate column count after extraction. Also triggers allow_text_failover retry logic.
`table_rows`	`int`	ACTIVE	Expected minimum number of rows in the extracted table. Passed to pdfplumber as min_words_vertical.
`row_spacing`	`int`	ACTIVE	pdfplumber snap_y_tolerance in PDF points. Rows whose top edges fall within this distance of each other are merged into the same table row. Increase if the statement uses tight line spacing that splits a single visual row across multiple pdfplumber rows.
`tests`	`list[Test`	STUB	Declarative post-extraction assertions. Declared and accepted in TOML but no pipeline code evaluates them. Reserved for a future config validation pass.
`delete_success_false`	`bool`	STUB	Intended to drop rows where any field extraction returned success = False. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag.
`delete_cast_success_false`	`bool`	STUB	Intended to drop rows where numeric casting failed. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag.
`delete_rows_with_missing_vital_fields`	`bool`	STUB	Intended to drop rows where any vital field is missing after extraction. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. Note: vital-field hard-failure logic exists in validate() but is separate from this flag.
`transaction_spec`	`TransactionSpec`	ACTIVE	When set, the table is processed as a transaction table using the bookend-based multi-row extraction path. Must be None for summary/detail tables.

`Location`¶

Describes a rectangular region on a PDF page from which a table or text is extracted.

Field	Type	Status	Description
`page_number`	`int`	ACTIVE	1-based page number. When set the location is used only on that page. When None the location is cloned for every page (spawn_locations()).
`top_left`	`list[int`	ACTIVE	[x, y] coordinates of the top-left corner of the crop rectangle. Must be set together with bottom_right. When both are None the full page is used.
`bottom_right`	`list[int`	ACTIVE	[x, y] coordinates of the bottom-right corner of the crop rectangle. Must be set together with top_left.
`vertical_lines`	`list[int`	ACTIVE	Explicit x-coordinates of vertical column dividers supplied to pdfplumber as explicit_vertical_lines. Pairs of identical values create a zero-width gap that forces a column boundary (e.g. [100, 100, 200, 200]). When set, pdfplumber's automatic column detection is disabled for this region.
`dynamic_last_vertical_line`	`DynamicLineSpec`	ACTIVE	When set, the final value in vertical_lines is replaced at runtime with an x-coordinate derived from a PDF image's bounding box. See DynamicLineSpec. Used where the rightmost column boundary floats with a logo.
`allow_text_failover`	`bool`	ACTIVE	When True and the extracted table has the wrong number of columns, the extraction is retried without vertical_lines, falling back to pdfplumber's text-based column detection. Useful as a safety net for pages where the explicit dividers produce a malformed table.
`try_shift_down`	`int`	ACTIVE	Number of PDF points to shift the crop rectangle downward (applied to both top_left[1] and bottom_right[1]) when the initial extraction returns an empty region. Handles statements where the table top boundary varies slightly between pages.

`DynamicLineSpec`¶

Locates the position of the last vertical column divider from an embedded PDF image.

Field	Type	Status	Description
`image_id`	`int`	ACTIVE	Zero-based index into the list of images on the page, identifying which image provides the boundary coordinate.
`image_location_tag`	`str`	ACTIVE	Bounding-box attribute of the image to use as the x-coordinate (e.g. "x0" for left edge, "x1" for right edge).

`Field`¶

Extraction specification for a single column or cell within a PDF table.

Field	Type	Status	Description
`field`	`str`	ACTIVE	Output column name for this field (e.g. "date", "£_paid_out"). Used as the field identifier throughout the pipeline and in the output Parquet files.
`cell`	`Cell`	ACTIVE	Row/column address for summary or detail table extraction. Mutually exclusive with `column`; set to None for transaction tables.
`column`	`int \| None`	ACTIVE	Zero-based column index for transaction table extraction. Mutually exclusive with `cell`; set to None for summary/detail tables.
`vital`	`bool`	ACTIVE	When True, extraction failure for this field causes the row to be flagged as a hard failure and excluded from output. When False, failure is recorded but the row is retained.
`type`	`str`	ACTIVE	Data type: "string", "numeric", or "currency". * "string" — raw text extraction; pattern matching and trimming applied. * "numeric" — numeric extraction with optional explicit currency stripping via `currency_override`. * "currency" — identical to "numeric" but inherits the CurrencySpec from the account's `Account.currency` rather than requiring an explicit `currency_override` on every field. Use this for all monetary amount fields; reserve "numeric" for non-monetary numerics (e.g. APR, sort code).
`strip_characters_start`	`str`	ACTIVE	Characters to strip from the start of the raw string before pattern matching (passed to Polars str.strip_chars_start()). Useful for leading currency symbols not covered by the account currency spec.
`strip_characters_end`	`str`	ACTIVE	Characters to strip from the end of the raw string before pattern matching (passed to Polars str.strip_chars_end()).
`currency_override`	`str \| None`	ACTIVE	Explicit ISO 4217 currency key (e.g. "GBP") used when `type == "numeric"` and currency stripping is needed but should differ from the account-level `Account.currency`. Ignored when `type == "currency"` (which always uses the account-level currency). Omit for non-monetary numeric fields (e.g. APR, sort code) where no currency stripping is required.
`numeric_modifier`	`NumericModifier`	ACTIVE	Sign/multiplier transformation applied after numeric casting. See NumericModifier. Omit for straightforward positive numeric values.
`string_pattern`	`str`	ACTIVE	Regex pattern the extracted string must match. Extraction is marked as failed (success = False) if the value does not match. Used to validate field contents (e.g. date format) and to skip blank or irrelevant rows.
`string_max_length`	`int`	ACTIVE	Maximum character length for string values; longer strings are truncated via str.head(). Useful for capping free-text description fields. Defaults to 999 if not set.
`date_format`	`str`	STUB	Intended strptime format for date parsing at the Field level. Declared but never read by the pipeline; date format parsing is handled via StdRefs.format in get_standard_fields() instead.
`value_offset`	`'FieldOffset'`	ACTIVE	When set, reads the field's value from an adjacent column (Field.column + FieldOffset.cols_offset) using the type and currency rules defined in the FieldOffset rather than those on this Field. The primary field column is still extracted normally; the offset column value replaces it in the output. See FieldOffset.

`Cell`¶

Zero-based row and column address of a cell within a PDF table.

Field	Type	Status	Description
`row`	`int`	ACTIVE	Zero-based row index within the extracted table.
`col`	`int`	ACTIVE	Zero-based column index within the extracted table.

`NumericModifier`¶

Optional sign/multiplier transformation applied after numeric casting.

Field	Type	Status	Description
`prefix`	`str`	ACTIVE	If the raw value starts with this string the prefix is stripped and the multiplier applied. Use for formats like "(123.45)" where "(" signals a negative value.
`suffix`	`str`	ACTIVE	If the raw value ends with this string the suffix is stripped and the multiplier applied. Use for formats like "123.45 CR" or "123.45D".
`multiplier`	`float`	ACTIVE	Scalar applied to the cast value when the prefix/suffix matches, or unconditionally if neither prefix nor suffix is set. Typically -1 to invert sign.
`exclude_negative_values`	`bool`	ACTIVE	When True, any negative result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column.
`exclude_positive_values`	`bool`	ACTIVE	When True, any positive result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column.

`FieldOffset`¶

Reads a field's value from an adjacent column rather than the field's own column.

Field	Type	Status	Description
`rows_offset`	`int`	STUB	Intended row offset for reading the value from a different row. Declared and accepted in TOML but never read by the pipeline; only cols_offset is currently consumed. Always set to 0 in TOML examples.
`cols_offset`	`int`	ACTIVE	Column offset applied to Field.column to locate the source cell (e.g. 1 reads from the column immediately to the right).
`vital`	`bool`	ACTIVE	Passed to the extraction pipeline for the offset field; when True extraction failure is treated as a hard failure for that row.
`type`	`str`	ACTIVE	Data type for the offset value: "string", "numeric", or "currency". Overrides the parent Field.type for this value read.
`currency_override`	`str \| None`	ACTIVE	Explicit currency key (e.g. "GBP") for numeric stripping of the offset value when type == "numeric". Overrides the account-level currency. When type == "currency" the account-level currency is used and this is ignored.
`numeric_modifier`	`NumericModifier`	ACTIVE	Sign/multiplier modifier for the offset value. Overrides the parent Field.numeric_modifier.

`CurrencySpec`¶

Currency formatting rules used to strip symbols and separators before numeric casting.

Field	Type	Status	Description
`name`	`str`	ACTIVE	Human-readable currency name (e.g. "British Pound Sterling").
`symbols`	`list[str]`	ACTIVE	List of currency symbol strings to strip from the raw value before casting (e.g. ["£", "$"]). Replaced with empty string via str.replace_many().
`seperator_decimal`	`str`	STUB	Intended decimal separator character (e.g. "."). Declared but never read by the pipeline; decimal handling is implicit after symbols and thousands separators are stripped.
`seperators_thousands`	`list[str]`	ACTIVE	List of thousands-separator strings to strip (e.g. [","]). Replaced with empty string via str.replace_many() before casting.
`round_decimals`	`int`	STUB	Intended rounding precision after casting. Declared but never read by the pipeline; no rounding is currently applied.
`pattern`	`str`	ACTIVE	Regex pattern used to extract the numeric substring from the raw cell text before symbol/separator stripping. Passed to patmatch() via build_pattern().

`TransactionSpec`¶

Full specification for extracting transactions from a transaction-type table.

Field	Type	Status	Description
`transaction_bookends`	`list[TransactionBookend]`	ACTIVE	One or more bookend definitions that identify transaction boundaries. Evaluated in order; a row matched by an earlier bookend is not re-matched by a later one. At least one bookend is required.
`fill_forward_fields`	`list[str`	ACTIVE	Field names whose null values should be forward-filled across rows within the same page after pivot. Use for sparse columns where a value (e.g. a date or payment type) appears only on the first row of a multi-row block and needs propagating to the end row.
`merge_fields`	`MergeFields`	ACTIVE	When set, collapses multi-row text fields within each transaction into a single joined string. See MergeFields.
`exclude_rows`	`list[FieldValidation`	ACTIVE	Rows where any rule's field value matches its pattern are removed from the results before bookend detection runs. Use to suppress known non-transaction rows (e.g. a closing balance summary line) that would otherwise interfere with transaction counting or checks & balances. Each rule is a {field, pattern} pair; a row is excluded if any rule matches.

`TransactionBookend`¶

Defines how the start and end of a single transaction are detected within a table.

Field	Type	Status	Description
`start_fields`	`list[str]`	ACTIVE	Field names that are checked to identify the first row of a transaction. A row qualifies as a start row when at least min_non_empty_start of these fields extracted successfully (success = True).
`min_non_empty_start`	`int`	ACTIVE	Minimum number of start_fields that must have extracted successfully for a row to be flagged as transaction_start = True.
`end_fields`	`list[str]`	ACTIVE	Field names checked to identify the last row of a transaction. A row qualifies as an end row when at least min_non_empty_end of these fields extracted successfully.
`min_non_empty_end`	`int`	ACTIVE	Minimum number of end_fields that must have extracted successfully for a row to be flagged as transaction_end = True.
`extra_validation_start`	`FieldValidation`	ACTIVE	When set, any row where the named field's value does NOT match the pattern is excluded from being a start-bookend candidate for this bookend. Rows excluded here may still be captured by another bookend in the list. Useful for bookends that should only trigger on a specific row shape (e.g. an interest charge line identified by its details text).
`extra_validation_end`	`FieldValidation`	STUB	Symmetric counterpart to extra_validation_start for end rows. Declared but not yet implemented in the pipeline; no code currently reads this field. Reserved for future use.
`sticky_fields`	`list[str`	STUB	Intended to forward-fill named fields from the start row of a transaction down to its end row, scoped within a single transaction (as opposed to fill_forward_fields which fills across transactions). Declared but not implemented; no pipeline code reads this field.

`FieldValidation`¶

A field-name/regex-pattern pair used as a row filter or row qualification rule.

Field	Type	Status	Description
`field`	`str`	ACTIVE	Name of the extracted field (output column name) whose value is tested against the pattern.
`pattern`	`str`	ACTIVE	Regex pattern tested via Polars str.contains(). For exclude_rows a match causes exclusion; for extra_validation_start a non-match causes exclusion.

`MergeFields`¶

Specifies how multi-row text fields are collapsed into a single output value.

Field	Type	Status	Description
`fields`	`list[str]`	ACTIVE	Names of the fields whose per-row values should be joined.
`separator`	`str`	ACTIVE	Delimiter inserted between joined values (e.g. " \| ").

Step 5: Define Statement Types¶

Create statement_types.toml to define the extraction workflow for each distinct statement layout. A single bank may have multiple statement types (e.g. current account vs. credit card) if their PDF layouts differ.

Each statement type groups extraction into two sections:

header — runs once per statement to extract metadata (dates, balances, account info)
lines — runs per-page to extract transaction rows

Configs within each section either reference a statement_table_key from statement_tables.toml or define an inline single-field extraction.

Example (HSBC_UK/statement_types.toml):

[HSBC_UK_CUR]
statement_type = 'HSBC UK Current Account'
    [HSBC_UK_CUR.header]
        [[HSBC_UK_CUR.header.configs]]
            config = 'Statement Balances'
            statement_table_key = 'HSBC_UK_CUR_ACCT_SUM'

        [[HSBC_UK_CUR.header.configs]]
            config = 'Statement Info'
            locations = [
                {page_number=1, top_left = [37, 325], bottom_right = [290, 385]}
                ]
            field = {field = 'statement_date', vital=true, type = "string", string_pattern ='^[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s(\d{4}\s)?to\s[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s[0-9]{4}$'}

        [[HSBC_UK_CUR.header.configs]]
            config = 'Account Info'
            statement_table_key = 'HSBC_UK_CUR_ACCT_DET'

    [HSBC_UK_CUR.lines]
        [[HSBC_UK_CUR.lines.configs]]
            config = 'Transaction Lines'
            statement_table_key = 'HSBC_UK_CUR_TRANSACTIONS'

[HSBC_UK_SAV]
statement_type = 'HSBC UK Saving Account'
    [HSBC_UK_SAV.header]
        [[HSBC_UK_SAV.header.configs]]
            config = 'Statement Balances'
            statement_table_key = 'HSBC_UK_CUR_ACCT_SUM'

        [[HSBC_UK_SAV.header.configs]]
            config = 'Statement Info'
            locations = [{page_number=1, top_left = [37, 325], bottom_right = [290, 385]}]
            field = {field = 'statement_date', vital=true, type = "string", string_pattern ='^[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s(\d{4}\s)?to\s[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s[0-9]{4}$'}

        [[HSBC_UK_SAV.header.configs]]
            config = 'Account Info'
            statement_table_key = 'HSBC_UK_SAV_ACCT_DET'

    [HSBC_UK_SAV.lines]
        [[HSBC_UK_SAV.lines.configs]]
            config = 'Transaction Lines'
            statement_table_key = 'HSBC_UK_CUR_TRANSACTIONS'

[HSBC_UK_CRD]
statement_type = 'HSBC UK Credit Card'
    [HSBC_UK_CRD.header]
        [[HSBC_UK_CRD.header.configs]]
            config = 'Statement Balances'
            statement_table_key = 'HSBC_UK_CRD_ACCT_SUM'

        [[HSBC_UK_CRD.header.configs]]
            config = 'Statement Info'
            locations = [{page_number=1, top_left = [120, 470], bottom_right = [250, 500]}]
            field = {field = 'statement_date', vital=true, type = "string", string_pattern ='^[0-3]?[0-9]\s[A-Z][a-z]{2,8}\s[0-9]{4}$'}

        [[HSBC_UK_CRD.header.configs]]
            config = 'Account Info'
            locations = [{page_number=2, top_left = [44, 174], bottom_right = [215, 200]}]
            field = {field = 'account_name', vital=true, type = "string", string_pattern ='^[A-Z]+[a-z]*\s[A-Z]+[a-z]*.*$'}

        [[HSBC_UK_CRD.header.configs]]
            config = 'Account Info'
            locations = [{page_number=2, top_left = [220, 175], bottom_right = [340, 197]}]
            field = {field = 'card_number', vital=true, type = "string", string_pattern ='^[0-9]{4}\s[0-9]{4}\s[0-9]{4}\s[0-9]{4}\s?$'}

    [HSBC_UK_CRD.lines]
        [[HSBC_UK_CRD.lines.configs]]
            config = 'Transaction Lines'
            statement_table_key = 'HSBC_UK_CRD_TRANSACTIONS'

Key dataclasses¶

`StatementType`¶

Full extraction specification for one statement layout variant.

Field	Type	Status	Description
`statement_type`	`str`	ACTIVE	Human-readable label matching the value used in StdRefs.statement_type (e.g. "HSBC UK Current Account"). Used to select the correct StdRefs mapping when promoting raw fields to standard columns.
`header`	`ConfigGroup`	ACTIVE	Config steps that extract statement-level metadata: dates, account numbers, opening/closing balances, etc.
`lines`	`ConfigGroup`	ACTIVE	Config steps that extract per-transaction data from the body of each page.

`ConfigGroup`¶

An ordered list of Config extraction steps for one pipeline section.

Field	Type	Status	Description
`configs`	`list[Config`	ACTIVE	Ordered list of Config steps. Executed in sequence during extraction; results are stacked into the section's results DataFrame.

`Config`¶

A single extraction step: one table (or one standalone field) from one location.

Field	Type	Status	Description
`config`	`str`	ACTIVE	Human-readable label for this extraction step (e.g. "Statement Balances"). Written into the "config" column of the long-format results DataFrame for traceability.
`statement_table_key`	`str`	ACTIVE	Key into statement_tables.toml that identifies the StatementTable to use. Resolved to statement_table at load time. Set to None for inline single-field configs.
`statement_table`	`StatementTable`	ACTIVE	Resolved at load time from statement_table_key. The StatementTable object used during extraction. Not set directly in TOML.
`locations`	`list[Location`	ACTIVE	Used only for inline single-field configs (where statement_table is None). Defines where on the page to find the field value.
`field`	`Field`	ACTIVE	Used only for inline single-field configs. Defines the extraction spec for the single value to read from the location.

Step 6: Define Accounts¶

Create accounts.toml to define each account product offered by the bank. Each account links together a company, an account type, and a statement type, plus defines a PDF detection rule to identify which account a given statement belongs to.

Example (HSBC_UK/accounts.toml — first entry shown):

[HSBC_UK_CRD_RCC]
account = "Rewards Credit Card"
company_key = 'HSBC_UK'
account_type_key = 'CRD'
statement_type_key = 'HSBC_UK_CRD'
exclude_last_n_pages = 1
currency = "GBP"
    [HSBC_UK_CRD_RCC.config]
    config = 'Account Info'
    locations = [{page_number = 1, top_left = [275, 20], bottom_right = [575, 70]}]
    field = {field = 'account', vital=true, type="string", string_pattern ='^Your[\s]*Rewards[\s]*Credit[\s]*Card[\s]*statement[\s]*'}

Key fields:

company_key — must match a key in your companies.toml
account_type_key — must match a key in the shared account_types.toml (e.g. CRD, CUR)
statement_type_key — must match a key in your statement_types.toml
exclude_last_n_pages — number of trailing pages to skip (terms & conditions, etc.)
config — inline extraction rule to identify this account from page 1 of the PDF

Key dataclasses¶

`Account`¶

Full runtime configuration for one bank account.

Field	Type	Status	Description
`account`	`str`	ACTIVE	Human-readable account name (e.g. "Current Account"). Written to the STD_ACCOUNT standard field in the output.
`company_key`	`str`	ACTIVE	Key into companies.toml identifying the issuing bank. Used to build ID_ACCOUNT and to look up the Company object at load time.
`company`	`Company`	ACTIVE	Resolved at load time from company_key. Provides the company name and company-level identification config.
`account_type_key`	`str`	ACTIVE	Key into account_types.toml (e.g. "CRD", "CUR", "SAV"). Used to look up the AccountType object at load time.
`account_type`	`AccountType`	STUB	Resolved at load time from account_type_key. The AccountType object is populated but never subsequently read by any pipeline consumer.
`statement_type_key`	`str`	ACTIVE	Key into statement_types.toml identifying the extraction layout for this account's statements. Used to look up the StatementType object at load time.
`statement_type`	`StatementType`	ACTIVE	Resolved at load time from statement_type_key. Provides the header and lines ConfigGroups used during extraction.
`exclude_last_n_pages`	`int`	ACTIVE	Number of trailing pages to skip when cloning per-page locations. Set to 1 (or more) when the final page(s) contain terms & conditions or other non-transaction content that would otherwise be passed to the extraction pipeline.
`currency`	`str`	ACTIVE	ISO 4217 currency code for all monetary fields on this account (e.g. "GBP", "USD", "PHP"). Must be a key in `currency_spec` in `currency.py`; validated at config load time. Used by the extraction pipeline to resolve the CurrencySpec for fields of type "currency".
`config`	`Config`	ACTIVE	Account-level identification config. A lightweight extraction step run to confirm a PDF belongs to this account before the full extraction pass. Defined inline under `[ACCOUNT_KEY.config]` in accounts.toml.

Step 7: Register Standard Field Mappings¶

Finally, add entries for your new statement type(s) to the shared standard_fields.toml. This file maps bank-specific raw field names to standardised output columns (STD_*).

For each STD_* field, add a new std_refs entry with your statement type's name and the corresponding raw field name from your statement_tables.toml.

File: project/config/import/standard_fields.toml

Example (showing STD_OPENING_BALANCE with entries for multiple banks):

[STD_OPENING_BALANCE]
    section = "header"
    type = "numeric"
    vital = true
    std_refs = [
        {statement_type="HSBC UK Current Account", field="opening_balance"},
        {statement_type="HSBC UK Saving Account", field="opening_balance"},
        {statement_type="HSBC UK Credit Card", field="previous_balance", multiplier=-1.0000},
        {statement_type="TSB UK Current Account", field="opening_balance"},
    ]

Standard fields reference¶

The following standard fields must be mapped for each statement type. Fields marked vital = true will raise a ConfigError if no mapping is found.

Standard Field	Section	Type	Vital	Purpose
`STD_STATEMENT_DATE`	header	date	Yes	Statement period end date
`STD_SORTCODE`	header	string	No	Bank sort code
`STD_ACCOUNT_NUMBER`	header	string	Yes	Account or card number
`STD_ACCOUNT_HOLDER`	header	string	No	Account holder name
`STD_OPENING_BALANCE`	header	numeric	Yes	Opening balance
`STD_CLOSING_BALANCE`	header	numeric	Yes	Closing balance
`STD_PAYMENTS_IN`	header	numeric	Yes	Total credits in period
`STD_PAYMENTS_OUT`	header	numeric	Yes	Total debits in period
`STD_TRANSACTION_DATE`	lines	date	Yes	Individual transaction date
`STD_TRANSACTION_TYPE`	lines	str	Yes	Payment type code
`STD_TRANSACTION_DESC`	lines	string	Yes	Transaction description
`STD_PAYMENT_IN`	lines	numeric	Yes	Credit amount per transaction
`STD_PAYMENT_OUT`	lines	numeric	Yes	Debit amount per transaction

`std_refs` entry options¶

Each std_refs entry supports the following options:

`StdRefs`¶

Mapping rule that promotes a raw extracted field to a standard output column.

Field	Type	Status	Description
`statement_type`	`str`	ACTIVE	Key used to select this rule; matched against the statement type string of the PDF being processed (e.g. "HSBC UK Current Account").
`field`	`str`	ACTIVE	Name of the raw extracted column to promote. Set to None (or omit) when a literal default value should be used instead of a column value.
`format`	`str`	ACTIVE	strptime format string applied when StandardFields.type == "date" (e.g. "%-d %B %Y"). Ignored for numeric and string types.
`default`	`str`	ACTIVE	Literal string value used as the output when `field` is None/absent. Useful for injecting constant metadata (e.g. transaction_type = "CC").
`multiplier`	`float`	ACTIVE	Scalar applied to the value after casting when StandardFields.type == "numeric". Use -1 to invert sign (e.g. to convert a credit amount stored as positive into a negative figure).
`exclude_positive_values`	`bool`	ACTIVE	When True, any positive numeric value is replaced with 0 after casting. Used to isolate debit-side figures from a combined amount column.
`exclude_negative_values`	`bool`	ACTIVE	When True, any negative numeric value is replaced with 0 after casting. Used to isolate credit-side figures from a combined amount column.
`terminator`	`str`	ACTIVE	Regex pattern; when present the string value is truncated at the first match position before being written to the standard column. Useful for stripping trailing boilerplate appended by merge_fields (e.g. " \| BALANCE CARRIED FORWARD").

`StandardFields`¶

Declaration of a single standard output column and how to derive it.

Field	Type	Status	Description
`section`	`str`	ACTIVE	Pipeline section this field belongs to: "header" (statement-level metadata extracted once per statement) or "lines" (per-transaction data). Used to dispatch the field to the correct extraction pass.
`type`	`str`	ACTIVE	Data type of the standard column: "string", "numeric", or "date". Controls casting, multiplier application, and date parsing in get_standard_fields().
`vital`	`bool`	ACTIVE	When True a ConfigError is raised if no matching StdRefs entry is found for the current statement type, halting processing. Set False for optional fields that not all statement types provide.
`std_refs`	`list[StdRefs]`	ACTIVE	One entry per supported statement type. The correct entry is selected at runtime by matching StdRefs.statement_type.

Configuration Checklist¶

Use this checklist to verify your configuration is complete:

[ ] Account type registered in account_types.toml (or existing type reused)
[ ] Bank config folder created: project/config/import/<BANK_COUNTRY>/
[ ] companies.toml — company key, name, and PDF detection rule
[ ] statement_tables.toml — all table extraction rules (summary, detail, transaction)
[ ] statement_types.toml — header and lines config groups referencing your table keys
[ ] accounts.toml — account entries linking company, type, and statement type
[ ] standard_fields.toml — std_refs entries added for all 13 standard fields
[ ] Test with a real PDF: bsp process --pdfs /path/to/statements
[ ] Verify checks & balances pass (opening + payments_in - payments_out = closing)

Dataclass Reference¶

Complete reference for all configuration dataclasses defined in bank_statement_parser.modules.data. Fields marked STUB are declared but not currently read by the pipeline — they are reserved for future use.

`Company`¶

Configuration for a financial institution (bank/provider).

Field	Type	Status	Description
`company`	`str`	ACTIVE	Human-readable company name (e.g. "HSBC UK"). Used to populate the STD_COMPANY standard field.
`config`	`Config`	ACTIVE	Extraction config used during the company-identification pass. Extracts a discriminating field (e.g. a bank-specific header string) to confirm the PDF belongs to this company before attempting account matching.
`accounts`	`dict`	STUB	Declared but never accessed by the pipeline after load. Intended as a lookup from account key to Account object but currently unused.

`Account`¶

Full runtime configuration for one bank account.

Field	Type	Status	Description
`account`	`str`	ACTIVE	Human-readable account name (e.g. "Current Account"). Written to the STD_ACCOUNT standard field in the output.
`company_key`	`str`	ACTIVE	Key into companies.toml identifying the issuing bank. Used to build ID_ACCOUNT and to look up the Company object at load time.
`company`	`Company`	ACTIVE	Resolved at load time from company_key. Provides the company name and company-level identification config.
`account_type_key`	`str`	ACTIVE	Key into account_types.toml (e.g. "CRD", "CUR", "SAV"). Used to look up the AccountType object at load time.
`account_type`	`AccountType`	STUB	Resolved at load time from account_type_key. The AccountType object is populated but never subsequently read by any pipeline consumer.
`statement_type_key`	`str`	ACTIVE	Key into statement_types.toml identifying the extraction layout for this account's statements. Used to look up the StatementType object at load time.
`statement_type`	`StatementType`	ACTIVE	Resolved at load time from statement_type_key. Provides the header and lines ConfigGroups used during extraction.
`exclude_last_n_pages`	`int`	ACTIVE	Number of trailing pages to skip when cloning per-page locations. Set to 1 (or more) when the final page(s) contain terms & conditions or other non-transaction content that would otherwise be passed to the extraction pipeline.
`currency`	`str`	ACTIVE	ISO 4217 currency code for all monetary fields on this account (e.g. "GBP", "USD", "PHP"). Must be a key in `currency_spec` in `currency.py`; validated at config load time. Used by the extraction pipeline to resolve the CurrencySpec for fields of type "currency".
`config`	`Config`	ACTIVE	Account-level identification config. A lightweight extraction step run to confirm a PDF belongs to this account before the full extraction pass. Defined inline under `[ACCOUNT_KEY.config]` in accounts.toml.

`AccountType`¶

Simple lookup label for an account type category.

Field	Type	Status	Description
`account_type`	`str`	ACTIVE	Account type label (e.g. "CRD" for credit card, "CUR" for current account). Populated at load time but not subsequently consumed by the pipeline; present for potential reporting or routing use.

`StatementType`¶

Full extraction specification for one statement layout variant.

Field	Type	Status	Description
`statement_type`	`str`	ACTIVE	Human-readable label matching the value used in StdRefs.statement_type (e.g. "HSBC UK Current Account"). Used to select the correct StdRefs mapping when promoting raw fields to standard columns.
`header`	`ConfigGroup`	ACTIVE	Config steps that extract statement-level metadata: dates, account numbers, opening/closing balances, etc.
`lines`	`ConfigGroup`	ACTIVE	Config steps that extract per-transaction data from the body of each page.

`ConfigGroup`¶

An ordered list of Config extraction steps for one pipeline section.

Field	Type	Status	Description
`configs`	`list[Config`	ACTIVE	Ordered list of Config steps. Executed in sequence during extraction; results are stacked into the section's results DataFrame.

`Config`¶

A single extraction step: one table (or one standalone field) from one location.

Field	Type	Status	Description
`config`	`str`	ACTIVE	Human-readable label for this extraction step (e.g. "Statement Balances"). Written into the "config" column of the long-format results DataFrame for traceability.
`statement_table_key`	`str`	ACTIVE	Key into statement_tables.toml that identifies the StatementTable to use. Resolved to statement_table at load time. Set to None for inline single-field configs.
`statement_table`	`StatementTable`	ACTIVE	Resolved at load time from statement_table_key. The StatementTable object used during extraction. Not set directly in TOML.
`locations`	`list[Location`	ACTIVE	Used only for inline single-field configs (where statement_table is None). Defines where on the page to find the field value.
`field`	`Field`	ACTIVE	Used only for inline single-field configs. Defines the extraction spec for the single value to read from the location.

`StatementTable`¶

Full configuration for extracting one table from a PDF statement.

Field	Type	Status	Description
`type`	`str`	STUB	Table type label: "transaction", "summary", or "detail". Loaded from TOML but not currently read by the pipeline; the extraction path is determined by whether transaction_spec is present rather than this field.
`statement_table`	`str`	STUB	Human-readable table label (e.g. "Transactions", "Account Summary"). Loaded from TOML for documentation purposes but not consumed by the pipeline.
`header_text`	`str`	ACTIVE	When set, the first table row whose text matches this string is stripped before extraction. Use when pdfplumber includes the column header row in the extracted data.
`remove_header`	`bool`	ACTIVE	When True the first table row is unconditionally stripped. Use when the header row is always present but its text varies (making header_text impractical).
`locations`	`list[Location]`	ACTIVE	One or more Location entries describing where on the page to find this table. Locations without a page_number are cloned for every page.
`fields`	`list[Field]`	ACTIVE	Ordered list of field extraction specs. For transaction tables each field must have a column; for summary/detail tables each field must have a cell.
`table_columns`	`int`	ACTIVE	Expected minimum number of columns in the extracted table. Passed to pdfplumber as min_words_horizontal and used to validate column count after extraction. Also triggers allow_text_failover retry logic.
`table_rows`	`int`	ACTIVE	Expected minimum number of rows in the extracted table. Passed to pdfplumber as min_words_vertical.
`row_spacing`	`int`	ACTIVE	pdfplumber snap_y_tolerance in PDF points. Rows whose top edges fall within this distance of each other are merged into the same table row. Increase if the statement uses tight line spacing that splits a single visual row across multiple pdfplumber rows.
`tests`	`list[Test`	STUB	Declarative post-extraction assertions. Declared and accepted in TOML but no pipeline code evaluates them. Reserved for a future config validation pass.
`delete_success_false`	`bool`	STUB	Intended to drop rows where any field extraction returned success = False. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag.
`delete_cast_success_false`	`bool`	STUB	Intended to drop rows where numeric casting failed. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag.
`delete_rows_with_missing_vital_fields`	`bool`	STUB	Intended to drop rows where any vital field is missing after extraction. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. Note: vital-field hard-failure logic exists in validate() but is separate from this flag.
`transaction_spec`	`TransactionSpec`	ACTIVE	When set, the table is processed as a transaction table using the bookend-based multi-row extraction path. Must be None for summary/detail tables.

`Location`¶

Describes a rectangular region on a PDF page from which a table or text is extracted.

Field	Type	Status	Description
`page_number`	`int`	ACTIVE	1-based page number. When set the location is used only on that page. When None the location is cloned for every page (spawn_locations()).
`top_left`	`list[int`	ACTIVE	[x, y] coordinates of the top-left corner of the crop rectangle. Must be set together with bottom_right. When both are None the full page is used.
`bottom_right`	`list[int`	ACTIVE	[x, y] coordinates of the bottom-right corner of the crop rectangle. Must be set together with top_left.
`vertical_lines`	`list[int`	ACTIVE	Explicit x-coordinates of vertical column dividers supplied to pdfplumber as explicit_vertical_lines. Pairs of identical values create a zero-width gap that forces a column boundary (e.g. [100, 100, 200, 200]). When set, pdfplumber's automatic column detection is disabled for this region.
`dynamic_last_vertical_line`	`DynamicLineSpec`	ACTIVE	When set, the final value in vertical_lines is replaced at runtime with an x-coordinate derived from a PDF image's bounding box. See DynamicLineSpec. Used where the rightmost column boundary floats with a logo.
`allow_text_failover`	`bool`	ACTIVE	When True and the extracted table has the wrong number of columns, the extraction is retried without vertical_lines, falling back to pdfplumber's text-based column detection. Useful as a safety net for pages where the explicit dividers produce a malformed table.
`try_shift_down`	`int`	ACTIVE	Number of PDF points to shift the crop rectangle downward (applied to both top_left[1] and bottom_right[1]) when the initial extraction returns an empty region. Handles statements where the table top boundary varies slightly between pages.

`DynamicLineSpec`¶

Locates the position of the last vertical column divider from an embedded PDF image.

Field	Type	Status	Description
`image_id`	`int`	ACTIVE	Zero-based index into the list of images on the page, identifying which image provides the boundary coordinate.
`image_location_tag`	`str`	ACTIVE	Bounding-box attribute of the image to use as the x-coordinate (e.g. "x0" for left edge, "x1" for right edge).

`Field`¶

Extraction specification for a single column or cell within a PDF table.

Field	Type	Status	Description
`field`	`str`	ACTIVE	Output column name for this field (e.g. "date", "£_paid_out"). Used as the field identifier throughout the pipeline and in the output Parquet files.
`cell`	`Cell`	ACTIVE	Row/column address for summary or detail table extraction. Mutually exclusive with `column`; set to None for transaction tables.
`column`	`int \| None`	ACTIVE	Zero-based column index for transaction table extraction. Mutually exclusive with `cell`; set to None for summary/detail tables.
`vital`	`bool`	ACTIVE	When True, extraction failure for this field causes the row to be flagged as a hard failure and excluded from output. When False, failure is recorded but the row is retained.
`type`	`str`	ACTIVE	Data type: "string", "numeric", or "currency". * "string" — raw text extraction; pattern matching and trimming applied. * "numeric" — numeric extraction with optional explicit currency stripping via `currency_override`. * "currency" — identical to "numeric" but inherits the CurrencySpec from the account's `Account.currency` rather than requiring an explicit `currency_override` on every field. Use this for all monetary amount fields; reserve "numeric" for non-monetary numerics (e.g. APR, sort code).
`strip_characters_start`	`str`	ACTIVE	Characters to strip from the start of the raw string before pattern matching (passed to Polars str.strip_chars_start()). Useful for leading currency symbols not covered by the account currency spec.
`strip_characters_end`	`str`	ACTIVE	Characters to strip from the end of the raw string before pattern matching (passed to Polars str.strip_chars_end()).
`currency_override`	`str \| None`	ACTIVE	Explicit ISO 4217 currency key (e.g. "GBP") used when `type == "numeric"` and currency stripping is needed but should differ from the account-level `Account.currency`. Ignored when `type == "currency"` (which always uses the account-level currency). Omit for non-monetary numeric fields (e.g. APR, sort code) where no currency stripping is required.
`numeric_modifier`	`NumericModifier`	ACTIVE	Sign/multiplier transformation applied after numeric casting. See NumericModifier. Omit for straightforward positive numeric values.
`string_pattern`	`str`	ACTIVE	Regex pattern the extracted string must match. Extraction is marked as failed (success = False) if the value does not match. Used to validate field contents (e.g. date format) and to skip blank or irrelevant rows.
`string_max_length`	`int`	ACTIVE	Maximum character length for string values; longer strings are truncated via str.head(). Useful for capping free-text description fields. Defaults to 999 if not set.
`date_format`	`str`	STUB	Intended strptime format for date parsing at the Field level. Declared but never read by the pipeline; date format parsing is handled via StdRefs.format in get_standard_fields() instead.
`value_offset`	`'FieldOffset'`	ACTIVE	When set, reads the field's value from an adjacent column (Field.column + FieldOffset.cols_offset) using the type and currency rules defined in the FieldOffset rather than those on this Field. The primary field column is still extracted normally; the offset column value replaces it in the output. See FieldOffset.

`Cell`¶

Zero-based row and column address of a cell within a PDF table.

Field	Type	Status	Description
`row`	`int`	ACTIVE	Zero-based row index within the extracted table.
`col`	`int`	ACTIVE	Zero-based column index within the extracted table.

`FieldOffset`¶

Reads a field's value from an adjacent column rather than the field's own column.

Field	Type	Status	Description
`rows_offset`	`int`	STUB	Intended row offset for reading the value from a different row. Declared and accepted in TOML but never read by the pipeline; only cols_offset is currently consumed. Always set to 0 in TOML examples.
`cols_offset`	`int`	ACTIVE	Column offset applied to Field.column to locate the source cell (e.g. 1 reads from the column immediately to the right).
`vital`	`bool`	ACTIVE	Passed to the extraction pipeline for the offset field; when True extraction failure is treated as a hard failure for that row.
`type`	`str`	ACTIVE	Data type for the offset value: "string", "numeric", or "currency". Overrides the parent Field.type for this value read.
`currency_override`	`str \| None`	ACTIVE	Explicit currency key (e.g. "GBP") for numeric stripping of the offset value when type == "numeric". Overrides the account-level currency. When type == "currency" the account-level currency is used and this is ignored.
`numeric_modifier`	`NumericModifier`	ACTIVE	Sign/multiplier modifier for the offset value. Overrides the parent Field.numeric_modifier.

`NumericModifier`¶

Optional sign/multiplier transformation applied after numeric casting.

Field	Type	Status	Description
`prefix`	`str`	ACTIVE	If the raw value starts with this string the prefix is stripped and the multiplier applied. Use for formats like "(123.45)" where "(" signals a negative value.
`suffix`	`str`	ACTIVE	If the raw value ends with this string the suffix is stripped and the multiplier applied. Use for formats like "123.45 CR" or "123.45D".
`multiplier`	`float`	ACTIVE	Scalar applied to the cast value when the prefix/suffix matches, or unconditionally if neither prefix nor suffix is set. Typically -1 to invert sign.
`exclude_negative_values`	`bool`	ACTIVE	When True, any negative result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column.
`exclude_positive_values`	`bool`	ACTIVE	When True, any positive result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column.

`CurrencySpec`¶

Currency formatting rules used to strip symbols and separators before numeric casting.

Field	Type	Status	Description
`name`	`str`	ACTIVE	Human-readable currency name (e.g. "British Pound Sterling").
`symbols`	`list[str]`	ACTIVE	List of currency symbol strings to strip from the raw value before casting (e.g. ["£", "$"]). Replaced with empty string via str.replace_many().
`seperator_decimal`	`str`	STUB	Intended decimal separator character (e.g. "."). Declared but never read by the pipeline; decimal handling is implicit after symbols and thousands separators are stripped.
`seperators_thousands`	`list[str]`	ACTIVE	List of thousands-separator strings to strip (e.g. [","]). Replaced with empty string via str.replace_many() before casting.
`round_decimals`	`int`	STUB	Intended rounding precision after casting. Declared but never read by the pipeline; no rounding is currently applied.
`pattern`	`str`	ACTIVE	Regex pattern used to extract the numeric substring from the raw cell text before symbol/separator stripping. Passed to patmatch() via build_pattern().

`TransactionSpec`¶

Full specification for extracting transactions from a transaction-type table.

Field	Type	Status	Description
`transaction_bookends`	`list[TransactionBookend]`	ACTIVE	One or more bookend definitions that identify transaction boundaries. Evaluated in order; a row matched by an earlier bookend is not re-matched by a later one. At least one bookend is required.
`fill_forward_fields`	`list[str`	ACTIVE	Field names whose null values should be forward-filled across rows within the same page after pivot. Use for sparse columns where a value (e.g. a date or payment type) appears only on the first row of a multi-row block and needs propagating to the end row.
`merge_fields`	`MergeFields`	ACTIVE	When set, collapses multi-row text fields within each transaction into a single joined string. See MergeFields.
`exclude_rows`	`list[FieldValidation`	ACTIVE	Rows where any rule's field value matches its pattern are removed from the results before bookend detection runs. Use to suppress known non-transaction rows (e.g. a closing balance summary line) that would otherwise interfere with transaction counting or checks & balances. Each rule is a {field, pattern} pair; a row is excluded if any rule matches.

`TransactionBookend`¶

Defines how the start and end of a single transaction are detected within a table.

Field	Type	Status	Description
`start_fields`	`list[str]`	ACTIVE	Field names that are checked to identify the first row of a transaction. A row qualifies as a start row when at least min_non_empty_start of these fields extracted successfully (success = True).
`min_non_empty_start`	`int`	ACTIVE	Minimum number of start_fields that must have extracted successfully for a row to be flagged as transaction_start = True.
`end_fields`	`list[str]`	ACTIVE	Field names checked to identify the last row of a transaction. A row qualifies as an end row when at least min_non_empty_end of these fields extracted successfully.
`min_non_empty_end`	`int`	ACTIVE	Minimum number of end_fields that must have extracted successfully for a row to be flagged as transaction_end = True.
`extra_validation_start`	`FieldValidation`	ACTIVE	When set, any row where the named field's value does NOT match the pattern is excluded from being a start-bookend candidate for this bookend. Rows excluded here may still be captured by another bookend in the list. Useful for bookends that should only trigger on a specific row shape (e.g. an interest charge line identified by its details text).
`extra_validation_end`	`FieldValidation`	STUB	Symmetric counterpart to extra_validation_start for end rows. Declared but not yet implemented in the pipeline; no code currently reads this field. Reserved for future use.
`sticky_fields`	`list[str`	STUB	Intended to forward-fill named fields from the start row of a transaction down to its end row, scoped within a single transaction (as opposed to fill_forward_fields which fills across transactions). Declared but not implemented; no pipeline code reads this field.

`FieldValidation`¶

A field-name/regex-pattern pair used as a row filter or row qualification rule.

Field	Type	Status	Description
`field`	`str`	ACTIVE	Name of the extracted field (output column name) whose value is tested against the pattern.
`pattern`	`str`	ACTIVE	Regex pattern tested via Polars str.contains(). For exclude_rows a match causes exclusion; for extra_validation_start a non-match causes exclusion.

`MergeFields`¶

Specifies how multi-row text fields are collapsed into a single output value.

Field	Type	Status	Description
`fields`	`list[str]`	ACTIVE	Names of the fields whose per-row values should be joined.
`separator`	`str`	ACTIVE	Delimiter inserted between joined values (e.g. " \| ").

`StandardFields`¶

Declaration of a single standard output column and how to derive it.

Field	Type	Status	Description
`section`	`str`	ACTIVE	Pipeline section this field belongs to: "header" (statement-level metadata extracted once per statement) or "lines" (per-transaction data). Used to dispatch the field to the correct extraction pass.
`type`	`str`	ACTIVE	Data type of the standard column: "string", "numeric", or "date". Controls casting, multiplier application, and date parsing in get_standard_fields().
`vital`	`bool`	ACTIVE	When True a ConfigError is raised if no matching StdRefs entry is found for the current statement type, halting processing. Set False for optional fields that not all statement types provide.
`std_refs`	`list[StdRefs]`	ACTIVE	One entry per supported statement type. The correct entry is selected at runtime by matching StdRefs.statement_type.

`StdRefs`¶

Mapping rule that promotes a raw extracted field to a standard output column.

Field	Type	Status	Description
`statement_type`	`str`	ACTIVE	Key used to select this rule; matched against the statement type string of the PDF being processed (e.g. "HSBC UK Current Account").
`field`	`str`	ACTIVE	Name of the raw extracted column to promote. Set to None (or omit) when a literal default value should be used instead of a column value.
`format`	`str`	ACTIVE	strptime format string applied when StandardFields.type == "date" (e.g. "%-d %B %Y"). Ignored for numeric and string types.
`default`	`str`	ACTIVE	Literal string value used as the output when `field` is None/absent. Useful for injecting constant metadata (e.g. transaction_type = "CC").
`multiplier`	`float`	ACTIVE	Scalar applied to the value after casting when StandardFields.type == "numeric". Use -1 to invert sign (e.g. to convert a credit amount stored as positive into a negative figure).
`exclude_positive_values`	`bool`	ACTIVE	When True, any positive numeric value is replaced with 0 after casting. Used to isolate debit-side figures from a combined amount column.
`exclude_negative_values`	`bool`	ACTIVE	When True, any negative numeric value is replaced with 0 after casting. Used to isolate credit-side figures from a combined amount column.
`terminator`	`str`	ACTIVE	Regex pattern; when present the string value is truncated at the first match position before being written to the standard column. Useful for stripping trailing boilerplate appended by merge_fields (e.g. " \| BALANCE CARRIED FORWARD").

`Test`¶

Declarative test assertion attached to a StatementTable.

Field	Type	Status	Description
`test_desc`	`str`	STUB	Human-readable description of the test assertion.
`assertion`	`str`	STUB	The assertion expression to evaluate (format TBD).

Adding a New Bank¶

Overview¶

Bank config folder structure¶

Processing pipeline¶

Step 1: Register the Account Type¶

AccountType¶

Step 2: Create the Bank Config Folder¶

Step 3: Define the Company¶

Key dataclasses¶

Company¶

Config¶

Location¶

Field¶

Step 4: Define Statement Tables¶

Table types¶

Summary table example¶

Transaction table example¶

Key dataclasses¶

StatementTable¶

Location¶

DynamicLineSpec¶

Field¶

Cell¶

NumericModifier¶

FieldOffset¶

CurrencySpec¶

TransactionSpec¶

TransactionBookend¶

FieldValidation¶

MergeFields¶

Step 5: Define Statement Types¶

Key dataclasses¶

StatementType¶

ConfigGroup¶

Config¶

Step 6: Define Accounts¶

Key dataclasses¶

Account¶

Step 7: Register Standard Field Mappings¶

Standard fields reference¶

std_refs entry options¶

StdRefs¶

StandardFields¶

Configuration Checklist¶

Dataclass Reference¶

Company¶

Account¶

AccountType¶

StatementType¶

ConfigGroup¶

Config¶

StatementTable¶

Location¶

DynamicLineSpec¶

Field¶

Cell¶

FieldOffset¶

NumericModifier¶

CurrencySpec¶

TransactionSpec¶

TransactionBookend¶

FieldValidation¶

MergeFields¶

StandardFields¶

StdRefs¶

Test¶

`AccountType`¶

`Company`¶

`Config`¶

`Location`¶

`Field`¶

`StatementTable`¶

`Location`¶

`DynamicLineSpec`¶

`Field`¶

`Cell`¶

`NumericModifier`¶

`FieldOffset`¶

`CurrencySpec`¶

`TransactionSpec`¶

`TransactionBookend`¶

`FieldValidation`¶

`MergeFields`¶

`StatementType`¶

`ConfigGroup`¶

`Config`¶

`Account`¶

`std_refs` entry options¶

`StdRefs`¶

`StandardFields`¶

`Company`¶

`Account`¶

`AccountType`¶

`StatementType`¶

`ConfigGroup`¶

`Config`¶

`StatementTable`¶

`Location`¶

`DynamicLineSpec`¶

`Field`¶

`Cell`¶

`FieldOffset`¶

`NumericModifier`¶

`CurrencySpec`¶

`TransactionSpec`¶

`TransactionBookend`¶

`FieldValidation`¶

`MergeFields`¶

`StandardFields`¶

`StdRefs`¶

`Test`¶