Adding a New Bank¶
This guide walks through the process of configuring bank_statement_parser to parse PDF statements from a new bank. The configuration is entirely TOML-based and does not require writing any Python code.
Overview¶
Adding support for a new bank involves creating and editing several TOML files that describe how to identify the bank's PDFs, locate tables on each page, extract field values, and map them to standard output columns.
The configuration lives in two places:
| Location | Purpose |
|---|---|
project/config/import/<BANK_COUNTRY>/ |
Bank-specific config folder (4 TOML files) |
project/config/import/account_types.toml |
Shared account type registry |
project/config/import/standard_fields.toml |
Shared standard field mappings |
Bank config folder structure¶
Each bank has its own subfolder named in SCREAMING_SNAKE_CASE (e.g. HSBC_UK,
TSB_UK). A complete folder contains exactly four files:
| File | Purpose | Key Dataclass |
|---|---|---|
companies.toml |
Bank identification (name + PDF detection rule) | Company |
accounts.toml |
Account definitions (one per product/card type) | Account |
statement_types.toml |
Statement layout definitions (header + lines extraction) | StatementType |
statement_tables.toml |
Physical table extraction rules (locations, fields, bookends) | StatementTable |
Processing pipeline¶
Understanding the processing order helps when writing config:
- Company identification — the
Company.configextraction is run against page 1 to determine which bank issued the PDF. - Account identification — each
Account.configis tried until one matches, identifying the specific account product. - Header extraction — the
StatementType.headerconfigs run to extract statement-level metadata (dates, balances, account details). - Lines extraction — the
StatementType.linesconfigs run per-page to extract transaction rows. - Standard field mapping — raw extracted fields are mapped to
STD_*output columns viastandard_fields.toml. - Checks & balances — opening balance + payments in - payments out = closing balance is validated.
Step 1: Register the Account Type¶
If your bank uses an account type not already in account_types.toml, add a new
entry. Most banks will use the existing types (CRD, CUR, SAV, ISA).
File: project/config/import/account_types.toml
[CRD]
account_type = "Credit Card"
[CUR]
account_type = "Current Account"
[SAV]
account_type = "Savings Account"
[ISA]
account_type = "ISA"
AccountType¶
Simple lookup label for an account type category.
| Field | Type | Status | Description |
|---|---|---|---|
account_type |
str |
ACTIVE | Account type label (e.g. "CRD" for credit card, "CUR" for current account). Populated at load time but not subsequently consumed by the pipeline; present for potential reporting or routing use. |
Step 2: Create the Bank Config Folder¶
Create a new subfolder under project/config/import/ using the naming convention
<BANK>_<COUNTRY> in SCREAMING_SNAKE_CASE:
project/config/import/
HSBC_UK/ # existing
TSB_UK/ # existing
NEWBANK_UK/ # <- your new folder
companies.toml
accounts.toml
statement_types.toml
statement_tables.toml
Step 3: Define the Company¶
Create companies.toml in your new folder. This file identifies the bank by
extracting a distinguishing piece of text from page 1 of the PDF (typically a
website URL or bank name).
Example (HSBC_UK/companies.toml):
[HSBC_UK]
company = 'HSBC Bank UK'
[HSBC_UK.config]
config = 'Company Info'
locations = [
{page_number = 1, top_left = [475,110], bottom_right = [575, 130]},
{page_number = 1, top_left = [460,145], bottom_right = [575, 165]},
{page_number = 1, top_left = [460,165], bottom_right = [575, 185]},
]
field = {field = 'website', vital=true, type="string", string_pattern ='^www\.hsbc\.co\.uk$'}
How it works: The config block defines a small extraction region on page 1.
The field spec extracts text from that region and checks it against
string_pattern. If the pattern matches, this company is selected. Multiple
locations can be provided — the pipeline tries each until one succeeds.
Key dataclasses¶
Company¶
Configuration for a financial institution (bank/provider).
| Field | Type | Status | Description |
|---|---|---|---|
company |
str |
ACTIVE | Human-readable company name (e.g. "HSBC UK"). Used to populate the STD_COMPANY standard field. |
config |
Config |
ACTIVE | Extraction config used during the company-identification pass. Extracts a discriminating field (e.g. a bank-specific header string) to confirm the PDF belongs to this company before attempting account matching. |
accounts |
dict |
STUB | Declared but never accessed by the pipeline after load. Intended as a lookup from account key to Account object but currently unused. |
Config¶
A single extraction step: one table (or one standalone field) from one location.
| Field | Type | Status | Description |
|---|---|---|---|
config |
str |
ACTIVE | Human-readable label for this extraction step (e.g. "Statement Balances"). Written into the "config" column of the long-format results DataFrame for traceability. |
statement_table_key |
str |
ACTIVE | Key into statement_tables.toml that identifies the StatementTable to use. Resolved to statement_table at load time. Set to None for inline single-field configs. |
statement_table |
StatementTable |
ACTIVE | Resolved at load time from statement_table_key. The StatementTable object used during extraction. Not set directly in TOML. |
locations |
list[Location |
ACTIVE | Used only for inline single-field configs (where statement_table is None). Defines where on the page to find the field value. |
field |
Field |
ACTIVE | Used only for inline single-field configs. Defines the extraction spec for the single value to read from the location. |
Location¶
Describes a rectangular region on a PDF page from which a table or text is extracted.
| Field | Type | Status | Description |
|---|---|---|---|
page_number |
int |
ACTIVE | 1-based page number. When set the location is used only on that page. When None the location is cloned for every page (spawn_locations()). |
top_left |
list[int |
ACTIVE | [x, y] coordinates of the top-left corner of the crop rectangle. Must be set together with bottom_right. When both are None the full page is used. |
bottom_right |
list[int |
ACTIVE | [x, y] coordinates of the bottom-right corner of the crop rectangle. Must be set together with top_left. |
vertical_lines |
list[int |
ACTIVE | Explicit x-coordinates of vertical column dividers supplied to pdfplumber as explicit_vertical_lines. Pairs of identical values create a zero-width gap that forces a column boundary (e.g. [100, 100, 200, 200]). When set, pdfplumber's automatic column detection is disabled for this region. |
dynamic_last_vertical_line |
DynamicLineSpec |
ACTIVE | When set, the final value in vertical_lines is replaced at runtime with an x-coordinate derived from a PDF image's bounding box. See DynamicLineSpec. Used where the rightmost column boundary floats with a logo. |
allow_text_failover |
bool |
ACTIVE | When True and the extracted table has the wrong number of columns, the extraction is retried without vertical_lines, falling back to pdfplumber's text-based column detection. Useful as a safety net for pages where the explicit dividers produce a malformed table. |
try_shift_down |
int |
ACTIVE | Number of PDF points to shift the crop rectangle downward (applied to both top_left[1] and bottom_right[1]) when the initial extraction returns an empty region. Handles statements where the table top boundary varies slightly between pages. |
Field¶
Extraction specification for a single column or cell within a PDF table.
| Field | Type | Status | Description |
|---|---|---|---|
field |
str |
ACTIVE | Output column name for this field (e.g. "date", "£_paid_out"). Used as the field identifier throughout the pipeline and in the output Parquet files. |
cell |
Cell |
ACTIVE | Row/column address for summary or detail table extraction. Mutually exclusive with column; set to None for transaction tables. |
column |
int | None |
ACTIVE | Zero-based column index for transaction table extraction. Mutually exclusive with cell; set to None for summary/detail tables. |
vital |
bool |
ACTIVE | When True, extraction failure for this field causes the row to be flagged as a hard failure and excluded from output. When False, failure is recorded but the row is retained. |
type |
str |
ACTIVE | Data type: "string", "numeric", or "currency". * "string" — raw text extraction; pattern matching and trimming applied. * "numeric" — numeric extraction with optional explicit currency stripping via currency_override. * "currency" — identical to "numeric" but inherits the CurrencySpec from the account's Account.currency rather than requiring an explicit currency_override on every field. Use this for all monetary amount fields; reserve "numeric" for non-monetary numerics (e.g. APR, sort code). |
strip_characters_start |
str |
ACTIVE | Characters to strip from the start of the raw string before pattern matching (passed to Polars str.strip_chars_start()). Useful for leading currency symbols not covered by the account currency spec. |
strip_characters_end |
str |
ACTIVE | Characters to strip from the end of the raw string before pattern matching (passed to Polars str.strip_chars_end()). |
currency_override |
str | None |
ACTIVE | Explicit ISO 4217 currency key (e.g. "GBP") used when type == "numeric" and currency stripping is needed but should differ from the account-level Account.currency. Ignored when type == "currency" (which always uses the account-level currency). Omit for non-monetary numeric fields (e.g. APR, sort code) where no currency stripping is required. |
numeric_modifier |
NumericModifier |
ACTIVE | Sign/multiplier transformation applied after numeric casting. See NumericModifier. Omit for straightforward positive numeric values. |
string_pattern |
str |
ACTIVE | Regex pattern the extracted string must match. Extraction is marked as failed (success = False) if the value does not match. Used to validate field contents (e.g. date format) and to skip blank or irrelevant rows. |
string_max_length |
int |
ACTIVE | Maximum character length for string values; longer strings are truncated via str.head(). Useful for capping free-text description fields. Defaults to 999 if not set. |
date_format |
str |
STUB | Intended strptime format for date parsing at the Field level. Declared but never read by the pipeline; date format parsing is handled via StdRefs.format in get_standard_fields() instead. |
value_offset |
'FieldOffset' |
ACTIVE | When set, reads the field's value from an adjacent column (Field.column + FieldOffset.cols_offset) using the type and currency rules defined in the FieldOffset rather than those on this Field. The primary field column is still extracted normally; the offset column value replaces it in the output. See FieldOffset. |
Step 4: Define Statement Tables¶
Create statement_tables.toml to define how tables are physically extracted from
the PDF pages. This is usually the most complex configuration file, as it requires
understanding the precise layout of the bank's PDF statements.
Each table entry defines:
- Where on the page to look (bounding box coordinates, vertical column dividers)
- What fields to extract (column indices or cell addresses, data types, patterns)
- How to handle multi-row transactions (bookend detection, field merging)
Table types¶
There are three table types, determined by the presence of transaction_spec:
| Type | Use Case | Field Addressing | Has transaction_spec? |
|---|---|---|---|
summary |
Account balances, totals | cell = {row, col} |
No |
detail |
Account holder info, sort codes | cell = {row, col} |
No |
transaction |
Transaction line items | column = N |
Yes |
Summary table example¶
A summary table extracts fixed values from known cell positions (e.g. opening balance at row 1, column 1):
[HSBC_UK_CUR_ACCT_SUM]
type = "summary"
statement_table = 'Account Summary'
table_columns = 2
table_rows = 4
row_spacing = 7
locations = [
{page_number=1, top_left = [345, 180], bottom_right = [575, 300], vertical_lines = [360, 475, 475, 550], dynamic_last_vertical_line = {image_id = 0, image_location_tag = "x1"}, allow_text_failover = true},
]
fields = [
{field = 'opening_balance', cell = {row = 1, col = 1}, vital=true, type = 'currency', numeric_modifier = {suffix = "D", multiplier = -1}},
{field = 'payments_in', cell = {row = 2, col = 1}, vital=true, type = 'currency'},
{field = 'payments_out', cell = {row = 3, col = 1}, vital=true, type = 'currency'},
{field = 'closing_balance', cell = {row = 4, col = 1}, vital=true, type = 'currency', numeric_modifier = {suffix = "D", multiplier = -1}},
]
Transaction table example¶
A transaction table extracts rows of variable length, using bookend detection to identify where each transaction starts and ends:
[HSBC_UK_CUR_TRANSACTIONS]
type = "transaction"
statement_table = 'Transactions'
table_columns = 6
locations = [
{vertical_lines = [50, 100, 100, 130, 130, 320, 320, 400, 400, 480, 480, 555]},
]
fields = [
{field = 'date', column = 0, vital=false, type = "string", string_pattern ='^[0-3][0-9]\s?[A-Z][a-z]{2}\s?[0-3][0-9]$'},
{field = 'payment_type', column = 1, vital=false, type = "string", string_pattern ='(^[A-Z0-9]{1,3}$)|(^[)]{3}$)'},
{field = 'details', column = 2, vital=true, type = "string", string_pattern ='.+', string_max_length = 100},
{field = '£_paid_out', column = 3, vital=false, type = "currency"},
{field = '£_paid_in', column = 4, vital=false, type = "currency"},
{field = '£_balance', column = 5, vital=false, type = "currency", numeric_modifier = {suffix = "D", multiplier = -1.0000}},
]
delete_success_false = true
delete_cast_success_false = true
delete_rows_with_missing_vital_fields = true
[HSBC_UK_CUR_TRANSACTIONS.transaction_spec]
transaction_bookends = [
{start_fields = ['payment_type','details'], min_non_empty_start = 2, end_fields = ['£_paid_out','£_paid_in'], min_non_empty_end = 1}
]
fill_forward_fields = ['date','payment_type']
merge_fields = {fields=['details'], separator=' | '}
Key dataclasses¶
StatementTable¶
Full configuration for extracting one table from a PDF statement.
| Field | Type | Status | Description |
|---|---|---|---|
type |
str |
STUB | Table type label: "transaction", "summary", or "detail". Loaded from TOML but not currently read by the pipeline; the extraction path is determined by whether transaction_spec is present rather than this field. |
statement_table |
str |
STUB | Human-readable table label (e.g. "Transactions", "Account Summary"). Loaded from TOML for documentation purposes but not consumed by the pipeline. |
header_text |
str |
ACTIVE | When set, the first table row whose text matches this string is stripped before extraction. Use when pdfplumber includes the column header row in the extracted data. |
remove_header |
bool |
ACTIVE | When True the first table row is unconditionally stripped. Use when the header row is always present but its text varies (making header_text impractical). |
locations |
list[Location] |
ACTIVE | One or more Location entries describing where on the page to find this table. Locations without a page_number are cloned for every page. |
fields |
list[Field] |
ACTIVE | Ordered list of field extraction specs. For transaction tables each field must have a column; for summary/detail tables each field must have a cell. |
table_columns |
int |
ACTIVE | Expected minimum number of columns in the extracted table. Passed to pdfplumber as min_words_horizontal and used to validate column count after extraction. Also triggers allow_text_failover retry logic. |
table_rows |
int |
ACTIVE | Expected minimum number of rows in the extracted table. Passed to pdfplumber as min_words_vertical. |
row_spacing |
int |
ACTIVE | pdfplumber snap_y_tolerance in PDF points. Rows whose top edges fall within this distance of each other are merged into the same table row. Increase if the statement uses tight line spacing that splits a single visual row across multiple pdfplumber rows. |
tests |
list[Test |
STUB | Declarative post-extraction assertions. Declared and accepted in TOML but no pipeline code evaluates them. Reserved for a future config validation pass. |
delete_success_false |
bool |
STUB | Intended to drop rows where any field extraction returned success = False. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. |
delete_cast_success_false |
bool |
STUB | Intended to drop rows where numeric casting failed. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. |
delete_rows_with_missing_vital_fields |
bool |
STUB | Intended to drop rows where any vital field is missing after extraction. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. Note: vital-field hard-failure logic exists in validate() but is separate from this flag. |
transaction_spec |
TransactionSpec |
ACTIVE | When set, the table is processed as a transaction table using the bookend-based multi-row extraction path. Must be None for summary/detail tables. |
Location¶
Describes a rectangular region on a PDF page from which a table or text is extracted.
| Field | Type | Status | Description |
|---|---|---|---|
page_number |
int |
ACTIVE | 1-based page number. When set the location is used only on that page. When None the location is cloned for every page (spawn_locations()). |
top_left |
list[int |
ACTIVE | [x, y] coordinates of the top-left corner of the crop rectangle. Must be set together with bottom_right. When both are None the full page is used. |
bottom_right |
list[int |
ACTIVE | [x, y] coordinates of the bottom-right corner of the crop rectangle. Must be set together with top_left. |
vertical_lines |
list[int |
ACTIVE | Explicit x-coordinates of vertical column dividers supplied to pdfplumber as explicit_vertical_lines. Pairs of identical values create a zero-width gap that forces a column boundary (e.g. [100, 100, 200, 200]). When set, pdfplumber's automatic column detection is disabled for this region. |
dynamic_last_vertical_line |
DynamicLineSpec |
ACTIVE | When set, the final value in vertical_lines is replaced at runtime with an x-coordinate derived from a PDF image's bounding box. See DynamicLineSpec. Used where the rightmost column boundary floats with a logo. |
allow_text_failover |
bool |
ACTIVE | When True and the extracted table has the wrong number of columns, the extraction is retried without vertical_lines, falling back to pdfplumber's text-based column detection. Useful as a safety net for pages where the explicit dividers produce a malformed table. |
try_shift_down |
int |
ACTIVE | Number of PDF points to shift the crop rectangle downward (applied to both top_left[1] and bottom_right[1]) when the initial extraction returns an empty region. Handles statements where the table top boundary varies slightly between pages. |
DynamicLineSpec¶
Locates the position of the last vertical column divider from an embedded PDF image.
| Field | Type | Status | Description |
|---|---|---|---|
image_id |
int |
ACTIVE | Zero-based index into the list of images on the page, identifying which image provides the boundary coordinate. |
image_location_tag |
str |
ACTIVE | Bounding-box attribute of the image to use as the x-coordinate (e.g. "x0" for left edge, "x1" for right edge). |
Field¶
Extraction specification for a single column or cell within a PDF table.
| Field | Type | Status | Description |
|---|---|---|---|
field |
str |
ACTIVE | Output column name for this field (e.g. "date", "£_paid_out"). Used as the field identifier throughout the pipeline and in the output Parquet files. |
cell |
Cell |
ACTIVE | Row/column address for summary or detail table extraction. Mutually exclusive with column; set to None for transaction tables. |
column |
int | None |
ACTIVE | Zero-based column index for transaction table extraction. Mutually exclusive with cell; set to None for summary/detail tables. |
vital |
bool |
ACTIVE | When True, extraction failure for this field causes the row to be flagged as a hard failure and excluded from output. When False, failure is recorded but the row is retained. |
type |
str |
ACTIVE | Data type: "string", "numeric", or "currency". * "string" — raw text extraction; pattern matching and trimming applied. * "numeric" — numeric extraction with optional explicit currency stripping via currency_override. * "currency" — identical to "numeric" but inherits the CurrencySpec from the account's Account.currency rather than requiring an explicit currency_override on every field. Use this for all monetary amount fields; reserve "numeric" for non-monetary numerics (e.g. APR, sort code). |
strip_characters_start |
str |
ACTIVE | Characters to strip from the start of the raw string before pattern matching (passed to Polars str.strip_chars_start()). Useful for leading currency symbols not covered by the account currency spec. |
strip_characters_end |
str |
ACTIVE | Characters to strip from the end of the raw string before pattern matching (passed to Polars str.strip_chars_end()). |
currency_override |
str | None |
ACTIVE | Explicit ISO 4217 currency key (e.g. "GBP") used when type == "numeric" and currency stripping is needed but should differ from the account-level Account.currency. Ignored when type == "currency" (which always uses the account-level currency). Omit for non-monetary numeric fields (e.g. APR, sort code) where no currency stripping is required. |
numeric_modifier |
NumericModifier |
ACTIVE | Sign/multiplier transformation applied after numeric casting. See NumericModifier. Omit for straightforward positive numeric values. |
string_pattern |
str |
ACTIVE | Regex pattern the extracted string must match. Extraction is marked as failed (success = False) if the value does not match. Used to validate field contents (e.g. date format) and to skip blank or irrelevant rows. |
string_max_length |
int |
ACTIVE | Maximum character length for string values; longer strings are truncated via str.head(). Useful for capping free-text description fields. Defaults to 999 if not set. |
date_format |
str |
STUB | Intended strptime format for date parsing at the Field level. Declared but never read by the pipeline; date format parsing is handled via StdRefs.format in get_standard_fields() instead. |
value_offset |
'FieldOffset' |
ACTIVE | When set, reads the field's value from an adjacent column (Field.column + FieldOffset.cols_offset) using the type and currency rules defined in the FieldOffset rather than those on this Field. The primary field column is still extracted normally; the offset column value replaces it in the output. See FieldOffset. |
Cell¶
Zero-based row and column address of a cell within a PDF table.
| Field | Type | Status | Description |
|---|---|---|---|
row |
int |
ACTIVE | Zero-based row index within the extracted table. |
col |
int |
ACTIVE | Zero-based column index within the extracted table. |
NumericModifier¶
Optional sign/multiplier transformation applied after numeric casting.
| Field | Type | Status | Description |
|---|---|---|---|
prefix |
str |
ACTIVE | If the raw value starts with this string the prefix is stripped and the multiplier applied. Use for formats like "(123.45)" where "(" signals a negative value. |
suffix |
str |
ACTIVE | If the raw value ends with this string the suffix is stripped and the multiplier applied. Use for formats like "123.45 CR" or "123.45D". |
multiplier |
float |
ACTIVE | Scalar applied to the cast value when the prefix/suffix matches, or unconditionally if neither prefix nor suffix is set. Typically -1 to invert sign. |
exclude_negative_values |
bool |
ACTIVE | When True, any negative result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column. |
exclude_positive_values |
bool |
ACTIVE | When True, any positive result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column. |
FieldOffset¶
Reads a field's value from an adjacent column rather than the field's own column.
| Field | Type | Status | Description |
|---|---|---|---|
rows_offset |
int |
STUB | Intended row offset for reading the value from a different row. Declared and accepted in TOML but never read by the pipeline; only cols_offset is currently consumed. Always set to 0 in TOML examples. |
cols_offset |
int |
ACTIVE | Column offset applied to Field.column to locate the source cell (e.g. 1 reads from the column immediately to the right). |
vital |
bool |
ACTIVE | Passed to the extraction pipeline for the offset field; when True extraction failure is treated as a hard failure for that row. |
type |
str |
ACTIVE | Data type for the offset value: "string", "numeric", or "currency". Overrides the parent Field.type for this value read. |
currency_override |
str | None |
ACTIVE | Explicit currency key (e.g. "GBP") for numeric stripping of the offset value when type == "numeric". Overrides the account-level currency. When type == "currency" the account-level currency is used and this is ignored. |
numeric_modifier |
NumericModifier |
ACTIVE | Sign/multiplier modifier for the offset value. Overrides the parent Field.numeric_modifier. |
CurrencySpec¶
Currency formatting rules used to strip symbols and separators before numeric casting.
| Field | Type | Status | Description |
|---|---|---|---|
name |
str |
ACTIVE | Human-readable currency name (e.g. "British Pound Sterling"). |
symbols |
list[str] |
ACTIVE | List of currency symbol strings to strip from the raw value before casting (e.g. ["£", "$"]). Replaced with empty string via str.replace_many(). |
seperator_decimal |
str |
STUB | Intended decimal separator character (e.g. "."). Declared but never read by the pipeline; decimal handling is implicit after symbols and thousands separators are stripped. |
seperators_thousands |
list[str] |
ACTIVE | List of thousands-separator strings to strip (e.g. [","]). Replaced with empty string via str.replace_many() before casting. |
round_decimals |
int |
STUB | Intended rounding precision after casting. Declared but never read by the pipeline; no rounding is currently applied. |
pattern |
str |
ACTIVE | Regex pattern used to extract the numeric substring from the raw cell text before symbol/separator stripping. Passed to patmatch() via build_pattern(). |
TransactionSpec¶
Full specification for extracting transactions from a transaction-type table.
| Field | Type | Status | Description |
|---|---|---|---|
transaction_bookends |
list[TransactionBookend] |
ACTIVE | One or more bookend definitions that identify transaction boundaries. Evaluated in order; a row matched by an earlier bookend is not re-matched by a later one. At least one bookend is required. |
fill_forward_fields |
list[str |
ACTIVE | Field names whose null values should be forward-filled across rows within the same page after pivot. Use for sparse columns where a value (e.g. a date or payment type) appears only on the first row of a multi-row block and needs propagating to the end row. |
merge_fields |
MergeFields |
ACTIVE | When set, collapses multi-row text fields within each transaction into a single joined string. See MergeFields. |
exclude_rows |
list[FieldValidation |
ACTIVE | Rows where any rule's field value matches its pattern are removed from the results before bookend detection runs. Use to suppress known non-transaction rows (e.g. a closing balance summary line) that would otherwise interfere with transaction counting or checks & balances. Each rule is a {field, pattern} pair; a row is excluded if any rule matches. |
TransactionBookend¶
Defines how the start and end of a single transaction are detected within a table.
| Field | Type | Status | Description |
|---|---|---|---|
start_fields |
list[str] |
ACTIVE | Field names that are checked to identify the first row of a transaction. A row qualifies as a start row when at least min_non_empty_start of these fields extracted successfully (success = True). |
min_non_empty_start |
int |
ACTIVE | Minimum number of start_fields that must have extracted successfully for a row to be flagged as transaction_start = True. |
end_fields |
list[str] |
ACTIVE | Field names checked to identify the last row of a transaction. A row qualifies as an end row when at least min_non_empty_end of these fields extracted successfully. |
min_non_empty_end |
int |
ACTIVE | Minimum number of end_fields that must have extracted successfully for a row to be flagged as transaction_end = True. |
extra_validation_start |
FieldValidation |
ACTIVE | When set, any row where the named field's value does NOT match the pattern is excluded from being a start-bookend candidate for this bookend. Rows excluded here may still be captured by another bookend in the list. Useful for bookends that should only trigger on a specific row shape (e.g. an interest charge line identified by its details text). |
extra_validation_end |
FieldValidation |
STUB | Symmetric counterpart to extra_validation_start for end rows. Declared but not yet implemented in the pipeline; no code currently reads this field. Reserved for future use. |
sticky_fields |
list[str |
STUB | Intended to forward-fill named fields from the start row of a transaction down to its end row, scoped within a single transaction (as opposed to fill_forward_fields which fills across transactions). Declared but not implemented; no pipeline code reads this field. |
FieldValidation¶
A field-name/regex-pattern pair used as a row filter or row qualification rule.
| Field | Type | Status | Description |
|---|---|---|---|
field |
str |
ACTIVE | Name of the extracted field (output column name) whose value is tested against the pattern. |
pattern |
str |
ACTIVE | Regex pattern tested via Polars str.contains(). For exclude_rows a match causes exclusion; for extra_validation_start a non-match causes exclusion. |
MergeFields¶
Specifies how multi-row text fields are collapsed into a single output value.
| Field | Type | Status | Description |
|---|---|---|---|
fields |
list[str] |
ACTIVE | Names of the fields whose per-row values should be joined. |
separator |
str |
ACTIVE | Delimiter inserted between joined values (e.g. " | "). |
Step 5: Define Statement Types¶
Create statement_types.toml to define the extraction workflow for each distinct
statement layout. A single bank may have multiple statement types (e.g. current
account vs. credit card) if their PDF layouts differ.
Each statement type groups extraction into two sections:
header— runs once per statement to extract metadata (dates, balances, account info)lines— runs per-page to extract transaction rows
Configs within each section either reference a statement_table_key from
statement_tables.toml or define an inline single-field extraction.
Example (HSBC_UK/statement_types.toml):
[HSBC_UK_CUR]
statement_type = 'HSBC UK Current Account'
[HSBC_UK_CUR.header]
[[HSBC_UK_CUR.header.configs]]
config = 'Statement Balances'
statement_table_key = 'HSBC_UK_CUR_ACCT_SUM'
[[HSBC_UK_CUR.header.configs]]
config = 'Statement Info'
locations = [
{page_number=1, top_left = [37, 325], bottom_right = [290, 385]}
]
field = {field = 'statement_date', vital=true, type = "string", string_pattern ='^[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s(\d{4}\s)?to\s[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s[0-9]{4}$'}
[[HSBC_UK_CUR.header.configs]]
config = 'Account Info'
statement_table_key = 'HSBC_UK_CUR_ACCT_DET'
[HSBC_UK_CUR.lines]
[[HSBC_UK_CUR.lines.configs]]
config = 'Transaction Lines'
statement_table_key = 'HSBC_UK_CUR_TRANSACTIONS'
[HSBC_UK_SAV]
statement_type = 'HSBC UK Saving Account'
[HSBC_UK_SAV.header]
[[HSBC_UK_SAV.header.configs]]
config = 'Statement Balances'
statement_table_key = 'HSBC_UK_CUR_ACCT_SUM'
[[HSBC_UK_SAV.header.configs]]
config = 'Statement Info'
locations = [{page_number=1, top_left = [37, 325], bottom_right = [290, 385]}]
field = {field = 'statement_date', vital=true, type = "string", string_pattern ='^[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s(\d{4}\s)?to\s[1-3]?[0-9]\s[A-Z][a-z]{2,8}\s[0-9]{4}$'}
[[HSBC_UK_SAV.header.configs]]
config = 'Account Info'
statement_table_key = 'HSBC_UK_SAV_ACCT_DET'
[HSBC_UK_SAV.lines]
[[HSBC_UK_SAV.lines.configs]]
config = 'Transaction Lines'
statement_table_key = 'HSBC_UK_CUR_TRANSACTIONS'
[HSBC_UK_CRD]
statement_type = 'HSBC UK Credit Card'
[HSBC_UK_CRD.header]
[[HSBC_UK_CRD.header.configs]]
config = 'Statement Balances'
statement_table_key = 'HSBC_UK_CRD_ACCT_SUM'
[[HSBC_UK_CRD.header.configs]]
config = 'Statement Info'
locations = [{page_number=1, top_left = [120, 470], bottom_right = [250, 500]}]
field = {field = 'statement_date', vital=true, type = "string", string_pattern ='^[0-3]?[0-9]\s[A-Z][a-z]{2,8}\s[0-9]{4}$'}
[[HSBC_UK_CRD.header.configs]]
config = 'Account Info'
locations = [{page_number=2, top_left = [44, 174], bottom_right = [215, 200]}]
field = {field = 'account_name', vital=true, type = "string", string_pattern ='^[A-Z]+[a-z]*\s[A-Z]+[a-z]*.*$'}
[[HSBC_UK_CRD.header.configs]]
config = 'Account Info'
locations = [{page_number=2, top_left = [220, 175], bottom_right = [340, 197]}]
field = {field = 'card_number', vital=true, type = "string", string_pattern ='^[0-9]{4}\s[0-9]{4}\s[0-9]{4}\s[0-9]{4}\s?$'}
[HSBC_UK_CRD.lines]
[[HSBC_UK_CRD.lines.configs]]
config = 'Transaction Lines'
statement_table_key = 'HSBC_UK_CRD_TRANSACTIONS'
Key dataclasses¶
StatementType¶
Full extraction specification for one statement layout variant.
| Field | Type | Status | Description |
|---|---|---|---|
statement_type |
str |
ACTIVE | Human-readable label matching the value used in StdRefs.statement_type (e.g. "HSBC UK Current Account"). Used to select the correct StdRefs mapping when promoting raw fields to standard columns. |
header |
ConfigGroup |
ACTIVE | Config steps that extract statement-level metadata: dates, account numbers, opening/closing balances, etc. |
lines |
ConfigGroup |
ACTIVE | Config steps that extract per-transaction data from the body of each page. |
ConfigGroup¶
An ordered list of Config extraction steps for one pipeline section.
| Field | Type | Status | Description |
|---|---|---|---|
configs |
list[Config |
ACTIVE | Ordered list of Config steps. Executed in sequence during extraction; results are stacked into the section's results DataFrame. |
Config¶
A single extraction step: one table (or one standalone field) from one location.
| Field | Type | Status | Description |
|---|---|---|---|
config |
str |
ACTIVE | Human-readable label for this extraction step (e.g. "Statement Balances"). Written into the "config" column of the long-format results DataFrame for traceability. |
statement_table_key |
str |
ACTIVE | Key into statement_tables.toml that identifies the StatementTable to use. Resolved to statement_table at load time. Set to None for inline single-field configs. |
statement_table |
StatementTable |
ACTIVE | Resolved at load time from statement_table_key. The StatementTable object used during extraction. Not set directly in TOML. |
locations |
list[Location |
ACTIVE | Used only for inline single-field configs (where statement_table is None). Defines where on the page to find the field value. |
field |
Field |
ACTIVE | Used only for inline single-field configs. Defines the extraction spec for the single value to read from the location. |
Step 6: Define Accounts¶
Create accounts.toml to define each account product offered by the bank. Each
account links together a company, an account type, and a statement type, plus
defines a PDF detection rule to identify which account a given statement belongs to.
Example (HSBC_UK/accounts.toml — first entry shown):
[HSBC_UK_CRD_RCC]
account = "Rewards Credit Card"
company_key = 'HSBC_UK'
account_type_key = 'CRD'
statement_type_key = 'HSBC_UK_CRD'
exclude_last_n_pages = 1
currency = "GBP"
[HSBC_UK_CRD_RCC.config]
config = 'Account Info'
locations = [{page_number = 1, top_left = [275, 20], bottom_right = [575, 70]}]
field = {field = 'account', vital=true, type="string", string_pattern ='^Your[\s]*Rewards[\s]*Credit[\s]*Card[\s]*statement[\s]*'}
Key fields:
company_key— must match a key in yourcompanies.tomlaccount_type_key— must match a key in the sharedaccount_types.toml(e.g.CRD,CUR)statement_type_key— must match a key in yourstatement_types.tomlexclude_last_n_pages— number of trailing pages to skip (terms & conditions, etc.)config— inline extraction rule to identify this account from page 1 of the PDF
Key dataclasses¶
Account¶
Full runtime configuration for one bank account.
| Field | Type | Status | Description |
|---|---|---|---|
account |
str |
ACTIVE | Human-readable account name (e.g. "Current Account"). Written to the STD_ACCOUNT standard field in the output. |
company_key |
str |
ACTIVE | Key into companies.toml identifying the issuing bank. Used to build ID_ACCOUNT and to look up the Company object at load time. |
company |
Company |
ACTIVE | Resolved at load time from company_key. Provides the company name and company-level identification config. |
account_type_key |
str |
ACTIVE | Key into account_types.toml (e.g. "CRD", "CUR", "SAV"). Used to look up the AccountType object at load time. |
account_type |
AccountType |
STUB | Resolved at load time from account_type_key. The AccountType object is populated but never subsequently read by any pipeline consumer. |
statement_type_key |
str |
ACTIVE | Key into statement_types.toml identifying the extraction layout for this account's statements. Used to look up the StatementType object at load time. |
statement_type |
StatementType |
ACTIVE | Resolved at load time from statement_type_key. Provides the header and lines ConfigGroups used during extraction. |
exclude_last_n_pages |
int |
ACTIVE | Number of trailing pages to skip when cloning per-page locations. Set to 1 (or more) when the final page(s) contain terms & conditions or other non-transaction content that would otherwise be passed to the extraction pipeline. |
currency |
str |
ACTIVE | ISO 4217 currency code for all monetary fields on this account (e.g. "GBP", "USD", "PHP"). Must be a key in currency_spec in currency.py; validated at config load time. Used by the extraction pipeline to resolve the CurrencySpec for fields of type "currency". |
config |
Config |
ACTIVE | Account-level identification config. A lightweight extraction step run to confirm a PDF belongs to this account before the full extraction pass. Defined inline under [ACCOUNT_KEY.config] in accounts.toml. |
Step 7: Register Standard Field Mappings¶
Finally, add entries for your new statement type(s) to the shared
standard_fields.toml. This file maps bank-specific raw field names to
standardised output columns (STD_*).
For each STD_* field, add a new std_refs entry with your statement type's
name and the corresponding raw field name from your statement_tables.toml.
File: project/config/import/standard_fields.toml
Example (showing STD_OPENING_BALANCE with entries for multiple banks):
[STD_OPENING_BALANCE]
section = "header"
type = "numeric"
vital = true
std_refs = [
{statement_type="HSBC UK Current Account", field="opening_balance"},
{statement_type="HSBC UK Saving Account", field="opening_balance"},
{statement_type="HSBC UK Credit Card", field="previous_balance", multiplier=-1.0000},
{statement_type="TSB UK Current Account", field="opening_balance"},
]
Standard fields reference¶
The following standard fields must be mapped for each statement type. Fields
marked vital = true will raise a ConfigError if no mapping is found.
| Standard Field | Section | Type | Vital | Purpose |
|---|---|---|---|---|
STD_STATEMENT_DATE |
header | date | Yes | Statement period end date |
STD_SORTCODE |
header | string | No | Bank sort code |
STD_ACCOUNT_NUMBER |
header | string | Yes | Account or card number |
STD_ACCOUNT_HOLDER |
header | string | No | Account holder name |
STD_OPENING_BALANCE |
header | numeric | Yes | Opening balance |
STD_CLOSING_BALANCE |
header | numeric | Yes | Closing balance |
STD_PAYMENTS_IN |
header | numeric | Yes | Total credits in period |
STD_PAYMENTS_OUT |
header | numeric | Yes | Total debits in period |
STD_TRANSACTION_DATE |
lines | date | Yes | Individual transaction date |
STD_TRANSACTION_TYPE |
lines | str | Yes | Payment type code |
STD_TRANSACTION_DESC |
lines | string | Yes | Transaction description |
STD_PAYMENT_IN |
lines | numeric | Yes | Credit amount per transaction |
STD_PAYMENT_OUT |
lines | numeric | Yes | Debit amount per transaction |
std_refs entry options¶
Each std_refs entry supports the following options:
StdRefs¶
Mapping rule that promotes a raw extracted field to a standard output column.
| Field | Type | Status | Description |
|---|---|---|---|
statement_type |
str |
ACTIVE | Key used to select this rule; matched against the statement type string of the PDF being processed (e.g. "HSBC UK Current Account"). |
field |
str |
ACTIVE | Name of the raw extracted column to promote. Set to None (or omit) when a literal default value should be used instead of a column value. |
format |
str |
ACTIVE | strptime format string applied when StandardFields.type == "date" (e.g. "%-d %B %Y"). Ignored for numeric and string types. |
default |
str |
ACTIVE | Literal string value used as the output when field is None/absent. Useful for injecting constant metadata (e.g. transaction_type = "CC"). |
multiplier |
float |
ACTIVE | Scalar applied to the value after casting when StandardFields.type == "numeric". Use -1 to invert sign (e.g. to convert a credit amount stored as positive into a negative figure). |
exclude_positive_values |
bool |
ACTIVE | When True, any positive numeric value is replaced with 0 after casting. Used to isolate debit-side figures from a combined amount column. |
exclude_negative_values |
bool |
ACTIVE | When True, any negative numeric value is replaced with 0 after casting. Used to isolate credit-side figures from a combined amount column. |
terminator |
str |
ACTIVE | Regex pattern; when present the string value is truncated at the first match position before being written to the standard column. Useful for stripping trailing boilerplate appended by merge_fields (e.g. " | BALANCE CARRIED FORWARD"). |
StandardFields¶
Declaration of a single standard output column and how to derive it.
| Field | Type | Status | Description |
|---|---|---|---|
section |
str |
ACTIVE | Pipeline section this field belongs to: "header" (statement-level metadata extracted once per statement) or "lines" (per-transaction data). Used to dispatch the field to the correct extraction pass. |
type |
str |
ACTIVE | Data type of the standard column: "string", "numeric", or "date". Controls casting, multiplier application, and date parsing in get_standard_fields(). |
vital |
bool |
ACTIVE | When True a ConfigError is raised if no matching StdRefs entry is found for the current statement type, halting processing. Set False for optional fields that not all statement types provide. |
std_refs |
list[StdRefs] |
ACTIVE | One entry per supported statement type. The correct entry is selected at runtime by matching StdRefs.statement_type. |
Configuration Checklist¶
Use this checklist to verify your configuration is complete:
- [ ] Account type registered in
account_types.toml(or existing type reused) - [ ] Bank config folder created:
project/config/import/<BANK_COUNTRY>/ - [ ]
companies.toml— company key, name, and PDF detection rule - [ ]
statement_tables.toml— all table extraction rules (summary, detail, transaction) - [ ]
statement_types.toml— header and lines config groups referencing your table keys - [ ]
accounts.toml— account entries linking company, type, and statement type - [ ]
standard_fields.toml—std_refsentries added for all 13 standard fields - [ ] Test with a real PDF:
bsp process --pdfs /path/to/statements - [ ] Verify checks & balances pass (opening + payments_in - payments_out = closing)
Dataclass Reference¶
Complete reference for all configuration dataclasses defined in
bank_statement_parser.modules.data. Fields marked STUB are declared but
not currently read by the pipeline — they are reserved for future use.
Company¶
Configuration for a financial institution (bank/provider).
| Field | Type | Status | Description |
|---|---|---|---|
company |
str |
ACTIVE | Human-readable company name (e.g. "HSBC UK"). Used to populate the STD_COMPANY standard field. |
config |
Config |
ACTIVE | Extraction config used during the company-identification pass. Extracts a discriminating field (e.g. a bank-specific header string) to confirm the PDF belongs to this company before attempting account matching. |
accounts |
dict |
STUB | Declared but never accessed by the pipeline after load. Intended as a lookup from account key to Account object but currently unused. |
Account¶
Full runtime configuration for one bank account.
| Field | Type | Status | Description |
|---|---|---|---|
account |
str |
ACTIVE | Human-readable account name (e.g. "Current Account"). Written to the STD_ACCOUNT standard field in the output. |
company_key |
str |
ACTIVE | Key into companies.toml identifying the issuing bank. Used to build ID_ACCOUNT and to look up the Company object at load time. |
company |
Company |
ACTIVE | Resolved at load time from company_key. Provides the company name and company-level identification config. |
account_type_key |
str |
ACTIVE | Key into account_types.toml (e.g. "CRD", "CUR", "SAV"). Used to look up the AccountType object at load time. |
account_type |
AccountType |
STUB | Resolved at load time from account_type_key. The AccountType object is populated but never subsequently read by any pipeline consumer. |
statement_type_key |
str |
ACTIVE | Key into statement_types.toml identifying the extraction layout for this account's statements. Used to look up the StatementType object at load time. |
statement_type |
StatementType |
ACTIVE | Resolved at load time from statement_type_key. Provides the header and lines ConfigGroups used during extraction. |
exclude_last_n_pages |
int |
ACTIVE | Number of trailing pages to skip when cloning per-page locations. Set to 1 (or more) when the final page(s) contain terms & conditions or other non-transaction content that would otherwise be passed to the extraction pipeline. |
currency |
str |
ACTIVE | ISO 4217 currency code for all monetary fields on this account (e.g. "GBP", "USD", "PHP"). Must be a key in currency_spec in currency.py; validated at config load time. Used by the extraction pipeline to resolve the CurrencySpec for fields of type "currency". |
config |
Config |
ACTIVE | Account-level identification config. A lightweight extraction step run to confirm a PDF belongs to this account before the full extraction pass. Defined inline under [ACCOUNT_KEY.config] in accounts.toml. |
AccountType¶
Simple lookup label for an account type category.
| Field | Type | Status | Description |
|---|---|---|---|
account_type |
str |
ACTIVE | Account type label (e.g. "CRD" for credit card, "CUR" for current account). Populated at load time but not subsequently consumed by the pipeline; present for potential reporting or routing use. |
StatementType¶
Full extraction specification for one statement layout variant.
| Field | Type | Status | Description |
|---|---|---|---|
statement_type |
str |
ACTIVE | Human-readable label matching the value used in StdRefs.statement_type (e.g. "HSBC UK Current Account"). Used to select the correct StdRefs mapping when promoting raw fields to standard columns. |
header |
ConfigGroup |
ACTIVE | Config steps that extract statement-level metadata: dates, account numbers, opening/closing balances, etc. |
lines |
ConfigGroup |
ACTIVE | Config steps that extract per-transaction data from the body of each page. |
ConfigGroup¶
An ordered list of Config extraction steps for one pipeline section.
| Field | Type | Status | Description |
|---|---|---|---|
configs |
list[Config |
ACTIVE | Ordered list of Config steps. Executed in sequence during extraction; results are stacked into the section's results DataFrame. |
Config¶
A single extraction step: one table (or one standalone field) from one location.
| Field | Type | Status | Description |
|---|---|---|---|
config |
str |
ACTIVE | Human-readable label for this extraction step (e.g. "Statement Balances"). Written into the "config" column of the long-format results DataFrame for traceability. |
statement_table_key |
str |
ACTIVE | Key into statement_tables.toml that identifies the StatementTable to use. Resolved to statement_table at load time. Set to None for inline single-field configs. |
statement_table |
StatementTable |
ACTIVE | Resolved at load time from statement_table_key. The StatementTable object used during extraction. Not set directly in TOML. |
locations |
list[Location |
ACTIVE | Used only for inline single-field configs (where statement_table is None). Defines where on the page to find the field value. |
field |
Field |
ACTIVE | Used only for inline single-field configs. Defines the extraction spec for the single value to read from the location. |
StatementTable¶
Full configuration for extracting one table from a PDF statement.
| Field | Type | Status | Description |
|---|---|---|---|
type |
str |
STUB | Table type label: "transaction", "summary", or "detail". Loaded from TOML but not currently read by the pipeline; the extraction path is determined by whether transaction_spec is present rather than this field. |
statement_table |
str |
STUB | Human-readable table label (e.g. "Transactions", "Account Summary"). Loaded from TOML for documentation purposes but not consumed by the pipeline. |
header_text |
str |
ACTIVE | When set, the first table row whose text matches this string is stripped before extraction. Use when pdfplumber includes the column header row in the extracted data. |
remove_header |
bool |
ACTIVE | When True the first table row is unconditionally stripped. Use when the header row is always present but its text varies (making header_text impractical). |
locations |
list[Location] |
ACTIVE | One or more Location entries describing where on the page to find this table. Locations without a page_number are cloned for every page. |
fields |
list[Field] |
ACTIVE | Ordered list of field extraction specs. For transaction tables each field must have a column; for summary/detail tables each field must have a cell. |
table_columns |
int |
ACTIVE | Expected minimum number of columns in the extracted table. Passed to pdfplumber as min_words_horizontal and used to validate column count after extraction. Also triggers allow_text_failover retry logic. |
table_rows |
int |
ACTIVE | Expected minimum number of rows in the extracted table. Passed to pdfplumber as min_words_vertical. |
row_spacing |
int |
ACTIVE | pdfplumber snap_y_tolerance in PDF points. Rows whose top edges fall within this distance of each other are merged into the same table row. Increase if the statement uses tight line spacing that splits a single visual row across multiple pdfplumber rows. |
tests |
list[Test |
STUB | Declarative post-extraction assertions. Declared and accepted in TOML but no pipeline code evaluates them. Reserved for a future config validation pass. |
delete_success_false |
bool |
STUB | Intended to drop rows where any field extraction returned success = False. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. |
delete_cast_success_false |
bool |
STUB | Intended to drop rows where numeric casting failed. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. |
delete_rows_with_missing_vital_fields |
bool |
STUB | Intended to drop rows where any vital field is missing after extraction. Declared and set in TOML (typically True) but no pipeline code currently reads or acts on this flag. Note: vital-field hard-failure logic exists in validate() but is separate from this flag. |
transaction_spec |
TransactionSpec |
ACTIVE | When set, the table is processed as a transaction table using the bookend-based multi-row extraction path. Must be None for summary/detail tables. |
Location¶
Describes a rectangular region on a PDF page from which a table or text is extracted.
| Field | Type | Status | Description |
|---|---|---|---|
page_number |
int |
ACTIVE | 1-based page number. When set the location is used only on that page. When None the location is cloned for every page (spawn_locations()). |
top_left |
list[int |
ACTIVE | [x, y] coordinates of the top-left corner of the crop rectangle. Must be set together with bottom_right. When both are None the full page is used. |
bottom_right |
list[int |
ACTIVE | [x, y] coordinates of the bottom-right corner of the crop rectangle. Must be set together with top_left. |
vertical_lines |
list[int |
ACTIVE | Explicit x-coordinates of vertical column dividers supplied to pdfplumber as explicit_vertical_lines. Pairs of identical values create a zero-width gap that forces a column boundary (e.g. [100, 100, 200, 200]). When set, pdfplumber's automatic column detection is disabled for this region. |
dynamic_last_vertical_line |
DynamicLineSpec |
ACTIVE | When set, the final value in vertical_lines is replaced at runtime with an x-coordinate derived from a PDF image's bounding box. See DynamicLineSpec. Used where the rightmost column boundary floats with a logo. |
allow_text_failover |
bool |
ACTIVE | When True and the extracted table has the wrong number of columns, the extraction is retried without vertical_lines, falling back to pdfplumber's text-based column detection. Useful as a safety net for pages where the explicit dividers produce a malformed table. |
try_shift_down |
int |
ACTIVE | Number of PDF points to shift the crop rectangle downward (applied to both top_left[1] and bottom_right[1]) when the initial extraction returns an empty region. Handles statements where the table top boundary varies slightly between pages. |
DynamicLineSpec¶
Locates the position of the last vertical column divider from an embedded PDF image.
| Field | Type | Status | Description |
|---|---|---|---|
image_id |
int |
ACTIVE | Zero-based index into the list of images on the page, identifying which image provides the boundary coordinate. |
image_location_tag |
str |
ACTIVE | Bounding-box attribute of the image to use as the x-coordinate (e.g. "x0" for left edge, "x1" for right edge). |
Field¶
Extraction specification for a single column or cell within a PDF table.
| Field | Type | Status | Description |
|---|---|---|---|
field |
str |
ACTIVE | Output column name for this field (e.g. "date", "£_paid_out"). Used as the field identifier throughout the pipeline and in the output Parquet files. |
cell |
Cell |
ACTIVE | Row/column address for summary or detail table extraction. Mutually exclusive with column; set to None for transaction tables. |
column |
int | None |
ACTIVE | Zero-based column index for transaction table extraction. Mutually exclusive with cell; set to None for summary/detail tables. |
vital |
bool |
ACTIVE | When True, extraction failure for this field causes the row to be flagged as a hard failure and excluded from output. When False, failure is recorded but the row is retained. |
type |
str |
ACTIVE | Data type: "string", "numeric", or "currency". * "string" — raw text extraction; pattern matching and trimming applied. * "numeric" — numeric extraction with optional explicit currency stripping via currency_override. * "currency" — identical to "numeric" but inherits the CurrencySpec from the account's Account.currency rather than requiring an explicit currency_override on every field. Use this for all monetary amount fields; reserve "numeric" for non-monetary numerics (e.g. APR, sort code). |
strip_characters_start |
str |
ACTIVE | Characters to strip from the start of the raw string before pattern matching (passed to Polars str.strip_chars_start()). Useful for leading currency symbols not covered by the account currency spec. |
strip_characters_end |
str |
ACTIVE | Characters to strip from the end of the raw string before pattern matching (passed to Polars str.strip_chars_end()). |
currency_override |
str | None |
ACTIVE | Explicit ISO 4217 currency key (e.g. "GBP") used when type == "numeric" and currency stripping is needed but should differ from the account-level Account.currency. Ignored when type == "currency" (which always uses the account-level currency). Omit for non-monetary numeric fields (e.g. APR, sort code) where no currency stripping is required. |
numeric_modifier |
NumericModifier |
ACTIVE | Sign/multiplier transformation applied after numeric casting. See NumericModifier. Omit for straightforward positive numeric values. |
string_pattern |
str |
ACTIVE | Regex pattern the extracted string must match. Extraction is marked as failed (success = False) if the value does not match. Used to validate field contents (e.g. date format) and to skip blank or irrelevant rows. |
string_max_length |
int |
ACTIVE | Maximum character length for string values; longer strings are truncated via str.head(). Useful for capping free-text description fields. Defaults to 999 if not set. |
date_format |
str |
STUB | Intended strptime format for date parsing at the Field level. Declared but never read by the pipeline; date format parsing is handled via StdRefs.format in get_standard_fields() instead. |
value_offset |
'FieldOffset' |
ACTIVE | When set, reads the field's value from an adjacent column (Field.column + FieldOffset.cols_offset) using the type and currency rules defined in the FieldOffset rather than those on this Field. The primary field column is still extracted normally; the offset column value replaces it in the output. See FieldOffset. |
Cell¶
Zero-based row and column address of a cell within a PDF table.
| Field | Type | Status | Description |
|---|---|---|---|
row |
int |
ACTIVE | Zero-based row index within the extracted table. |
col |
int |
ACTIVE | Zero-based column index within the extracted table. |
FieldOffset¶
Reads a field's value from an adjacent column rather than the field's own column.
| Field | Type | Status | Description |
|---|---|---|---|
rows_offset |
int |
STUB | Intended row offset for reading the value from a different row. Declared and accepted in TOML but never read by the pipeline; only cols_offset is currently consumed. Always set to 0 in TOML examples. |
cols_offset |
int |
ACTIVE | Column offset applied to Field.column to locate the source cell (e.g. 1 reads from the column immediately to the right). |
vital |
bool |
ACTIVE | Passed to the extraction pipeline for the offset field; when True extraction failure is treated as a hard failure for that row. |
type |
str |
ACTIVE | Data type for the offset value: "string", "numeric", or "currency". Overrides the parent Field.type for this value read. |
currency_override |
str | None |
ACTIVE | Explicit currency key (e.g. "GBP") for numeric stripping of the offset value when type == "numeric". Overrides the account-level currency. When type == "currency" the account-level currency is used and this is ignored. |
numeric_modifier |
NumericModifier |
ACTIVE | Sign/multiplier modifier for the offset value. Overrides the parent Field.numeric_modifier. |
NumericModifier¶
Optional sign/multiplier transformation applied after numeric casting.
| Field | Type | Status | Description |
|---|---|---|---|
prefix |
str |
ACTIVE | If the raw value starts with this string the prefix is stripped and the multiplier applied. Use for formats like "(123.45)" where "(" signals a negative value. |
suffix |
str |
ACTIVE | If the raw value ends with this string the suffix is stripped and the multiplier applied. Use for formats like "123.45 CR" or "123.45D". |
multiplier |
float |
ACTIVE | Scalar applied to the cast value when the prefix/suffix matches, or unconditionally if neither prefix nor suffix is set. Typically -1 to invert sign. |
exclude_negative_values |
bool |
ACTIVE | When True, any negative result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column. |
exclude_positive_values |
bool |
ACTIVE | When True, any positive result after casting and multiplier application is replaced with 0. Useful for isolating one side of a combined debit/credit column. |
CurrencySpec¶
Currency formatting rules used to strip symbols and separators before numeric casting.
| Field | Type | Status | Description |
|---|---|---|---|
name |
str |
ACTIVE | Human-readable currency name (e.g. "British Pound Sterling"). |
symbols |
list[str] |
ACTIVE | List of currency symbol strings to strip from the raw value before casting (e.g. ["£", "$"]). Replaced with empty string via str.replace_many(). |
seperator_decimal |
str |
STUB | Intended decimal separator character (e.g. "."). Declared but never read by the pipeline; decimal handling is implicit after symbols and thousands separators are stripped. |
seperators_thousands |
list[str] |
ACTIVE | List of thousands-separator strings to strip (e.g. [","]). Replaced with empty string via str.replace_many() before casting. |
round_decimals |
int |
STUB | Intended rounding precision after casting. Declared but never read by the pipeline; no rounding is currently applied. |
pattern |
str |
ACTIVE | Regex pattern used to extract the numeric substring from the raw cell text before symbol/separator stripping. Passed to patmatch() via build_pattern(). |
TransactionSpec¶
Full specification for extracting transactions from a transaction-type table.
| Field | Type | Status | Description |
|---|---|---|---|
transaction_bookends |
list[TransactionBookend] |
ACTIVE | One or more bookend definitions that identify transaction boundaries. Evaluated in order; a row matched by an earlier bookend is not re-matched by a later one. At least one bookend is required. |
fill_forward_fields |
list[str |
ACTIVE | Field names whose null values should be forward-filled across rows within the same page after pivot. Use for sparse columns where a value (e.g. a date or payment type) appears only on the first row of a multi-row block and needs propagating to the end row. |
merge_fields |
MergeFields |
ACTIVE | When set, collapses multi-row text fields within each transaction into a single joined string. See MergeFields. |
exclude_rows |
list[FieldValidation |
ACTIVE | Rows where any rule's field value matches its pattern are removed from the results before bookend detection runs. Use to suppress known non-transaction rows (e.g. a closing balance summary line) that would otherwise interfere with transaction counting or checks & balances. Each rule is a {field, pattern} pair; a row is excluded if any rule matches. |
TransactionBookend¶
Defines how the start and end of a single transaction are detected within a table.
| Field | Type | Status | Description |
|---|---|---|---|
start_fields |
list[str] |
ACTIVE | Field names that are checked to identify the first row of a transaction. A row qualifies as a start row when at least min_non_empty_start of these fields extracted successfully (success = True). |
min_non_empty_start |
int |
ACTIVE | Minimum number of start_fields that must have extracted successfully for a row to be flagged as transaction_start = True. |
end_fields |
list[str] |
ACTIVE | Field names checked to identify the last row of a transaction. A row qualifies as an end row when at least min_non_empty_end of these fields extracted successfully. |
min_non_empty_end |
int |
ACTIVE | Minimum number of end_fields that must have extracted successfully for a row to be flagged as transaction_end = True. |
extra_validation_start |
FieldValidation |
ACTIVE | When set, any row where the named field's value does NOT match the pattern is excluded from being a start-bookend candidate for this bookend. Rows excluded here may still be captured by another bookend in the list. Useful for bookends that should only trigger on a specific row shape (e.g. an interest charge line identified by its details text). |
extra_validation_end |
FieldValidation |
STUB | Symmetric counterpart to extra_validation_start for end rows. Declared but not yet implemented in the pipeline; no code currently reads this field. Reserved for future use. |
sticky_fields |
list[str |
STUB | Intended to forward-fill named fields from the start row of a transaction down to its end row, scoped within a single transaction (as opposed to fill_forward_fields which fills across transactions). Declared but not implemented; no pipeline code reads this field. |
FieldValidation¶
A field-name/regex-pattern pair used as a row filter or row qualification rule.
| Field | Type | Status | Description |
|---|---|---|---|
field |
str |
ACTIVE | Name of the extracted field (output column name) whose value is tested against the pattern. |
pattern |
str |
ACTIVE | Regex pattern tested via Polars str.contains(). For exclude_rows a match causes exclusion; for extra_validation_start a non-match causes exclusion. |
MergeFields¶
Specifies how multi-row text fields are collapsed into a single output value.
| Field | Type | Status | Description |
|---|---|---|---|
fields |
list[str] |
ACTIVE | Names of the fields whose per-row values should be joined. |
separator |
str |
ACTIVE | Delimiter inserted between joined values (e.g. " | "). |
StandardFields¶
Declaration of a single standard output column and how to derive it.
| Field | Type | Status | Description |
|---|---|---|---|
section |
str |
ACTIVE | Pipeline section this field belongs to: "header" (statement-level metadata extracted once per statement) or "lines" (per-transaction data). Used to dispatch the field to the correct extraction pass. |
type |
str |
ACTIVE | Data type of the standard column: "string", "numeric", or "date". Controls casting, multiplier application, and date parsing in get_standard_fields(). |
vital |
bool |
ACTIVE | When True a ConfigError is raised if no matching StdRefs entry is found for the current statement type, halting processing. Set False for optional fields that not all statement types provide. |
std_refs |
list[StdRefs] |
ACTIVE | One entry per supported statement type. The correct entry is selected at runtime by matching StdRefs.statement_type. |
StdRefs¶
Mapping rule that promotes a raw extracted field to a standard output column.
| Field | Type | Status | Description |
|---|---|---|---|
statement_type |
str |
ACTIVE | Key used to select this rule; matched against the statement type string of the PDF being processed (e.g. "HSBC UK Current Account"). |
field |
str |
ACTIVE | Name of the raw extracted column to promote. Set to None (or omit) when a literal default value should be used instead of a column value. |
format |
str |
ACTIVE | strptime format string applied when StandardFields.type == "date" (e.g. "%-d %B %Y"). Ignored for numeric and string types. |
default |
str |
ACTIVE | Literal string value used as the output when field is None/absent. Useful for injecting constant metadata (e.g. transaction_type = "CC"). |
multiplier |
float |
ACTIVE | Scalar applied to the value after casting when StandardFields.type == "numeric". Use -1 to invert sign (e.g. to convert a credit amount stored as positive into a negative figure). |
exclude_positive_values |
bool |
ACTIVE | When True, any positive numeric value is replaced with 0 after casting. Used to isolate debit-side figures from a combined amount column. |
exclude_negative_values |
bool |
ACTIVE | When True, any negative numeric value is replaced with 0 after casting. Used to isolate credit-side figures from a combined amount column. |
terminator |
str |
ACTIVE | Regex pattern; when present the string value is truncated at the first match position before being written to the standard column. Useful for stripping trailing boilerplate appended by merge_fields (e.g. " | BALANCE CARRIED FORWARD"). |
Test¶
Declarative test assertion attached to a StatementTable.
| Field | Type | Status | Description |
|---|---|---|---|
test_desc |
str |
STUB | Human-readable description of the test assertion. |
assertion |
str |
STUB | The assertion expression to evaluate (format TBD). |