The beanhub-extract library

We have already gotten the bank transactions as CSV files from the previous step, either by manually downing them from the bank's website or using BeanHub Direct Connect. Now what? We can always find repeating transactions if we look at our transaction data carefully. Be it your rent, internet service fee, or mobile data plan. This kind of transaction appears again and again periodically. Also, we usually categorize purchases from the same merchants as the same type of expenses. Businesses also run payrolls for employees regularly. In the end, there are only very few unexpected or one-time transactions. The key to successfully making your accounting book as fully automatic as possible is to have the software run through all those transactions with pre-defined rules and create corresponding accounting entries automatically based on data imported from the bank.

Different kinds of tools in the plaintext accounting community can help you import transactions from CSV files and various sources. But usually, data are in different shapes, making it hard to work with. Many tools also couple the process of extracting data and transaction generation in the same tool, making it very hard to reuse the same logic elsewhere. To solve those problems, when building our open-source tools for importing Beancount transactions, we break down the responsibility of extracting and importing. For the extracting part, we built beanhub-extract. It's a simple library to extract CSV files and potentially files in other formats and then provide a standardized data structure for beanhub-import or other import engines to consume.

Diagram shows how beanhub-extract reads CSV files from different banks and produce uniform transaction records

Here are the currently available fields in the Transaction data structure beanhub-extract provides:

extractor - name of the extractor
file - the filename of import source
lineno - the entry line number of the source file
reversed_lineno - the entry line number of the source file in reverse order. comes handy for CSV files in desc datetime order
transaction_id - the unique id of the transaction
date - date of the transaction
post_date - date when the transaction posted
timestamp - timestamp of the transaction
timezone - timezone of the transaction, needs to be one of timezone value supported by pytz
desc - description of the transaction
bank_desc - description of the transaction provided by the bank
amount - transaction amount
currency - ISO 4217 currency symbol
category - category of the transaction, like Entertainment, Shopping, etc..
subcategory - subcategory of the transaction, like Entertainment, Shopping, etc..
pending - pending status of the transaction
status - status of the transaction
type - type of the transaction, such as Sale, Return, Debit, etc
source_account - Source account of the transaction
dest_account - destination account of the transaction
note - note or memo for the transaction
reference - Reference value
payee - Payee of the transaction
gl_code - General Ledger Code
name_on_card - Name on the credit/debit card
last_four_digits - Last 4 digits of credit/debit card
extra - All the columns not handled and put into Transaction's attributes by the extractor goes here