Skip to content

The beanhub-extract library

We have already gotten the bank transactions as CSV files from the previous step, either by manually downing them from the bank's website or using BeanHub Direct Connect. Now what? We can always find repeating transactions if we look at our transaction data carefully. Be it your rent, internet service fee, or mobile data plan. This kind of transaction appears again and again periodically. Also, we usually categorize purchases from the same merchants as the same type of expenses. Businesses also run payrolls for employees regularly. In the end, there are only very few unexpected or one-time transactions. The key to successfully making your accounting book as fully automatic as possible is to have the software run through all those transactions with pre-defined rules and create corresponding accounting entries automatically based on data imported from the bank.

Different kinds of tools in the plaintext accounting community can help you import transactions from CSV files and various sources. But usually, data are in different shapes, making it hard to work with. Many tools also couple the process of extracting data and transaction generation in the same tool, making it very hard to reuse the same logic elsewhere. To solve those problems, when building our open-source tools for importing Beancount transactions, we break down the responsibility of extracting and importing. For the extracting part, we built beanhub-extract. It's a simple library to extract CSV files and potentially files in other formats and then provide a standardized data structure for beanhub-import or other import engines to consume.

Diagram shows how beanhub-extract reads CSV files from different banks and produce uniform transaction records
Diagram shows how beanhub-extract reads CSV files from different banks and produce uniform transaction records

Here are the currently available fields in the Transaction data structure beanhub-extract provides:

  • extractor - name of the extractor
  • file - the filename of import source
  • lineno - the entry line number of the source file
  • reversed_lineno - the entry line number of the source file in reverse order. comes handy for CSV files in desc datetime order
  • transaction_id - the unique id of the transaction
  • date - date of the transaction
  • post_date - date when the transaction posted
  • timestamp - timestamp of the transaction
  • timezone - timezone of the transaction, needs to be one of timezone value supported by pytz
  • desc - description of the transaction
  • bank_desc - description of the transaction provided by the bank
  • amount - transaction amount
  • currency - ISO 4217 currency symbol
  • category - category of the transaction, like Entertainment, Shopping, etc..
  • subcategory - subcategory of the transaction, like Entertainment, Shopping, etc..
  • pending - pending status of the transaction
  • status - status of the transaction
  • type - type of the transaction, such as Sale, Return, Debit, etc
  • source_account - Source account of the transaction
  • dest_account - destination account of the transaction
  • note - note or memo for the transaction
  • reference - Reference value
  • payee - Payee of the transaction
  • gl_code - General Ledger Code
  • name_on_card - Name on the credit/debit card
  • last_four_digits - Last 4 digits of credit/debit card
  • extra - All the columns not handled and put into Transaction's attributes by the extractor goes here