Python, Files, and Databases

What's a Database?

A database is an organized collection of data 1. Data is stored on disk and in memory in a manner that maximizes reading data in complex ways.

Why a Database?

Databases are extremely good at complex queries. They also excel at managing concurrency, data integrity and redundancy.

Why Not a Database?

Storing data on a file (referred to as a flat file) is fine. In fact in some situations, storing data in a flat file is better that using a database. In others it's worse. Flat files shouldn't be considered verboten but rather just another tool in your tool chest. You wouldn't use a hammer to tighten a bolt, would you?

Let's look at a couple examples:

Example 1
Volker is writing a program that interacts with hardware via a serial port. He wants to be able to configure the port and communication settings without having to change his program.

This is a great opportunity to use a flat file to store his settings. Configuration files are a very common use case for flat files and easy to do in a number of ways. The easiest ways all rely on a known format to read and write data in. This abstracts away reading and writing the data, getting out of the developer's way and letting them focus on using the data.

Two very common formats are YAML and JSON.

Here's YAML, you'll notice it uses human readable syntax.

Config:
    SubItem1: SomeValue
    SubItem2:
        Key: Value 1
        Elements:
            SomeKey: Value
    SubItem3:
        Name: Node 2
        Elements:
            SomeKey: AnotherValue

Let's store this in a file called "config.yaml" and load it using python. We'll first need to install, PyYaml, the library that does the work for us.

$ pip install PyYaml

And then we read in the file in python:

import yaml

with open('config.yaml') as config_file:
    config = yaml.safe_load(config_file)

print config['Config']['SubItem1']

config['Config']['SubItem2']['Key'] = 'A different value'

with open('config.yaml', 'w+') as config_file:
    yaml.safe_dump(config, config_file, default_flow_style=False)
Note
We pass default_flow_style=False to yaml.safe_dump so it doesn't turn single item lists into a more compacted style. Try removing that parameter and seeing what the output looks like!

If you run this program you'll see:

$ python yamlConfig.py
SomeValue

And your config.yaml file now looks like:

Config:
  SubItem1: SomeValue
  SubItem2:
    Elements:
      SomeKey: Value
    Key: A different value
  SubItem3:
    Elements:
      SomeKey: AnotherValue
    Name: Node 2
Note
YAML is incredibly powerful and very extensible. You can learn more about the format here.

How about JSON? Here's the same example YAML document in JSON:

{
    "Config": {
        "SubItem1": "SomeValue",
        "SubItem2": {
            "Key": "Value 1",
            "Elements": {
                "SomeKey": "Value"
            }
        },
        "SubItem3": {
            "Name": "Node 2",
            "Elements": {
                "SomeKey": "AnotherValue"
            }
        }
    }
}

Eh. JSON is a little more verbose than YAML or more structured. To each their own. Here's how you would do the same thing using our JSON file.

import json

with open('config.json') as config_file:
    config = json.load(config_file)

print config['Config']['SubItem1']

config['Config']['SubItem2']['Key'] = 'A different value'

with open('config.json', 'w+') as config_file:
    json.dump(config, config_file, indent=4)
Note
We're doing the same thing with indent=4 in the json.dump(...) call as we did above with YAML. This tells the JSON library to use indenting rather than compacting the output onto one line.

You'll see the exact same output as our YAML script and your JSON file will now look like:

$ python jsonConfig.py
SomeValue
{
    "Config": {
        "SubItem1": "SomeValue", 
        "SubItem3": {
            "Elements": {
                "SomeKey": "AnotherValue"
            }, 
            "Name": "Node 2"
        }, 
        "SubItem2": {
            "Elements": {
                "SomeKey": "Value"
            }, 
            "Key": "A different value"
        }
    }
}
Note
You'll notice we didn't have to install a JSON library. The JSON format was so abundant years ago that functions for reading and writing in it are commonplace in most languages.
You can learn more about the JSON format at www.json.org.

Now let's go back to our example with Volker and his serial port configuration. Let's use YAML to put together a configuration file.

Hardware:
    Serial:
        Port: /dev/ttyUSB1
        BaudRate: 9600
        DataBits: 6
        Parity: N
        StopBits: 2

Then he would load in this configuration file and use it like:

def load_config(filepath):
    with open(filepath) as file:
        return yaml.safe_load(file)

def setup_serial(config):
    return serial.Serial(
        port=config['Hardware']['Serial']['Port'],
        baudrate=config['Hardware']['Serial']['BaudRate'],
        parity=config['Hardware']['Serial']['Parity'],
        stopbits=config['Hardware']['Serial']['StopBits'],
        bytesize=config['Hardware']['Serial']['DataBits']
    )

def main():
    config = load_config('config.yaml')
    serial = setup_serial(config)
    # ...

Why Not a File?

As data grows more complex a flat file can quickly become a burden for the developer. Consider the following example:

Example 2
The hardware Volker is interacting with is a primitive Point of Sale (POS) machine. He wants to ingest purchase data being sent from the POS machine to create his own rewards system for his business. Additionally, he wants to be able to provide special reward bonuses on certain products to incentivize their purchase. Ideally, Volker would like to be able to see a report on purchase history so he can develop a strategy for setting reward perks.

Let's look at some sample data he expects to need to be able to write, read and manipulate.

Customers

Customer ID Name Phone Address State City Zip
1 T. Jefferson 555-555-5555 1540 Halberd St SD Sioux Falls 57103
2 G. Washinton 555-555-5556 647 Dupont Ct SD Sioux Falls 57108

Inventory

SKU Name Price Stock
123-12345 Gavel 24.99 14
123-12346 Name Plackard 19.99 12

Purchase Receipts

Receipt ID Customer ID Date Total Price
1 1 2017-03-19 14:37 69.97
2 2 2017-03-19 15:02 19.99

Purchased Items

| Receipt ID | SKU | Quantity | Total Price |
|---|---|---|---|---:|
| 1 | 123-12345 | 2 | 49.98 |
| 1 | 123-12346 | 1 | 19.99 |
| 2 | 123-12346 | 1 | 19.99 |

Generated Reward Points

Customer ID Rewarded Points Receipt ID Received On Expires On
1 69 1 2017-03-19 2017-04-19
2 19 2 2017-03-19 2017-04-19

The data in the tables above is pretty straight forward. There are two sample customers registered in the rewards system and two items currently in the store stock. The POS machine reports two purchases, the first for $69.97 and the second for $19.99. As these purchases are reported, the reward data is generated earmarking the transaction time and setting the reward point expiration to be a month later.

We can do the above with a flat file pretty easily:

Customers:
    - Customer ID: 1
      Name: T. Jefferson
      # ...
    - Customer ID: 2
      # ...
Inventory:
    - SKU: 123-12345
      Name: Gavel
      # ...
    - SKU: 123-12346
PurchaseReceipts:
    - # ...
PurchasedItems:
    - # ...
RewardPoints:
    - # ...

And we would load the transaction data in with our YAML library:

import yaml

def load_transactions():
    with open('transactions.yaml') as transaction_file:
        transactions = yaml.safe_load(transaction_file)
        return transactions

Now let's generate our purchase history report. Suppose we want a report that looks like this:

SKU Start Date End Date # Purchased Avg. / Customer
123-12345 2017-03-01 2017-04-01 ... ...
123-12346 2017-03-01 2017-04-01 ... ...
... ... ... ... ...

This is where using a flat file becomes cumbersome. In truth, it's not so much the flat file as it is the representation of the data. Let's look at how we might build this report with the data we loaded with load_transactions().

import yaml
from pprint import pprint
from datetime import date, datetime

def load_transactions():
    with open('transactions.yaml') as transaction_file:
        transactions = yaml.safe_load(transaction_file)
        return transactions

def save_report(filename, report):
    with open(filename, 'w+') as report_file:
        yaml.dump(report, report_file, default_flow_style=False)

def generate_report(transactions, start_date, end_date):
    # Build a mapping of receipts for fast lookup
    receipts = {}
    for item in transactions['Purchase Receipts']:
        item['Purchase Date'] = datetime.strptime(item['Purchase Date'], '%Y-%m-%d %H:%M').date()
        # Exclude out of band purchases
        if start_date < item['Purchase Date'] < end_date:
            receipts[item['Receipt ID']] = item
            receipts[item['Receipt ID']]['Items'] = {}

    # Iterate and build the report
    report = {}
    for item in transactions['Purchased Items']:
        # Exclude out of band purchases
        if item['Receipt ID'] in receipts:
            if not item['SKU'] in report:
                report[item['SKU']] = {
                    'SKU': item['SKU'],
                    'Start Date': start_date.strftime('%Y-%m-%d'),
                    'End Date': end_date.strftime('%Y-%m-%d'),
                    '# Purchased': item['Quantity'],
                    'Customers': {receipts[item['Receipt ID']]['Customer ID']: 1}
                }
            else:
                report_item = report[item['SKU']]
                report_item['# Purchased'] += item['Quantity']

                # De-dupe purchases from the same customer
                if not receipts[item['Receipt ID']]['Customer ID'] in report_item['Customers']:
                    report_item['Customers'][receipts[item['Receipt ID']]['Customer ID']] = 1

    # Calculate the Avg. / Customer and remove Customers
    for sku, item in report.iteritems():
        item['Avg. / Customer'] = item['# Purchased'] / len(item['Customers'].keys())
        del item['Customers']

    return report.values()

def main():
    transactions = load_transactions()
    report = generate_report(transactions, date(2017, 3, 1), date(2017, 4, 1))
    save_report('report.yaml', report)
Note
We built a "lookup table" or mapping from receipt ID to the receipt to quickly get a receipt by its ID. We could have simply created a function to iterate through the list looking for the ID we wanted but that's not that great.
Right now our report is trivial, only searching through three items amongst two receipts. So if we used a function that iteratively searched we'd end up iterating over that list of two items three times - effectively making 6 comparisons. If we had 350 items across 100 receipts we'd end up iterating 350 times over the 100 item list - making 35,000 comparisons. What this means is the number of times we iterate grows very quickly with the size of our lists. N items multiplied by M receipts to be specific. At best a linear growth approaching an exponential one.
On the other hand, our lookup table takes advantage of hash tables to be able to find our value in a single execution. Regardless of the list size. Sweet!

Nothing too complicated. We iterate through purchased items that fall between start_date and end_date summing purchases and de-duping on customers. We have to do some reverse lookups on the receipt data for the customer ID from the inventory but that's about as complicated as it gets.

The code seems a little...bulky or spurious. What happens if we need to start grouping based on customer? Or reporting on information spanning multiple relationships (items -> reward points, etc)? We'd end up building more lookup tables and iterating over and over. Sure, it'd be straight forward but it would be pretty tedious to write and very easy to mess up.

This scenario is where databases shine - generating reports over complex data. Let's look at the same thing above using a common database, SQLite.

from datetime import datetime, date
import sqlite3
import yaml 

def open_database(filename):
    def dict_factory(cursor, row):
        data = {}
        for idx, col in enumerate(cursor.description):
            data[col[0]] = row[idx]
        return data

    database = sqlite3.connect(filename)
    database.row_factory = dict_factory
    return database

def save_report(filename, report):
    with open(filename, 'w+') as report_file:
        yaml.safe_dump(report, report_file, default_flow_style=False)

def generate_report(database, start_date, end_date):
    cursor = database.cursor()
    cursor.execute("""
        select
            sku as 'SKU',
            '{start_date}' as 'Start Date',
            '{end_date}' as 'End Date',
            sum(total) as '# Purchased',
            avg(total) as 'Avg. / Customer'
        from (
            select sku, sum(quantity) as total
            from receipt_items as items
            left join receipts on items.receipt_id = receipts.receipt_id
            where purchase_date >= '{start_date}' and purchase_date <= '{end_date}'
            group by customer_id, sku
        )
        group by sku;""".format(start_date=start_date, end_date=end_date))
    return cursor.fetchall()

def main():
    with open_database('transactions.db') as db:
        report = generate_report(db, date(2017, 3, 1), date(2017, 4, 1))
        save_report('dbReport.yaml', report)
Note
Where did sqlite3 come from in our import sqlite3? Install it with pip install sqlite3.
Note
What's going on in open_database(...)? We're setting a special transformer function, row_factory, that converts our query results into something yaml.safe_dump(...) can print out.

The above looks similar to our manual code in method signatures alone. Notice how the code in our generate_report(...) method has slimmed down to 3 lines. Sure, one of them is pretty bulky, but that's because we're doing a pretty complex query. Let's pick it apart:

select
    sku as 'SKU',
    '{start_date}' as 'Start Date',
    '{end_date}' as 'End Date',
    sum(total) as '# Purchased',
    avg(total) as 'Avg. / Customer'
from (
    select sku, sum(quantity) as total
    from receipt_items as items
    left join receipts on items.receipt_id = receipts.receipt_id
    where purchase_date >= '{start_date}' and purchase_date <= '{end_date}'
    group by customer_id, sku
)
group by sku;

The first thing you see is select. This is asking for the SKU, Start Date, End Date, sum(total), and avg(total) from an inner query. Let's quickly look at the inner query - it's actually the most complicated thing we're doing.

select sku, sum(quantity) as total
    from receipt_items as items
    left join receipts on items.receipt_id = receipts.receipt_id
    where purchase_date >= '{start_date}' and purchase_date <= '{end_date}'
    group by customer_id, sku

We're asking for the SKU and sum(quantity) from receipt_items, our Purchased Items list, and performing a left join on receipts, our Receipts list. You can learn more about sql joins here but the quick and dirty is our left join adds Receipt data to each of our Purchased Items rows where their items.receipt_id = receipts.receipt_id, in other words, whenever the receipt data corresponds to the item data.

After the join we're specifying a where clause stipulating that we only want data who's Purchase Data falls between our start and end dates.

Next we're grouping the data together by the Customer ID and the SKU. This enables us to use aggregate functions, like the sum(...), to combine data mathematically. In sum(...)'s case, we're summing the Quantity's of all of the Receipt Data.

And that's it for the complicated bits. Now let's relook at the outer query:

select
    sku as 'SKU',
    '{start_date}' as 'Start Date',
    '{end_date}' as 'End Date',
    sum(total) as '# Purchased',
    avg(total) as 'Avg. / Customer'
from (
    # Inner Query
)
group by sku;

This one is now pretty straight forward. We're selecting SKU, Start Date, End Date, sum(total), and avg(total) from our inner query's results grouping by the SKU. This grouping lets us sum together all of the Quantity totals and average (avg) the Quantity totals. The sum is a straight summation, the average is calculated on the sum divided by the count of rows after the grouping.

Why does our inner query group on Customer ID and SKU while our outer query groups on only the SKU?

The answer lies on the purpose of the inner query. We need to determine the average number of products purchased per customer but our Receipt Items data can contain multiple entries for an item and a customer. For example, if yesterday we bought a placard, and today we decided we needed another one, we would have two entries for the placard.

If we only grouped by SKU then each of our entries would be treated as a different customer skewing our average. If we group on by the Customer ID then our data is no longer specific to an individual SKU.

The inner query allows us to first combine purchases for the same SKU and the same Customer ID. The outer query then does what we actually want, averages the total items purchased by the total number of customers.

The beauty of SQL and databases is that querying data in complex ways has been converted from writing spurious boilerplate code to only writing what matters: the querying bits.

Where did transactions.db come from? You create databases using SQL schema. The schema below creates the database used above. Each database has its own flavor of schema. You can learn more about SQLite's schema here and, in particular, its create table syntax here.

create table customers (
    customer_id integer primary key autoincrement,
    name text,
    phone text,
    address text,
    state text,
    city text,
    zip text
);
create table inventory (
    sku text primary key,
    name text
);
create table receipts (
    receipt_id integer primary key autoincrement,
    customer_id integer,
    purchase_date date,
    total_price number
);
create table receipt_items (
    receipt_id integer,
    sku text,
    quantity integer,
    total_price number
);
Note
The above schema works but it can be improved a lot using what's called foreign keys. Just like when we made our lookup tables in our flat file implementations, databases build index tables around table indices. You use foreign keys to tell the database which table columns correspond to other tables. You can learn more about foreign keys here and how to use them in SQLite here.

Ok, so we've covered creating tables and querying from tables (very quickly) but how do you insert, delete, and update rows in the tables? Let's look at a couple examples:

First, let's insert our customers:

insert into customers
    (name, phone, address, state, city, zip)
values
    ('T. Jefferson', '555-555-5555', '1540 Halberd St', 'SD', 'Sioux Falls', '57103');
insert into customers
    (name, phone, address, state, city, zip)
values
    ('G. Washinton', '555-555-5556', '647 Dupont Ct', 'SD', 'Sioux Falls', '57108');

I'll leave inserting the inventory items up to you but for another example here's the SQL for inserting Receipts and Purchased Items.

insert into receipts
    (customer_id, purchase_date, total_price)
values
    (1, '2017-03-16 14:37', 69.97);
insert into receipt_items
    (receipt_id, sku, quantity, total_price)
values
    (1, '123-12345', 2, 49.98);
insert into receipt_items
    (receipt_id, sku, quantity, total_price)
values
    (1, '123-12346', 1, 19.99);

Next, George has let us know that his phone number has changed. Let's quick update it:

update customers set phone = '555-555-5565' where customer_id = 2;

Finally, let's delete the Gavels from our inventory as we're dropping it from our stock:

delete from inventory where sku = '123-12345';
Note
Whitespace doesn't matter in SQL. The above uses line breaks for easy reading but they're not needed.

Database or Flat File?

Flat files are quick, quick to create, quick to write, quick to read, and quick to change. Databases are quick to write, read, and change but their creation can be time consuming. There are an abundance of tools that make creating and editing database's and their tables easy. Checkout SQLite Browser for help with creating and manipulating SQLite databases.

Flat File

  • I don't need complex queries
  • My data is not related
  • Eg: serial port data and logging configuration (log = true, filename = 'somelog.txt'.
  • I don't plan on changing data structure (schema) much
  • I don't have a lot of data

Database

  • I need complex queries
  • My data is relational
  • Eg. Transaction data coupled with customer data
  • I need to be able to manipulate the data structure (schema) easily
  • I have a lot of data

At the end of the day, and for the general layperson, databases are there to make querying data easier. They handle creating indices for fast lookups, munging together disjoint blobs of data, and easy manipulation of the results. Don't use a hammer for a ratchet and keep it simple. If a flat file is a simple solution then it's probably the right one. If you find yourself creating lookup tables to create your own relationships within your data think about using a database.