🐍 Python Once a Week
Primer to SQL Databases in Python2019-11-19

Let's go over a primer into relational databasing in python and give you a few opinions on how to approach database interaction into your application. This is aimed at newcomers to python who have covered some of the fundamentals (classes, basic data structures, control loops) and want a deep introduction to useful and popular tools in python.

I welcome all constructive feedback on how to make these topics more useful to you! There was a lot of material to cover, but I feel this particular post went on a bit long. My intention is that nothing in this post should appear intimidating or mysterious. Let me know how I can do better!

What are databases for?

Databases are designed to store large amounts of data (think on the order of GBs). If you've dealt with excel or google sheets, you've already interacted with a kind of database: tabular relational data. Tabular meaning organized in rows and columns, and relational meaning that a cell could reference the value in another. Applications like this have their own language to ask questions about the data like (like typing =SUM(A:B) in a cell); SQL is a another kind of language used about data through a database. And, as we'll see, can be really powerful to add and modify data as well.

I love examples, so let's start with one! Let's say we downloaded a data set from Spotify or another music application into google sheets. We'd probably have a few columns that look like this:

Record:

  • Artist
  • Album
  • Song
  • Song Duration
  • Song Plays

This kind of data model likely extends to a similar problem you may be facing. Instead of records, maybe we have facilities, and instead of plays, maybe it's number of employees. The actual application matters, no one data structure will fit all problems, but the concepts carry over.

If we had only a hundred artists, this probably would only be a couple of thousand of rows. However, large companies deal with tens of thousands of artist data structure that grow in complexity. One of the essential questions we can ask about this data is: which artist has the most total number of plays. That can be a pretty tricky equation in google sheets, we'll have to aggregate all plays along songs and group then by artists. But more: what if we have more than one sheet. Maybe another sheet capturing listening times by user.

Listener:

  • User
  • Record
  • Date User Listened to Record
  • Amount of Time They Listened to the Record
  • User's Country

Here our questions can become more complex. Which song is most on a a Tuesday, or in the evening? Which country is an artist most popular in? Can we get a playlist of a particular users favorite songs? Are longer songs more popular than shorter songs? You can see how doing this in a tool like google sheets can become unwieldy, especially if you have tens of thousands of rows. SQL gets out of your way and let's get answers to these questions.

Quick-start

Before we dive in, let's get you an environment where we can start using a SQL database in python. While we can't program python in "SQL" we use python to send SQL to a database. There are other, more direct, methods to send SQL to a database. However, if we want to generally interact with it beyond just querying the database, getting the data into python let's us all the regular tools of python to further process the data. And as you'll see, python can be used to make writing SQL much simpler.

There are many different kinds of databases that let you interact with it using the SQL language. Right now, we'll use one called SQLite. As the name implies, it's a simple database to get up and running. Many databases need a place to store the data, but one of SQLite's powerful feature is that it can run fully in memory, meaning that it requires no additional set-up to get up and going. In fact, we can get it running directly from python. SQLAlchemy is the library we'll be using to get us there.

Define the Data Model

import sqlalchemy
import sqlalchemy.ext.declarative
from sqlalchemy import Column, Integer, String, Float, Integer, Date, Time

# database uri will tell SQLAlchemy that we're going to connect 
# to our in-memory SQLite instance
database_uri = 'sqlite://'
# An "engine" will help us create connections to this database
engine = sqlalchemy.create_engine(database_uri)
# We'll register our table against this "Base" instance. 
# Declarative base, means that we can declare what the data should look like
# SQLAlchemy can take this definition and turn it into SQL
Base = sqlalchemy.ext.declarative.declarative_base()

# My personal naming is to use the singular noun as the class name, and the plural noun for the table name.
class Record(Base):
    # This is the name of the table or "sheet" where the data will be stored.
    __tablename__ = 'records'
    id   = Column(Integer, primary_key=True)
    artist = Column(String)
    album  = Column(String)
    song = Column(String)
    duration_s = Column(Float)
    plays = Column(Integer)

# Now that we've registered are data structures, 
# we can ask SQLAlchemy to create the tables on our behalf.
# SQLAlchemy will create a SQL statement that SQLite will understand.
# If you're working with a database that has tables that already exist
# Then, this is unnecessary.
Base.metadata.create_all(bind=engine)

id wasn't in our original data model, and why are we saying it's a primary key? Essentially, SQL databases require that each row as a unique way to identify that particular row of data. This of this as the row number in google sheets or excel. Maybe we want to say an artist's song name is what will be unique. But what if an another artist creates a song with that same name? The uniqueness property is lost. Thankfully, by setting aside another field and marking that as the primary_key and giving it a meaningless integer value, we don't have to worry about that. In fact, most SQL databases will automatically generate this value for you, so we don't need to specify it at all. We just need to specify primary_key=True to tell the SQL database that we set aside a field for it.

 ## Let's take a peek and see what _exactly_ 
# the SQL that will be sent to the database is
import sqlalchemy.schema 
# CreateTable takes the table as defined by our class and converts it to SQL. 
# We call `.compile(engine)` that will convert that generic SQL
# to use any engine-specific syntax the "engine", our SQLite instance, will need.
# In this case, `.compile` is unnecessary, SQLite will accept the generic SQL.
print(sqlalchemy.schema.CreateTable(Record.__table__).compile(engine))

"""
Our output looks like:
`CREATE TABLE songs (
    id INTEGER NOT NULL, 
    artist VARCHAR, 
    album VARCHAR, 
    duration FLOAT, 
    plays INTEGER, 
    PRIMARY KEY (id)
)`
"""

We could copy and paste this to any SQL database and it would work. However, SQLAlchemy let's us think in "python-land" (that is, in terms of classes and functions) and does the heavy-lifting of translating that logic into SQL for us.

Datasets and Sessions

We'll need to create a session in order to add data to the database. You see, whenever we create or query data SQL will perform a "transaction". A session manages these transactions for us, letting us group our queries in efficient ways. And, importantly, it enables SQL databases to be ACID. ACID is an acronym, but the promise there is that by scoping our queries in a transaction we know our data will have integrity. If you've ever used a program that crashed on you while you were working and opened it back up to an unintelligible mess, then you have suffered from data integrity issues. The ACID property of SQL databases work to prevent that from ever happening.

from sqlalchemy.orm import sessionmaker

# An easy way to get a "session" from our "engine".
Session = sessionmaker(bind=engine)
# Capital-S session is what we call a factory, it helps us create more sessions to use.
# In large application we would be more careful about how we handle this,
# but here and in most simple cases we can just create one and use is right away.
session = Session()

I've included a function to help us generate some fake data, but you can imagine this coming from a CSV or an online data set. You don't have to concern yourself too much with the inner workings of generate_dataset. Essentially, I'm just pulling some public word choices and generating random data from it. This random data will look like a dataset, and we can fill in the pseudo-dataset into SQLAlchemy.

import random
import datetime
import urllib.request

def download_plain_list(src, __cache__={}):
    if src in __cache__:
        return __cache__[src]
    
    response = urllib.request.urlopen(src)
    long_txt = response.read().decode()
    data = long_txt.splitlines()
    __cache__[src] = data
    
    return __cache__[src]

words = download_plain_list("http://svnweb.freebsd.org/csrg/share/dict/words?view=co&content-type=text/plain")
countries = download_plain_list("https://gist.githubusercontent.com/kalinchernev/486393efcca01623b18d/raw/daa24c9fea66afb7d68f8d69f0c4b8eeb9406e83/countries")
names = download_plain_list("https://raw.githubusercontent.com/dominictarr/random-name/master/names.txt")

def generate_dataset(n=100, __cache__={}):
    records = []
    listeners = []


    for _ in range (n):
        artist =  ' '.join(map(str.capitalize, random.choices(words, k=random.randint(1, 2))))
        for _ in range(random.randint(1, 10)):
            album = random.choice(words).capitalize()
            for _ in range(random.randint(3, 20)):
                song = ' '.join(random.choices(words, k=random.randint(1, 5)))
                duration = random.uniform(60, 360)
                plays = random.randint(0, 1000)
                records.append((artist, album, song, duration, plays))

                
    return records
            
records = generate_dataset(n=2)
session = Session()
for artist, album, song, duration, plays in records:
    session.add(Record(artist=artist, album=album, song=song, duration_s=duration, plays=plays))

session.commit()

Calling session.commit() above tells SQLAlchemy that we are at a good saving point and it can close and "commit" the transaction to the database. After we do this, SQLAlchemy will create the relevant SQL statement and send it. If anything went wrong in that transaction, then an Exception would be thrown at this point. For example, if we were working with a database over the internet and our connection went down. Because of most databases ACID property, if something like this happened, then our transaction wouldn't have gone through. It's all or nothing (that's the 'A' in ACID, all transactions are all atomic).


Before we continue I want to add a quick note about how I'm using SQLAlchemy here. This is just one way to perform database transactions with SQLAlchemy, and it's the one that handles most of SQL for us. We could be manually writing SQL by hand and sending it over with SQLAlchemy or using SQLAlchemy create_table function instead of a declarative_base. Look at the first few examples over at https://www.pythonsheets.com/notes/python-sqlalchemy.html to see what I mean. I think any of those methods work well and are simple. However, using let's us use python expressiveness directly, while letting us never touch a line of SQL. The method we'reusing is called Object-Relational Mapping (or ORM). My one big word of warning is that using an ORM hides a lot of the fine details of what's going on in SQL. That can be a great thing for getting off the ground; however, when our queries become more complex then using an ORM will become obtuse than it's worth. Sometimes taking the time to write the raw SQL, though arduous sometimes, is time better spent. My goal to differentiate this blog post over others is to not shy away from those low-level details, but illuminate them, and show them why these abstract tools exist.


Asking Questions of your Data

At this point we have data in our database and we can actually start asking our original questions against it! In your day-to-day work there may already be a database that your interacting with. Let's learn how to set-up asking those questions.

We have two ways to start querying: writing raw SQL or using the ORM. Let's do both, and see why using the ORM is faster.

For this example, I want to see which artist is the most popular. That will be the artist with the most aggregate total plays. But first, let's start with some SQL basics: just listing the artists we have in our database.

# We're telling our engine to give us the "artist" column from our "records" table and 
# just show us the first ten results (LIMIT 10)
raw_sql = 'SELECT artist from records LIMIT 10'
# Here, .execute gives us a ResultProxy that we call fetchall, meaning that we want to get all the data it returns. 
results = engine.execute(raw_sql).fetchall()

# Results proxy also works like a sequence, so we can also do ...
for result in engine.execute(raw_sql):
    pass

print(results)

I'm a big fan of going straight to documentation in order to learn what other methods certain objects have. I believe one of the best things you can learn is how to search things effectively. If I didn't know what I could do with the results object returned by engine.execute(...) I would use print(results, type(results)) to see what kind of class it is and then try to find it in the SQLAlchemy docs or google by typing "resultproxy sqlalchemy docs".

You'll always see a bunch of tutorial websites or off-brand doc pages. Those are find to for reference, but you typically won't go wrong looking right at the docs. https://docs.sqlalchemy.org/en/13/core/connections.html#sqlalchemy.engine.ResultProxy. If you're just starting, don't worry if it looks intimidating! SQLAlchemy has some dense documentation that only sometimes do a good job of pointing you to what the right thing to do is. Here, it's okay to leverage other sites. For example, above I had to search how to get the raw SQL when creating a table and it brought me to https://stackoverflow.com/questions/2128717/sqlalchemy-printing-raw-sql-from-create. That gave me a good starting point to dive-deeper in the docs.

Let's carry forward!

In the output above you likely saw the same artist multiple time. For every single row in our "spreadsheet", our SQL database will return the artist column. We could take off our LIMIT of 10 to see all of them duplicated, or we could tell our database to only return DISTINCT (or unique) results.

raw_sql = 'SELECT DISTINCT artist from records'
results = engine.execute(raw_sql).fetchall()
print(results)

I removed the LIMIT because I wanted to see all the artists, but you could add it back no problem. Now, how did I know that the DISTINCT keyword went right after the SELECT? Experience, that's all. A quick google search leads me to this example https://www.w3schools.com/sql/sql_distinct.asp.

This is the beauty of having a nice sandbox environment to work in. You can just search and try things and make progress. Alright, now let's answer our original question. I don't know how to do this off the top of the my head, but I know we want to group by artists and sum the plays column. This search works "sum on one column and group by another sql". One of the first results seemed pretty clear. Let's try it!

raw_sql = 'SELECT artist, SUM(plays) from records GROUP BY artist'
results = engine.execute(raw_sql).fetchall()
print(results)

That's super close! But I want to get the one with the greatest number of views. We'll need to add some sorting.

raw_sql = 'SELECT artist, SUM(plays) as sum_plays from records GROUP BY artist ORDER BY sum_plays DESC LIMIT 1'
results = engine.execute(raw_sql).fetchall()
print(results[0])

I could change LIMIT 1 to LIMIT 3 if I wanted the top three for example. Note the sum_plays label is arbitrary. Call it zap_foo and the call will still work!

Writing this SQL is cool, but let's see the alternative method: using our ORM and the Record class we wrote above.

from sqlalchemy import func, desc
# start a new transaction
session = Session()

session.query(Record.artist, func.SUM(Record.plays).label('sum_plays')).group_by(Record.artist).order_by(desc('sum_plays')).first()

In SQLAlchemy, we tell session.query the methods the "columns" we want and the function SUM we want to call. On first glance, there is no distinction to using the raw SQL. But let's make a key distinction.

from sqlalchemy import func, desc
session = Session()

record, plays = session.query(Record, func.SUM(Record.plays)).group_by(Record.artist).order_by(func.SUM(Record.plays).desc()).first()
print(record)
print(record.artist)
print(record.plays, plays)
print(type(record))

Here we got our fully-fledged Record class instance back, with all the relevant fields! No need to ask for specific ones. You'll notice that print(record) returns something vaguely intelligible. Let's change that and make it display more useful information.

Base = sqlalchemy.ext.declarative.declarative_base()

class Record(Base):
    __tablename__ = 'records'
    id   = Column(Integer, primary_key=True)
    artist = Column(String)
    album  = Column(String)
    song = Column(String)
    duration_s = Column(Float)
    plays = Column(Integer)
    
    def __str__(self):
        return f'<Record artist={self.artist} album={self.album} song={self.song} duration={self.duration_s} plays={self.plays}>'

We just over-rode the __str__ method to print something more descriptive in the future. Re-run the previous example to see how our print changes.

However, there's a problem with the result we have. Record.plays refers to the specific artist-song combination that has the number of plays, not the sum total aggregate number of plays. That's what func.SUM(Record.plays) is. SQLAlchemy will only return a record that was in the group that had the top number of plays, that means only the artist field is really correct because of our group_by. This example is a bit of a word-of-warning of performing these kinds of queries with the ORM. The details can get you! However, let's keep with the example. Say I wanted to get all the songs in by the most popular artist. I can look up the SQL to do that ... OR I could use a bit of python to help me out.

# I modified the query to show you how to the direct value without any intermediate objects
artist = session.query(Record.artist).group_by(Record.artist).order_by(func.SUM(Record.plays).desc()).limit(1).scalar()
print(f'Top artist is {artist}')
# Filter by lets us selects records meeting some criteria
records = session.query(Record).filter_by(artist=artist).all()
songs = {record.song for record in records}
print(f'Songs of the artist are {songs}')

# Let's add some more python methods
Base = sqlalchemy.ext.declarative.declarative_base()

class Record(Base):
    __tablename__ = 'records'
    id   = Column(Integer, primary_key=True)
    artist = Column(String)
    album  = Column(String)
    song = Column(String)
    duration_s = Column(Float)
    plays = Column(Integer)
    
    @property
    def describe_song_length(self):
        if self.duration_s <= 2 * 60:
            return 'short'
        elif self.duration_s <= 4 * 60:
            return 'medium'
        elif self.duration_s <= 8 * 60:
            return 'long'
        else:
            return 'prog rock'
        
    
    def __str__(self):
        return f'<Record artist={self.artist} album={self.album} song={self.song} duration={self.duration_s} plays={self.plays}>'
    
records = session.query(Record).filter_by(artist=artist).all()
medium_songs = [r for r in records if r.describe_song_length == 'medium']
print(f'{artist} has {len(medium_songs)} medium length songs out of {len(records)} total songs')

What's next?

Where the SQLAlchemy ORM really shines is helping us query between the relationships of two separate structures. I talked about having a second sheet that captured how listeners interacted with records to answer more complicated questions about our dataset. We'll cover how to make querying relationships between data easy with SQLAlchemy in a following python tip of the week!

Resources

Advanced Topics to Cover

Index/Indices

A mystical, but very practical way to optimize queries on your data.

NoSQL

A new family of modern databases.

Session state (local vs remote)

How does SQLAlchemy's session object keeps track of local vs remote state and how to avoid getting in trouble.

Keeping a single source of truth

And other philosophies.