🐍 Python Once a Week
CSV Parsing Demystified2019-11-26

A lot of my friends just come into python usually deal with CSV files as a file format for their datasets. Today, we'll look on how to pull CSV data into python!

The first thing to understand is how CSV files are formatted. CSV stands for Comma Separated Values; if you open a CSV file up in notepad or your favorite text editors, you'll see just that: all the fields in this tabular data being separated by commas. It will look something like this:

name,header 1, header 2
Test name,10.001,2019-11-20
Foo bar,3.1415,2019-10-19

If we think of this as a table where we have rows and columns, then each column is separated by a comma and new rows are separated by new line characters \n . A new line character is how the computer knows to put the next bit of text on a new line. To the computer, the text above really looks something like this:

name,header 1, header 2\nTest name,10.001,2019-11-20\nFoo bar,3.1415,2019-10-19\n

But when we open it in a text editing application, the new line character \n is displays as a separate line.

Now that we know how are data is coming in, the next important bit of design is to make a decision about how we want the data to appear in python. Do we want our CSV to appear as a list of tuples, so that data[1][2] gives us 2019-11-20 ; or do we want a list of dictionaries so that we can do data[1]['header 2'] and get the same result. There are a number of ways to format this data for your usage, any of them valid for their own reason. For this example, let's keep it simple and use a list of tuples.

# Generate some fake data
# path_to_csv should be an actual path in your file system
import tempfile
path_to_csv = tempfile.NamedTemporaryFile().name
with open(path_to_csv, 'w') as fd:
    fd.write('name,header 1, header 2\nTest name,10.001,2019-11-20\nFoo bar,3.1415,2019-10-19\n')
### Example begins below ###

data = []

with open(path_to_csv) as fd:
    # We assume the first line in your CSV is a header
    # In this case, we're not using for anything
    header = fd.readline()
    # readlines will read the file line by line, meaning that even if we had a massive CSV file
    # Our python script only reads one line at a time, consuming a lot less memory!
    for line in fd.readlines():
        # we need to parse the raw line into a proper data structure
        # At this point line may look like "Test name,  10.001,2019-11-20\n"
        # After calling `.split` elements will look like:
        # elements == ['Test name', '  10.001', '2019-11-20\n']
        elements = line.split(',')
        # as you're experimenting, 
        # feel free to drop a print statement to examine what the line looks like
        # print(elements)
        
        # Now we need to clean up elements to remove the whitespace and new line characters
        # After this elements will look like:
        # elements == ['Test name', '10.001', '2019-11-20']
        elements = [e.strip() for e in elements]
        # Perform any necessary type conversions
        name = elements[0]
        # We know that this column will always be a float
        # And if it's not, we'll get an exception and be able to check our data integrity!
        header_1 = float(elements[1])
        header_2 = elements[2]
        
        data.append((name, header_1, header_2))
        
print(data)

Now, for your own edification, try making it a list of dictionaries. Or, for a challenge, use my preferred data structure for this, a namedtuple!

Of course, sometimes there's already a python library to help you read and write CSVs, helpfully named the csv) module. Now you know roughly what it does internally, and so it should appear less intimidating. One of its main utilities is that it features support for non-standard CSV formats or as the module puts it:

It allows programmers to say, “write this data in the format preferred by Excel,” or “read data from this file which was generated by Excel,” without knowing the precise details of the CSV format used by Excel. Programmers can also describe the CSV formats understood by other applications or define their own special-purpose CSV formats

NamedTuple Solution

Here's a quick solution for this using a namedtuple

# generate some fake data
# path_to_csv should be an actual path in your file system
import tempfile
path_to_csv = tempfile.NamedTemporaryFile().name
with open(path_to_csv, 'w') as fd:
    fd.write('name,header 1, header 2\nTest name,10.001,2019-11-20\nFoo bar,3.1415,2019-10-19\n')

from collections import namedtuple
data = []

with open(path_to_csv) as fd:
    header = fd.readline()
    Row = namedtuple('Row', [h.strip().replace(' ', '_') for h in header.split(',')])

    for line in fd.readlines():
        elements = [e.strip() for e in line.split(',')]
        row = Row(elements[0], float(elements[1]), elements[2])
        
        data.append(row)
        
print(data)
# namedtuple supports both attribute and index look-up
print(data[1].name, data[1].header_1)
print(data[1][0], data[1][1])
# as well as tuple unpacking
first, second, third = data[0]
print(first, second, third)