La Tinaja

Connecting to Google Sheets with Python

27 Oct 2017

I recently had cause to look into how difficult it would be to use a Python script to connect to a Google Sheets spreadsheet and pull down some data. I quickly found the gspread package which makes everything very simple. The most complicated part of the process is getting set up with authorization credentials to access the relevant spreadsheet remotely.

The Problem

I’m working on a project that catalogs and maps live jazz performances around Austin, TX in the 1920s and 1930s. In these early stages, my partner on the project is doing a lot of the archival research and data collection while I work on developing scripts for processing our data and techniques we’ll use to present it on the web. We may eventually also have a research assistant or two, so having a distributed place to input our first-run archival notes is helpful.

So, we have the problem of wanting a ubiquitous and low-friction place for us to input our data while also needing that data to be well-structured so that it can be aggregated according to venue, date, performers, etc.

For ubiquity and low friction, we’re using Google Sheets for input. As a web-based tool, it is available anywhere for input, whether at library workstations or our personal machines. As a simple spreadsheet, data entry mostly involves typing notes naturally, within the limits of a few light formatting conventions, rather than futzing with a more structured interface.

To be able to use the data, we’ll want it rationalized into some predictably structured JSON files against which I can write scripts that will aggregate, map, and visualize these performance data in different ways.

Using gspread gave us a very simple, Pythonic way to get stuff out of the Google Sheets spreadsheet and into a form that we could work with easily in Python.

Configuring

gspread provides the ability to connect to Google Sheets using a username and password, but particularly on a shared project, it’s going to be safer to use Google’s service account keys. The interface Google provides for setting up new credentials is not the simplest or most intuitive I’ve seen, but you only need to do it once.

First, sign in to the Google Developers Console and select ‘Credentials’ from the sidebar.

Clicking the ‘Create Credentials’ button will display a dropdown menu from which you will select ‘Service account key’. On the following screen, you’ll want to use the dropdown menu to create a new service account for which you’ll need to choose a Name and a Role. The name can be whatever makes the most sense for your project, and the Role should be ‘Project’ → ‘Owner’. Once you’ve configured these settings, you can download the generated JSON file.

The file you’ve just downloaded contains information that will provide unrestricted access to your spreadsheet, so keep it out of public source control repositories and the like.

When you open the JSON file, you will see a line that will look something like:

"client_email": "test-147@centering-abode-111415.iam.gserviceaccount.com"

You’ll give your new service account access by going to the spreadsheet and using the ‘Share’ button in the upper right to share the spreadsheet with the email address listed on the "client_email" line.

Authorizing

We’ll need to import gspread to work with our spreadsheet, and to authorize for access to Google from our script, we’ll need to import ServiceAccountCredentials from the oauth2client package.

import gspread
from oauth2client.service_account import ServiceAccountCredentials

Then we’ll authorize with Google using the JSON key file we downloaded, open our spreadsheet using its URL, and use gspread to download the spreadsheet worksheet in which we’re interested into a Worksheet object.

credentials = ServiceAccountCredentials.from_json_keyfile_name(
                path_to_keyfile,
                ['https://spreadsheets.google.com/feeds'])
client = gspread.authorize(credentials)

sheet = client.open_by_url(url_for_google_sheet)
worksheet = sheet.get_worksheet(0)

In worksheet, we now have an object with which we can manipulate all of the rows in our shared spreadsheet.

Ingesting and Processing

Our sheet has nine columns including some specifics regarding the performance, metadata related to the archival source, which person made the record, and a column called ‘Processed’ that this script will mark to indicate that it has ingested a record. This Processed column allows me to set up some conditional formatting rules in our spreadsheet that displays the lines in the database that have been processed by this script in grey text and struck through so that we can retain our data in its entirety as it was originally entered while minimizing visual distraction when inputting new material.

Our ingestion script will go through the rows that have not been marked as processed, add those rows to a list to be saved as JSON, and add the indexes of any rows we have processed to another list that will allow us to clean up after ourselves and mark newly processed rows in the shared spreadsheet.

all_rows = worksheet.get_all_records()

performances = []
processed_rows = []

for idx, row in enumerate(all_rows):
    if row['Processed'] != '*':
        keys = ('Artist', 'Date', 'Venue', 'Source',
                'Notes', 'Important', 'Ad', 'Contributor')

        # Create new Dict with only keys from list (Drop 'Processed')
        new_row = {key: row[key] for key in keys}
        performances.append(new_row)

        # add 2 to index to skip header row *and* convert from
        # zero-based indexing to 1-based
        processed_rows.append(idx + 2)

The Worksheet model in gspread has a method get_all_records() that will give us all of the rows in our sheet. We then iterate through those rows, taking both the row’s index and the data from the row as a Dict where the keys are the titles of the columns in our spreadsheet.

If the Processed column on the row does not have our * character indicating that we have processed it before, we make a new Dict with the row data (excluding our Processed column) and add it to a list of performances to be saved, and add the index of the row to a list of processed_rows to be marked in the spreadsheet.

Finishing Up

The output of this script just saves off this data to be processed further at another time. That processing step is more manual, so we might be working with a specific generation of ingested data over multiple sessions. We name our output files with a unique timestamp to mark the generations.

First, we output the performance data itself.

# Get a timestamp to mark output files uniquely
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d-%H-%M-%S')

# Output JSON for ingested data
output_file = open('{}-performances.json'.format(st), 'w')
output_file.write(json.dumps(performances))

And then output the list of row indexes that have been processed in case we need this history to update the Processed status in our spreadsheet manually.

# Output row numbers for processed data in case automated
# spreadsheet update fails
output_file = open('{}-processed-rows.json'.format(st), 'w')
output_file.write(json.dumps(processed_rows))

And, finally, we step through that list of row indexes and use gspread to put a star in the Processed column on each of our rows processed in this session, so that they will appear greyed out in our spreadsheet and will be ignored on future runs of this ingestion script.

for row in processed_rows:
    worksheet.update_cell(row, 9, '*')

These update_cell() calls happen one by one, each with an individual request to the server. Not only does this make the process relatively slow, but it also makes it relatively fragile. If the script hangs because of poor connectivity somewhere during the process of updating dozens (or hundreds) of individual rows in spreadsheet, we have our record of rows that we processed in this session saved off to a JSON file to help us update the spreadsheet manually if necessary.

Next Steps

This is a small script with two purposes: to save data from a spreadsheet shared on Google Sheets into a local JSON file and to update the shared spreadsheet to reflect the fact that the data has been processed. Now that we have the data locally we can do all sorts of things to organize, structure, and aggregate it in ways that make it more useful to us while not having sacrificed the convenience of data entry with Google Sheets. These other scripts will do things like sanitize and standardize Venue and Performer names, establish relationships between connected Performers and Venues, and archive research notes so they can be called up in connection to a particular performance later.