I recently had cause to look into how difficult it would be to use a Python script to connect to a Google Sheets spreadsheet and pull down some data. I quickly found the gspread
package which makes everything very simple. The most complicated part of the process is getting set up with authorization credentials to access the relevant spreadsheet remotely.
I’m working on a project that catalogs and maps live jazz performances around Austin, TX in the 1920s and 1930s. In these early stages, my partner on the project is doing a lot of the archival research and data collection while I work on developing scripts for processing our data and techniques we’ll use to present it on the web. We may eventually also have a research assistant or two, so having a distributed place to input our first-run archival notes is helpful.
So, we have the problem of wanting a ubiquitous and low-friction place for us to input our data while also needing that data to be well-structured so that it can be aggregated according to venue, date, performers, etc.
For ubiquity and low friction, we’re using Google Sheets for input. As a web-based tool, it is available anywhere for input, whether at library workstations or our personal machines. As a simple spreadsheet, data entry mostly involves typing notes naturally, within the limits of a few light formatting conventions, rather than futzing with a more structured interface.
To be able to use the data, we’ll want it rationalized into some predictably structured JSON files against which I can write scripts that will aggregate, map, and visualize these performance data in different ways.
Using gspread
gave us a very simple, Pythonic way to get stuff out of the Google Sheets spreadsheet and into a form that we could work with easily in Python.
gspread
provides the ability to connect to Google Sheets using a username and password, but particularly on a shared project, it’s going to be safer to use Google’s service account keys. The interface Google provides for setting up new credentials is not the simplest or most intuitive I’ve seen, but you only need to do it once.
First, sign in to the Google Developers Console and select ‘Credentials’ from the sidebar.
Clicking the ‘Create Credentials’ button will display a dropdown menu from which you will select ‘Service account key’. On the following screen, you’ll want to use the dropdown menu to create a new service account for which you’ll need to choose a Name and a Role. The name can be whatever makes the most sense for your project, and the Role should be ‘Project’ → ‘Owner’. Once you’ve configured these settings, you can download the generated JSON file.
The file you’ve just downloaded contains information that will provide unrestricted access to your spreadsheet, so keep it out of public source control repositories and the like.
When you open the JSON file, you will see a line that will look something like:
"client_email": "test-147@centering-abode-111415.iam.gserviceaccount.com"
You’ll give your new service account access by going to the spreadsheet and using the ‘Share’ button in the upper right to share the spreadsheet with the email address listed on the "client_email"
line.
We’ll need to import gspread
to work with our spreadsheet, and to authorize for access to Google from our script, we’ll need to import ServiceAccountCredentials
from the oauth2client
package.
import gspread
from oauth2client.service_account import ServiceAccountCredentials
Then we’ll authorize with Google using the JSON key file we downloaded, open our spreadsheet using its URL, and use gspread
to download the spreadsheet worksheet in which we’re interested into a Worksheet object.
credentials = ServiceAccountCredentials.from_json_keyfile_name(
path_to_keyfile,
['https://spreadsheets.google.com/feeds'])
client = gspread.authorize(credentials)
sheet = client.open_by_url(url_for_google_sheet)
worksheet = sheet.get_worksheet(0)
In worksheet
, we now have an object with which we can manipulate all of the rows in our shared spreadsheet.
Our sheet has nine columns including some specifics regarding the performance, metadata related to the archival source, which person made the record, and a column called ‘Processed’ that this script will mark to indicate that it has ingested a record. This Processed column allows me to set up some conditional formatting rules in our spreadsheet that displays the lines in the database that have been processed by this script in grey text and struck through so that we can retain our data in its entirety as it was originally entered while minimizing visual distraction when inputting new material.
Our ingestion script will go through the rows that have not been marked as processed, add those rows to a list to be saved as JSON, and add the indexes of any rows we have processed to another list that will allow us to clean up after ourselves and mark newly processed rows in the shared spreadsheet.
all_rows = worksheet.get_all_records()
performances = []
processed_rows = []
for idx, row in enumerate(all_rows):
if row['Processed'] != '*':
keys = ('Artist', 'Date', 'Venue', 'Source',
'Notes', 'Important', 'Ad', 'Contributor')
# Create new Dict with only keys from list (Drop 'Processed')
new_row = {key: row[key] for key in keys}
performances.append(new_row)
# add 2 to index to skip header row *and* convert from
# zero-based indexing to 1-based
processed_rows.append(idx + 2)
The Worksheet
model in gspread
has a method get_all_records()
that will give us all of the rows in our sheet. We then iterate through those rows, taking both the row’s index and the data from the row as a Dict
where the keys are the titles of the columns in our spreadsheet.
If the Processed column on the row does not have our *
character indicating that we have processed it before, we make a new Dict
with the row data (excluding our Processed column) and add it to a list of performances
to be saved, and add the index of the row to a list of processed_rows
to be marked in the spreadsheet.
The output of this script just saves off this data to be processed further at another time. That processing step is more manual, so we might be working with a specific generation of ingested data over multiple sessions. We name our output files with a unique timestamp to mark the generations.
First, we output the performance data itself.
# Get a timestamp to mark output files uniquely
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d-%H-%M-%S')
# Output JSON for ingested data
output_file = open('{}-performances.json'.format(st), 'w')
output_file.write(json.dumps(performances))
And then output the list of row indexes that have been processed in case we need this history to update the Processed status in our spreadsheet manually.
# Output row numbers for processed data in case automated
# spreadsheet update fails
output_file = open('{}-processed-rows.json'.format(st), 'w')
output_file.write(json.dumps(processed_rows))
And, finally, we step through that list of row indexes and use gspread
to put a star in the Processed column on each of our rows processed in this session, so that they will appear greyed out in our spreadsheet and will be ignored on future runs of this ingestion script.
for row in processed_rows:
worksheet.update_cell(row, 9, '*')
These update_cell()
calls happen one by one, each with an individual request to the server. Not only does this make the process relatively slow, but it also makes it relatively fragile. If the script hangs because of poor connectivity somewhere during the process of updating dozens (or hundreds) of individual rows in spreadsheet, we have our record of rows that we processed in this session saved off to a JSON file to help us update the spreadsheet manually if necessary.
This is a small script with two purposes: to save data from a spreadsheet shared on Google Sheets into a local JSON file and to update the shared spreadsheet to reflect the fact that the data has been processed. Now that we have the data locally we can do all sorts of things to organize, structure, and aggregate it in ways that make it more useful to us while not having sacrificed the convenience of data entry with Google Sheets. These other scripts will do things like sanitize and standardize Venue and Performer names, establish relationships between connected Performers and Venues, and archive research notes so they can be called up in connection to a particular performance later.