A Tour of the Python Collections Library: Part 1

One of my favorite things about working with Python on a daily basis is how robust the Python standard library is. In particular, Python has a huge range of prebuilt data types that can significantly enhance productivity, code quality and complexity management. For me, so many of my favorites are in the collections module. Today, I want to present three of my favorite Python types: Deques, NamedTuples and DefaultDicts.

Deques

Deques serve as great go-to data structures for either stacks or queues (although it’s worth noting that the queue.Queue data structure is built for multithreading/multiprocessing and implements some locks, and as such are better choices for those applications). Implemented as a doubly-linked list, the first and last element of a deque can be accessed in constant time.

I’m a strong believer that code examples are the best way to communicate about code ideas, so let’s build a deque here:

from collections import deque
deq = deque()

# Append to end of queue
deq.append("potato")
deq.append("other potato")
deq.append("last potato")

# Get first and last value
first = deq.popleft() # "potato"
last = deq.pop() # "last potato"

# Add to front of queue
deq.appendleft("new potato")

# Create a new deque with maxlen
short_deque = deque(maxlen=5)

While the standard library docs are very detailed and well-organized, they tend not to cover example use cases. Deques are useful in any case in which you need to read data in a first in, first out (FIFO) or last in, first out (LIFO) manner. For instance, a deque can can hold a list of undo operations.

Namedtuples

Tuples are great for constant-time lookups, destructuring return values and holding immutable data, but they introduce some cognitive overhead through their obscurity. For instance, consider this tuple:

data = ('Falafel Gyro', 17, False, 191)
data[0]
> 'Falafel Gyro'

While it’s clear what the values here ARE, it’s certainly not clear what they represent. Namedtuples elegantly solve this problem:

MenuItem = namedtuple('MenuItem', ['name', 'price', 'gluten_free', 'calories'])

falafel = MenuItem('Falafel Gyro', 17, False, 191)

falafel.name
> 'Falafel Gyro

falafel.calories
> 191

Without changing any of the basic functionality, we have improved our code tremendously. Raymond Hettinger, inventor of the namedtuple and one of the best Python teachers out there, provides other great example use cases in this excellent video.

If you know of a good falafel sandwich with less than 200 calories, please email me immediately.

DefaultDicts

We’re accustomed to dicts that return a KeyError when accessing a nonexistent key, eg


In [1]: potato_tracker = {'home': 55, 'work': 12}

In [2]: potato_tracker['home']
Out[2]: 55

In [3]: potato_tracker['garden']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-86a01747b0ab> in <module>()
----> 1 potato_tracker['garden']

KeyError: 'garden'

This exception serves us in a lot of cases; it ensures that we actually have the data we’re trying to access. However, in my potato inventory system, I can just assume that anywhere I don’t have any potatoes where I don’t have an entry. It would make sense, then, just to return a zero-value in these cases.

 
In [1]: from collections import defaultdict

In [2]: default_tracker = defaultdict(int)

In [3]: default_tracker['home'] = 55

In [4]: default_tracker['work'] = 12

In [5]: default_tracker['home']
Out[5]: 55


In [6]: default_tracker['Dublin']
Out[6]: 0

That’s a good overview of the basics for these datatypes. Soon, I’ll be discussing the rest of the collections module. In the meantime, thanks for stopping by!

Handling Simple Messaging with Redis

There is no shortage of message brokers and queueing solutions available to software engineers. Full-featured enterprise solutions like RabbitMQ offer developers tremendous flexibility with which to solve their messaging needs. While lots of use cases require these advanced features, a solution like RabbitMQ, even without a support license, comes with significant costs. Running a RabbitMQ server adds operational overhead, and requires developers to fully understand the feature set available to them in order to make good engineering decisions.

Proprietary cloud solutions like AWS SQS also come with their own tradeoffs. SQS offers an excellent feature set, and reduces operational overhead by delegating service administration to Amazon. However, making use of these proprietary solutions introduces vendor lock-in; it will dramatically increase the complexity of migrating your system to a different cloud provider or a private cloud solution.

For simple message passing to one or more recipients, Redis is an excellent message broker. Let’s take a closer look at three possible use cases to illustrate this.

Simple Message Queueing

Consider the following situation: Service A, a public facing API server, needs to messages to Service B, which integrates certain API data with a suite of third-party business tools. Order is important here; we want to ensure that all customer data is as correct and up-to-date as possible. To meet this need, we can make use of an intuitive Redis data type: the list.

Lists are ordered data types containing string elements. We can treat this list as a queue with two commands: LPOP and RPUSH. LPOP removes the first element from the list and returns it. RPUSH adds a new element to the end of the list. To pass messages, have Service A RPUSH the message onto the list and instruct your Service B to LPOP from it.

Lots of open source solutions already exist for polling Redis lists, and the underlying implementation is fairly simple. In Python, we can build a generator that polls the queue for new data for an indefinite period of time. If you favor off-the-shelf solutions (as you probably should), RQ is a very mature project.

PubSub for Messaging Multiple Recipients

Let’s complicate this picture a little bit. Now, in addition to passing messages to Service B, Service A must pass identical messages to Service C, which transforms the data and sends it to a data warehouse for further processing and analysis. Here, we encounter a limitation of using lists to pass messages across services – services consume these events as they use them.

One potential solution is to use two separate lists, but this introduces additional complexity and doesn’t scale well. Fortunately, Redis supports the PubSub messaging paradigm, which allows multiple subscribers to listen on a given channel. Simply have each service subscribe to whichever channel(s) are relevant to them, and use the PUBLISH command to distribute messages to each subscriber.

PubSub is a well established and powerful messaging protocol, but it’s important to remember that these messages won’t persist after distribution to subscribers. If that’s a requirement, you can check out Redis’ new stream data type, which I’ll cover in detail on a later post.

Other Benefits and Downsides

It can often be a nontrivial task to get two services communicating locally, especially if you’re working outside of a containerized and orchestrated environment. Redis is simple to run locally and doesn’t require any special configuration to get working. Redis is also very fast. While performance isn’t critical for the use cases we outlined, more performance-critical messaging cases do exist. For these cases, using Redis might be a good choice.

That said, there are definitely use cases in which Redis is the inferior choice. Redis doesn’t allow for delayed message processing, and is not a good choice if you need massively concurrent message processing due to its single threaded nature.

For simple message passing without incurring a lot of operational overhead, consider Redis. It’s performant, easy to use and FOSS. It can reduce your operational overhead without incurring vendor lock-in.

Serving API Responses Faster with Edge Caching

API Scaling Challenges

The only thing worse than having to solve problems of scale is not having to solve problems of scale. You’re doing at least one thing right: you’ve made an API other people want to use. Now, however, you’re likely to face slowdowns during peak traffic, especially for endpoints that require multiple round trips to the database or are computationally expensive.. Horizontal and vertical scaling can be expensive, so what’s the best way to serve API requests under load?

While there are a variety of optimizations we can perform, I would like to assert that one often-neglected strategy is making use of edge caching. By instructing a varnish cache or other CDN to hold onto an API response and purging the cache when the data mutates, you can simultaneously decrease response latency and decrease load on the API server. I highly recommend using a Varnish-based cache. For one reason, Varnish is very fast. Additionally, the VCL language gives engineers the capacity to do lots of routine operations at the cache layer, before any traffic is directed to the API server. You can even do authorization and feature flagging at the edge.

Don’t forget client-side caching! For people living in remote geographical areas far from a CDN Point of Presence (PoP) , client side caching can significantly improve perceived latency and performance. The two most common approaches here are the use of  either the ETag or Last-Modified header. Both will allow browsers and other clients to perform an OPTIONS request to see if the response has been modified since it was last fetched.

How We Did It

For more technical information on how we’ve done that at my current company, check out this blog post. For the TL;DR, we used a set of headers to define a cache TTL for several of our most-accessed endpoints. This has allowed us to reduce load significantly while serving requests much more quickly: the average time to first byte on a cache hit is less than 10 milliseconds. Our client side caching strategy has improved performance for users in remote locations as well.

API caching is a powerful strategy to alleviate a lot of the performance problems associated with scaling a service. While it won’t act as a silver bullet, caching can significantly reduce load on GET requests to your service while serving responses much faster than even the most optimized web server could.