A Tour of the Python Collections Library: Part 1

One of my favorite things about working with Python on a daily basis is how robust the Python standard library is. In particular, Python has a huge range of prebuilt data types that can significantly enhance productivity, code quality and complexity management. For me, so many of my favorites are in the collections module. Today, I want to present three of my favorite Python types: Deques, NamedTuples and DefaultDicts.

Deques

Deques serve as great go-to data structures for either stacks or queues (although it’s worth noting that the queue.Queue data structure is built for multithreading/multiprocessing and implements some locks, and as such are better choices for those applications). Implemented as a doubly-linked list, the first and last element of a deque can be accessed in constant time.

I’m a strong believer that code examples are the best way to communicate about code ideas, so let’s build a deque here:

from collections import deque
deq = deque()

# Append to end of queue
deq.append("potato")
deq.append("other potato")
deq.append("last potato")

# Get first and last value
first = deq.popleft() # "potato"
last = deq.pop() # "last potato"

# Add to front of queue
deq.appendleft("new potato")

# Create a new deque with maxlen
short_deque = deque(maxlen=5)

While the standard library docs are very detailed and well-organized, they tend not to cover example use cases. Deques are useful in any case in which you need to read data in a first in, first out (FIFO) or last in, first out (LIFO) manner. For instance, a deque can can hold a list of undo operations.

Namedtuples

Tuples are great for constant-time lookups, destructuring return values and holding immutable data, but they introduce some cognitive overhead through their obscurity. For instance, consider this tuple:

data = ('Falafel Gyro', 17, False, 191)
data[0]
> 'Falafel Gyro'

While it’s clear what the values here ARE, it’s certainly not clear what they represent. Namedtuples elegantly solve this problem:

MenuItem = namedtuple('MenuItem', ['name', 'price', 'gluten_free', 'calories'])

falafel = MenuItem('Falafel Gyro', 17, False, 191)

falafel.name
> 'Falafel Gyro

falafel.calories
> 191

Without changing any of the basic functionality, we have improved our code tremendously. Raymond Hettinger, inventor of the namedtuple and one of the best Python teachers out there, provides other great example use cases in this excellent video.

If you know of a good falafel sandwich with less than 200 calories, please email me immediately.

DefaultDicts

We’re accustomed to dicts that return a KeyError when accessing a nonexistent key, eg


In [1]: potato_tracker = {'home': 55, 'work': 12}

In [2]: potato_tracker['home']
Out[2]: 55

In [3]: potato_tracker['garden']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-86a01747b0ab> in <module>()
----> 1 potato_tracker['garden']

KeyError: 'garden'

This exception serves us in a lot of cases; it ensures that we actually have the data we’re trying to access. However, in my potato inventory system, I can just assume that anywhere I don’t have any potatoes where I don’t have an entry. It would make sense, then, just to return a zero-value in these cases.

 
In [1]: from collections import defaultdict

In [2]: default_tracker = defaultdict(int)

In [3]: default_tracker['home'] = 55

In [4]: default_tracker['work'] = 12

In [5]: default_tracker['home']
Out[5]: 55


In [6]: default_tracker['Dublin']
Out[6]: 0

That’s a good overview of the basics for these datatypes. Soon, I’ll be discussing the rest of the collections module. In the meantime, thanks for stopping by!

Handling Simple Messaging with Redis

There is no shortage of message brokers and queueing solutions available to software engineers. Full-featured enterprise solutions like RabbitMQ offer developers tremendous flexibility with which to solve their messaging needs. While lots of use cases require these advanced features, a solution like RabbitMQ, even without a support license, comes with significant costs. Running a RabbitMQ server adds operational overhead, and requires developers to fully understand the feature set available to them in order to make good engineering decisions.

Proprietary cloud solutions like AWS SQS also come with their own tradeoffs. SQS offers an excellent feature set, and reduces operational overhead by delegating service administration to Amazon. However, making use of these proprietary solutions introduces vendor lock-in; it will dramatically increase the complexity of migrating your system to a different cloud provider or a private cloud solution.

For simple message passing to one or more recipients, Redis is an excellent message broker. Let’s take a closer look at three possible use cases to illustrate this.

Simple Message Queueing

Consider the following situation: Service A, a public facing API server, needs to messages to Service B, which integrates certain API data with a suite of third-party business tools. Order is important here; we want to ensure that all customer data is as correct and up-to-date as possible. To meet this need, we can make use of an intuitive Redis data type: the list.

Lists are ordered data types containing string elements. We can treat this list as a queue with two commands: LPOP and RPUSH. LPOP removes the first element from the list and returns it. RPUSH adds a new element to the end of the list. To pass messages, have Service A RPUSH the message onto the list and instruct your Service B to LPOP from it.

Lots of open source solutions already exist for polling Redis lists, and the underlying implementation is fairly simple. In Python, we can build a generator that polls the queue for new data for an indefinite period of time. If you favor off-the-shelf solutions (as you probably should), RQ is a very mature project.

PubSub for Messaging Multiple Recipients

Let’s complicate this picture a little bit. Now, in addition to passing messages to Service B, Service A must pass identical messages to Service C, which transforms the data and sends it to a data warehouse for further processing and analysis. Here, we encounter a limitation of using lists to pass messages across services – services consume these events as they use them.

One potential solution is to use two separate lists, but this introduces additional complexity and doesn’t scale well. Fortunately, Redis supports the PubSub messaging paradigm, which allows multiple subscribers to listen on a given channel. Simply have each service subscribe to whichever channel(s) are relevant to them, and use the PUBLISH command to distribute messages to each subscriber.

PubSub is a well established and powerful messaging protocol, but it’s important to remember that these messages won’t persist after distribution to subscribers. If that’s a requirement, you can check out Redis’ new stream data type, which I’ll cover in detail on a later post.

Other Benefits and Downsides

It can often be a nontrivial task to get two services communicating locally, especially if you’re working outside of a containerized and orchestrated environment. Redis is simple to run locally and doesn’t require any special configuration to get working. Redis is also very fast. While performance isn’t critical for the use cases we outlined, more performance-critical messaging cases do exist. For these cases, using Redis might be a good choice.

That said, there are definitely use cases in which Redis is the inferior choice. Redis doesn’t allow for delayed message processing, and is not a good choice if you need massively concurrent message processing due to its single threaded nature.

For simple message passing without incurring a lot of operational overhead, consider Redis. It’s performant, easy to use and FOSS. It can reduce your operational overhead without incurring vendor lock-in.