In my previous post, as part of some Python refactoring, I found myself implementing a WriteProtectedDict – a wrapper around a Python dict, using a write_lock to make updates atomic. This triggered some warning bells.
Most people, when they speak of Python, will actually be speaking about CPython (I know I am). And current CPython implementations include a quirky component, called the Global Interpreter Lock (GIL).
Since the GIL is not a part of the language specification, it’s generally speaking unwise to rely on it existence. That said, it comes with some useful side effects, and, in practice, programers ignore this bit of advice, and rely on the side effects more often than not.
For example, in a situation like the one I was faced with:
- a shared dictionary
- multiple, concurrent, threads accessing the dictionary
- guaranteed single writes to each dictionary key
- multiple concurrent reads from the dictionary
The GIL guarantees that, even though a race condition exists – multiple threads are writing to the same dictionary – each individual dictionary update is actually an atomic operation within the same process.
So, while the write_lock is required when strictly adhering to the Python language specification, in practice it’s superfluous. Why, then, was I seeing behaviour, in my application, which made it look like the write operation was not atomic? The answer, if you’ve been reading the source, sits in one line of code:
for published_at in sorted(self.messages)
Turns out that the code I was testing against, when I wrote the tutorial, was not sorting self.messages. The message update loop will always return the first new message it finds – effectively ignoring some messages that are stored in non-chronological order. As a side effect, all the longpolling chat clients will be missing messages in their chat history (in my basic tests, missing the same messages, to boot), a behaviour consistent with data being lost due to a race condition.
(I updated the original post to correct my error, leaving the related locking exercise intact)
So, what’s the appropriate course of action here? There are two schools of thought:
- Since the write_lock isn’t strictly necessary, remove it. The GIL’s behaviour is confusing enough, and an unexpected lock can make the related code harder to read. Veteran Python programmers may even find the lock offensive. The first rule of Python is, one does not talk about the GIL
- Optimizations which rely on implicit assumptions about the interpreter you’re running on are incredibly obfuscated, and should generally be avoided. Implement the lock, and explicitly optimize it away on systems with an appropriate GIL implementation.
For now, I went with the first option, and removed the write lock related code; I think that most Python programmers would probably do the same.
I reserve the right to change my mind, if I can figure out a practical implementation of option 2 (there’s no easy way I know of, to detect whether a GIL exists).
Some more discussion, in case what happened isn’t quite clear:
There’s definitely a race condition there – multiple chat clients might all be sending messages at roughly the same time. Without a write lock, they might interrupt each other, and not all of the updates would make it through, causing data loss. The GIL already locks the data on write, though, as a side-effect, guaranteeing that none of it is lost.
However, since the updates are concurrent, there’s no guarantee that messages are stored in the shared messages dictionary in the order they are received in.
Imagine, for example, that two clients are sending a message at the same time. Each will be handled by a separate twisted thread, and both threads will try to execute the following line of code:
self.messages[float(time.time())] = args['new_message'][0]
In a single threaded system, this is what the Python interpreter will do, in order:
- Step A: get the current time – float(time.time())
- Step B: store the new_message in the dictionary, at the index retrieved in Step A.
When there are two (or more) threads, each handling a different message, there’s no telling what order they’ll be going through this part of the code in. They might follow each other, like so:
Thread 1: Step A - get timestamp1 Thread 1: Step B - store message1 at timestamp1 Thread 2: Step A - get timestamp2 Thread 2: Step B - store message2 at timestamp2
Or go in reversed order:
Thread 2: Step A - get timestamp1 Thread 2: Step B - store message1 at timestamp1 Thread 1: Step A - get timestamp2 Thread 1: Step B - store message2 at timestamp2
leading to a dictionary which might look like this:
{ timestamp1:message1, timestamp2:message2 }
or like this:
{ timestamp1:message1, timestamp2:message2 }
Or, the threads might execute in a mixed up order like so:
Thread 1: Step A - get timestamp1 Thread 2: Step A - get timestamp2 Thread 2: Step B - store message2 at timestamp2 Thread 1: Step B - store message1 at timestamp1
leading to a dictionary which looks like this:
{ timestamp2:message2, timestamp1:message1 }
Where timestamp1 < timestamp2. As a result, message2 is seen by the message update loop and sent out before message1. Chat clients are then updated to think they have all the messages up to timestamp2, and never request to be updated with message1.
The lack of a sort there is more than a just concurrency related problem – dictionary datastructures explicitly do not guarantee that elements will be iterated over in sorted key order – keys can, over time, change location within the datastructure. Even if the messages are inserted in order by their received time, an iteration over the dictionary can return them out of order (dictionary implementations rely on this freedom for various optimizations).
A friend recommended reading Python in Practice, to help me be more Pythonic when I approach these kinds of problems