Be Careful About What You Dislike(5 days, 2 hours ago)

The last few months I keep making the same observation over and over again in various different contexts: that whenever you are confronted with a very strong opinion about a topic, reasonable discussions about the topic often involve arguments that have long become outdated or are no longer strictly relevant to the conversation.

What I mean by that is that given a controversial topic, a valid argument for one side of the other is being repeated by a crowd of people that once heard it, even after that argument stops being valid. This happens because often the general situation changed and the argument references a reality that no longer exists in the same form. Instead of reevaluating the environment however, goalposts are moved to restore the general sentiment of the opinion.

To give you a practical example of this problem I can just go by a topic I have a very strong opinion about: Python 3. When Python 3 was not a huge thing yet I started having conversations with people in the community about the problems I see with splitting the community and complexity of porting. Not just that, I also kept bringing up general questions about some of the text and byte decisions. I started doing talks about the topic and write blog articles that kept being shared. Nowadays when I go to a conference I very quickly end up in conversations where other developers come to me and see me as the "Does not like Python 3 guy". While I still am not a friend of some of the decisions in Python 3 I am very much aware that Python 3 in 2016 is a very different Python 3 than 6 years ago or earlier.

In fact, I myself campaigned for some changes to Python 3 that made it possible to achieve better ports (like the reintroduction of the u prefix on Unicode string literals) and the bulk of my libraries work on Python 3 for many years now. It's a fact that in 2016 the problems that people have with Python 3 are different than they used to have before.

This leads to very interesting conversations where I can have a highly technical conversation about a very specific issue with Python 3 and thoughts about how to do it differently or deal with it (like some of the less obvious consequences of the new text storage model) and another person joins into the conversation with an argument against Python 3 that has long stropped being valid. Why? Because there is a cost towards porting to Python 3 and a chance is not seen. This means that a person with a general negativity towards Python 3 would seek me out and try to reaffirm their opposition to a port to it.

Same thing is happening with JavaScript where there is a general negative sentiment about programming in it but not everybody is having good arguments for it. There are some that actually program a lot in it and dislike specific things about the current state of the ecosystem, but generally acknowledge that the language is evolving, and then there are those that take advantage of unhappiness and bring their heavily outdated opposition against JavaScript into a conversation just to reaffirm their own opinion.

This is hardly confined to the programming world. I made the same discovery about CETA. CETA is a free trade agreement between the European Union and Canada and it had the misfortune of being negotiated at the same time as the more controversial TTIP with the US. The story goes roughly like this: TTIP was negotiated in secrecy (as all trade agreements are) and there were strong disagreements between what the EU and what the US thought trade should look like. Those differences were about food safety standards and other highly sensitive topics. Various organizations on both the left and right extremes of the political scale started to grab any remotely controversial information that leaked out to shift the public opinion towards negativity to TTIP. Then the entire thing spiraled out of control: people not only railed against TTIP but took their opposition and looked for similar contracts and found CETA. Since both are trade agreements there is naturally a lot of common ground between them. The subtleties where quickly lost. Where the initial arguments against TTIP were food standards, public services and intransparent ISDS courts many of the critics failed to realize that CETA fundamentally was a different beast. Not only was it already a much improved agreement from the start, but it kept being modified from the initial public version of it to the one that was finally sent to national parliaments.

However despite what I would have expected: that critics go in and acknowledge that their criticism was being heard instead slowly moved the goalposts. At this point there is so much emotion and misinformation in the general community that the goalpost moved all the way to not supporting further free trade at all. In the general conversation about ISDS and standards many people brought introduced their own opinions about free trade and their dislike towards corporations and multinationals.

This I assume is human behavior. Admitting that you might be wrong is hard enough, but it's even harder when you had validation that you were right in the past. In particular that an argument against something might no longer be valid because that something has changed in the meantime is hard. I'm not sure what the solution to this is but I definitely realized in the few years on my own behavior that one needs to be more careful about stating strong opinions in public. At the same time however I think we should all be more careful dispelling misinformation in conversations even if the general mood supports your opinion. As an example while emotionally I like hearing stories about how JavaScript's packaging causes pain to developers since I experienced it first hand, I know from a rational point of view that the ecosystem is improving at tremendous speeds. Yes I have been burned by npm but it's not like this is not tremendously improving.

Something that has been put to paper once is hard to remove from people's minds. In particular in the technological context technology moves so fast that very likely something you read once might no longer be up to date as little as six months later.

So I suppose my proposal to readers is not to fall into that trap and to assume that the environment around oneself keeps on changing.

Be Careful with Python's New-Style String Format(5 days, 2 hours ago)

This should have been obvious to me for a longer time, but until earlier today I did not really realize the severity of the issues caused by str.format on untrusted user input. It came up as a way to bypass the Jinja2 Sandbox in a way that would permit retrieving information that you should not have access to which is why I just pushed out a security release for it.

However I think the general issue is quite severe and needs to be a discussed because most people are most likely not aware of how easy it is to exploit.

The Core Issue

Starting with Python 2.6 a new format string syntax landed inspired by .NET which is also the same syntax that is supported by Rust and some other programming languages. It's available behind the .format() method on byte and unicode strings (on Python 3 just on unicode strings) and it's also mirrored in the more customizable string.Formatter API.

One of the features of it is that you can address both positional and keyword arguments to the string formatting and you can explicitly reorder items at all times. However the bigger feature is that you can access attributes and items of objects. The latter is what is causing the problem here.

Essentially one can do things like the following:

>>> 'class of {0} is {0.__class__}'.format(42)
"class of 42 is <class 'int'>"

In essence: whoever controls the format string can access potentially internal attributes of objects.

Where does it Happen?

First question is why would anyone control the format string. There are a few places where it shows up:

  • untrusted translators on string files. This is a big one because many applications that are translated into multiple languages will use new-style Python string formatting and not everybody will vet all the strings that come in.
  • user exposed configuration. One some systems users might be permitted to configure some behavior and that might be exposed as format strings. In particular I have seen it where users can configure notification mails, log message formats or other basic templates in web applications.

Levels of Danger

For as long as only C interpreter objects are passed to the format string you are somewhat safe because the worst you can discover is some internal reprs like the fact that something is an integer class above.

However tricky it becomes once Python objects are passed in. The reason for this is that the amount of stuff that is exposed from Python functions is pretty crazy. Here is an example from a hypothetical web application setup that would leak the secret key:

CONFIG = {
    'SECRET_KEY': 'super secret key'
}

class Event(object):
    def __init__(self, id, level, message):
        self.id = id
        self.level = level
        self.message = message

def format_event(format_string, event):
    return format_string.format(event=event)

If the user can inject format_string here they could discover the secret string like this:

{event.__init__.__globals__[CONFIG][SECRET_KEY]}

Sandboxing Formatting

So what do you do if you do need to let someone else provide format strings? You can use the somewhat undocumented internals to change the behavior.

from string import Formatter
from collections import Mapping

class MagicFormatMapping(Mapping):
    """This class implements a dummy wrapper to fix a bug in the Python
    standard library for string formatting.

    See http://bugs.python.org/issue13598 for information about why
    this is necessary.
    """

    def __init__(self, args, kwargs):
        self._args = args
        self._kwargs = kwargs
        self._last_index = 0

    def __getitem__(self, key):
        if key == '':
            idx = self._last_index
            self._last_index += 1
            try:
                return self._args[idx]
            except LookupError:
                pass
            key = str(idx)
        return self._kwargs[key]

    def __iter__(self):
        return iter(self._kwargs)

    def __len__(self):
        return len(self._kwargs)

# This is a necessary API but it's undocumented and moved around
# between Python releases
try:
    from _string import formatter_field_name_split
except ImportError:
    formatter_field_name_split = lambda \
        x: x._formatter_field_name_split()

class SafeFormatter(Formatter):

    def get_field(self, field_name, args, kwargs):
        first, rest = formatter_field_name_split(field_name)
        obj = self.get_value(first, args, kwargs)
        for is_attr, i in rest:
            if is_attr:
                obj = safe_getattr(obj, i)
            else:
                obj = obj[i]
        return obj, first

def safe_getattr(obj, attr):
    # Expand the logic here.  For instance on 2.x you will also need
    # to disallow func_globals, on 3.x you will also need to hide
    # things like cr_frame and others.  So ideally have a list of
    # objects that are entirely unsafe to access.
    if attr[:1] == '_':
        raise AttributeError(attr)
    return getattr(obj, attr)

def safe_format(_string, *args, **kwargs):
    formatter = SafeFormatter()
    kwargs = MagicFormatMapping(args, kwargs)
    return formatter.vformat(_string, args, kwargs)

Now you can use the safe_format method as a replacement for str.format:

>>> '{0.__class__}'.format(42)
"<type 'int'>"
>>> safe_format('{0.__class__}', 42)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: __class__

Diversity in Technology and Open Source(5 days, 2 hours ago)

It's a topic I generally do not talk much about but a recent Wired Article made me think about finally writing down my thoughts on this topic. The title of that article was “Diversity in Open Source Is Even Worse Than in Tech Overall” and that is undoubtedly true.

When you start an Open Source project today, in particular one which is further disconnected from frontend technologies there is a very high chance the organic community development will be everything but diverse. The highest form of diversity you can naturally expect to form is people from different countries but even there you might have a bias.

There are many arguments that can be had about this, but it's my personal opinion that at least in the longer run it's not healthy for a project or a community to lack diversity. I think it's natural for like-minded people to group together but the longer that process continues the more of an echo chamber it becomes. What's worse is the longer you wait to try to get people involved in the project that would naturally not try to join the harder it will be. When your team is 4 men, the first woman which joins will make a significant impact. When your team is already 20 men you need to get a lot more women on board to have the same impact. But it's not just gender that is making a difference, it's in particular cultural backgrounds. The reason Unicode is hard is not because Unicode is hard, but because a lot of projects start out with a lack of urgency since many of the original developers might live in ASCII constrained environments (It took emojis to become popular for people to develop a general understanding of why Unicode is useful in the western world).

A lot of the criticism that comes against the diversity movement is that it undermines the idea of “meritocracy” and that it does not mirror the realities in the real world by artificially balancing teams. Both of those arguments are weird in a way because they are very hard to defend if you look at larger parts of society. Tech for recent historical reasons is very male heavy but society is not. Meritocracy in many ways is just sourcing the best from the pool of naturally available people in your environment. Sure by some measurements you will get the best but is the best really what is lacking in an Open Source project? We don't need more of the best, we need more of what is actually missing and what is missing in many ways is not more strong alpha males but people that are good in de-escalating arguments in bug trackers and mailing lists, people that take care of documentations, people that make software work in new cultural contexts (localization, globalization, internationalization, etc.), people that care about user experience etc.

If you look at Open Source projects in comparison with commercial software you can quickly see where this lack of diversity is noticeable the most: consumer applications. While we're doing reasonably well with low level technology that never translated well to things that consumers care about. The most successful consumer products that came out of the Open Source community are probably things like The Gimp. A project that not only has a ridiculous name for a consumer product, but also one that is everything but user friendly. If you do a Google auto complete search for “Why is Gimp” it completes to “slow [on mac]”, “bad”, “complicated” and “unintuitive”. In many ways I think the answer is probably a reflection of the developer community lacking focus in some key areas. There is no reason that Open Source software has to be user unfriendly.

In particular some of the infamous Open Source communities like the Linux are (almost?) proud of their harsh culture. Often documentation is so bad that it became a rite of passage to decipher it or fill in the blanks by reading the code.

The only way to achieve the (in my mind) necessary change in Open Source and tech in general is to go out of ones way to involve people that do not come naturally.

So when someone cancels a conference because the speaker lineup after a blind selection was 100% male it just shows how bad the imbalance in the developer community is. It's not unfair to try to artificially bring balance a conference because the fact that the community is this imbalanced to that extend is a problem that needs fixing and will not fix itself naturally.

You can't Rust that(5 days, 2 hours ago)

The last year has been fun because I could build a lot for really nice stuff for Sentry in Rust and for the first time the development experience was without bigger roadblocks. While we have been using Rust before it now feels different because the ecosystem is so much more stable and we ran less against language or tooling issues.

However talking to people new to Rust (and even brainstorming APIs with coworkers) it's hard to get rid of the feeling that Rust can be a mind bending adventure and that the best way to have a stress free experience is knowing upfront what you cannot (or should not attempt to) do. Knowing that certain things just cannot be done helps putting your mind back back on the right track.

So here are things not to do in Rust and what to do instead which I think should be better known.

Things Move

The biggest difference between Rust and C++ for me is the address-of operator (&). In C++ (like C) that thing just returns the address of whatever its applied to and while the language might put some restrictions on you when doing so is a good idea, there is generally nothing stopping you from taking an address of a value and then using it.

In Rust this is just usually not useful. First of all the moment you take a reference in Rust the borrow checker looms over your code and prevents you from doing anything stupid. More importantly however is that even if it's safe to take a reference it's not nearly as useful as you might think. The reason for this is that objects in Rust generally move around.

Just take how objects are typically constructed in Rust:

struct Point {
    x: u32,
    y: u32,
}

impl Point {
    fn new(x: u32, y: u32) -> Point {
        Point { x, y }
    }
}

Here the new method (not taking self) is a static method on the implementation. It also returns Point here by value. This is generally how values are constructed. Because of this taking a reference in the function does not do anything useful as the value is potentially moved to a new location on calling. This is very different to how this whole thing works in C++:

struct Point {
    uint32_t x;
    uint32_t y;
};

Point::Point(uint32_t x, uint32_t y) {
    this->x = x;
    this->y = y;
}

A constructor in C++ is already operating on an allocated piece of memory. Before the constructor even runs something already provided the memory where this points to (typically either somewhere on the stack or through the new operator on the heap). This means that C++ code can generally assume that an instance does not move around. It's not uncommon that C++ code does really stupid things with the this pointer as a result (like storing it in another object).

This difference might sound very minor but it's one of the most fundamental ones that has huge consequences for Rust programmers. In particular it is one of the reasons you cannot have self referential structs. While there is talk about expressing types that cannot be moved in Rust there is no reasonable workaround for this at the moment (The future direction is the pinning system from RFC 2349).

So what do we do currently instead? This depends a bit on the situation but generally the answer is to replace pointers with some form of Handle. So instead of just storing an absolute pointer in a struct one would instead store the offset to some reference value. Later if the pointer is needed it's calculated on demand.

For instance we use a pattern like this to work with memory mapped data:

use std::{marker, mem::{transmute, size_of}, slice, borrow::Cow};

#[repr(C)]
struct Slice<T> {
    offset: u32,
    len: u32,
    phantom: marker::PhantomData<T>,
}

#[repr(C)]
struct Header {
    targets: Slice<u32>,
}

pub struct Data<'a> {
    bytes: Cow<'a, [u8]>,
}

impl<'a> Data<'a> {
    pub fn new<B: Into<Cow<'a, [u8]>>>(bytes: B) -> Data<'a> {
        Data { bytes: bytes.into() }
    }
    pub fn get_target(&self, idx: usize) -> u32 {
        self.load_slice(&self.header().targets)[idx]
    }

    fn bytes(&self, start: usize, len: usize) -> *const u8 {
        self.bytes[start..start + len].as_ptr()
    }
    fn header(&self) -> &Header {
        unsafe { transmute(self.bytes(0, size_of::<Header>())) }
    }
    fn load_slice<T>(&self, s: &Slice<T>) -> &[T] {
        let size = size_of::<T>() * s.len as usize;
        let bytes = self.bytes(s.offset as usize, size);
        unsafe { slice::from_raw_parts(bytes as *const T, s.len as usize) }
    }
}

In this case Data<'a> only holds a copy-on-write reference to the backing byte storage (an owned Vec<u8> or a borrowed &[u8] slice). The byte slice starts with the bytes from Header and they are resolved on demand when header() is called. Likewise a single slice is resolved similarly by the call to load_slice() which takes a stored slice and then looks it up by offsetting on demand.

To recap: instead of storing a pointer to an object itself, store some information so that you can calculate the pointer later. This is also commonly called using “handles”.

Refcounts are not Dirty

Another quite interesting case that is surprisingly easy to run into also has to do with the borrow checker. The borrow checker doesn't let you do stupid things with data you do not own and sometimes that can feel like running into a wall because you think you know better. In many of those cases the answer is just one Rc<T> away however.

To make this less mysterious let's look at the following piece of C++ code:

thread_local struct {
    bool debug_mode;
} current_config;

int main() {
    current_config.debug_mode = true;
    if (current_config.debug_mode) {
        // do something
    }
}

This seems pretty innocent but it has a problem: nothing stops you from borrowing a field from current_config and then passing it somewhere else. This is why in Rust the direct equivalent of that looks significantly more complicated:

#[derive(Default)]
struct Config {
    pub debug_mode: bool,
}

thread_local! {
    static CURRENT_CONFIG: Config = Default::default();
}

fn main() {
    CURRENT_CONFIG.with(|config| {
        // here we can *immutably* work with config
        if config.debug_mode {
            // do something
        }
    });
}

This should make it immediately obvious that this API is not fun. First of all the config is immutable. Secondly we can only access the config object within the closure passed to the with function. Any attempt of trying to borrow from this config object and have it outlive the closure will fail (probably with something like “cannot infer an appropriate lifetime”). There is no way around it!

This API is clearly objectively bad. Imagine we want to look up more of those thread local variables. So let's look at both of those issues separately. As hinted above ref counting is generally a really nice solution to deal with the underlying issue here: it's unclear who the owner is.

Let's imagine for a second this config object just happens to be bound to the current thread but is not really owned by the current thread. What happens if the config is passed to another thread but the current thread shuts down? This is a typical example where one can think of logically the config having multiple owners. Since we might want to pass from one thread to another we want an atomically reference counted wrapper for our config: an Arc<Config>. This lets us increase the refcount in the with block and return it. The refactored version looks like this:

use std::sync::Arc;

#[derive(Default)]
struct Config {
    pub debug_mode: bool,
}

impl Config {
    pub fn current() -> Arc<Config> {
        CURRENT_CONFIG.with(|c| c.clone())
    }
}

thread_local! {
    static CURRENT_CONFIG: Arc<Config> = Arc::new(Default::default());
}

fn main() {
    let config = Config::current();
    // here we can *immutably* work with config
    if config.debug_mode {
        // do something
    }
}

The change here is that now the thread local holds a reference counted config. As such we can introduce a function that returns an Arc<Config>. In the closure from the TLS we increment the refcount with the clone() method on the Arc<Config> and return it. Now any caller to Config::current gets that refcounted config and can hold on to it for as long as necessary. For as long as there is code holding the Arc, the config within it is kept alive. Even if the originating thread died.

So how do we make it mutable like in the C++ version? We need something that provides us with interior mutability. There are two options for this. One is to wrap the Config in something like an RwLock. The second one is to have the Config use locking internally. For instance one might want to do this:

use std::sync::{Arc, RwLock};

#[derive(Default)]
struct ConfigInner {
    debug_mode: bool,
}

struct Config {
    inner: RwLock<ConfigInner>,
}

impl Config {
    pub fn new() -> Arc<Config> {
        Arc::new(Config { inner: RwLock::new(Default::default()) })
    }
    pub fn current() -> Arc<Config> {
        CURRENT_CONFIG.with(|c| c.clone())
    }
    pub fn debug_mode(&self) -> bool {
        self.inner.read().unwrap().debug_mode
    }
    pub fn set_debug_mode(&self, value: bool) {
        self.inner.write().unwrap().debug_mode = value;
    }
}

thread_local! {
    static CURRENT_CONFIG: Arc<Config> = Config::new();
}

fn main() {
    let config = Config::current();
    config.set_debug_mode(true);
    if config.debug_mode() {
        // do something
    }
}

If you do not need this type to work with threads you can also replace Arc with Rc and RwLock with RefCell.

To recap: when you need to borrow data that outlives the lifetime of something you need refcounting. Don't be afraid of using `Arc` but be aware that this locks you to immutable data. Combine with interior mutability (like `RwLock`) to make the object mutable.

Kill all Setters

But the above pattern of effectively having Arc<RwLock<Config>> can be a bit problematic and swapping it for RwLock<Arc<Config>> can be significantly better.

Rust done well is a liberating experience because if programmed well it's shockingly easy to parallelize your code after the fact. Rust encourages immutable data and that makes everything so much easier. However in the previous example we just introduced interior mutability. Imagine we have multiple threads running, all referencing the same config but one flips a flag. What happens to concurrently running code that now is not expecting the flag to randomly flip? Because of that interior mutability should be used carefully. Ideally an object once created does not change its state in such a way. In general I think such a type of setter should be an anti pattern.

So instead of doing this what about we take a step back to where we were earlier where configs were not mutable? What if we never mutate the config after we created it but we add an API to promote another config to current. This means anyone who is currently holding on to a config can safely know that the values won't change.

use std::sync::{Arc, RwLock};

#[derive(Default)]
struct Config {
    pub debug_mode: bool,
}

impl Config {
    pub fn current() -> Arc<Config> {
        CURRENT_CONFIG.with(|c| c.read().unwrap().clone())
    }
    pub fn make_current(self) {
        CURRENT_CONFIG.with(|c| *c.write().unwrap() = Arc::new(self))
    }
}

thread_local! {
    static CURRENT_CONFIG: RwLock<Arc<Config>> = RwLock::new(Default::default());
}

fn main() {
    Config { debug_mode: true }.make_current();
    if Config::current().debug_mode {
        // do something
    }
}

Now configs are still initialized automatically by default but a new config can be set by constructing a Config object and calling make_current. That will move the config into an Arc and then bind it to the current thread. Callers to current() will get that Arc back and can then again do whatever they want.

Likewise you can again switch Arc for Rc and RwLock for RefCell if you do not need this to work with threads. If you are just working with thread locals you can also combine RefCell with Arc.

To recap: instead of using interior mutability where an object changes its internal state, consider using a pattern where you promote new state to be current and current consumers of the old state will continue to hold on to it by putting an `Arc` into an `RwLock`.

In Conclusion

Honestly I wish I would have learned the above three things earlier than I did. Mostly because even if you know the patterns you might not necessarily know when to use them. So I guess the following mantra is now what I want to print out and hang somewhere:

  • Handles, not self referential pointers
  • Reference count your way out of lifetime / borrow checker hell
  • Consider promoting new state instead of interior mutability

Python(5 days, 2 hours ago)

Guido van Rossum announced that he's stepping down as BDFL. It made me think. The Python programming language has left a profound impact on my life. It's my home, it gave me many of my friendships and acquaintances. It gave me my work, supplied me with many invaluable experiences and it even made me meet my now wife.

As most readers of this blog might know, I have a ambivalent relationship with the language as of a few years ago. I learned a lot through Python and one of the things I learned is also which mistakes one can make in language and interpreter design. Since I know Python in and out it's not hard for me to see all the things that did not go well. However nothing is perfect. The things that might be ugly in the language or implementation also have some unexpected benefits. Python has a pretty weak story on package distribution and imports, yet at the same time this has made the Python community more cautious about API breakage. The simplistic nature of the interpreter has cultivated an environment of countless C extensions that expanded the Python community in ways that few people would have expected.

Python is Guido van Rossum. While there have been many contributors over the years it's without doubt his creation. You can go back to the earliest versions of the language and it still feels similar. The interpreter design is still the same and so were the influences of the language. Python has achieved something that few languages did: it enabled absolute beginners to start with a language that is fun to pick up and it stays relevant and useful into ones professional life.

In case you are reading this Guido: I cannot express enough how much I owe to you. For all the strong disagreements I had with some of your decisions over the years please do not forget that I always appreciated it.

Updated Thoughts on Trust Scaling(5 days, 2 hours ago)

A few years back I wrote down my thoughts on the problem of micropackages and trust scaling. In the meantime the problem has only gotten worse. Unfortunately my favorite programming language Rust is also starting to suffer from dependency explosion and how risky dependencies have become. Since I wrote about this last I have learned a few more things about this and I have some new ideas of how this could potentially be managed.

The Problem Summarized

Every dependency comes with a cost. It pulls in code and a license and it needs to be pulled from somewhere. One of the things that has generally improved over the last few years is that package registries have become largely immutable. Once published it's there forever and at the very least it cannot be replaced by different code. So if you depend on a precise version of a library you will no longer be subject to the risk of someone putting something else in place there. We are still however dealing with having to download, compile and link the thing. The number and size of dependencies has been particularly frustrating for me in JavaScript but it's also definitely a concern in Rust where even the smallest app quickly has north of 100 dependencies.

Our symbolicator project written in Rust currently has 303 unique dependencies. Some of these are duplicates due to different versions being used. For instance we depend on rand 0.4 [1], rand 0.5, rand 0.6 and rand 0.7 and there a few more cases like this. But even if we remove all of this we still have 280 unique package names involved.

Currently I'm in the situation that I can just pray that when I run cargo update the release is clean. There is no realistic way for me to audit this at all.

[1]One thing of note here is that rand is a bit special in that some older rand versions will depend on newer ones so that they use the same internals. This is a trick that is also used by the libc library in Rust. For the purpose of the number of dependencies this optimization however does not help much.

Why we have Dependencies

We use dependencies because they are useful in general. For instance symbolicator would not exist if it could not benefit from a huge number of code written by other people, a lot we contribute to. This means the entire community benefits from this. Rust probably some of the best DWARF and PDB libraries in existence now as a result of many different people contributing to the same cause. Those libraries in turn are sitting on top of very powerful binary reading and manipulation libraries which are a good thing not to be reinvented all over the place.

A quite heated discussion [2] on Twitter emerged the last few days about the danger and cost of dependencies among some Rust developers. One of the arguments that was brought up in support of dependencies was that software for non English speakers is mostly so terrible because people chose to reinvent the world instead of using third party libraries that handle things like localization and text input. I absolutely agree with this — some problems are just too large not to be put into a common dependency.

So clearly dependencies are something we do not want to get rid of. But we also need to live with the downsides they bring.

[2]The thread on twitter with various different view points on this issue can be found here: https://twitter.com/pcwalton/status/1155881388106821632

The Goal: Auditing

The number of dependencies and the automatic way by which people generally update them through semver in minor releases introduces a lot of unchecked code changes. It's not realistic to think that everything can be reviewed but compared to our Python code base we bump dependencies in Rust (and JavaScript) a lot more freely and without a lot of care because that's what the ecosystem is optimized towards.

My current proposal to deal with this would be to establish a secondary system where auditors can be established that you can pin groups of packages against. Such an auditor would audit new releases of packages monitor primarily for just one property: that what's on Github is what's in the package that made it to the registry.

Here a practical example of how this could work: symbolicator currently has 18 tokio-* dependencies. Imagine all of these were audited by a "tokio auditor". An imaginary workflow could be something like having a registry of auditors and their packages stored on a registry (in this case crates.io). In addition to a lock file there would be an audit file (eg: Cargo.audit) which contains the list of all used auditors and for which packages they are used. Then whenever the dependency resolution algorithm runs it only accepts packages up to the latest audited version and it skips over versions that were never audited.

This could reduce the total amount of people one needs to trust tremendously. For instance all the tokio packages could be audited by one group. Now how is this different than the current de-facto world where all tokio packages are published by the same group of people anyways? The biggest difference immediately would be that that just because a package starts with tokio- does not mean it comes from the tokio developers. Additionally one does not have to trust just this group. For instance larger companies could run their own audits centrally for all packages that they use which can then be used across the organization.

What matters here is the user experience. Rust has an amazing packaging tool with cargo and what makes it so convenient are all the helpers around it. If we have an auditing tool where auditing our dependencies becomes an interactive process which gives us all the dependencies currently involved which are not audited, can link us to the release in github, show us the differences in the published cargo package compared to the source repository and more I would feel a lot less worried about the dependency count.

Secondary Goal: Understanding Micro-Dependencies

That however is only half the solution in my book. The second one is the cognitive overhead of all those micro-dependencies. They come with an extra problem which is that every one of them carries a license, even if they are only a single line. If you want to distribute code to an end user you need to ship all those licenses even though it's not quite sure if a function like left-pad even constitutes enough intellectual property to carry a license file.

I wonder if the better way to deal with those micro-dependencies is to call them out for what they are and add a separate category of these. It's quite uninformative to hear that one's application has 280 dependencies because that does not account for much if it each of these dependencies can be a single line or a hundred thousand line behemoth. If instead we would start breaking down our packages into categories at installation and audit time this could help us understand our codebases better.

Ideally the audit and installation/compilation process can tell us how many packages are leaf packages, how many are below a certain line count, how many use unsafe in their own codebase and tag them appropriately. This could give us a better understanding of what we're dealing with and how to deal with updates.

Why do we update?

Overall most of the reasons why I update dependencies in Python have been: bug fixed or security issue encountered. I never proactively upgraded packages. In Rust and JavaScript on the other hand for some reason I started upgrading the whole time. The biggest reason for this has been inter-package dependencies and without upgrading everything to latest one ends up dragging multiple versions of the same library around.

This is what worries me the most. We started to update dependencies because it's easy, not because it's a good idea. One should update dependencies but an update should have a cost.

For instance for micro-dependencies I really do not want to install updates ever. The chance that there is a security vulnerability in isArray that is fixed in an update is impossibly small. As such I would like to skip them entirely in updates unless a CVE is filed against it, in which case I probably want to be notified about it.

On the other hand large and very important direct dependencies in my system (like frameworks) I probably do want to update regularly. The thought process here is that skipping versions typically makes it harder to upgrade later and security fixes will only go into some of the newer versions. Staying on old versions for too long has clear disadvantages.

Understanding best practices for reviewing and updating might be interesting to analyze and could tell us write better tools to work with dependencies.

Hacking The Package Manager

One of the things that might be interesting for toying around would be to make the dependency resolution process in package managers hook-able. For instance it would be very interesting if cargo or yarn could shell out to a configured tool which takes the resolved dependencies which are in the registry and can blacklist some. That way separate tools could be developed that try various approaches for auditing dependencies without those having to become part of the core package manager until the community has decided on best practices.

Theoretically one could do this entirely separately from the package manager by using third party tools to emit lock files but considering how the main build chain overrides lock files if the source dependencies change it might be too easy to get this wrong accidentally.

Such a hook for instance could already be used to automatically consult rustsec to blacklist package versions with security vulnerabilities.

Open Source, SaaS and Monetization(5 days, 2 hours ago)

When you're reading this blog post Sentry which I have been working on for the last few years has undergone a license change. Making money with Open Source has always been a complex topic and over the years my own ideas of how this should be done have become less and less clear. The following text is an attempt to summarize my thoughts on it an to put some more clarification on how we ended up picking the BSL license for Sentry.

Making Money with Open Source

My personal relationship with Open Source and monetization is pretty clear cut: I never wanted money to be involved in libraries but I always encouraged people to monetize applications. This is also why I was always very liberal with my own choice of license (BSD, MIT, Apache) and encouraged others to do the same. Open Source libraries under permissive licenses helps us all as developers.

I understand that there are many developers out there who are trying to monetize libraries but I have no answer to that. Money and Open Source libraries is a tricky territory to which I have nothing to add.

However when it comes to monetizing Open Source applications I see many different approaches. One of them is what we did at Sentry: we Open Sourced our server and client libraries and monetized our SaaS installation. This from where I stand is a pretty optimal solution because it allows developers to use the software on their own and contribute to it, but also allows you to monetize the value you provide through the SaaS installation. In the case of Sentry it has worked out very well for us and there is very little I would change about that.

But there is a catch …

The SaaS Problem

Obviously there is an issue with this which is why we're playing around with changing the license. We love Open Source and continue to do so, but at one point someone has to make money somewhere and that better be done in the most clear way possible. I don't want a company that runs on donations or has a business model that just happens to run by accident. For SaaS businesses there is always the risk that it could turn into a margin business. What stops someone from taking the Sentry code and compete with the sentry.io installation not investing any development efforts into it?

This is not a new problem and many companies have faced it before. This is where a pragmatic solution is necessary.

The goal is to ensure that companies like Sentry can exist, can produce Open Source code but prevent competition on it's core business from its own forks.

Open Source — Eventually

Open Source is pretty clear cut: it does not discriminate. If you get the source, you can do with it what you want (within the terms of the license) and no matter who you are (within the terms of the license). However as Open Source is defined — and also how I see it — Open Source comes with no strings attached. The moment we restrict what you can do with it — like not compete — it becomes something else.

The license of choice is the BSL. We looked at many opens and the idea of putting a form of natural delay into our releases looked the most appealing. The BSL does that. We make sure that if time passes all we have, becomes Open Source again but until that point it's almost Open Source but with strings attached. This means for as long as we innovate there is some natural disadvantage for someone competing with the core product while still ensuring that our product stays around and healthy in the Open Source space.

If enough time passes everything becomes available again under the Apache 2 license.

This ensures that no matter what happens to Sentry the company or product, it will always be there for the Open Source community. Worst case, it just requires some time.

I'm personally really happy with the BSL. I cannot guarantee that after years no better ideas came around but this is the closest I have seen that I feel very satisfied with where I can say that I stand behind it.

Money and Libraries

The situation is much more complex however with libraries, frameworks and everything like this. The BSL would not solve anything here, it would cause a lot of friction with reusing code. For instance if someone wants to pull reusable code out of Sentry they would have to wait for the license conversion to kick in, find an older version that is already open source or reach out to us to get a snippet converted earlier. All of this would be a problem for libraries.

At Sentry we very purposefully selected what falls under the license. For instance we chose not to BSL license for components where we believe that pulling efforts together is particularly important. For instance our native symbolication libraries and underlying service (symbolicator) will not get the BSL because we want to encourage others to contribute to it and bundle efforts. Symbolicator like symbolic are components that are very similar to libraries. They are not products by themselves. I could not monetize Flask, Jinja or anything like this this way and I have absolutely no desire to do so.

At the same time I cannot count how many mails I got over the years from people asking why I don't monetize my stuff, questions from people about how they should go about monetizing their code.

I do not have an answer.

I feel like there is no answer to this. I remember too many cases of people that tried to do dual licensing with code and ended up regretting it after ownership was transferred or they had a fall out with other partners.

I however want to continue evaluating if there are ways libraries can be monetized. For now the best I have is the suggestion for people to build more Open Source companies with an Open Source (maybe BSL licensed) product and encourage true open source contributions to underlying libraries that become popular. Open Source companies dedicating some of their revenue to help libraries is a good thing from where I stand. We should do more of that.

I would however love to hear how others feel about money and Open Source. Reach out to me in person, by mail, twitter or whatever else.

Open Source Migrates With Emotional Distress(5 days, 2 hours ago)

Legacy code is bad and if you keep using it, it's really your own fault. There are many variations of the same thing floating around in Open Source communities and it always comes down to the same thing: at one point something is being declared old and it has to be replaced by something newer which is better. That better typically has some really good arguments on its side: we learned from our mistakes, it was wrong to begin with or something along the lines of it being impure or that it propagated bad ideas. Maybe that new thing only supports the newest TLS/SSL and you really should not longer be using the old versions because they are insecure.

Some communities as a whole for instance are suffering from this a whole lot. Every few years a library or the entire ecosystem of that community is thrown away and replaced by something new and support for the old one ends abruptly and arbitrarily. This has happened to the packaging ecosystem, the interpreter itself, modules in the standard library etc. How well this works out depends. Zope for instance never really recovered from it's Zope 2 / Zope 3 split. Perl didn't manage it's 5 / 6 split either. Both of those projects ended up with two communities as a result.

Many open source communities behave exactly the same way: they are replacing something with something else without a clear migration path. However some communities manage to survive some transitions like this.

This largely works because the way open source communities are managing migrations is by cheating and the currency of payment is emotional distress. Since typically money is not involved (at least not in the sense that a user would pay for the product directly) there is no obvious monetary impact of people not migrating. So if you cause friction in the migration process it won't hurt you as a library maintainer. If anything the churn of some users might actually be better in the long run because the ones that don't migrate are likely also some of the ones that are the most annoying in the issue tracker. In fact Open Source ecosystems manage these migrations largely by trading their general clout for support of a large part of their user base to become proponents for a migration to the updated ecosystems. Open Source projects nowadays often measure their popularity through some package download counts, Github stars or other indicators. All of these are trending upwards generally and it takes a really long time for projects to lose traction because all the users count against it, even the ones that are migrating off frustratedly.

The cheat is to convince the community as a whole that the migration is very much worth it. However the under-delivery to what is promised then sets up the community for another one of these experiences later. I have seen how GTK migrated from 1, to 2 and then later to 3. At any point it was painful and when most apps finally were on the same version, the next big breaking change was coming up.

Since the migration causes a lot of emotional distress, the cheat is carried happily by the entire community. The big Python 3 migration is a good example of this: A lot of users of the language started a community effort to force participants in the ecosystem to migrate. Suffering together does not feel as bad, and putting yourself on the moral right side (the one that migrates vs the ones that are holding off) helps even more. That Python 3 effort was less based on reasonable arguments but on emotions. While the core of the argument was correct and a lot of stuff was better on Python 3, it took many iterations not to regress in many other aspects. Yet websites were started like a big "wall of shame" for libraries that did not undergo the migration yet. The community is very good at pushing through even the most controversial of changes. This tour de force then became something of a defining characteristic of the community.

A big reason why this all happens in the first place is because as an Open Source maintainer the standard response which works against almost all forms of criticism is “I'm not paid for this and I no longer want to maintain the old version of X”. And in fact this is a pretty good argument because it's both true, and very few projects actually are large enough that a fork by some third party would actually survive. Python for instance currently has a fork of 2.7 called Tauthon which got very little traction.

There are projects which are clearly managing such forceful transitions, but I think what is often forgotten is that with that transition many people love the community who do not want to participate in it or can't. Very often a backwards incompatible replacement without clear migration might be able to guide the majority of people but they will lose out on many on the fringes and those people might be worthwhile investment into the future. For a start such a reckless deprecation path will likely alienate commercial users. That might be fine for a project (since many are non profit efforts in the first place) and very successful projects will likely still retain a lot of commercial users but with that user base reduced there will be reduced investments by those too.

I honestly believe a lot of Open Source projects would have an easier time existing if they would acknowledge that these painful migrations are painful for everybody involved. Writing a new version that fixes all known issues might be fun for a developer in the first place, but if they then need to spend their mental and emotional capacity to convince their user base that migrating is worth the effort it takes out all the enjoyment in the process. I have been a part of the Python 3 migration and I can tell you that it sucked out all my enjoyment of being a part of that community. No matter on which side you were during that migration I heard very little positive about that experience.

Setting good migration paths rewards you and there are many projects to learn from for how to manage this. It's lovely as a user to be able to upgrade to a new version of a project and the upgrade is smooth. Not only that, it also encourages me as a user to give back valuable contributions because there is a high chance that I can use it without having to be afraid that upgrading is going to break all my stuff.

It's also important to realize that many projects outside the Open Source world just do not have the luxury to break backwards compatibility this easily. Especially when you work in an environment where hundreds of systems have to be interoperable migrations are really hard and you sometimes have to make decisions which seem bad. The open source community was much quicker in dropping support for older TLS standards than many others because they did not have to live with the consequences of that change really as they force everybody to upgrade. That's just not always possible for everybody else at the speeds envisioned.

I'm writing this because we're a few days away from the end of life of Python 2 at which point the community is also going to stop maintaining a lot of valuable tools like pytest, pip [1] and others for Python 2. Yet the user base of the language has only migrated to ~50%. My own libraries which are now maintained by the pallets community are joining in on this something I can understand but don't agree with. I really wish the Python community all the best but I hope that someone does a post-mortem on all of this, because there are lots of things to be learned from all of this.

[1]it has correctly been pointed out that pip is not deprecating Python 2 support any time soon.

I'm not feeling the async pressure(5 days, 2 hours ago)

Async is all the rage. Async Python, async Rust, go, node, .NET, pick your favorite ecosystem and it will have some async going. How good this async business works depends quite a lot on the ecosystem and the runtime of the language but overall it has some nice benefits. It makes one thing really simple: to await an operation that can take some time to finish. It makes it so simple, that it creates innumerable new ways to blow ones foot off. The one that I want to discuss is the one where you don't realize you're blowing your foot off until the system starts overloading and that's the topic of back pressure management. A related term in protocol design is flow control.

What's Back Pressure

There are many explanations for back pressure and a great one is Backpressure explained — the resisted flow of data through software which I recommend reading. So instead of going into detail about what back pressure is I just want to give a very short definition and explanation for it: back pressure is resistance that opposes the flow of data through a system. Back pressure sounds quite negative — who does not imagine a bathtub overflowing due to a clogged pipe — but it's here to save your day.

The setup we're dealing with here is more or less the same in all cases: we have a system composed of different components into a pipeline and that pipeline has to accept a certain number of incoming messages.

You could imagine this like you would model luggage delivery at airports. Luggage arrives, gets sorted, loaded into the aircraft and finally unloaded. At any point an individual piece of luggage is thrown together with other luggage into containers for transportation. When a container is full it will need to be picked up. When no containers are left that's a natural example of back pressure. Now the person that would want to throw luggage into a container can't because there is no container. A decision has to be made now. One option is to wait: that's often referred to as queueing or buffering. The other option is to throw away some luggage until a container arrives — this is called dropping. That sounds bad, but we will get into why this is sometimes important later. However there is another thing that plays into here. Imagine the person tasked with putting luggage into a container does not receive a container for an extended period of time (say a week). Eventually if they did not end up throwing luggage away now they will have an awful lot of luggage standing around. Eventually the amount of luggage they will have to sort through will be so enormous that they run out of physical space to store the luggage. At that point they are better off telling the airport not to accept any more incoming luggage until their container issue is resolved. This is commonly referred to as flow control and a crucial aspect of networking.

All these processing pipelines are normally scaled for a certain amount of messages (or in this case luggage) per time period. If the number exceeds this — or worst of all — if the pipeline stalls terrible things can happen. An example of this in the real world was the London Heathrow Terminal 5 opening where 42,000 bags failed to be routed correctly over 10 days because the IT infrastructure did not work correctly. They had to cancel more than 500 flights and for a while airlines chose to only permit carry-on only.

Back Pressure is Important

What we learn from the Heathrow disaster is that being able to communicate back pressure is crucial. In real life as well as in computing time is always finite. Eventually someone gives up waiting on something. In particular even if internally something would wait forever, externally it wouldn't.

A real life example for this: if your bag is supposed to be going via London Heathrow to your destination in Paris, but you will only be there for 7 days, then it is completely pointless for your luggage to arrive there with a 10 day delay. In fact you want your luggage to be re-routed back to your home airport.

It's in fact better to admit defeat — that you're overloaded — than to pretend that you're operational and keep buffering up forever because at one point it will only make matters worse.

So why is back pressure all the sudden a topic to discuss when we wrote thread based software for years and it did not seem to come up? A combination of many factors some of which are just the easy to shoot yourself into the foot.

Bad Defaults

To understand why back pressure matters in async code I want to give you a seemingly simple piece of code with Python's asyncio that showcases a handful of situations where we accidentally forgot about back pressure:

from asyncio import start_server, run

async def on_client_connected(reader, writer):
    while True:
        data = await reader.readline()
        if not data:
            break
        writer.write(data)

async def server():
    srv = await start_server(on_client_connected, '127.0.0.1', 8888)
    async with srv:
        await srv.serve_forever()

run(server())

If you are new to the concept of async/await just imagine that at any point where await is called, the function suspends until the expression resolves. Here the start_server function that is provided by Python's asyncio system runs a hidden accept loop. It listens on a socket and spawns an independent task running the on_client_connected function for each socket that connects.

Now this looks pretty straightforward. You could remove all the await and async keywords and you end up with code that looks very similar to how you would write code with threads.

However that hides one very crucial issue which is the root of all our issues here: and that are function calls that do not have an await in front of it. In threaded code any function can yield. In async code only async functions can. This means for instance that the writer.write method cannot block. So how does this work? So it will try to write the data right into the operating system's socket buffer which is non blocking. However what happens if the buffer is full and the socket would block? In the threading case we could just block here which would be ideal because it means we're applying some back pressure. However because there are not threads here we can't do that. So we're left with buffering here or dropping data. Because dropping data would be pretty terrible, Python instead chooses to buffer. Now what happens if someone sends a lot of data in but does not read? Well in that case the buffer will grow and grow and grow. This API deficiency is why the Python documentation says not to use write at all on it's own but to follow up with drain:

writer.write(data)
await writer.drain()

Drain will drain some excess on the buffer. It will not cause the entire buffer to flush out, but just enough to prevent things to run out of control. So why is write not doing an implicit drain? Well it's a massive API oversight and I'm not exactly sure how it happened.

An important part that is very important here is that most sockets are based on TCP and TCP has built-in flow control. A writer will only write so fast as the reader is willing to accept (give or take some buffering involved). This is hidden from you entirely as a developer because not even the BSD socket libraries expose this implicit flow control handling.

So did we fix our back pressure issue here? Well let's see how this whole thing would look like in a threading world. In a threading world our code most likely would have had a fixed number of threads running and the accept loop would have waited for a thread to become available to take over the request. In our async example however we now have an unbounded number of connections we're willing to handle. This similarly means we're willing to accept a very high number of connections even if it means that the system would potentially overload. In this very simple example this is probably less of an issue but imagine what would happen if we were to do some database access.

Picture a database connection pool that will give out up to 50 connections. What good is it to accept 10000 connections when most of them will bottleneck on that connection pool?

Waiting vs Waiting to Wait

So this finally leads me to where I wanted to go in the first place. In most async systems and definitely in most of what I encountered in Python even if you fix all the socket level buffering behavior you end up in a world where you chain a bunch of async functions together with no regard of back pressure.

If we take our database connection pool example let's say there are only 50 connections available. This means at most we can have 50 concurrent database sessions for our code. So let's say we want to let 4 times as many requests be processed as we're expecting that a lot of what the application does is independent of the database. One way to go about it would be to make a semaphore with 200 tokens and to acquire one at the beginning. If we're out of tokens we would start waiting for the semaphore to release a token.

But hold on. Now we're back to queueing! We're just queueing a bit earlier. If we were to severely overload the system now we would queue all the way at the beginning. So now everybody would wait for the maximum amount of time they are willing to wait and then give up. Worse: the server might still process these requests for a while until it realizes the client has disappeared and is no longer interested in the response.

So instead of waiting straight away we would want some feedback. Imagine you're in a post office and you are drawing a ticket from a machine that tells you when it's your turn. This ticket gives you a pretty good indication of how long you will have to wait. If the waiting time is too long you can decide to abandon your ticket and head out to try again later. Note that the waiting time you have until it's your turn at the post office is independent of the waiting time you have for your request (for instance because someone needs to fetch your parcel, check documents and collect a signature).

So here is the naive version where we can only notice we're waiting:

from asyncio.sync import Semaphore

semaphore = Semaphore(200)

async def handle_request(request):
    await semaphore.acquire()
    try:
        return generate_response(request)
    finally:
        semaphore.release()

For the caller of the handle_request async function we can only see that we're waiting and nothing is happening. We can't see if we're waiting because we're overloaded or if we're waiting because generating the response just takes so long. We're basically endlessly buffering here until the server will finally run out of memory and crash.

The reason for this is that we have no communication channel for back pressure. So how would we go about fixing this? One option is to add a layer of indirection. Now here unfortunately asyncio's semaphore is no use because it only lets us wait. But let's imagine we could ask the semaphore how many tokens are left, then we could do something like this:

from hypothetical_asyncio.sync import Semaphore, Service

semaphore = Semaphore(200)

class RequestHandlerService(Service):
    async def handle(self, request):
        await semaphore.acquire()
        try:
            return generate_response(request)
        finally:
            semaphore.release()

    @property
    def is_ready(self):
        return semaphore.tokens_available()

Now we have changed the system somewhat. We now have a RequestHandlerService which has a bit more information. In particular it has the concept of readiness. The service can be asked if it's ready. That operation is inherently non blocking and a best estimate. It has to be, because we're inherently racy here.

The caller now would now turn from this:

response = await handle_request(request)

Into this:

request_handler = RequestHandlerService()
if not request_handler.is_ready:
    response = Response(status_code=503)
else:
    response = await request_handler.handle(request)

There are multiple ways to skin the cat, but the idea is the same. Before we're actually going to commit ourself to doing something we have a way to figure out how likely it is that we're going to succeed and if we're overloaded we're going to communicate this upwards.

Now the definition of the service I did not come up with. The design of this comes from Rust's tower and Rust's actix-service. Both have a very similar definition of the service trait which is similar to that.

Now there is still a chance to pile up on the semaphore because of how racy this is. You can now either take that risk or still fail if handle is being invoked.

A library that has solved this better than asyncio is trio which exposes the internal counter on the semaphore and a CapacityLimiter which is a semaphore optimized for the purpose of capacity limiting which protects against some common pitfalls.

Streams and Protocols

Now the example above solves us RPC style situations. For every call we can be informed well ahead of time if the system is overloaded. A lot of these protocols have pretty straightforward ways to communicate that the server is at load. In HTTP for instance you can emit a 503 which can also carry a retry-after header that tells the client when it's a good idea to retry. This retry adds a natural point to re-evaluate if what you want to retry with it still the same request or if something changed. For instance if you can't retry in 15 seconds, maybe it's better to surface this inability to the user instead of showing an endless loading icon.

However request/response style protocols are not the only ones. A lot of protocols have persistent connections open and let you stream a lot of data through. Traditionally a lot of these protocols were based on TCP which as was mentioned earlier has built-in flow control. This flow control is however not really exposed through socket libraries which is why high level protocols typically need to add their own flow control to it. In HTTP 2 for instance a custom flow control protocol exists because HTTP 2 multiplexes multiple independent streams over a single TCP connection.

Coming from a TCP background where flow control is managed silently behind the scenes can set a developer down a dangerous path where one just reads bytes from a socket and assumes this is all there is to know. However the TCP API is misleading because flow control is — from an API perspective — completely hidden from the user. When you design your own streaming based protocol you will need to absolutely make sure that there is a bidirectional communication channel and that the sender is not just sending, but also reading to see if they are allowed to continue.

With streams concerns are typically different. A lot of streams are just streams of bytes or data frames and you can't just drop packets in between. Worse: it's often not easy for a sender to check if they should slow down. In HTTP2 you need to interleave reads and writes constantly on the user level. You absolutely must handle flow control there. The server will send you (while you are writing) WINDOW_UPDATE frames when you're allowed to continue writing.

This means that streaming code becomes a lot more complex because you need to write yourself a framework first that can act on incoming flow control information. The hyper-h2 Python library for instance has a surprisingly complex file upload server example with flow control based on curio and that example is not even complete.

New Footguns

async/await is great but it encourages writing stuff that will behave catastrophically when overloaded. On the one hand because it's just so easy to queue but also because making a function async after the fact is an API breakage. I can only assume this is why Python still has a non awaitable write function on the stream writer.

The biggest reason though is that async/await lets you write code many people wouldn't have written with threads in the first place. That's I think a good thing, because it lowers the barrier to actually writing larger systems. The downside is that it also means many more developers who previously had little experience with distributed system now have many of the problems of a distributed system even if they only write a single program. HTTP2 is a protocol that is complex enough due to the multiplexing nature that the only reasonable way to implement it is based on async/await as an example.

And it's not just async await code that is suffering from these issues. Dask for instance is a parallelism library for Python used by data science programmers and despite not using async/await there are bug reports of the system running out of memory due to the lack of back pressure. But these issues are rather fundamental.

The lack of back pressure however is a type of footgun that has the size of a bazooka. If you realize too late that you built a monster it will be almost impossible to fix without major changes to the code base because you might have forgotten to make some functions async that should have been. And a different programming environment does not help here. The same issues people have in all programming environments including the latest additions like go and Rust. It's not uncommon to find open issues about “handle flow control” or “handle back pressure” even on very popular projects that are open for a lengthy period of time because it turns out that it's really hard to add after the fact. For instance go has an open issue from 2014 about adding a semaphore to all filesystem IO because it can overload the host. aiohttp has an issue dating back to 2016 about clients being able to break the server due to insufficient back pressure. There are many, many more examples.

If you look at the Python hyper-h2 docs there are a shocking amount of examples that say something along the lines of “does not handle flow control”, “It does not obey HTTP/2 flow control, which is a flaw, but it is otherwise functional” etc. I believe the fact flow control is very complex once it shows up in the surface and it's easy to just pretend it's not an issue, is why we're in this mess in the first place. Flow control also adds a significant overhead and doesn't look good in benchmarks.

So for you developers of async libraries here is a new year's resolution for you: give back pressure and flow control the importance they deserve in documentation and API.

App Assisted Contact Tracing(5 days, 2 hours ago)

I don't know how I thought the world would look like 10 years ago, but a pandemic that prevents us from going outside was not what I was picturing. It's about three weeks now that I and my family are spending at home in Austria instead of going to work or having the kids at daycare, two of those weeks were under mandatory social distancing because of SARS-CoV-2.

And as cute as social distancing and “flattening the curve” sounds at first, the consequences to our daily lives are beyond anything I could have imagined would happen in my lifetime.

What is still conveniently forgotten is that the curve really only stays flat if we're doing this for a very, very long time. And quite frankly, I'm not sure for how long our society will be able to do this. Even just closing restaurants is costing tens of thousands of jobs and closing schools is going to set back the lives of many children growing up. Many people are currently separated from their loved ones with no easy way to get to them because international travel grinded to a halt.

Technology to the Rescue

So to cut a very long story short: we can get away without social distancing with the help of technology. This is why: the most efficient way to fight the outbreak of a pandemic is isolating cases. If you can catch them before they can infect others you can starve the virus. Now the issue with this is obviously that you have people running around with the virus who can infect others but are not symptomatic. So we can only do the second next best thing: if we can find all the people they had contact with when they finally become symptomatic, we can narrow down the search radius for tests.

So a very successful approach could be:

  1. find a covid-19 suspect
  2. test the person
  3. when they are positive, test all of their close contacts

So how do we find their cases? The tool of choice in many countries already are apps. They send out a beacon signal and collect beacon signals of other users around. When someone tests positive, healthcare services can notice contacts.

Avoiding Orwell

Now this is where it gets interesting. Let's take Austria for instance where I live. We have around 9 million residents here. Let's assume we're aiming for 60% of resident using that app. That sounds like a surveillance state and scalability nightmare for a country known for building scalable apps.

But let's think for a moment what is actually necessary to achieve our goal: it turns out we could largely achieve what we want without a centralized infrastructure.

Let's set the window of people we care about to something like 5 days. This means that if someone tests positive, that person's contacts of the last 5 days ideally get informed about a covid case they had contact with. How do we design such a system that it's not a privacy invading behemoth?

The app upon installation would roll a random ID and store it. Then it encrypts the ID it just created with the public key of a central governmental authority and broadcasts it to other people around via bluetooth. It then cycles this ID in regular intervals.

When another device (the infected person) sees this ID it measures signal strength and time observed. When enough time was spent with the other person and that contact was “close enough” it records the broadcast (encrypted ID) on the device. The device also just deletes records older than 5 days.

When person is identified as infected they need to export the contacts from their app and send it to the health ministry. They could use their private key to decrypt the IDs and then get in contact with the potential contacts.

How do they do that? One option does involve a system like a push notification service. That would obviously require the device to register their unique ID with a central server and a push notification channel but this would not reveal much.

Another option could be to do the check in manually which would work for non connected IoT type of solutions. You could implement such a system as a token you need to regularly bring to a place to check if you are now considered a contact person. For instance one could deploy check-in stations at public transport hubs where you hold your token against and if one of your contacts was infected it would beep.

Either way the central authority would not know who you are. Your only point of contact would be when you become a covid case. Most importantly this system could be created in a way where it's completely useless for tracking people but still be useful for contact tracing.

The Phone in your Pocket

I had conversations with a lot of people over the last few days about contact tracing apps and I noticed — particularly from technically minded people — an aversion against the idea of contact tracing via apps. This does not surprise me, because it's an emotional topic. However it does hammer home a point that people are very good at misjudging data privacy.

Almost every person I know uses Google maps on their phone with location history enabled. With that, they also participate in a large data collection project where their location is constantly being transmitted to Google. They use this information to judge how fluid traffic is on the road, how many people are at stores, how busy public transit is etc. All that data is highly valuable and people love to use this data. I know I do. I'm also apparently entirely okay with that, even though I know there is an associated risk.

The Future

My point here is a simple one: contact tracing if done well is significantly less privacy infringing than what many tech companies already do where we're okay with.

I also believe that contact tracing via apps or hardware tokens is our best chance to return to a largely normal life without giving up all our civil liberties. I really hope that we're going to have informed and reasonable technical discussions about how to do contact tracing right and give this a fair chance.