A Simple Buffering Algorithm

A Simple Buffering Algorithm

I worked on this over 2 days of a company hackathon. This post is almost verbatim ripped from my notes. The implementation is available here and as a gist.


If we have a producer process (𝙿\texttt{P}) which sends messages to a consumer (𝙲\texttt{C}) on an unreliable link, there will occasionally be outage periods where the link between 𝙿\texttt{P} and 𝙲\texttt{C} fails and 𝙿\texttt{P} must either 1) drop messages, 2) buffer messages locally, or 3) stop generating messages.

There can be good reasons for a system to choose (𝟏)\textbf{(1)} or (πŸ‘)\textbf{(3)}, but I’m most interested in better ways to implement (𝟐)\textbf{(2)}. I noticed that a few popular metrics-streaming clients offered relatively limited options for how to buffer messages. For example, both DataDog’s Vector, and Fluent only support:

For some applications, recovering roughly evenly spaced data may be preferred to recovering messages from the beginning or end of the outage. I wanted to see if I could write a buffer that uses fixed memory and keeps a random, uniform sample of messages during an outage of arbitrary length.

π„π±πšπ¦π©π₯𝐞:\textbf{Example}: If my buffer is configured to store up to 64 messages, and an outage lasts ≀64\leq 64 messages, every message will be preserved and it acts in the same way as an unbounded buffer. If my buffer is presented with a 1024 message long outage, it preserves 64 random messages with an expected gap of 16 messages between retained messages.

Because there’s know way for the client to know how long an outage will last beforehand, we cannot just sample every nthn^{th} message. To sample approximately uniformly, I pulled out an old algorithm called 𝐫𝐞𝐬𝐞𝐫𝐯𝐨𝐒𝐫 𝐬𝐚𝐦𝐩π₯𝐒𝐧𝐠\textbf{reservoir\ sampling}.

𝐍.𝐁\textbf{N.B} β€” The Wikipedia page on reservoir sampling mentions a few algorithms - 𝐀π₯𝐠𝐨𝐫𝐒𝐭𝐑𝐦 𝐑\textbf{Algorithm R} and 𝐀π₯𝐠𝐨𝐫𝐒𝐭𝐑𝐦 𝐋\textbf{Algorithm L}. They are equivalent, and I implemented L, but it’s probably overkill unless the stream is exceptionally high volume. Algorithm L minimizes calls to πš›πšŠπš—πšπš˜πš–()\texttt{random()} when outages are long and the buffer is small, while R must call πš›πšŠπš—πšπš˜πš–()\texttt{random()} on each message seen. At low volumes this doesn’t make much of a difference, and algorihtm R is much easier to implement and debug. Furthermore, because I take locks to make it safe for concurrent access I probably wash away any performance improvement we’d get from L anyway.

I was able to get this working locally, but was not able to implement it in $π™Ώπšπ™Ύπ™³_πš‚πšˆπš‚πšƒπ™΄π™Ό\texttt{\$PROD_SYSTEM} in time because of $π™Όπ™Έπš‚π™²_πšπ™΄π™°πš‚π™Ύπ™½πš‚\texttt{\$MISC_REASONS}.

During the local test, I had a client send messages every 5ms and had the server reject messages for 1s intervals every 3s. Observations in black were collected in-order as part of the normal operating periods, the blue observations (notice that these periods are sparser!) represent values that are backfilled from the buffer after the β€œoutages” have cleared.

There are a few places where this should be tidied up: