This post was written by Will from Bump’s server team.
We’ve been using Redis at Bump for a number of things since we launched our 2.0 platform last July. After reading Francesco’s recent post about Redis I was inspired to enumerate the many different roles Redis plays here. I also stuck in a few best practices that we’ve learned along the way at the end.
Message Passing
We originally looked into using AMQP for communicating between multiple servers and processes, and were frustrated with the complexity of it–we’re not a bank nor do we want to deal with the overhead required for complex transactional messages. Enter Redis lists. The combination of lpush and brpop create simple queuing and message passing behavior. BSON encoding the data we send over the wire provides a simple translation between python dictionaries and a reasonably efficient wire format. (Note: I use queue and list here to represent the user-level functionality being provided, however, to Redis, they are all just lists.)
Adding an rpc mechanism simply requires sending a header that defines a callback queue to listen on (this can be generated at a uuid4 or anything else suitable collision free) along with the original message. The sender calls lpush, followed by brpop on the callback queue and the receiver, whenever it’s ready to reply, lpushes it’s response onto the callback queue. Since there’s nothing left in the callback queue, Redis takes care of cleaning it up for you.
Logging
We run a number of different processes on a number of different servers to keep Bump up and running. Instead of trying to keep track of text logs spread across multiple machines we just push any log line we are interested in onto a single Redis queue. Then, we have one process pop from that queue and write those log lines to disk. If that process or the logging server goes down, no big deal, events just queue up in Redis until the logging service is back online.
Social Graphs
We recently released the ability to make Bump friends without actually bumping and with that came the problem of constructing and storing a social graph of our users. Already knowing Redis well (and knowing it was fast), we turned to Redis sets. Want to know what friends I have in common with Jamie? We keep all of Jamie’s friends and all of my friends in separate sets in Redis, so the friends we have in common is just a set intersection, a function Redis already provides on sets.
Asset Caching
Like most companies these days, we’ve gotten out of the storage business, leaving the heavy lifting to Amazon. Unfortunately, the latency to S3 wasn’t as low as we wanted for our most active assets. At this point I imagine you’ve guessed the trend, but Redis has a solution for this too–max memory and lru mode. Redis 2.2 (just recently released) can be run in a mode where you specify the maximum amount of memory Redis can use and also specify the record eviction criteria. When an asset gets set up through Bump we put in into the Redis cache and eagerly store it to S3 as well. The fetch of an asset simply checks Redis, if it’s not there, repopulates the cache from S3 and tries the fetch again. LRU mode takes care of the eviction of items that haven’t been used in a while. Perfect.
We have ended up using Redis for a host of other smaller things as we continue to grow. Although this wasn’t an explicit decision, Redis provides enough basic data types and functionality to get most problems done and you’d be hard pressed to find something that can do it faster.
That said, we’ve run into a few issues along the way. None of these problems dampen my enthusiasm for Redis, they are worth being aware of, especially if you are planning on using Redis for latency-dependent operations.
Persistence
This is something that every database has trouble with and Redis is no exception. Antirez is hard at work on a new disk-based back-end for Redis that looks like it has promise, but for current production deployments there are two options: periodic background saves and writing to an append only file (AOF). Since periodic background saves can result in significant data loss if the server crashes, that wasn’t a great option, so we settled on using the append only file. This works without a hitch until you need to compact the AOF. Due to the behavior of fsync and the fact that Redis uses a single event loop, at some point in the process of compacting the AOF the main event loop can become blocked on disk I/O. Since we use Redis for time sensitive things like message passing, we can’t afford anything that will hang the entire server up. Unfortunately, there’s no silver bullet here–you just can’t persist an active production instance of Redis. We can’t live without persistence, so we just have the slaves run in AOF persistence mode. The momentary hiccup in the slave just leads to a minor backup of the replication process, not ideal, but certainly not a terrible solution to a hard problem.
Running with Mongo
Very early in our deployment we had Redis and Mongo deployed on the same machine. The whole story here will have to wait for a post on Mongo, but the take away is definitely that Mongo should always be run on a server by itself. That said, it’s worth taking a minute to think about the implications of the single-threaded model that Redis employs. If, for example, malloc takes a few hundred ms to return (the problem we were seeing with instance shared with Mongo) or if something is doing heavy disk I/O and a writing a log line ends up being disk-bound, the entire server will stop responding. Depending on your use case, this can be acceptable or not, but it’s something to keep in mind.
If you’re looking at a new problem and it involves a data type that Redis supports (and really, what problems can’t be solved with lists, sets and hashes) I strongly recommend giving Redis a long look. It’s become the Swiss Army knife here a Bump, except that it opens the bottle of wine faster than anything else, instead of leaving half the cork in there.
Want to work with Will and the rest of the Bump team? We are hiring: http://bu.mp/jobs.