Article

Using `mem::take` to reduce heap allocations

This is the first post of a series where we go through the performance work we did in rustls, a modern TLS library in Rust. Today we tame the borrow checker using `mem::take`.

Published on 14 min read

    Background

    Ferrous Systems was contracted by the Internet Security Research Group (ISRG) to work on rustls, a modern TLS library in Rust, in 2023. We worked on adding no-std support to rustls, as well as designing and implementing an alternative API to the Connection API that lets the user manage the buffers used by rustls. We also did some performance improvements on the side. You can view ISRG's rustls work plan on their website.

    I'd like to walk through some of the work I did on the performance side as I ran into some interesting borrow checker problems and I think it would be instructive to share how I dealt with them.

    I'll cover three borrow checker problems, my solutions to those problems and the limitations of my solutions. The content will be covered in 3 posts; one borrow checker problem per post.

    NOTE: most snippets shown here are simplified versions of the rustls codebase as of version v0.22.0. Links are provided to the relevant parts of the rustls codebase should you wish to look at the full version.

    Processing incoming TLS data

    First, let's cover at a high level how rustls (v0.22.0) handles incoming TLS data.

    The Connection API has an internal contiguous buffer of bytes that it uses to read and buffer TLS data from an IO object, usually a network socket. Let's call this internal buffer incoming_tls_data. As a first approximation, you can think of this buffer as being a Vec<u8>. When you call read_tls, data is read from the IO object and appended to this internal buffer.

    // first approximation
    struct Connection {
        incoming_tls_data: Vec<u8>,
        // ..
    }
    

    After reading data with read_tls, the end user will call process_new_packets to complete the handshake process and to decrypt any application data that read_tls may have read from the network socket. process_new_packets does many things but it starts by extracting TLS records out of the incoming_tls_data buffer.

    TLS data is framed in records. Each record has a header that indicates what kind of payload it contains – for example, the payload could be a certificate – and how big the payload is. In addition to this, the payload of some TLS record types is encrypted; this is the case with records that carry application data.

         <--------------------- TLS data --------------------->
          <----- record 1 -----> <----- record 2 ----->
         +----------+-----------+----------+-----------+------+
         | Header 1 | Payload 1 | Header 2 | Payload 2 | (..) |
         +----------+-----------+----------+-----------+------+
    encr?|    no    |    no     |    no    |    yes    |
    size |    5B    |    1B     |    5B    |   122B    |
    type |   ChangeCipherSpec   |  EncryptedExtensions |
    

    Figure 1. ASCII diagram of TLS records as they appear in a TLS data stream. Diagram indicates the record types; header and payload sizes in bytes; and whether the payload is encrypted or not.

    The object used to extract records, and potentially decrypt their payloads, from this linear buffer is called the MessageDeframer. A second, better approximation of the Connection object then becomes:

    // NOTE: the types that sit between `MessageDeframer` and
    // `Connection` have been omitted
    struct Connection {
        incoming_tls: MessageDeframer,
        // ..
    }
    
    struct MessageDeframer {
        buffer: Vec<u8>
        // ..
    }
    

    MessageDeframer's most important method is pop. This method extracts one record out of its internal buffer and decrypts its payload if it's encrypted.

    impl MessageDeframer {
        fn pop(
            &mut self,
            /* .. */,
        ) -> Result<Option<Deframed>, /* .. */> {
            // ..
        }
    }
    
    struct Deframed {
        message: PlainMessage,
        // ..
    }
    
    struct PlainMessage {
        payload: Payload,
        // ..
    }
    
    struct Payload(Vec<u8>);
    

    PlainMessage is a TLS record with decrypted Payload.

    deframer.buffer =
        +----------+-----------+----------+-----------+-----+
        | Header 1 | Payload 1 | Header 2 | 0xd4e48.. | Hea |
        +----------+-----------+----------+-----------+-----+
    
    deframer.pop() ->
        +----------+-----------+
        | Header 1 | Payload 1 |
        +----------+-----------+
    
    deframer.pop() ->
        +----------+-----------+
        | Header 2 | Payload 2 |
        +----------+-----------+
    
    deframer.pop() -> None
    

    Figure 2. ASCII diagram of the operation of the pop method. The first record has an unencrypted payload; the second, an encrypted payload; the third record is incomplete.

    The main takeaway here is that PlainMessage is an "owning" object: it does not borrow data from the MessageDeframer's buffer. This means that data is eagerly copied out of the deframer's buffer and placed into a fresh heap allocation. That's the kind of heap allocation we want to eliminate so the first step is to tweak PlainMessage internals so it can borrow from MessageDeframer's buffer instead of cloning part of it.

    Inside MessageDeframer::pop

    As I mentioned above, MessageDeframer::pop performs two operations: it extracts one record out of a linear buffer and then decrypts the payload of said record, if necessary. The first operation is handled by an object called the codec::Reader; the second, by an object called the RecordLayer.

    If we look at how MessageDeframer uses codec::Reader, we see that codec::Reader is initialized with a slice into MessageDeframer's buffer: good, no allocation or cloning there. Then, MessageDeframer uses the OpaqueMessage::read method to get one OpaqueMessage message out of the Reader.

    impl MessageDeframer {
        fn pop(/* .. */) -> /* .. */  {
            // ..
    
            let mut rd = codec::Reader::init(
                &self.buffer[start..self.used],
            );
            let m = match OpaqueMessage::read(&mut rd) {
                // ..
            };
            
            let msg = match record_layer.decrypt_incoming(m) {
                // ..
            };
    
            // ..
        }
    }
    
    struct OpaqueMessage {
        payload: Payload,
        // ..
    }
    
    buffer = 
        +--------------+----------------+----+
        | 0x160304007a | 0xd4e4877bac.. |(..)|
        +--------------+----------------+----+
    
    rd = codec::Reader::init(&buffer[some_range]) ->
        +--------------+----------------+
        | 0x160304007a | 0xd4e4877bac.. |
        +--------------+----------------+
    
    m = OpaqueMessage::read(rd) ->
        +--------+----------------------+
        | Header | 0xd4e4877bac978b79.. |
        +--------+----------------------+
    
    if m.has_encrypted_payload() then
      msg = RecordLayer::::decrypt(m) ->
        +--------+----------------------+
        | Header | b"plaintext payload" |
        +--------+----------------------+
    

    Figure 3. ASCII diagram that depicts the operations that happen inside MessageDeframer::pop. OpaqueMessage::read decodes the raw bytes that make up the header; RecordLayer::decrypt decrypts the encrypted payload.

    OpaqueMessage is a version of PlainMessage that contains a potentially encrypted payload; it is also an owning object. This means that the OpaqueMessage::read method is performing the heap allocation we want to remove.

    If we trace down where OpaqueMessage is converted into a PlainMessage, we arrive at the MessageDecrypter::decrypt method:

    trait MessageDecrypter {
        fn decrypt(
            &mut self,
            msg: OpaqueMessage,
            /* .. */
        ) -> Result<PlainMessage /* .. */>;
    }
    

    There is not a lot of detail there as that's a trait method declaration.

    If we look at actual implementations of this trait (for example, the ring implementation), we discover that they perform the decryption "in-place". This means OpaqueMessage's payload is mutated as part of the decryption process and then the OpaqueMessage object is effectively "casted" (zero-cost converted) into a PlainMessage object. No heap allocation occurs in this process; that is good.

    There's an important takeaway from looking at these MessageDecrypter implementations: OpaqueMessage's payload needs to be mutable.

    Mutable codec::Reader

    Based on what we have learned, we need to make these changes:

    • OpaqueMessage needs to hold a mutable slice into MessageDeframer's buffer
    • PlainMessage needs to hold an immutable slice into MessageDeframer's buffer
    • OpaqueMessage::read must be tweaked to not allocate / clone

    For now, let's ignore the Payload newtype and assume that the new types will look like this:

    struct OpaqueMessage<'a> { // now has a lifetime parameter
        payload: &'a mut [u8],
        // ..
    }
    
    struct PlainMessage<'a> { // same here
        payload: &'a [u8],
        // ..
    }
    

    Next, let's look at the updated OpaqueMessage::read signature:

    impl<'a> OpaqueMessage<'a> {
    //   ^^                ^^
        fn read(
            r: &mut codec::Reader<'a>,
            //                    ^^
        ) -> Result<Self, /* .. */> {
            // ..
        }
    }
    

    Those would be the right lifetime contraints but there's a problem: codec::Reader contains an immutable slice from which we cannot create the mutable slice that we need to put in OpaqueMessage.

    struct Reader<'a> {
        buffer: &'a [u8],
        cursor: usize,
    }
    

    We'll deal with the issue by creating a mutable version of Reader – we'll call it ReaderMut – that contains a mutable slice. Only OpaqueMessage::read will use this new type so not a lot of churn elsewhere in the codebase with this approach.

    struct ReaderMut<'a> {
        buffer: &'a mut [u8],
        cursor: usize,
    }
    
    impl<'a> OpaqueMessage<'a> {
        fn read(
            r: &mut codec::ReaderMut<'a>,
        ) -> Result<Self, /* .. */> {
            // ..
        } 
    }
    

    The reborrow problem

    As I copied over methods from Read to ReadMut I hit a borrow checker problem in the codec::Reader::take method:

    impl<'a> ReaderMut<'a> {
        // lifetime 's made explicit for clarity
        fn take<'s>(
            &'s mut self,
            length: usize,
        ) -> Option<&'a mut [u8]> {
            if self.left() < length {
                return None;
            }
            let current = self.cursor;
            self.cursor += length;
            Some(&mut self.buffer[current..current + length])
            //^~ error: lifetime may not live long enough
        }
    }
    

    The error message is: "method (take) was supposed to return data with lifetime 'a but it is returning data with lifetime 's".

    Because the self variable has type &'s ReaderMut<'a>, mutably accessing its buffer field through self limits the slice re-borrow to &'s mut [u8]. Let's desugar the last line to make that clearer:

    self.cursor += length;
    
    // NOTE: 's, not 'a
    let reborrow: &'s mut [u8] = &mut self.buffer;
    
    let retval: &'s mut [u8] =
        reborrow.index_mut(current..current + length);
    
    Some(retval)
    

    This had me thinking for a bit: "why is this not a problem in the immutable case?" because this works fine:

    impl<'a> Reader<'a> {
        fn take(
            &mut self,
            length: usize
        ) -> Option<&'a [u8]> {
            if self.left() < length {
                return None;
            }
            let current = self.cursor;
            self.cursor += length;
            Some(&self.buffer[current..current + length])
        }
    }
    

    "Isn't the same constrained reborrow happening in this case?" The answer is "no" because immutable slices are Copy so the compiler can copy the buffer field instead of borrowing it. In the Reader case, the compiler can desugar the last line into:

    self.cursor += length;
    
    // NOTE: 'a! this coerce-copies the `buffer` field
    let copy: &'a [u8] = self.buffer;
    
    let retval: &'a [u8] = copy.index(current..current + length);
    
    Some(retval)
    

    This last desugaring is not possible in the mutable case because it would require moving the non-Copy buffer field out of self and that can't happen through the mutable reference specified in the function signature of take (&mut self).

    Solution: mem::take

    We can't move buffer out of self but we can mem::take it like this:

    let buffer: &'a mut [u8] = core::mem::take(&mut self.buffer);
    

    and now we have the 'a lifetime we wanted!

    The side effect is that self.buffer becomes an empty mutable slice. Leaving the field empty like that will prevent us from using the take method again so more changes are needed to preserve the semantics of Reader::take. The final version looks like this:

    struct ReaderMut<'a> {
        buffer: &'a mut [u8],
        // renamed because it has a different role now
        used: usize, 
    }
    
    impl<'a> ReaderMut<'a> {
        fn take<'s>(
            &'s mut self,
            length: usize,
        ) -> Option<&'a mut [u8]> {
            if self.left() < length {
                return None;
            }
            
            // `self.buffer` becomes `&mut []`
            let buffer = mem::take(&mut self.buffer);
            let (taken, rest) = buffer.split_at_mut(length);
    
            self.used += length;
            self.buffer = rest; // <- repopulate the field
    
            Some(taken)
        }
    }
    

    A closer look

    I think it's important to highlight that take does not freeze ReaderMut. You can tell from the method signature: the input parameter &'s mut self and the output parameter Option<&'a mut [u8]> have different lifetime parameters.

    Because ReaderMut is not frozen, you can call take again while the value returned by take is in scope:

    let mut buffer = [0, 1, 2, 3];
    let reader = ReaderMut::init(&mut buffer);
    let first = reader.take(2).unwrap(); // start of 'first
    
    // borrow checker acceps this while `first` is alive
    let second = reader.take(2).unwrap(); 
    
    drop(first); // end of 'first
    

    Another insight is that the buffer field of ReaderMut is modified by the take method. What take does is chop length bytes from the front of buffer and returns that.

    Let's have a closer look using a hypothetical dbg2! macro that preserves more type information (e.g. mutability) than dbg!:

    let mut buffer = [0, 1, 2, 3];
    let reader = ReaderMut::init(&mut buffer);
    let first = reader.take(2).unwrap(); 
    
    dbg2!(first, reader);
    

    That prints:

    first = &mut [0, 1]
    reader = ReaderMut {
        buffer: &mut [2, 3],
        used: 2,
    }
    

    There you can see that there are two mutable, non-overlapping slices. The fact that they are non-overlapping is important to adhere to Rust's aliasing rules.

    Aliasing rules are why the first version of ReaderMut::take was rejected. That version of take left buffer unchanged and did not freeze ReaderMut. If that version had been accepted by the compiler then the dbg2! snippet would print:

    first = &mut [0, 1]
    reader = ReaderMut {
        buffer: &mut [0, 1, 2, 3],
        used: 2,
    }
    

    There you can see that first and reader.buffer are mutable slices that overlap and that breaks Rust's aliasing rules.

    On the other hand, the Reader::take implementation does not modify the buffer field and that produces overlapping immutable slices:

    let buffer = [0, 1, 2, 3];
    let reader = Reader::init(&buffer);
    let first = reader.take(2).unwrap(); 
    
    dbg2!(first, reader);
    
    first = &[0, 1]
    reader = Reader {
        buffer: &[0, 1, 2, 3],
        taken: 2,
    }
    

    and that's fine as it's in line with Rust's aliasing rules.

    Alternative: the Option dance

    mem::take was not there in Rust 1.0; in fact, it was only stabilized in Rust 1.40. So, what did people do to deal with this problem before mem::take came into existence? They did "the Option dance".

    I'll now demonstrate the Option dance (🕺) to the new Rustaceans that do not know about it.

    struct ReaderMut<'a> {
        buffer: Option<&'a mut [u8]>, // <- wrap in Option
        used: usize,
    }
    
    impl<'a> ReaderMut<'a> {
        fn take<'s>(
            &'s mut self,
            length: usize,
        ) -> Option<&'a mut [u8]> {
            if self.left() < length {
                return None;
            }
            
            let buffer = self.buffer.take().unwrap(); // <-
            let (taken, rest) = buffer.split_at_mut(length);
    
            self.used += length;
            self.buffer = Some(rest); // <-
    
            Some(taken)
        }
    }
    

    Not too different except for the addition of some unwelcome unwraps.

    Even accessing buffer without the intention of taking it requires an unwrap; here's the Option dance version of the ReaderMut::left method.

    fn left(&self) -> usize {
        self.buffer.as_ref().unwrap().len()
    }
    

    Although mem::take is the better option (no pun intended), the Option dance pattern still has its place. mem::take<T>(dest: &mut T) requires that the value of type T being taken implements the Default trait so that a default value can be left behind in dest. If you are working with types that do not implement this trait then dancing is your only option (pun intended).

    Wrap-up

    That concludes the first blog post of this series. Thus far, we have removed one heap allocation from the process of deframing (decoding) and decrypting TLS records out of a linear buffer. As all TLS records go through that process, that's one less heap allocation per processed record.

    In the next post, we'll continue working our way through the rustls pipeline that processes incoming TLS records. The next borrow checker problem will involve lifetime parameters in traits and Higher-Rank Trait Bounds (HRTB).


    If the borrow checker has you in a bind or if you want to optimize the memory usage of your project by switching from heap allocated objects to references, our consulting service can help you with that! Get in touch through our contact form.