cat ./notes/thread-local-storage-matrix.md

Thread Local Storage Matrix

Ordinum's implementation of Thread Local Storage and how the storage engine interacts with it

This journal walks through the design challenges and decisions faced with when using thread local storage for the Ordinum storage engine.

To preface, Ordinum does NOT re-implement thread local storage. This journal does not go into the low level details of how thread local storage works. The implementation details are purely how Ordinum utilizes existing thread local storage mechanisms in rust for subsystems of it's storage engine.

At a high level, thread local storage is quite simply what it's title suggests. A per thread process with storage local to the thread that can be called into and accessed for that thread only independent of other thread processes. In rust this comes in the form of the thread_local!() macro

macro_rules! thread_localrust

macro_rules! thread_local {
    () => { ... };
    ($($tt:tt)+) => { ... };
}

It implements std::thread::LocalKey which is a key into the underlying storage for tls on the target platform. For example, on Linux that maybe the TLS support of the ELF ABI (..).

LocalKey uses the fastest implementation available on the target platform and is constructed with the thread_local! macros as described above.

Some interesting points when it comes to TLS and LocalKey:

  • Initialisation is done dynamically/lazily on the first call to a setter X.with(..)
  • Although TLS is a single thread primitive, it is possible for thread local state to be shared with other threads so the implementation detail specifies that only &T references may be obtained. It is therefore necessary to encapsulate tls fields with interior mutability primitives such as Cell<>, UnsafeCell<>, RefCell<> etc. if mutability is required.
  • Destructors are 'best effort' and platform specific. A number of caveats are known for where destructors are not run (..)

Having said that, here is an example of how we would initialise and interact with thread local storage in rust:

thread_local!()rust

use std::cell::{Cell, RefCell};

thread_local! {
    pub static FOO: Cell<u32> = const { Cell::new(1) };

    static BAR: RefCell<Vec<f32>> = RefCell::new(vec![1.0, 2.0]);
}

assert_eq!(FOO.get(), 1);
BAR.with_borrow(|v| assert_eq!(v[1], 2.0));

Now this is just a reiteration of the rust std library, but it is useful to know the context because Ordinum makes use of thread local storage quite heavily, not just for capturing metrics but as subsystems for optimisations and efficiencies.

Problem Statement

Ordinum has a number of sub-systems which require the use of thread local storage to speed up processes and behaviour, and to reduce strain on global structures. It also needs thread local storage to capture per process metrics and local state which extend for the lifetime of the thread. Both of these are orthogonal to each other. The former must ineract with the program with state having different lifetimes and accessors based on the program logic, for example scoped to per database instances. The latter, stretches for the length of the thread process and is purely local to that thread.

For the storage state which must interact with the program, the problem becomes, how do we effectively separate state from different instances of the program and protect cross thread interaction.

Those are the two axis which we are to focus on for this.

tls_axis

If we cast our mind back up to the TLS/LocalKey invariants, it is mentioned that we only are given &T references back from TLS and that we must carefully address the fact that other threads can (and in our case, will) touch thread local storage and more importantly may mutate local state in certain cases.

Ordinum will have subsystems of varying complexity and needs which will need to utilise tls and this problem space is what we'll address further in the journal.

Naive Implementation

Where possible, it is recommended to start with the naive implementation. Although my perfectionism often overules this and forces me into an optimisation blender. For thread local storage, we truly started with the basic implementation and discovered along the way what needed to be changed based on the problem evolving as we introduced more complex subsystems and invariants.

We will talk to 3 subsystems (non-exhaustive), each with their own needs and each covering the different problems as described.

  1. PerfContext (per thread local scope)
  2. BatchCache (per database instance thread local)
  3. SuperVersion Cache (per database, per column family, cross thread state)