This is a short blog covering the changes made to huus to give it a 300-400x speedup.

The changes are two-part and simple. These changes seem trivial but often go unnoticed when building a prototype. Before we see these changes let us look at the subpar design decisions that lead to the slow (200-800 ops/sec based on hardware) speeds.

1. Lack of cache and its consequences.

The current implementation has no mechanism to hold pages in memory. This results in the system needing to load a page from disk, make changes, and write it back to disk. Even if a page only needs to be read it is loaded into memory, read, and then discarded. Hence we see that for all operations, insertion, updates, reading and deletion, we hit the disk every time while a lot of this can be avoided by caching.

func (s *storage) writePage(pageId uint32, data []byte) error {

  // ...

  page := make([]byte, s.pageSize)
  copy(page, data)
  
  offset := int64(int(pageId) * int(s.pageSize) + metadataSize)
  _, err := s.fo.WriteAt(page, offset) // Costs a write into disk
  
  // ...

  return nil
}

Every writePage writes to the file after a change is made.

The consequence of this is seen in the next problem.

2. Multiple disk flushes.

Since the file is written to on each page change, a reader reading the file might not see the changes if the content is not flushed onto the disk. This can cause incorrect reads which violate basic consistency guarantees.

A plain WriteAt is cheap on its own, it only lands the bytes in the OS page cache. What actually costs us is the flush (fsync) that forces the disk to physically commit those bytes. To stay durable the old design flushed after every node update, and since a single Put can update several nodes on the way down the tree, these flushes stack up. That is what balloons the cost from a few us to ms per operation. The storage was extremely durable, but it paid for it in ms / op.

func (s *storage) updateNode(cur *node) error {

oldData, err := s.loadNodeRaw(curPageId)

// ... Update Node ...

err = s.flush()

// ... return ...

}

See Every node update causes a flush and since a single Put can often update multiple nodes in a tree this stacks

A single solution that solves both constraints

Introducing a cache solves both problems. Pages now live in memory, so we no longer read a page from disk, mutate it, and immediately fsync it back on every change. Instead we mutate the cached buffer and defer the flush, so a Put that touched five nodes no longer means five fsyncs, it means one flush later on. Any reader running simultaneously first hits the cache and then falls back to disk, ensuring the most recent changes are read.

This implementation features an unbounded cache with two major functions

// Return cached buffer, If it is not in memory then read it from disk
func (s *storage) getPage(pageId uint32) ([]byte, error) {
  if buf, ok := s.cache[pageId]; ok {
    return buf, nil
  }

  buf := make([]byte, s.pageSize)
  offset := int64(int(pageId) * int(s.pageSize) + metadataSize)
  if _, err := s.fo.ReadAt(buf, offset); err != nil {
    return nil, fmt.Errorf("Error reading page %d : %w", pageId, err)
  }

  s.cache[pageId] = buf
  return buf, nil
}
func (s *storage) setPage(pageId uint32, data []byte) error {
  buf, ok := s.cache[pageId]
  if !ok {
    buf = make([]byte, s.pageSize)
    s.cache[pageId] = buf
  }

  // data may be shorter than a full page, so clear before copying.
  clear(buf)
  copy(buf, data)

  offset := int64(int(pageId) * int(s.pageSize) + metadataSize)
  if _, err := s.fo.WriteAt(buf, offset); err != nil {
    return fmt.Errorf("Error writing page %d : %w", pageId, err)
  }

  return nil
}

See These two functions provide cache access. Note that setPage still issues a WriteAt, but that write is buffered by the OS, the expensive fsync is what we removed from the hot path and now defer.

Results

Speedup (× faster with cache)
Keys Insert Read Delete Total Wall Time
500 474.8× 6.3× 286.3× 372.9×
5 000 698.1× 3.5× 346.6× 495.3×
50 000 731.6× 1.7× 132.3× 276.6×

Total wall-time = INSERT + READ + DELETE phase durations summed:

  • 500: 7.22 s → 0.0194 s = 372.9×
  • 5 000: 73.72 s → 0.149 s = 495.3×
  • 50 000: 940.35 s (15.7 min) → 3.40 s = 276.6×
RAM cost (× more heap with cache)
keys main cache cost
500 0.2 MB 0.8 MB
5 000 0.2 MB 6.7 MB 33.5×
50 000 0.2 MB 65.5 MB 327×
Disk cost
keys main cache cost
500 0.6 MB 0.6 MB 1.0×
5 000 6.4 MB 6.4 MB 1.0×
50 000 64.0 MB 64.0 MB 1.0×
Observations
  • Writes get ~470–730× faster (INSERT) and ~130–350× faster (DELETE).
  • Reads improve only mildly (1.7–6.3×), and the advantage shrinks with scale.
  • The trade is RAM: the cache holds the whole tree in heap, so memory grows ~linearly with the dataset (327× at 50k; heap ≈ disk, e.g. 650 MB at 500k keys) while main stays flat at 0.2 MB.

Next steps

An unbounded cache is infeasible, so the next step is to introduce a bounded cache with an eviction policy. This will cap the amount of RAM used and ensure it doesn’t balloon from holding every page in memory until close.

References