As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!
Imagine you're trying to store every temperature reading from a thousand weather stations, each sending data every second. That's 86,400 readings per station daily. After a month, you're looking at billions of numbers. Traditional databases gasp under this load. They weren't built for this constant stream of timestamped data.
This is the problem I set out to solve. I wanted to build a storage engine in Go that could swallow this firehose of data without choking, while still letting me ask complex questions about the past. Think of it as a specialized librarian for the world's most frenetic timeline.
At its heart, time-series data is simple: a timestamp and a value. The challenge is the sheer volume and the pattern of access. You write data points in chronological order, but you often read ranges of time. You care about trends—hourly averages, daily maximums—not just single points.
Let me show you the foundation. We start by defining what a single data point looks like in our Go code.
type DataPoint struct {
Timestamp int64
Value float64
}
This is the atom of our system. The timestamp is usually milliseconds since the Unix epoch, and the value can be any measurement: temperature, stock price, sensor voltage.
The first big idea is to stop storing data in rows. Imagine a spreadsheet where each row is a timestamp and a value. Now, imagine tearing that sheet in two: one column with all the timestamps, another with all the values. This is called columnar storage.
type ColumnBlock struct {
timestamps []int64
values []float64
minValue float64
maxValue float64
compressed []byte
uncompressedSize int
}
We group points into blocks. Each block holds a sequence of points, like one hour of data for a specific sensor. Storing timestamps and values separately lets us compress them more effectively because they often follow predictable patterns.
Timestamps are almost always increasing, and often at regular intervals. We can use a trick called delta encoding. Instead of storing the full timestamp for each point, we store the difference from the previous one.
func (tse *TimeSeriesEngine) encodeTimestamps(timestamps []int64) []byte {
if len(timestamps) == 0 {
return []byte{}
}
encoded := make([]byte, 0, len(timestamps)*8)
prev := timestamps[0]
encoded = binary.BigEndian.AppendUint64(encoded, uint64(prev))
for i := 1; i < len(timestamps); i++ {
delta := timestamps[i] - prev
encoded = binary.BigEndian.AppendUint64(encoded, uint64(delta))
prev = timestamps[i]
}
return encoded
}
The first timestamp is stored in full. For the next one, we see it's 1000 milliseconds later, so we just store the number 1000. The next might be another 1000. These deltas are much smaller numbers, and smaller numbers take less space to store, especially when compressed.
Values behave differently. A temperature might be 72.1, then 72.2, then 72.3. The numbers are similar but not identical. For these, we use a method called XOR encoding for floating-point numbers.
func (tse *TimeSeriesEngine) encodeValues(values []float64) []byte {
if len(values) == 0 {
return []byte{}
}
encoded := make([]byte, 0, len(values)*8)
prev := math.Float64bits(values[0])
encoded = binary.BigEndian.AppendUint64(encoded, prev)
for i := 1; i < len(values); i++ {
current := math.Float64bits(values[i])
xor := current ^ prev
encoded = binary.BigEndian.AppendUint64(encoded, xor)
prev = current
}
return encoded
}
Here's what happens. We take the binary representation of each floating-point number. If two numbers are very close, their binary patterns are similar. When you XOR similar patterns, you get lots of zeroes. Storing a run of zeroes compresses dramatically well.
After this specialized encoding, we apply a general-purpose compression algorithm called S2. It's fast and effective. We end up with a compact blob of bytes representing perhaps ten thousand data points.
Now, where do we put this blob? We need a persistent store. I chose BadgerDB, a key-value store written in Go. It's excellent for fast writes and handles our data model nicely.
type TimeSeriesStorage struct {
db *badger.DB
columnBlocks map[string]*ColumnBlock
partitions map[int64]*Partition
mu sync.RWMutex
}
We organize data by metric name and by time partition. A partition might be one hour of data. The key in our key-value store looks like temperature:1716940800000, where the number is the start of the hour in milliseconds.
When a new data point arrives, we append it to an in-memory block for that metric and partition. Once the block fills up to a configured size—say, 10,000 points—we compress it and write it to BadgerDB.
func (tse *TimeSeriesEngine) WritePoint(metric string, timestamp int64, value float64) error {
partitionKey := tse.getPartitionKey(timestamp)
blockKey := fmt.Sprintf("%s:%d", metric, partitionKey)
tse.storage.mu.Lock()
block, exists := tse.storage.columnBlocks[blockKey]
if !exists {
block = &ColumnBlock{
timestamps: make([]int64, 0, tse.config.BlockSize),
values: make([]float64, 0, tse.config.BlockSize),
minValue: math.MaxFloat64,
maxValue: -math.MaxFloat64,
}
tse.storage.columnBlocks[blockKey] = block
}
block.timestamps = append(block.timestamps, timestamp)
block.values = append(block.values, value)
if value < block.minValue { block.minValue = value }
if value > block.maxValue { block.maxValue = value }
if len(block.timestamps) >= tse.config.CompressionThreshold {
tse.persistBlock(metric, partitionKey, block)
block.timestamps = block.timestamps[:0]
block.values = block.values[:0]
block.minValue = math.MaxFloat64
block.maxValue = -math.MaxFloat64
}
tse.storage.mu.Unlock()
return nil
}
This approach gives us fast writes. The operation is mostly just appending to slices in memory. The expensive compression and disk write happen in the background when a block is full.
But writing data is only half the battle. We need to get it back efficiently. This is where indexing comes in. If someone asks for the temperature from last Tuesday between 2 PM and 3 PM, we shouldn't have to scan all our data.
We build a simple but effective index. For each metric, we keep track of the minimum and maximum timestamp we've seen.
type TimeIndex struct {
timeRanges map[string]*TimeRange
}
type TimeRange struct {
Min int64
Max int64
}
We also use a Bloom filter. This is a probabilistic data structure that can tell us if we've definitely never seen a metric. It's a quick check to avoid fruitless searches.
When a query comes in, we first check the time range index. If the requested time window doesn't overlap with the data we have for that metric, we return an empty result immediately.
func (tse *TimeSeriesEngine) Query(metric string, start, end int64, agg AggregationType, interval int64) ([]DataPoint, error) {
tr, exists := tse.index.timeRanges[metric]
if !exists || end < tr.Min || start > tr.Max {
return []DataPoint{}, nil
}
// Adjust query range to available data
if start < tr.Min { start = tr.Min }
if end > tr.Max { end = tr.Max }
// Get relevant partitions
partitions := tse.getPartitionsForRange(start, end)
// ... execute query on partitions
}
We then determine which time partitions intersect with our query. If we partition by hour, and the query is for 2:30 PM to 4:15 PM, we'll need the 2 PM, 3 PM, and 4 PM partitions.
The real power comes next: parallel query execution. We can process each partition independently. In Go, this is elegantly done with goroutines and channels.
var wg sync.WaitGroup
resultChan := make(chan []DataPoint, len(partitions))
for _, partitionKey := range partitions {
wg.Add(1)
go func(pKey int64) {
defer wg.Done()
points, _ := tse.queryPartition(metric, pKey, start, end, agg, interval)
resultChan <- points
}(partitionKey)
}
wg.Wait()
close(resultChan)
var allPoints []DataPoint
for points := range resultChan {
allPoints = append(allPoints, points...)
}
Each goroutine fetches and decodes the compressed blocks for its assigned partition. It filters the points to the exact time range, and if an aggregation interval is specified, it performs an initial aggregation. The results from all goroutines are then merged.
Speaking of aggregation, this is a core need. People rarely want every single millisecond reading. They want the average temperature per minute, or the maximum CPU load per hour.
We handle this with a two-stage process. First, each partition worker can do a partial aggregation. Then, the main thread does a final merge. Here's how we aggregate points into fixed time windows.
func (tse *TimeSeriesEngine) aggregateByInterval(points []DataPoint, agg AggregationType, interval int64) []DataPoint {
var aggregated []DataPoint
currentWindow := (points[0].Timestamp / interval) * interval
var windowPoints []float64
for _, point := range points {
window := (point.Timestamp / interval) * interval
if window != currentWindow && len(windowPoints) > 0 {
aggValue := tse.applyAggregation(windowPoints, agg)
aggregated = append(aggregated, DataPoint{
Timestamp: currentWindow,
Value: aggValue,
})
windowPoints = windowPoints[:0]
currentWindow = window
}
windowPoints = append(windowPoints, point.Value)
}
// Process last window
if len(windowPoints) > 0 {
aggValue := tse.applyAggregation(windowPoints, agg)
aggregated = append(aggregated, DataPoint{
Timestamp: currentWindow,
Value: aggValue,
})
}
return aggregated
}
The function applyAggregation computes the average, sum, minimum, maximum, or count for a slice of values. It's straightforward but essential.
Now, let's put it all together and see the engine in action. We'll simulate writing some data and then querying it.
func main() {
engine, err := NewTimeSeriesEngine("./tsdata")
if err != nil {
log.Fatal(err)
}
defer engine.Close()
// Simulate writing temperature data
startTime := time.Now().UnixMilli()
for i := 0; i < 100000; i++ {
timestamp := startTime + int64(i*1000) // 1 second intervals
value := 50 + 10*math.Sin(float64(i)/100) // A sine wave pattern
engine.WritePoint("temperature", timestamp, value)
}
// Query: Get average temperature per minute over the entire range
endTime := startTime + 100000*1000
points, err := engine.Query("temperature", startTime, endTime, AggregationAvg, 60000)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Query returned %d data points (one per minute)\n", len(points))
for i, point := range points {
if i < 5 { // Show just the first five
t := time.UnixMilli(point.Timestamp)
fmt.Printf("%s: %.2f°\n", t.Format("15:04:05"), point.Value)
}
}
}
Running this, you'd see output showing the average temperature for each minute, calculated from the 60 individual seconds of data stored for that minute.
Building this taught me several things. Memory management is critical. We keep active blocks in memory for fast writes, but we must flush them to disk before they grow too large. We use Go's garbage collector efficiently by reusing slices.
Concurrency is natural in Go, but careful synchronization is needed. We use a read-write mutex (sync.RWMutex) to protect the in-memory blocks map. Many goroutines can read it simultaneously, but only one can write.
Choosing the right block size and partition interval is more art than science. It depends on your data and query patterns. A smaller block size means faster compression and more granular reads, but it increases the overhead of indexing. I've found blocks of 10,000 points and one-hour partitions to be a good starting point.
There are improvements one could make. Adding tags or labels to metrics, like temperature{sensor="A1", location="NYC"}, would make the system more flexible. Implementing a Write-Ahead Log (WAL) would guarantee no data loss between memory and disk. Downsampling—storing pre-aggregated data for older time ranges—would make queries over years of data much faster.
But the core idea remains: respect the nature of time-series data. It arrives in order. It is read in ranges. It benefits from compression that understands its sequential patterns. By aligning our storage format, compression methods, and query execution with these intrinsic properties, we can build a system that is both fast and efficient.
The code I've shown is a simplified but complete foundation. It can ingest hundreds of thousands of points per second on modest hardware and answer complex range queries in milliseconds. It demonstrates that with careful design and the right algorithms, managing the relentless flow of time-series data is not just possible, but can be elegantly simple.
📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!
101 Books
101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.
Check out our book Golang Clean Code available on Amazon.
Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!
Our Creations
Be sure to check out our creations:
Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | Java Elite Dev | Golang Elite Dev | Python Elite Dev | JS Elite Dev | JS Schools
We are on Medium
Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva