Faster Video DataLoaders for Faster Training

TL;DR

We optimized a PyTorch DataLoader for fixed-size videos using a small cache. This reduced the training time for a single epoch from 4 hours to less than 30 minutes, about 8x improvement.

Introduction

At Thelios, sports video processing using computer vision is our primary task. This involves training and performing inference on frame-level, frame-sequence-level, and even larger video-level deep learning models.

SlowFast is one of the common models we use for detecting actions in sequences of frames. We train it on custom data and achieve fairly accurate results. Primarily, we relied on the mmaction framework. However, the OpenMMLab team appears to have stopped maintaining the broader mm* ecosystem, which includes mmcv, mmdetection, and mmaction. Additionally, while training on existing models in mmaction is straightforward, modifying the head or model body is cumbersome. Consequently, we decided to move to pure PyTorch-based implementations of SlowFast and its sibling models.

The Problem

Recently, when training SlowFast on our data, we observed that a single epoch was estimated to take over 4 hours for a dataset of approximately 35,000 samples. This seemed excessively high, especially since a dry run on 1,000 samples took less than a minute per epoch. Even accounting for overhead, it was hard to imagine the epoch time exceeding one hour.

We waited for the training to stabilize, hoping the estimated time would drop to a reasonable level. Instead, it continued to climb, rising from an initial 2 hours to 4 hours after just 15 minutes of training. nvitop confirmed a bottleneck: GPU memory utilization was only 25 – 30%, and average utilization was less than 50%.

Problem Analysis

1. Video Loading Functions

We had previously trained a SlowFast model on 5,000 samples using individual JPG files for each frame, and that process was smooth. Our first guess was that OpenCV video loading might be the culprit. Research led us to alternatives like pyav, pytorchvideo, and decord. However, when we benchmarked video loading times across frameworks, the results were nearly identical:

Loading Method	Total Time (100 Videos)	Avg Time per Video
cv2 load	10.20s	0.10s
cv2 (preallocated numpy buffer)	10.43s	0.10s
pyav	10.40s	0.10s
pytorchvideo	10.69s	0.10s
torchvision video reader	10.56s	0.10s
decord	10.52s	0.10s

It became obvious that when loading short, fixed-frame videos (< 100 frames), no framework has a clear advantage. The bottleneck was likely the blocking task of loading the video file into memory.

2. DataLoader Knob Tuning

We performed a systematic analysis of PyTorch DataLoader parameters, including batch size, number of workers, pinned memory, and persistent workers. We ran a grid search to find the optimal configuration.

Observations:
1)The script crashed when using 8 or 16 workers.
2)The best combination was 4 workers with a batch size of 4 or 8.
3)No combination significantly reduced the epoch time.

We still hadn’t identified the core bottleneck.

3. Separate Frame-Based DataLoader

Since our previous JPG-based training was fast, we tested extracting 20% of our dataset as individual frames. While this should have taken 10 – 15 minutes per epoch, it actually took 50 minutes. While frames were slightly faster than video files, disk I/O was still blocking the process most of the time.

4. Transforms

We then looked at our transforms. We were using custom, LLM-generated transforms instead of the standard equivalents in `pytorchvideo`. Rewriting the dataset to use built-in transforms improved the estimated epoch time slightly, but it remained over 3 hours with low GPU utilization.

We noticed that the DataLoader only loads 2 – 4 batches in advance. While this makes sense for dynamic datasets, our data consists of fixed-size videos with minimal transforms. Since a single sample of 32 frames doesn’t consume much memory, we decided to focus on complete precaching .

The Solution

Cached Dataset

To bypass the limitations of PyTorch DataLoader threads, we implemented a single-threaded DataLoader paired with a separate thread to precache data.

The Challenge: To precache the next 50 – 200 batches, we needed to know the sampling order in advance. PyTorch’s default DataLoader doesn’t expose the predecided order for an epoch.

The Workaround: We wrote a **custom sampler** that determines and provides the sample order ahead of time.

Overall Architecture:

1. Cache Data Structure: A queue-based structure with a set upper limit on element count.
2. Cache Prefetcher: A background thread that fills the queue with future objects resolved via a thread pool.

Observability:

We needed to monitor how much data was being prefetched without creating a new bottleneck. Our requirements were:

Minimal changes to existing code.
Functionality within a separate process.
Ability to toggle the monitor on/off.

We opted for a simple Pipe-based IPC (Inter-Process Communication) to dump line-delimited JSON samples to a pipe. On the receiver side, we built a small dashboard to display queue occupancy and history.

Impact

The results were significant. Within seconds of starting training, the estimated epoch time stabilized at ~30 minutes – 6-to-8-fold reduction. We initially used a cache size of 200 batches, which kept the queue full (195+ samples) but resulted in a high RAM overhead of ~15 GB.

Memory Leaks

Although the speed was consistent, we noticed RAM usage creeping up as training progressed. The process would eventually crash overnight. Since the crashes didn’t happen at a predictable location or during the first epoch, we suspected a memory leak.

Finding the Leak

Attempt 1: We measured the total memory used by the Python process. The data confirmed a slow, steady leak.
Attempt 2: We added manual `gc.collect()` calls every $N$ batches. This helped slightly, but consumption still rose.
Attempt 3: We tracked object ages using `weakref`. This revealed that the number of tensors in memory was constantly increasing. Specifically, during Resize and Crop transforms, the original tensor (pre-transform) was being retained in the cache.

The Fix : Instead of creating new copies during transformation, we used permuted views and passed those to the `pytorchvideo` transforms

Reducing Cache Size

We eventually reduced the cache size to 50 batches. This had no negative impact on training speed but made the memory footprint negligible.