Optimizing PyTorch VRAM Usage for McByte tracker: From 16GB to Under 8GB

Optimizing McByte tracker GPU memory usage by tackling large matrix operations in get_similarity function

The Problem

We are building an object tracking system with the goal of running in real time. Our trackers are typically layered on top of existing ones such as McByte. McByte relies on a similarity computation at their core. This computation matches query keys from the current frame against memory keys accumulated from past frames, producing similarity scores that indicate how closely different regions of the frame match what the tracker has seen before.

This similarity function is executed twice per frame and is extremely memory-hungry. In its current form, it consumes roughly 16 GB of VRAM. This is not a theoretical upper bound — in practice, the system fails to run on any GPU with less than 16 GB of memory. For a real-time pipeline, this leaves no room for the rest of the model, intermediate buffers, or downstream processing.

The high memory usage comes primarily from large intermediate tensors created during the similarity computation, where features from the current frame are compared against a growing memory bank. While this approach works well from an accuracy standpoint, it makes the system impractical to deploy on commonly available GPUs.

Our objective was therefore straightforward: reduce VRAM usage to under 8 GB without giving up a meaningful amount of speed, while keeping the overall tracking behavior unchanged.

Iteration 0: Baseline VRAM Usage (16 GB)

Below is the baseline similarity function as it existed originally. For clarity, non-essential handling (argument normalization, optional branches that don’t affect allocation behavior) has been removed. What remains are the operations that dominate VRAM allocation.

def get_similarity(mk, ms, qk, qe):
    CK = mk.shape[1]

    mk = mk.flatten(start_dim=2)          # [B, C, M]
    qk = qk.flatten(start_dim=2)          # [B, C, Q]
    qe = qe.flatten(start_dim=2)          # [B, C, Q]

    mk = mk.transpose(1, 2)               # [B, M, C]

    a_sq   = mk.pow(2) @ qe               # [B, M, Q]
    two_ab = 2 * (mk @ (qk * qe))          # [B, M, Q]
    b_sq   = (qe * qk.pow(2)).sum(1, keepdim=True)

    similarity = -a_sq + two_ab - b_sq
    similarity = similarity * ms / math.sqrt(CK)

    return similarity

What’s happening here

The function computes dense similarity scores between all memory locations (M) and all query locations (Q).
Each matrix multiplication (@) produces a full [B, M, Q] tensor.
Intermediate tensors such as:
mk.pow(2)
qk * qe
a_sq, two_ab, b_sq are all separately allocated on the GPU.

Because PyTorch eagerly materializes these intermediates and keeps them alive until the function exits, peak memory usage balloons quickly. With realistic M and Q sizes, this pushes total VRAM consumption to ~16 GB, causing immediate OOM failures on GPUs with smaller memory budgets.

Result:

Stable and correct, but memory usage is far beyond what a real-time pipeline can tolerate.

Iteration 1: The Cache-Clearing Approach (12GB VRAM, 20% Slower)

Our first instinct was to manually clean up GPU memory. We added explicit del statements for intermediate tensors and forced cache eviction using torch.cuda.empty_cache():

def get_similarity(...):
    torch.cuda.empty_cache()  # clear at the start

    # ... similarity computation ...

    del a_sq
    del two_ab
    # ... more cleanup ...

    torch.cuda.empty_cache()  # clear at the end
    return similarity

This did reduce peak VRAM usage to under 12GB, which initially looked promising. However, it introduced two major issues.

First, performance regressed noticeably. The function became about 20% slower, largely due to the overhead of repeatedly clearing and rebuilding CUDA memory caches.

Second, memory usage became unstable. Instead of maintaining a steady footprint, VRAM usage oscillated sharply — tensors would allocate up to the peak, get cleared, then immediately reallocate on the next operation. PyTorch’s allocator ended up doing more work, not less.

In effect, memory usage followed a sawtooth pattern: spike, flush, spike again — an inefficient cycle that hurt both performance and predictability.

Result:

~12GB VRAM (unstable) with a ~20% performance regression.

Iteration 2: In-Place Operations (Under 8GB VRAM, 13% Faster)

Instead of fighting PyTorch’s memory allocator, we decided to work with it. The key insight was simple: in-place operations reuse existing memory instead of allocating new tensors.

Below is the optimized version of the similarity function, rewritten to aggressively minimize intermediate allocations:

def get_similarity(mk: torch.Tensor,
                   ms: torch.Tensor,
                   qk: torch.Tensor,
                   qe: torch.Tensor,
                   add_batch_dim: bool = False) -> torch.Tensor:

    if add_batch_dim:
        mk = mk.unsqueeze(0)
        ms = ms.unsqueeze(0) if ms is not None else None
        qk = qk.unsqueeze(0)
        qe = qe.unsqueeze(0) if qe is not None else None

    CK = mk.shape[1]
    mk = mk.flatten(start_dim=2)
    ms = ms.flatten(start_dim=1).unsqueeze(2) if ms is not None else None
    qk = qk.flatten(start_dim=2)
    qe = qe.flatten(start_dim=2) if qe is not None else None

    if qe is not None:
        mk_T = mk.transpose(1, 2)

        # Compute smaller tensors first
        b_sq = (qe * qk.pow(2)).sum(1, keepdim=True)

        # Fuse elementwise ops
        qk_qe = qk.mul(qe)

        # Build similarity tensor once, then modify in-place
        similarity = torch.matmul(mk_T, qk_qe)
        similarity.mul_(2)
        similarity.sub_(torch.matmul(mk_T.pow(2), qe))
        similarity.sub_(b_sq)

    else:
        similarity = torch.matmul(mk.transpose(1, 2), qk)
        similarity.mul_(2)
        a_sq = mk.pow(2).sum(1).unsqueeze(2)
        similarity.sub_(a_sq)

    if ms is not None:
        similarity.mul_(ms)

    similarity.div_(math.sqrt(CK))
    return similarity

What changed

1. In-place operations everywhere

Operations like:

similarity.mul_(2)
similarity.sub_(...)
similarity.div_(...)

modify the existing tensor instead of allocating a new one. This alone eliminated several large temporary tensors that previously overlapped in memory.

2. Fewer intermediate tensors

Elementwise operations were fused where possible (for example, qk.mul(qe)), reducing the number of short-lived allocations that contribute to peak VRAM usage.

3. Ordering computations by size

Smaller tensors such as b_sq are computed before the large matrix multiplications. This helps keep peak memory lower by avoiding overlap between multiple large intermediates.

Outcome

Peak VRAM usage dropped below 8GB, and remained stable.
Performance improved by ~13% compared to the baseline.

Avoiding unnecessary allocations not only reduced memory pressure, but also lowered allocator overhead and improved kernel scheduling — a win on both fronts.

The Results

Iteration	VRAM Usage	Performance	Stability
Baseline	16GB	Baseline (100%)	Stable
Manual Cleanup	~12GB	20% slower	Unstable (oscillating)
In-place Ops	<8GB	13% faster	Stable

Not only did we hit our memory target, but we also ended up with a measurable performance gain. The most “obvious” optimization—manually clearing memory—turned out to be the wrong one.

Key Takeaways

Measure, don’t assume.
Our initial fix (manual cleanup) looked reasonable but actually made things worse. Memory optimizations should always be backed by profiling and benchmarks, not intuition.

Conclusion

Reducing VRAM usage by more than half while improving performance wasn’t about clever tricks or aggressive cleanup. It came from understanding how PyTorch allocates and reuses GPU memory, and then structuring the code to work with that model instead of fighting it.

If there’s one lesson here, it’s this: GPU memory optimization usually starts before del statements and cache clearing. Profiling where memory is actually spent often leads to simpler, more stable solutions—like writing operations that avoid unnecessary allocations in the first place.

The Problem

Iteration 0: Baseline VRAM Usage (16 GB)

What’s happening here

Result:

Iteration 1: The Cache-Clearing Approach (12GB VRAM, 20% Slower)

Result:

Iteration 2: In-Place Operations (Under 8GB VRAM, 13% Faster)

What changed

Outcome

The Results

Key Takeaways

Conclusion

Leave a Comment Cancel Reply