Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

1Shanghai AI Laboratory, OpenGVLab, 2Snap Inc.
*First author, Co-corresponding authors
teaser

We addressed velocity ambiguity in video frame interpolation, leading to:
Clearer anytime frame interpolation & Manipulated interpolation of anything.


Clearer anytime frame interpolation


When integrating our plug-and-play training strategies ([D,R]) into the state-of-the-art learning-based models such as RIFE [1], IFRNet [2], AMT [3], and EMA-VFI [4], they exhibit markedly sharper outputs and superior perceptual quality in arbitrary time interpolations.
(Here, we employ RIFE as an illustrative example, generating 128 interpolated frames using just two images.)

Manipulated interpolation of anything



Additionally, our strategies enable temporal manipulation of each object independently during the inference stage, offering a novel tool for video editing tasks like re-timing. Our APP also supports video as an input to classic video frame interpolation.


Abstract

Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements. Given two images of a ⚾, there are infinitely many possible trajectories: accelerating or decelerating, straight or curved. This often results in blurry frames as the method averages out these possibilities. Instead of forcing the network to learn this complicated time-to-location mapping implicitly together with predicting the frames, we provide the network with an explicit hint on how far the object has traveled between start and end frames, a novel approach termed "distance indexing". This method offers a clearer learning goal for models, reducing the uncertainty tied to object speeds. We further observed that, even with this extra guidance, objects can still be blurry especially when they are equally far from both input frames (i.e., halfway in-between), due to the directional ambiguity in long-range motion. To solve this, we propose an iterative reference-based estimation strategy that breaks down a long-range prediction into several short-range steps. When integrating our plug-and-play strategies into state-of-the-art learning-based models, they exhibit markedly sharper outputs and superior perceptual quality in arbitrary time interpolations, using a uniform distance indexing map in the same format as time indexing. Additionally, distance indexing can be specified pixel-wise, which enables temporal manipulation of each object independently, offering a novel tool for video editing tasks like re-timing.

Velocity ambiguity problem

The current mainstream algorithm for arbitrary-time frame interpolation, denoted as \(\mathcal{F}\), predicts the target frame \(I_t\) using the starting frame \(I_0\), the ending frame \(I_1\), and a time index \(t\) as inputs: $$I_t = \mathcal{F}\left(I_0, I_1, t\right)$$ However, the unknown motion velocity of each independently moving object introduces the issue of "velocity ambiguity". This means there are multiple possible mappings from the same inputs to different locations: $$\left\{I_t^1, I_t^2, \ldots, I_t^n\right\} = \mathcal{F}(I_0, I_1, t)$$ Taking a ⚾ as an example, there are countless potential landing spots for the ball in mid-air, leading to conflicts in learning during the training process. In short, the algorithm cannot discern which scenario is correct to learn, so it settles for an average state: $$ \hat{I}_t = \mathbb{E}_{I_t \sim \mathcal{F}(I_0, I_1, t)}[I_t] $$ This results in the algorithm's prediction \(\hat{I}_t\) being blurry during testing:

Velocity ambiguity


Disambiguation

To resolve velocity ambiguity, a new paradigm for index-based learning is required. We need to guide the algorithm on why objects land in certain positions: $$I_{t} = \mathcal{F}\left(I_0, I_1, \text{motion hint}\right)$$

Distance indexing

In fact, the training approach using time indexing requires the algorithm to not only learn how to interpolate frames but also to guess the mapping relationship from time to position, denoted as \(\mathcal{D}\): $$I_t = \mathcal{F}(I_0, I_1, t) \to I_t = \mathcal{F}(I_0, I_1, \mathcal{D}(t))$$ Our solution involves calculating a path distance ratio map \(D_t\) to replace the time \(t\) for index-based learning: $$I_t = \mathcal{F}(I_0, I_1, \mathcal{D}(t)) \to I_t = \mathcal{F}(I_0, I_1, D_t)$$

Distance indexing

We first compute the optical flows from \(I_0\) to \(I_t\) and from \(I_0\) to \(I_1\), denoted as \(V_{0\to t}\) and \(V_{0\to 1}\), respectively. Then, for each pixel location \((x,y)\), we calculate the proportion of the optical flow \(V_{0\to t}\) projected onto \(V_{0\to 1}\), denoted as the "path distance ratio": $$D_t(x,y) = \frac{\left\Vert \mathbf{V}_{0\to t}(x,y)\right\Vert \cos{\theta}}{\left\Vert \mathbf{V}_{0\to 1}(x,y) \right\Vert}$$ With \(D_t\), the algorithm avoids the ambiguous time-to-position mappings during training caused by varying velocities, allowing for clearer predictions during testing. Importantly, even without the ability to calculate the exact \(D_t\) using ground truth labels during inference, providing a uniform index map similar to time indexing, i.e., \(D_t=t\), the algorithm can still predict clearer images (simulating uniform motion of objects).


Iterative reference-based estimation

While path distance indexing helps us sidestep ambiguities in speed, it does not resolve directional ambiguities. We apply a classic divide-and-conquer strategy to minimize the impact of directional ambiguities, further improving the predicted outcome.
In practice, we break down a long-distance inference into a series of short-distance inferences from near to far, using the previous inference, along with the starting and ending frames, as references to avoid accumulated errors: $$I_t = \mathcal{F}(I_0, I_1, D_t, I_{\text{ref}}, D_{\text{ref}})$$ For example, dividing the inference into two steps would look like: $$I_{t/2} = \mathcal{F}(I_0, I_1, D_{t/2}, I_{0}, D_{0})$$ $$I_{t} = \mathcal{F}(I_0, I_1, D_{t}, I_{t/2}, D_{t/2})$$ Similarly, taking a ⚾ as an example, our proposed strategies are illustrated in the following figure:

Disambiguation


Editable interpolation

Beyond using a uniform index map like time indexing, we can also take advantage of the 2D editable nature of path distance indexing to implement editable frame interpolation techniques. Initially, we can obtain masks for objects of interest using the Segment Anything Model (SAM) [5]. We then customize the path distance curves for different object regions to achieve manipulated interpolation of anything.

Manipulation


Ending

This work presents the next generation of video frame interpolation technology, aiming to inspire readers and contribute to fields like video enhancement, editing, and generation! 🔥🔥🔥
Welcome to 🌟 this project and follow the author's GitHub



BibTeX

@article{zhong2023clearer,
    title={Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation},
    author={Zhong, Zhihang and Krishnan, Gurunandan and Sun, Xiao and Qiao, Yu and Ma, Sizhuo and Wang, Jian},
    journal={arXiv preprint arXiv:2311.08007},
    year={2023}}

Acknowledgements

We thank Dorian Chan, Zhirong Wu, and Stephen Lin for their insightful feedback and advice. Our thanks also go to Vu An Tran for developing the web application, and to Wei Wang for coordinating the user study.

Reference

[1] Huang, Zhewei, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. "Real-time intermediate flow estimation for video frame interpolation." In European Conference on Computer Vision, pp. 624-642. Cham: Springer Nature Switzerland, 2022.
[2] Kong, Lingtong, Boyuan Jiang, Donghao Luo, Wenqing Chu, Xiaoming Huang, Ying Tai, Chengjie Wang, and Jie Yang. "Ifrnet: Intermediate feature refine network for efficient frame interpolation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969-1978. 2022.
[3] Li, Zhen, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng. "AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9801-9810. 2023.
[4] Zhang, Guozhen, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, and Limin Wang. "Extracting motion and appearance via inter-frame attention for efficient video frame interpolation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5682-5692. 2023.
[5] Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. "Segment Anything." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015-4026. 2023