Download PDFOpen PDF in browser

Consistency-Preserving Text-Based One-Shot Video Tuning with Segmentation Mask Guidance

EasyChair Preprint 14969

6 pagesDate: September 21, 2024

Abstract

Recent text-to-video (T2V) techniques have achieved remarkable success using the text-to-image (T2I) diffusion-based generation diagram. These methods have also been extended to tune video content using the existing T2I model in a one-shot manner. However, these models still struggle with temporal consistency preservation and tend to cause severe jitters, especially for moving objects. To address this issue, in this study, we propose to incorporate segmentation guidance into the diffusion pipeline to promote temporal stability. In particular, we first extract the positions of user-specified objects in each frame using an object segmentation model and generate a sequence of mask images. Then, we utilize the features of the mask image sequence as the query for the cross-attention mechanism in the diffusion model, while the content features of the original video serve as the key and value to generate the edited image sequence. As such, the object position information in the mask guidance can effectively guide the video generation process to reduce jitter. Experiments demonstrate that our method contributes to improved video quality compared to prior video tuning methods in terms of temporal smoothness.

Keyphrases: one-shot video tuning, segmentation guidance, text-to-video generation

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@booklet{EasyChair:14969,
  author    = {Yue Zhuo and Chunzhi Gu and Shigeru Kuriyama},
  title     = {Consistency-Preserving Text-Based One-Shot Video Tuning with Segmentation Mask Guidance},
  howpublished = {EasyChair Preprint 14969},
  year      = {EasyChair, 2024}}
Download PDFOpen PDF in browser