Download PDFOpen PDF in browserConsistency-Preserving Text-Based One-Shot Video Tuning with Segmentation Mask GuidanceEasyChair Preprint 149696 pages•Date: September 21, 2024AbstractRecent text-to-video (T2V) techniques have achieved remarkable success using the text-to-image (T2I) diffusion-based generation diagram. These methods have also been extended to tune video content using the existing T2I model in a one-shot manner. However, these models still struggle with temporal consistency preservation and tend to cause severe jitters, especially for moving objects. To address this issue, in this study, we propose to incorporate segmentation guidance into the diffusion pipeline to promote temporal stability. In particular, we first extract the positions of user-specified objects in each frame using an object segmentation model and generate a sequence of mask images. Then, we utilize the features of the mask image sequence as the query for the cross-attention mechanism in the diffusion model, while the content features of the original video serve as the key and value to generate the edited image sequence. As such, the object position information in the mask guidance can effectively guide the video generation process to reduce jitter. Experiments demonstrate that our method contributes to improved video quality compared to prior video tuning methods in terms of temporal smoothness. Keyphrases: one-shot video tuning, segmentation guidance, text-to-video generation
|