This means you are free to borrow the source code of this website, we just ask that you link back to this page in the footer. However, the copyright to the logo of Academia Sinica and the footer theme is reserved. Please remove them if your are not affiliated with Academia Sinica.
MeDM utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observed-space scores in latent-space Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models.
MeDM is capable of efficiently rendering high quality videos solely from 3D assets, including optical flows, occlusions and position information (depth, normal). We use the lineart derived from the normal maps as the input conditions to ControlNet. 3D assets from MPI Sintel.
Loading gallery...In addition, MeDM can also begin with adding noise to the pre-rendered videos and perform the denoising process from 0.5T step following SDEdit. The generated video should be similar to the original animation and incorporating the realistic prior from the pre-trained DM (The difference is especially significant in complex texture, such as hair). More quantitative results, including comparison with prior works, can be found in our paper.
Loading gallery...MeDM also performs well without high precision optical flows. We demonstrate this by applying text-guided video editing on real-world videos in DAVIS 2016.
Loading gallery...Finally, we demonstrate the versatility of MeDM. For example, MeDM can perform video anonymization out-of-the-box. We leverage the fact that human visual perception exhibits a remarkable sensitivity to human faces while our ability to detect and recognize other objects is not as specialized. We add noise to a video with a strength of 0.5T, which is strong enough to erase the identity while preserving other objects and the background scene, and perform denoising using MeDM to obtain the anonymized video. Text conditioning can also be injected to enable a more targeted identity modification. Celebrity videos from CelebV-HQ.
Loading gallery...