Anonymous Rebuttal

Figure R1. Scaling Diffusion Forcing Transformer and History Guidance. We showcase the video generation capabilities of our larger DFoT model, obtained by fine-tuning the Wan2.1 T2V-1.3B model, which was never trained to condition on images. With only 20k steps of finetuning under limited resources, the model is already able to generate image-conditioned videos (traditionally hard at this size) with varying history lengths. These results validate the scalability and effectiveness of our approach, including long video generation, flexible-length history conditioning, and improved quality and consistency in more dynamic, complex, and diverse scenes at a higher resolution of 832x480. Hover over the videos to see the text prompts, and tap the arrow buttons to view more videos.

(a) Diffusion Forcing Transformer's long video generation capability scales and generalizes well. We present six 217-frame videos generated by Diffusion Forcing Transformer via 5x sliding window rollout, given an single random image sourced from the web (not in training data) and a text prompt.


(b) History Guidance improves the quality and consistency of the generated videos. We compare the frame interpolation results given two frames (first and last) taken from random videos on the web and a text prompt, with (top) and without (bottom) History Guidance.

Figure R2. History Guidance efficiently utilizes the sampling compute budget. We compare the FVD of videos generated using history guidance variants (HG-v and HG-f) against those without HG, across varying sampling compute budgets. The budget is measured by the total number of function evaluations (NFE). Under the same sampling budget, HG-v and HG-f achieve significantly better (lower) FVD than w/o HG. This suggests that although history guidance requires multiple forward passes per timestep during sampling (HG-v = 2, HG-f = 3), it can be effectively compensated by reducing the number of sampling timesteps. The experiment setup is identical to Section 6.3.

Figure R3. HG-t trades off long-term consistency and quality in long context generation. This figure illustrates the balance between long-term consistency, measured by LPIPS, and quality, measured by FID, of Minecraft videos generated using HG-t by varying the weights of combining short (ωshort) and full long (ωlong) history. Clearly, with increasing ωlong, the long-term memory improves, while the quality degrades. This clear negative correlation between the two factors suggests that HG-t can effectively trade off long-horizon memory and quality by mixing the two histories. The experiment setup is identical to Section 6.4 - Task 2.

Anonymous Submission

Five samples generated by Diffusion Forcing Transformer from a single image. The model is trained only on the RealEstate10K dataset but can roll out much longer than prior state-of-the-art methods [1][2]. We highlight samples with challenging motions (e.g. zooming out, large rotation).

Ultra Long Video Generation

Diffusion Forcing Transformer (DFoT) along with History Guidance Across Time and Frequency can stably rollout extremeley long videos, such as the following 862-frame video from a single test image from the RealEstate10K dataset.

Compositionality and Flexibility

DFoT learns the distribution of all sub-sequences than just the full sequence, allowing conditioning on any length history. Temporal History Guidance composes long horizon behavior and local reactive behavior for new capabilities.

Qualitative Comparisons

On standard benchmarks, the Diffusion Forcing Transformer (DFoT) not only matches or surpasses industry closed-source models trained with large-scale compute but also enables long rollouts far beyond the test lengths of these datasets. We can perform rollouts of 60 frames on the Kinetics-600 dataset, compared to the previous benchmark of 11 frames, and at least 276 frames on the RealEstate10K dataset, significantly exceeding the previous limit of around 16 frames.

The figures below present qualitative samples generated by different diffusion methods using the same architecture. Standard Diffusion refers to the conditional diffusion baseline trained for a specific test history length (in contrast to DFoT's support for any history length). Binary Dropout is an ablative baseline that drops out frames during training to allow for flexible history conditioning. Full-sequence Diffusion is the traditional video diffusion method from Ho et al. 2022, which uses reconstruction guidance to enable flexible conditioning.

Samples on Kinetics-600 dataset with a challenging setting of predicting next 60 frames given 5 initial frames.


More samples on Kinetics-600 dataset with a challenging setting of predicting next 60 frames given 5 initial frames.


Samples on RealEstate10K dataset conditioned on the first frame and a camera pose sequence. This task is usually considered much harder than interpolating between two frames, the traditional video generation task on this dataset. In addition, we deliberately choose challenging motions such as big rotations or zooming out, and a big length of 276 frames.


More samples on RealEstate10K dataset conditioned on the first frame and a camera pose sequence. This task is usually considered much harder than interpolating between two frames, the traditional video generation task on this dataset. In addition, we deliberately choose challenging motions such as big rotations or zooming out, and a big length of 276 frames.