Harvest Video Foundation Models via Efficient Post-Pretraining

ArXi:2310.19554v2 Announce Type: replace Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets. In this paper, we propose an efficient framework to harvest video foundation models from image ones. Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pre