LLaVolta: Environment friendly Multi-modal Fashions through Stage-wise Visible Context Compression
Authors: Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille
Summary: Whereas vital developments have been made in compressed representations for textual content embeddings in massive language fashions (LLMs), the compression of visible tokens in massive multi-modal fashions (LMMs) has remained a largely neglected space. On this work, we current the examine on the evaluation of redundancy regarding visible tokens and environment friendly coaching inside these fashions. Our preliminary experiments present that eliminating as much as 70% of visible tokens on the testing stage by merely common pooling solely results in a minimal 3% discount in visible query answering accuracy on the GQA benchmark, indicating vital redundancy in visible context. Addressing this, we introduce Visible Context Compressor, which reduces the variety of visible tokens throughout coaching to reinforce coaching effectivity with out sacrificing efficiency. To reduce data loss brought on by the compression on visible tokens whereas sustaining coaching effectivity, we develop LLaVolta as a lite coaching scheme. LLaVolta incorporates stage-wise visible context compression to progressively compress the visible tokens from closely to evenly, and at last no compression on the finish of coaching, yielding no lack of data when testing. Intensive experiments reveal that our strategy enhances the efficiency of MLLMs in each image-language and video-language understanding, whereas additionally considerably reducing coaching prices. Code is out there at https://github.com/Beckschen/LLaVolta