Home Innovation Salesforce Salesforce AI unveils 'ThinK',...
Salesforce
Business Fortune
02 August, 2024
The Chinese University of Hong Kong and Salesforce AI Research researchers have developed a novel KV cache pruning technique called ThinK.
ThinK treats the task as an optimization problem with the goal of minimizing the loss of attention when pruning. It chooses crucial channels indiscriminately and presents a query-dependent criterion for determining channel importance. Important findings from the visualizations of the LLaMA3-8B model form the basis of the approach: Value cache has no discernible patterns, but key cache channels display different degrees of relevance. Few singular values contain substantial energy, as seen by the singular value decomposition of attention matrices, suggesting a low-rank character for the attention process. These findings imply that low-dimensional vectors can be used to efficiently approximate key caches. These results are used by ThinK to create an effective pruning method that targets the channel dimension of the key cache, possibly lowering memory consumption without compromising model performance.
By reducing the channel dimension of the key cache, Salesforce ThinK AI is a novel technique for improving the KV cache in LLMs. In order to minimize the difference between the original and pruned attention weights, the approach formulates the pruning task as an optimization problem. A new query-driven pruning criterion introduced by ThinK assesses channel relevance by looking at how the query and key vectors interact. By choosing the most crucial channels using a greedy algorithm, this technique maintains the main information flow during the attention computation, greatly enhancing AI-driven business solutions.
ThinK AI technology benefits are: in order to minimize computing costs, the implementation uses an observation window and concentrates on long-context cases. Pruned keys with a smaller channel size and unpruned keys with their original size are the two types of keys that ThinK keeps in the KV cache. Pruning channels are tracked using a binary mask. Pruning occurs either before the query is multiplied by the relevant keys during decoding or when pruned keys are zero-filled and concatenated with unpruned keys. Combining this strategy with optimization methods such as FlashAttention may result in increased computing efficiency without sacrificing model performance.