A group of software engineers from the University of California, collaborating with a colleague from Soochow University and another from LuxiTec, have devised a method to operate AI language models without relying on matrix multiplication. Their innovative approach and successful testing have been detailed in a paper published on the arXiv preprint server.
The increasing power of LLMs like ChatGPT has led to a rise in the computational resources they demand. Matrix multiplication (MatMul) is a crucial step in running LLMs, where data is combined with weights in neural networks to generate probable answers to queries.
Initially, AI researchers recognized that graphics processing units (GPUs) were well-suited for neural network tasks due to their ability to run multiple processes concurrently, including multiple MatMuls. However, even with extensive GPU clusters, MatMuls have become bottlenecks as LLMs grow in power and popularity.
In their recent study, the research team asserts that they have devised a method to operate AI language models without the necessity of MatMuls, while maintaining efficiency.
To accomplish this, the team adopted a novel approach to weight data, replacing the current 16-bit floating point reliance with a simplified system using only three values: {-1, 0, 1}. They also introduced new functions to perform similar operations as the previous method, along with innovative quantization techniques to enhance performance. By reducing the number of weights, less processing power is required, resulting in decreased computing demands. Additionally, they revolutionized LLM processing by implementing a MatMul-free linear gated recurrent unit (MLGRU) in place of traditional transformer blocks.
Upon testing their novel concepts, the researchers discovered that their system, utilizing the new approach, achieved performance comparable to current state-of-the-art systems. They also observed that their system utilized fewer resources.