Microsoft has unveiled significant updates to the Bing search infrastructure, aiming to enhance its search capabilities by incorporating both large language models (LLMs) and small language models (SLMs) along with cutting-edge optimisation techniques. The announcement was made recently, underscoring the company's commitment to advancing search technology to improve performance and lower costs in delivering search results.
In the announcement, Microsoft stated, "At Bing, we are always pushing the boundaries of search technology. Leveraging both Large Language Models (LLMs) and Small Language Models (SLMs) marks a significant milestone in enhancing our search capabilities." This reflects an awareness of the increasing complexity of user queries which necessitates more sophisticated models for effective search results.
The integration of LLMs into search systems often presents challenges, particularly regarding speed and operational expenses. To counter these issues, Microsoft has developed SLMs that reportedly boast a throughput improvement of about 100 times compared to their larger counterparts. The announcement elaborated, "LLMs can be expensive to serve and slow. To improve efficiency, we trained SLM models (~100x throughput improvement over LLM), which process and understand search queries more precisely."
Additionally, Bing is employing NVIDIA’s TensorRT-LLM technology to optimise the performance of these SLMs. TensorRT-LLM is a sophisticated tool designed to reduce not only the response time but also the associated costs of operating large models on NVIDIA GPUs.
A technical report released by Microsoft shared insights into the advancements made with the Deep Search feature, which now leverages SLMs in real-time to enhance the relevance of web results. The report detailed that prior to these optimisations, Bing's original transformer model displayed a latency of 4.76 seconds for batched queries and a throughput of 4.2 queries per second per instance. Post-optimisation, latency improved to 3.03 seconds and throughput increased to 6.6 queries per second per instance, marking a 36% reduction in latency and a 57% cut in operational costs.
The company reinforced its commitment to quality, stating, "…our product is built on the foundation of providing the best results, and we will not compromise on quality for speed. This is where TensorRT-LLM comes into play, reducing model inference time and, consequently, the end-to-end experience latency without sacrificing result quality."
Bing users can expect several advantages from these updates, including faster search results due to optimised inference, improved accuracy from the advanced capabilities of SLM models, and enhanced cost efficiency, which may allow Microsoft to allocate more resources towards continuous innovation and enhancements in the future.
Conclusion of these advancements could signal a pivotal evolution in how search engines operate, especially as users increasingly prefer to pose more intricate queries. The shift to LLM and SLM models, along with the application of advanced optimisation techniques, could potentially reshape the landscape of online search, emphasizing the importance of delivering timely and relevant results to users.
Source: Noah Wire Services