AWS brings prompt routing and caching to its Bedrock LLM service | TechCrunch
As businesses move from trying out generative AI in limited prototypes to putting them into production, they are becoming increasingly price conscious. Using large language models isn’t cheap, after all. One way to reduce cost is to go back to an old concept: caching. Another is to route simpler queries to smaller, more cost-efficient models. At its re:invent conference in Las Vegas, AWS today announced both of these features for its Bedrock LLM hosting service.
Let’s talk about the caching service first. “Say there is a document, and multiple people are asking questions on the same document. Every single time you’re paying,” Atul Deo, the director of product for Bedrock, told me. “And these context windows are getting longer and longer. For example, with Nova, we’re going to have 300k [tokens of] context and 2 million [tokens of] context. I think by next year, it could even go much higher.”
Caching essentially ensures that you don’t have to pay for the model to do repetitive work and reprocess the same (or substantially similar) queries over and over again. According to AWS, this can reduce cost by up to 90% but one additional byproduct of this is also that the latency for getting an answer back from the model is significantly lower (AWS says by up to 85%). Adobe, which tested prompt caching for some of its generative AI applications on Bedrock, saw a 72% reduction in response time.
The other major new feature is intelligent prompt routing for Bedrock. With this, Bedrock can automatically route prompts to different models in the same model family to help businesses strike the right balance between performance and cost. The system automatically predicts (using a small language model) how each model will perform for a given query and then route the request accordingly.
“Sometimes, my query could be very simple. Do I really need to send that query to the most capable model, which is extremely expensive and slow? Probably not. So basically, you want to create this notion of ‘Hey, at run time, based on the incoming prompt, send the right query to the right model,’” Deo explained.
LLM routing isn’t a new concept, of course. Startups like Martian and a number of open source projects also tackle this, but AWS would likely argue that what differentiates its offering is that the router can intelligently direct queries without a lot of human input. But it’s also limited, in that it can only route queries to models in the same model family. In the long run, though, Deo told me, the team plans to expand this system and give users more customizability.
Lastly, AWS is also launching a new marketplace for Bedrock. The idea here, Deo said, is that while Amazon is partnering with many of the larger model providers, there are now hundreds of specialized models that may only have a few dedicated users. Since those customers are asking the company to support these, AWS is launching a marketplace for these models, where the only major difference is that users will have to provision and manage the capacity of their infrastructure themselves — something that Bedrock typically handles automatically. In total, AWS will offer about 100 of these emerging and specialized models, with more to come.