The Language Model Serving Company

I realized recently that I had read the phrase "language model serving will become commoditized" without actually thinking particularly deeply about why this might be the case. As I started to explore this further, it transformed into more thoughts around building software for language model serving that isn't the actual API service, and how this correlates with a much more developed industry for developer tools: web applications.

Background

Despite the explosion of demand around large language models, and the proliferation of companies building them, there is a surprising lack of companies building developer tools for serving them in production. This is in stark contrast to the web application space (a much more developed space), where there are many such companies building developer tools for deploying, monitoring and administrating web applications.

In addition, these companies are not any of the large cloud providers, who are content to provide the infrastructure for these companies to build on top of (perhaps with the exception of Cloudflare's surgical execution on edge computing). Will this dynamic be the same for serving large language models?

If you are not already familiar with Vercel, they are a Platform as a Service provider, specializing in deploying frontend applications. In particular, they author the NextJS framework, which, amongst other usability and speed features, implements "Server Side Rendering" for React applications. This can make some applications much faster, as html is rendered on a server, and sent back to the client. On its own, this does not necessarily make applications faster, but it enables features like pre-loading of pages linked from a given page, and enhanced security. Additionally, they have a strong focus on developer friendly workflows, such as preview URLs for PRs, git integration, automatic speed testing etc.

A comparison with Vercel

Vercel differentiates itself from the large cloud providers by ease of use, making money by selling high margin, value-added features which are almost guaranteed to be used as companies scale web-facing applications (KV stores, databases, CDNs). Curiously, they are not interested in vertical integration, in the sense of owning their own hardware. Why?

One of Vercel's top line features is a global "edge" caching mechanism, which caches data requests for serving web applications close to a user's location. Vercel depends on globally distributed CPU servers to provide their locality based serving cache. The "minimum viable infrastructure" for this caching feature would be a full, global network of datacenters, so the capital cost of replicating this functionality with their own infrastructure makes this almost impossible.

The complexity and overhead of setting up this infrastructure provides little incentive to the cloud providers to compete with them, as Vercel still rely on cloud provided compute. Vercel cannot compete on operating cost for running CPU instances with GCP/AWS etc, because of their scale, and the difficulty and non-homogenous nature of providing general purpose compute (different machine types, storage, etc). Cloud providers have no incentive to provide an enriched developer experience similar to Vercel's, because they can't use their scale to compete on price.

Is deploying backend LLM services different to deploying web applications?

The economics of language model serving is different, because the cost of running GPU servers (cost of ownership) is much smaller compared to the extremely high cost of capital for GPUs (i.e NVIDIA's margins). This is why there are so many competing GPU specific cloud providers - being a "good" or "well-optimized" GPU server provider is of low marginal utility compared to the cost of initial capital.

Even as technological improvements make LLMs cheaper to run on CPUs (and hence, commodity goods), the cost of running the largest, state of the art LLMs is still commoditized by how much the costs of serving are dominated by GPU access. This has been well demonstrated by the price war for serving the new Mixtral 8x7b/8x13b models.

Despite the market dynamics differing substantially from the Vercel/GCP/AWS relationship, it demonstrates several similar traits that allow software companies to add value on top of the serving companies themselves. For example, the aggressive market competition for serving LLMs via APIs is causing the standardization of client libraries. Companies are desperate to make the cost of switching to their new, hot LLM essentially zero, and a part of this is writing API specs which are identical to OpenAI. E.g:

Together AI
Replicate Lifeboat
Perplexity: "Our supported models are listed on the "Supported Models" page. The API is conveniently OpenAI client-compatible for easy integration with existing applications."
Anyscale
Fireworks AI

For companies building developer tools for LLMs, this is great - the overhead of integrating with a wide range of API providers is low, and making it easy for users to switch is also low. Some logging/monitoring companies take advantage of this already - llm.report's integration requires changing the OpenAI client's base URL only, and similarly BrainTrust Data allows consolidated access to multiple model APIs using a single key/service. With model routing services like WithMartian, this can even be automated.

Some of these attributes show similarities to the web application ecosystem. The standardization in client code and access patterns looks similar to React emerging as the dominant JS library for dynamic HTML rendering. React allowed Vercel to elegantly solve the problem of slow client rendering without alienating existing React developers, but one very important difference is the substantial switching cost from one JS framework to another. If you write code in NextJS in order to get the benefits it provides, switching to use Vue would be a large amount of work. Building plugins or libraries across 2 React frameworks is 2x the work, whereas for LLMs, supporting multiple model APIs is quite easy.

This means there is almost an anti network effect for LLM serving APIs - users discover alternatives via API aggregators, and then switch to the cheapest/best one. This happens for 2 key reasons - LLMs have the simplest IO of any new technology in history, and they are also (currently) completely stateless. This is not the case for React frameworks, where the network effect is positive - if library developers make plugins for your framework, they are unlikely to be developed for other frameworks as well.

This is one of the strongest arguments for an OSS model strategy. Mistral's Apache 2.0 licensed models are already proliferating, and as LLMs become a more pervasive part of applications, on-premise hosting will become more common for companies, for the same reasons that self-serve SaSS models also have "enterprise" plans (SOC-2, data-retention etc etc). Custom deployments involve a more substantial investment (time and code) than calling an API, which gives Mistral an (eventual) edge in retaining users. Whether they can find services or new models which those users would pay for is another question.

Although there are some companies that are providing these developer products for LLM APIs (e.g Replicate), these companies typically focus on APIs and deployment rather than developer experience. Of course, this is a prerequisite to many of the additional features one can imagine building into a developer product for LLMs in production (logging, monitoring etc).

"The Wedge" for LLM serving

What is the key value proposition "(wedge)", for LLM serving? The product wedge for Vercel was the NextJS framework, which provided a compelling developer experience and otherwise unattainable web speed for building React applications. Currently for LLMs, the focus is on price and speed/throughput, but that will soon be driven down by a combination of commodity access to hardware and technological innovation. BERT, an encoder model from 5 years ago, can be trained in 1hr on 8 GPUs at a total cost of $20. There is no reason not to expect a similar level of technological innovation in larger language models, as they are exactly the same technology. Assuming companies will not be able to compete on price, what is the wedge for LLM serving?

Determining the best wedge for LLMs is an interesting question, because the separate pieces of an effective LLM solution are considerably more interconnected than for a web application. The ultimate value creation in an LLM serving application is the feedback loop between the served model, the generated text and implicit user feedback. This means that a potential wedge product feature designed around building and improving LLMs needs to cross multiple parts of a "typical" software application stack, e.g logging AND evaluation. This is not the case for a web application, where the value creation is in the initial user experience, and the feedback loop is much more indirect/guided by product sense. I think it is likely that the most effective wedge for LLMS (for B2B companies) will be one of either Evaluation or The Feedback Loop.

Value-add features for LLM serving

If LLM serving is becoming commoditized, how can a "Vercel of LLMs" provide services which create value? Here are some predictions/ideas:

Structured generation

LLMs are crying out for structured generation, and there are several open source libraries which accommodate this, either through function calling APIs or grammar based generation. The standout options include:

Outlines, from parent company .txt
Instructor
Jsonformer
llama.cpp

Although each of these tools has their eccentricities, when combined with tools which produce GBNF grammars from typescript interfaces, this becomes very powerful. One could imagine deploying source code which defines a typescript interface/Pydantic model by converting it to a grammar and automatically deploying structured LLM endpoints based on the generated grammars, similar to Vercel's serverless function deployments.

APIs with integrated search

The LLM startup with the most compelling B2C product (discounting Cohere/Anthropic etc) currently seems to be Perplexity, who's focus is integrating search into LLMs at a large scale. Retrieval Augmented Generation (RAG) adds complexity to LLM serving requirements and suddenly adds a layer of statefulness and customization to deployed applications.

Logging

This is the current focus of many companies/startups, but there is still a lot of room for improvement, especially around integrating the feedback loop (discussed below). Logging is also a good example of a standard application feature that looks quite different due to the prompt/response format. Smarter interfaces, which themselves involve ML, could be powerful differentiators for companies in the space (full text search over logs, clustering by prompt/response, UMAP plots of the last N days of responses).

Watermarking

As LLMs proliferate, it will become increasingly important to be able to identify the source of a given generation. This is an active research area, some of the recent papers include:

This can easily be integrated as a paid feature of an LLM serving company. This one is quite interesting, as often the watermarks require access to the model or model logits, which is not always possible without integrating with the model provider. Extensions to this, such as canary monitoring services, are also possible. These methods can be subverted, but for the average user, they are likely to be sufficient. I think the scope of use cases beyond "identifying nefarious use" is quite large; filtering training data, identifying model theft, even identifying the model source/version of a given generation are all possible.

Function calling

Some API providers already provide function calling APIs, but these are extremely ad-hoc, and require manual type inference to extract a e.g OpenAPI compatible schema. Similar to Vercel's automatic deployment of lambda functions specified in a api/ directory, a compelling feature here would be automatically making registered functions available to LLM calls. An additional problem to solve here is function search - which functions should you expose based on the prompt/query?

Local development experience

Developing locally with LLMs is quite painful - the APIs are expensive, and running models locally is hard; difficult to install, computationally expensive to run, not particularly stable software. You want caching, key use statistics, you want to log your test generations, you want to test streaming responses etc etc. Simon Willison's llm is in the right direction, but there's plenty of room for improvement here. The way NextJS makes running backend functions locally easy is a good example of how this could be done.

Security and Prompt Injection

One of the largest barriers to widespread adoption of LLM technology in commercial settings will be it's vulnerability to prompt injection attacks. Prompt injection attacks are difficult to fix, because by their nature they are just parts of a prompt. ML attempts to detect such inputs will always be vulnerable to new and creative ways to do the prompt injection.

Federated Finetuning

Some data is sensitive enough that language models should not be trained or fine-tuned on it directly. However, there is a substantial body of research around privacy prerserving computation, as well as ML frameworks that have invested in it, like Tensorflow.

Evaluation

Language models are hard to evaluate in a rigorous way. A multitude of benchmarks have been formulated to attempt to automatically evaluate models on tasks. Additionally, for LLM developers, there is another element of variability - the prompt itself. This is the route being taken by many startups.

This is only one piece of the puzzle when it comes to creating new language model applications. Just as important is collecting structured feedback and responses in a way that is amenable to further preference finetuning.

Grounded generation

Current LLM generation is hard to evaluate and verify. One way to improve this is to provide a "grounding" for the generation; conceptually related to structured prompts, or database/function access. For example, you can generate text from JSON fetched from a DB, where the LLM generates symbolic references to the DB entries. The best paper exploring this idea is Towards Verifiable Text Generation with Symbolic References. I think this area has the potential to influence many other areas of LLM generation, particularly evaluation (symbolic references are very easy to verify) and function calls/structured generation.

The feedback loop

A key differentiator with LLMs is the ease of improving a model simply by having people use your product or service. Finetuning on new data, if you have collected, cleaned and analysed it, is not a hugely difficult proposition for an ML Engineer. The difficult pieces of this are coordinating what data you need to collect, actually collecting that data, tying it back to LLM generations, and evaluating improved models. Many companies are tackling the individual pieces of this, viewing them as individual pieces of a software stack.

I think a key value proposition for a "Vercel for LLMs" might be to solve several pieces of this at once, creating an "off the shelf" solution for feedback loops. This would involve:

Logging hooks for existing LLM serving frameworks
A javascript LLM analytics library, allowing users to define "events" which correspond to finetuning signals, tied to model generations
Evaluation and filtering tools for reviewing collected data

I have not seen a convincing version of this "Flywheel" for production systems. Pieces of it exist, but without being focused at LLMs singularly.

ML Differentiated products

One interesting thing about LLM add-on products is that their inputs and outputs are, at a base level, purely text. This enables new features beyond typical solutions, gated by the ML expertise required to build them.

For example, take LLM logging. At one level, a LLM logging solution can provide metrics you might expect from Datadog - time to first token, throughput, cache usage, billing statistics etc. Of course, this information is extremely useful, and provides visibility to LLM users.

However, there is another set of analysis techniques that only really apply to the text in, text out interface of LLMs; text search over logs, clustering (e.g T-SNE, or UMAP) of recent responses and identifying erroneous or fraudulent usage all require the same language processing competency as building the system itself. These ML features require quite different architectural choices (and ML expertise) than a typical logging solution, making it hard for "general purpose" logging solutions to compete with a LLM specific solution.

Looking forward

Some of the features here do require direct access to models (or at least, model logits) to be implemented. But I think the majority of them can be abstracted around model proxies until LLM serving is cheap enough to be run on commodity hardware. Either way, lots of exciting stuff to build!

Thanks to Delip Rao for feedback on this blog post!