THIS ARTICLE IS A DRAFT. USERS WON'T SEE IT ON THE SITE UNTIL IT'S PUBLISHED

Company

09 Jul 2024

Ntropy caching framework

Author

Ilia Zintchenko

Co-founder and CTO

Training small models from scratch, fine-tuning smaller pre-trained models using human labels and most recently low-rank adaptation using outputs from larger LLMs has become the industry-standard to make language models viable at scale.

Less intuitively, better performance on the training data can come at the cost of lower accuracy on instances outside of the main distribution.

For complex tasks, for example in finance, banking or accounting, a single error can lead to large sums of wrongly allocated funds, compliance issues and missing cases of fraud. The cost of an error can be significantly larger than the value of a correct output. In such cases, it is critical to deliver high accuracy on both the easy and hard instances of a task, while also keeping cost and latency low. This level of performance is only accessible to the largest models.

Unlike smaller models, which need to be specialized for the average instance of a task, large models with succinct prompting can perform well even without specialization. As this performance comes from their emergent generalization ability, accuracy can be high on both easy and hard instances of a task.

At Ntropy, we have learnt this first hand, while training multiple generations of smaller language models for financial transactions.

Mission

We want to make the largest models viable at scale, across industries and use cases.

Horizontal efforts, which have dominated recent progress, focus on hardware and software enhancements such as specialized chips for transformer architectures, algorithmic tricks to speed up inference, and general model caching mechanisms. These optimizations need deep expertise on the model and infra level, and are not tied to any particular use case or product.

Vertical optimizations, on the other hand, involve problem-specific APIs, prompting and specialized caching. They are tailored to particular tasks, requiring domain knowledge, access to relevant data, and customer insights. The setup can, however, be reused between similar companies. This makes it increasingly cost-effective and speedy to serve new companies with analogous needs, creating a compounding advantage.

Over the past year, we have been working on vertically scaling LLMs by combining caching with denoising models and segmentation of tasks into simpler, independent parts. This has enabled us to reduce the average number of queries to an LLM by 3-5 orders of magnitude and cost per datapoint by 2-3 orders of magnitude, without impacting accuracy, even on the hardest instances of a task. This approach is a task-specific optimization and integrates vertically. It is agnostic to the type of LLM we are using and complementary to any other horizontal inference enhancement, including prompt optimizations, specialized hardware, software tricks and model routing.

How are such efficiency gains possible? Let’s make a few observations:

Real-world tasks are often composed of multiple independent subtasks. For example, extracting key fields from invoices, medical records, legal proceedings, answering customer support messages containing multiple questions, assessing mechanical damage from images for insurance claims, and many more.
Real-world subtasks are often overdefined. i.e. there is redundant information present in the input which does not meaningfully affect the output of the model.
What matters is the average cost per task, not the cost of every single task instance.
Real-world throughputs are large. equivalent subtasks occur frequently in production.

For such a task, we can illustrate the flow for solving it with an LLM as follows:

Combined with denoising models for all parts of the input and a cache for each subtask, we can reduce the average cost and latency of running the largest LLMs by orders of magnitude. Let's unpack below.

Formalism

To compare the overall cost of solving a number of instances of a task with an LLM, we first need to set up some tooling:

Decomposing the input space into parts and the output space into subtasks is only performed once when setting up a task. This can therefore be done manually using all available domain knowledge of the problem.
Any call to the LLM is cached by storing pairs of (key, LLM output). i.e. we retrieve the output of the model from memory every time a call with the same input key is made.
Cache keys are computed from the raw inputs by concatenating the outputs of denoisers across all parts of the input.
Denoising can be done with classifiers that assign a category to each part of the input space and are orders of magnitude cheaper than the base LLM. The list of categories should be chosen such that it contains just enough information to fully resolve the output of the task. We use encoder-only transformer models for this across all our tasks.
When computing the cache keys, to prevent accuracy from being affected, we use the output of the denoiser on each part when its confidence is high (above threshold) and the input of the denoiser when its confidence is low (below threshold). The more parts can be replaced with their category, the higher the overall cache hit rate will be.
For simplicity, we assume the same LLM is used to solve both the full task and all subtasks individually. If more optimized models or prompts are used for each subtask, the cost advantage will be even larger.
For simplicity, we will only focus on the division of the input space into parts and assume there is only a single task to solve. If we can also divide it into subtasks that each uses a fraction of the input space, the costs are even smaller.
For simplicity, we will assume the cost of running the denoising models is negligible. In reality, these models are orders of magnitude cheaper to run than the base LLM and will be part of the baseline cost for the full pipeline that scales linearly with throughput.
We assume throughputs are large enough for equivalent instances of a subtask to occur frequently in production.

Variables:

$N$ is the number of parts into which we divide the input space. We will assume all parts have the same number of possible states and are equally hard for the denoiser</br>
⁠$P$ is the probability that the confidence of denoiser is high</br>
⁠$M_t$ is the number of possible states per part of the input</br>
⁠$M_p$ is the number of possible categories that the denoiser outputs per part</br>
⁠$M_p$ is significantly smaller than ⁠$M_t = K\cdot M_p$, i.e $K$ is large

The data flow with this setup is the following:

When caching based on the full input state, the LLM needs to be called

$$C_1 = M_t^N$$

times to run through all possible input states. Now, with if we divide the input state into parts and run the denoiser on each, the LLM needs to run at most ·

$$C_2 = M_p^{P\cdot N}\cdot M_t^{(1-P)\cdot N} = M_t^N / K^{P\cdot N} = C_1 / K^{P\cdot N}$$

Hence, by splitting the input space into parts, denoising and cache, the base LLM is called

$$Q = C_1 / C_2 = K^{P\cdot N}$$

fewer times than caching based on the full input state. Q grows exponentially with P, the probability that the denoiser is confident, N the number of parts we divide the input state into and (1/K), the ratio of the number of states per part after denoising vs. before denoising.

Caching for bank transaction enrichment

We now have the tools we need to apply this formalism to the real world. The use case we have currently deployed in production is enrichment of bank transactions for fintechs, banks and accounting companies. This problem requires high precision and has a high cost of mistakes.

The pipeline for this is composed of multiple steps and models, including entity extraction, transfer classification, organization categorization, transaction labeling and other steps. Parity with human accuracy can currently only be reached using the largest LLMs available at all steps, coupled with optimal prompting logic (minimal ambiguity in system prompt, pre/post-processing of input, domain-knowledge injection). Running this pipeline directly costs around 10 cents / transaction (with GPT-4-0314 as the base model). Enriching all electronic transactions globally (~2B per day) at this rate would cost $200M / day, or $73B / year. Costs need to be reduced by ~1000x (cache hit rates > 99.9%) to make this viable. Optimal caching at all steps of the pipeline is therefore critical.

The following diagram illustrates the end-to-end enrichment process, from raw transaction data to extracting information about all parties involved and inferring the purpose of a transaction.

The most interesting parts of the pipeline are entity recognition and transaction categorization.

Entity recognition

The cache key is created by replacing entities in the transaction description with their tags. Only entities on which the hasher has high confidence can be replaced safely, while not affecting the tags of other entities. The following is an analysis of cache hit rates vs. transaction volume:

This data is from the last month (around 400M transactions after deduplication) of our production environment. Each curve is made with a different per-entity-tag confidence score threshold. This means each entity in the transaction is only replaced by its tag if the confidence of the hasher for assigning that tag is above the threshold. The cache hit rate at the end is 97.5% and 98.7% at 0.4 and 0.0 confidence thresholds respectively. This is significantly higher than the 36.5% hit rate achieved by using plain input caching with digit masking, which we use as a baseline. Note, that in the confidence threshold range of [0.4;1.0), accuracy is not affected relative to the base model, but starts to degrade at smaller thresholds.

Transaction categorization

The cache key is created by discretizing each part of the input space with enough granularity to resolve the full output category taxonomy. The parts are the amount, account holder organization, counterparty organization, intermediary organization, transaction description, etc. Some parts are discretized trivially, while other parts like organization or transfer categories is a much harder task and so is done with the base LLM. The more categories are used for each part, the lower the hit rate of the transaction categorization cache for a given volume. The following is an analysis of cache hit rates vs. transaction volume:

As multiplicity is lower than the entity recognition cache, hit rate grows more rapidly with volume: 96.2% at 100k transactions, higher than 0.08% for plain transaction cache with digit masking. Accuracy is also not affected by denoising and it is more efficient than the plain transaction cache with digit masking.

Cost projection

Below is a 36-month cost projection for enriching bank transactions of US consumers using the full pipeline with task-specific caching for all steps. Numbers are based on stats from the Ntropy API, including mean token usage per transaction, transaction-distribution of merchants in the US, number of possible task states, and baseline costs (1 cent / 1k transactions).

As a result of caching, we achieve a 50-150x reduction in cost per transaction within 6-18 months. Larger throughputs populate the cache and converge to baseline costs faster. Note, that we have assumed the cache is empty at month 0. This will only be the case for a completely new task. The length of the warmup period depends on the particular task structure, but at high enough throughputs, we expect new tasks to take 2-3 months to be cached beyond 95%. For financial transactions, we set a refresh period of 12-18 months for all entries in the cache, depending on a number of factors. For other tasks and data domains, the update period can vary.

Versioning

There are a number of transitive effects to consider when caching past model outputs: LLM prompt changes, changes in the pre/post-processing logic, changes in the base model itself, external changes (markets, organizations, websites, locations, etc.). The cadence, magnitude and compatibility of these changes from previous versions, as well as latency, cost and compute constraints, will affect the optimal strategy to take when an existing cache entry is hit. There are a few choices:

Use existing entry
Get new entry from base model and overwrite the existing one
Refresh parts of the existing entry using a smaller LLM or a subset of the prompt
Call LLM at non-zero temperature and average output with existing entry. This can be used to increase accuracy over just a single call to the base model.Another important transitive factor is the unit economics of an entry in the cache over its lifetime. If the expected time between hits for an entry is T, storage cost per unit time is S, cost of running the denoising functions to obtain its hash key is H, expected number of hits before refreshing the entity is N and cost of calling the base model is B, the cost advantage of caching the entity across N hits is

$$N\cdot B / (N\cdot (S\cdot T + H) + B)$$

Storage costs in most situations can be neglected and assuming the base model is significantly more expensive than the denoising functions, the cost advantage approaches N, the number of times an entry is expected to hit.

Remarks

In this post, we provided a high-level overview of the task-specific caching infrastructure that we use at Ntropy to make the largest LLMs viable at scale. Many technical details have been left out. As we further deploy the caching infrastructure across our network of customers and extend it to other tasks, we will provide regular updates.

Facebook

Twitter

Copy link