Azure API Management Quota: Policies, Limits & Cost

Q: What HTTP status code does Azure API Management return when a quota is exceeded?

Azure API Management returns a `403 Forbidden` HTTP status code when a subscriber exhausts their quota. The response includes a `Retry-After` header that specifies how many seconds the caller should wait before attempting another request. This behavior is consistent across all service tiers (Developer, Basic, Standard, and Premium), providing a predictable contract for client-side retry logic. Note that in rare cases where underlying compute resources restart, the platform may briefly continue handling requests after the quota ceiling is reached, so respecting the `Retry-After` header is a practical safeguard.

Q: Can the quota policy be applied at the operation level instead of the product level?

Yes, the quota policy can be applied at multiple scope levels in Azure API Management, including the operation level. While the standard `quota` policy typically enforces limits on a per-subscription basis, you can configure it at different levels within your API hierarchy. For more granular control, the `quota-by-key` policy allows you to define custom keys using policy expressions, enabling enforcement at virtually any logical level—by operation, tenant, user, or IP address—rather than being restricted to product or subscription boundaries.

Q: Does the quota counter reset automatically after the renewal period?

Yes, the quota counter resets automatically after the renewal period specified by the `renewal-period` attribute, which is measured in seconds. Setting `renewal-period` to a positive value creates a fixed window after which the counter resets and usage begins accumulating again. However, if you set `renewal-period` to zero, the quota becomes a true lifetime limit that never resets. This flexibility allows you to implement both periodic quotas (hourly, daily, monthly) and permanent consumption caps depending on your subscription tier requirements.

Q: How does quota-by-key handle distributed or multi-region APIM deployments?

The `quota-by-key` policy maintains separate counters based on the key you define (IP address, user ID, tenant ID, etc.), but in distributed or multi-region deployments, quota counters are typically managed at the instance level. Azure API Management synchronizes state across regions, but there can be brief consistency windows where different regions temporarily hold different counter values. For mission-critical quota enforcement across multiple regions, monitor synchronization latency and consider implementing additional client-side safeguards or using the Premium tier, which offers better consistency guarantees for distributed scenarios.

Q: Is there a way to programmatically reset a quota counter before the renewal period ends?

Azure API Management does not expose a built-in API endpoint to programmatically reset quota counters mid-cycle. Counters reset automatically only when the `renewal-period` expires. If you need to reset a quota before the renewal period ends, you must work around this limitation by either restarting the API Management instance (which clears all counters), using the Management REST API to update policy configurations, or implementing custom logic in a backend service that tracks quotas independently. For most scenarios, designing renewal periods that align with your business cycles avoids the need for mid-cycle resets.

Q: What is the difference between the quota policy and the rate-limit policy in Azure API Management?

The quota policy and rate-limit policy solve different problems. Rate limits protect against short, intense bursts by enforcing limits at the requests-per-second level with frequent resets. Quotas control call rates over longer periods (hours, days, months) and accumulate usage across a renewal window. Use rate limits to protect backend stability from sudden spikes; use quotas to enforce subscription tiers or manage long-term token consumption budgets. They are often used together: rate limits handle immediate traffic spikes while quotas enforce fair-use policies across billing cycles.

Q: Does the llm-token-limit policy replace the quota policy for AI API endpoints?

The `llm-token-limit` policy and the `quota` policy serve complementary purposes for AI API endpoints. The `llm-token-limit` policy specifically measures and limits token consumption for large language models, while the `quota` policy enforces broader call volume and bandwidth limits. For AI endpoints, you typically use both: `llm-token-limit` to cap tokens per request or per period, and `quota` to enforce subscription-level call limits. Neither replaces the other; they work together to provide comprehensive cost control and fair-use enforcement for AI workloads.

Q: How do I expose remaining quota to API consumers in the response headers?

Azure API Management does not automatically include remaining quota information in response headers. To expose quota status to API consumers, you must implement a custom solution: use the `set-header` policy to add headers that reflect remaining quota, or leverage the `quota-by-key` policy with a custom backend service that tracks and returns quota metrics. Alternatively, you can document quota limits in your API contract and let clients calculate remaining quota based on the `Retry-After` header returned when limits are exceeded. Some teams expose quota details through a separate management endpoint rather than in every response.

Q: What is the quota-by-key policy and when should I use it instead of the standard quota policy?

The `quota-by-key` policy enforces call volume and bandwidth limits against an arbitrary key you define at runtime—such as an IP address, user ID, or JWT claim—rather than being tied to a subscription. Use `quota-by-key` when a single subscription serves multiple distinct tenants or users who each need isolated quotas. It supports dynamic key expressions using C# policy expressions, allowing you to extract keys from headers, claims, or request context. The standard `quota` policy is simpler for single-tenant scenarios; `quota-by-key` is essential for multi-tenant SaaS products where per-subscription enforcement is insufficient.

Q: What are the core attributes needed to configure the quota policy?

The quota policy exposes five key attributes: `calls` (maximum requests per renewal window), `bandwidth` (maximum kilobytes transferred per window), `renewal-period` (window duration in seconds; zero means lifetime limit), `counter-key` (for quota-by-key only; defines the key against which limits are enforced), and `increment-condition` (optional Boolean expression that decides whether a request counts toward quota). Most production deployments use `calls` and `renewal-period` as a minimum. The `increment-condition` is powerful for excluding failed or unauthorized requests from quota consumption, preserving allowance for legitimate traffic.

Nicola·May 10, 2026

Azure API Management Quota: Policies, Limits & Cost

Azure API Management Quota: Policies, Limits, and Cost Optimization Explained

What Is the Azure API Management Quota Policy?

The Azure API Management quota policy is a configurable enforcement mechanism that controls how many API calls or how much bandwidth a subscription can consume over a defined period. As Microsoft's official documentation confirms, the policy enforces a renewable or lifetime call volume and/or bandwidth quota on a per-subscription basis. Understanding this policy is foundational to any serious cost optimization strategy built around Azure API Management.

The policy supports two measurable dimensions: call volume (the number of requests a subscriber can make) and bandwidth (the total data transferred). You configure a renewal window using the renewal-period attribute, which specifies the fixed window length in seconds after which the quota resets. Setting renewal-period to zero means the quota never resets, making it a true lifetime limit.

Quota vs. Rate Limit: Key Differences

These two mechanisms solve different problems, and conflating them leads to misconfigured APIs. According to Microsoft's throttling guidance, rate limits protect against short and intense volume bursts, while quotas control call rates over a longer period, such as capping a subscriber at a set number of monthly requests.

In practical terms, rate limits work at the requests-per-second level and reset frequently. Quotas accumulate usage across hours, days, or months. If your goal is protecting backend stability from sudden spikes, a rate limit is the right tool. If your goal is enforcing subscription tiers or managing long-term token consumption budgets, the quota policy is what you need.

How the 403 Forbidden Response Works

When a subscriber exhausts their quota, Azure API Management returns a 403 Forbidden HTTP status code. The response also includes a Retry-After header, which tells the caller how many seconds to wait before attempting another request. This behavior applies across all service tiers, including Developer, Basic, Standard, and Premium, giving teams a consistent contract to code against regardless of the environment.

There is a subtlety here worth understanding: if underlying compute resources restart, the platform may continue handling requests briefly even after the quota ceiling is reached. Building client-side retry logic that respects the Retry-After value is the practical safeguard against this edge case.

How Does the quota-by-key Policy Differ from the Standard quota Policy?

The standard quota policy enforces call volume and bandwidth limits on a per-subscription basis, meaning every subscriber under a product shares a single counter tied to their subscription key. The quota-by-key policy takes a different approach: it enforces the same kind of renewable or lifetime limits against an arbitrary key you define at runtime, such as an IP address, a user ID, or a claim extracted from a JWT token. This distinction matters enormously for teams building multi-tenant SaaS products where a single subscription might serve dozens of distinct tenants, each needing their own isolated quota.

Choosing the Right Key Expression

The counter-key attribute is where the real flexibility lives. Because policy expressions are fully supported in the counter-key attribute, you can compute the key dynamically at request time using C# expressions against the request context. A few practical examples:

IP-based isolation: @(context.Request.IpAddress) gives each client IP its own counter, useful for public APIs with anonymous callers.
JWT claim: @(context.Request.Claims.GetValueOrDefault("tenant_id","unknown")) isolates each tenant without requiring separate subscriptions.
Custom header: @(context.Request.Headers.GetValueOrDefault("X-Tenant-Id","default")) works well when tenants pass an identifier in a header.

The optional increment-condition attribute adds another layer of control. It accepts a Boolean policy expression that decides whether a given request should count toward the quota at all. For example, you could set it to count only responses with a 200 status code, so failed or unauthorized calls do not erode a tenant's monthly allowance. This kind of conditional counting is difficult to replicate with the standard subscription-scoped policy.

Tier Compatibility for quota-by-key

The quota-by-key policy runs on Developer, Basic, Standard, and Premium tiers, but the Consumption tier does not support it. If your architecture relies on Consumption for cost optimization through serverless scaling, you will need to handle per-key enforcement at the application layer or restructure your tier choice. The standard quota policy, by contrast, covers all classic tiers without this restriction. Teams running AI coding agents or other high-throughput workloads on Consumption should plan for this gap early, since retrofitting tier changes after launch carries real migration cost.

What Are the Core Policy Attributes You Need to Configure?

Honestly, getting these attributes right separates a policy that works as intended from one that silently misbehaves. The Azure API Management quota policies expose five key attributes: calls, bandwidth, renewal-period, counter-key, and increment-condition. Each serves a distinct role, and most production deployments use several of them together.

The calls attribute sets the maximum number of requests a subscription or key can make within one renewal window. The bandwidth attribute caps the total kilobytes transferred during that same period, which matters when your API returns large payloads that could drive up costs even at low call volumes. The renewal-period attribute specifies the length in seconds of the fixed window after which the counter resets; setting it to 0 makes the quota apply for the lifetime of the subscription with no reset.

For quota-by-key policies, the counter-key attribute is critical. It accepts any string expression, including policy expressions that reference headers, JWT claims, or IP addresses, so you can create a separate quota bucket per tenant, user, or API key without issuing separate subscriptions. The counter-key attribute accepts policy expressions, making it the primary tool for multi-tenant cost optimization scenarios.

The increment-condition attribute is a boolean policy expression that decides whether a particular request counts toward the quota at all. You could, for example, exclude requests that return a 5xx error, ensuring clients are not penalized for backend failures.

Renewable vs. Lifetime Quotas

Setting renewal-period to a positive integer creates a renewable quota that resets on a fixed schedule. Setting it to 0 converts the policy into a lifetime quota, useful for trial API keys that should have a hard cap across their entire active period, with no possibility of a monthly refresh.

Combining `calls` and `bandwidth` Limits

You can specify both calls and bandwidth in a single policy block. The quota is considered exhausted when either limit is reached first. This pairing is particularly useful for AI coding agents or data-export APIs where a small number of calls can still transfer enormous amounts of data, making call count alone an insufficient control for token consumption and transfer costs.

How Do Azure API Management Tiers Affect Quota Behavior?

Your choice of Azure API Management tier directly shapes which quota policies are available and how quota counters behave at runtime. The basic quota policy works across all tiers, but more granular controls are locked to specific service levels, and the Consumption tier operates under a fundamentally different execution model.

As a starting point, the quota policy applies across Developer, Basic, Standard, and Premium tiers, giving teams a consistent per-subscription call volume cap regardless of where they sit in the tier hierarchy. The more powerful quota-by-key policy, however, is not available on the Consumption tier, which restricts flexible, per-tenant quota scenarios to paid classic tiers. This distinction matters significantly for teams building multi-tenant SaaS APIs on a cost-sensitive plan.

Quota counters are stored per service instance and carry over after a tier change, so upgrading your tier does not reset them. Any existing call volumes remain intact.

Multi-Region Quota Counter Synchronization

Premium tier supports multi-region deployments, and this introduces a real complexity for Azure API Management quota enforcement. Quota counters are not instantly synchronized across regional units. A caller hitting your West Europe gateway and your East US gateway could, in theory, exceed their intended limit before the counter state propagates. Teams relying on strict enforcement for high-value API products should account for this lag in their policy design, either by building in conservative thresholds or by routing quota-sensitive traffic to a single region.

Consumption Tier Considerations

The Consumption tier uses a per-execution billing model rather than a reserved capacity model. This affects how quota counters interact with the service lifecycle. Because Consumption instances can spin up and down rapidly, there is a known behavior where compute restarts may cause brief periods of continued request handling after a quota is reached. Teams using Consumption for cost optimization should treat quota enforcement here as "best effort" rather than hard-stop, and pair it with external monitoring for precise control over token consumption and call volume.

How Can You Monitor and Query Quota Counters via the REST API?

You can query live quota counter values programmatically through two dedicated REST API endpoints, giving your team visibility into consumption before consumers hit the 403 threshold. The Quota By Period Keys GET endpoint (REST API version 2024-05-01) retrieves the current counter value for a specific period, while the Quota By Counter Keys List By Service endpoint returns all counter values for a given quota key across your entire service instance.

Quota By Period Keys GET Request Structure

The GET request requires five path parameters: subscriptionId, resourceGroupName, serviceName, quotaCounterKey, and quotaPeriodKey. Each parameter narrows the query scope from your Azure subscription down to the exact counter you want to inspect. This precision matters when you have multiple products with distinct Azure API Management quota policies running simultaneously.

A typical use case is building an alerting pipeline around this endpoint. You can poll it on a schedule, compare the returned counter value against your configured limit, and trigger a notification when consumption crosses, say, 80% of the ceiling. Because callers receive a 403 Forbidden response once the quota is exceeded, proactive monitoring prevents your consumers from experiencing silent failures with no warning.

Key path parameters at a glance:

subscriptionId: your Azure subscription identifier
resourceGroupName: the resource group hosting the APIM instance
serviceName: the APIM service name
quotaCounterKey: matches the counter-key attribute in your quota-by-key policy
quotaPeriodKey: the specific renewal window you want to inspect

Logging Quota Events with Azure Monitor

Look, one detail that consistently catches teams off guard: Azure API Management does not retain historical quota data indefinitely. Only current counter values are queryable through the REST API, which means if you need an audit trail of token consumption or call volume over time, you must route that data externally.

Azure Monitor Diagnostic Settings and Event Hub are the two standard paths for this. Configure your APIM instance to stream gateway logs to a Log Analytics workspace or an Event Hub sink, then build queries or downstream processors that capture counter snapshots on whatever cadence your cost optimization strategy requires. For AI coding agents or any service with heavy API usage, this external log store becomes the source of truth for billing reconciliation and capacity planning. Skipping this step leaves you with no way to reconstruct consumption patterns after the fact, which is a significant gap for any team that needs to audit dependency graph changes across multiple consumers.

What Is the llm-token-limit Policy and How Does It Relate to Quota?

The llm-token-limit policy is a specialized Azure API Management control that caps token consumption per minute for large language model backends, working alongside (not replacing) the standard quota policy. While quota governs call volume over a renewal period, llm-token-limit targets the cost dimension unique to LLM APIs: how many tokens flow through each request cluster. Together, they give teams a complete picture of both call frequency and context window expenditure.

As Microsoft's throttling documentation confirms, the llm-token-limit policy limits the number of tokens processed per minute by your backend to help protect against sudden spikes in token usage. That protection matters enormously when AI coding agents are hitting Azure OpenAI endpoints at high frequency, because a single agent session can exhaust a token budget in minutes if nothing intervenes.

Token Consumption vs. Call Volume: Why Both Matter

Standard quota policies count calls. A request that sends a 200-token prompt and one that sends a 4,000-token prompt both register as a single call, yet their backend costs differ by an order of magnitude. This is why token consumption needs its own governing layer.

Rate limits handle short bursts while quotas control call rates over longer periods, but neither of those mechanisms accounts for the token depth of each request. The llm-token-limit policy fills that gap by operating on a per-key basis, matching how quota-by-key scopes call limits to individual subscribers or tenants.

For teams running AI coding agents, this combination prevents two failure modes: a subscriber making too many calls (caught by quota) and a subscriber making fewer calls but with enormous context windows that drain the token budget early (caught by llm-token-limit).

Configuring llm-token-limit alongside quota

When placing both policies in the same inbound pipeline, the recommended pattern is to apply the quota policy first at the product or API scope, then apply llm-token-limit at the operation scope where the LLM endpoint lives. This way, a request blocked by quota never reaches the token-counting logic, keeping processing overhead low.

The policy applies across Developer, Basic, Basic v2, Standard, Standard v2, Premium, and Premium v2 tiers, covering the full range of production-grade deployments. Key configuration points to watch:

Set tokens-per-minute conservatively at first, then tune upward based on observed usage patterns from your monitoring logs.
Use the same counter key expression in both quota-by-key and llm-token-limit so limits align to the same identity (API key, subscription ID, or tenant header).
Account for prompt tokens and completion tokens separately if your LLM backend reports them; some models weight completion tokens at a higher cost ratio.

For cost optimization, treating token consumption as a first-class quota dimension is no longer optional when AI coding agents are part of the architecture. A misconfigured or absent llm-token-limit can silently turn a controlled monthly quota into a runaway billing event before any alert fires.

How Do You Design API Product Tiers Using Quota Policies?

You design API product tiers in Azure API Management by grouping APIs into products and attaching quota policies at the appropriate scope for each tier. This gives you a clean mechanism to offer differentiated service levels, such as a starter plan with 100 calls per month and a professional plan with 500 calls per month, all sharing the same backend API. The policy scope hierarchy and the dependency graph of policies are what make this work without requiring separate infrastructure for each tier.

Policy Scope Hierarchy in APIM

Azure API Management evaluates policies at four scopes: global, product, API, and operation. Lower scopes override higher ones, so an operation-level policy takes precedence over a product-level policy when both apply to the same request. This matters because a product-level quota and an operation-level rate-limit can coexist and both enforce independently. As Microsoft's documentation confirms, product, API, and operation call quotas are applied independently, meaning a subscriber can hit an operation ceiling without exhausting their monthly product quota, and vice versa.

When structuring tiers, think of the dependency graph of policies as a set of layered gates. A request passes through each gate in sequence. You can apply a monthly call cap at the product level, then apply tighter per-operation limits for expensive endpoints, and the two constraints coexist without one canceling the other.

Structuring Products for Cost Optimization

For practical cost optimization, map each product to a subscription tier and attach a quota policy scoped at the product level. A common pattern mirrors what Microsoft illustrates with monetized APIs: a basic plan might allow 10,000 calls per month, while a premium plan scales to 100,000,000 calls per month, both pointing at identical backend APIs.

Key considerations when building your product tiers:

Set the renewal-period attribute to match your billing cycle (for example, 2,592,000 seconds for a 30-day window).
Use quota-by-key for multi-tenant scenarios where a single subscription must track quotas per tenant identifier rather than per subscription key.
Pair product-level quotas with operation-level rate-limits to protect high-cost endpoints even within a generous plan.

When a subscriber exceeds their quota, the caller receives a 403 Forbidden response along with a Retry-After header indicating when to retry. This behavior is consistent across Developer, Basic, Standard, and Premium service tiers, which means your tier design transfers across environments without policy changes. Planning the dependency graph of policies carefully upfront reduces the rework needed as your product catalog grows.

What Are Common Quota Configuration Mistakes and How Do You Avoid Them?

Misconfigured Azure API Management quota policies are surprisingly easy to ship to production, and they tend to surface at the worst possible moment. Most mistakes fall into a handful of repeatable patterns, all of which are avoidable with deliberate policy design and proper testing before promotion.

Setting `renewal-period` too short. When the renewal-period attribute controls how long the fixed window runs before the quota resets, setting it to something like 60 seconds turns a quota into a de facto rate limiter. That creates confusion for consumers who expect quota to govern monthly or daily call volumes, not burst windows. Rate limits protect against short intense spikes; quotas govern longer periods. Keep those responsibilities separate.

Omitting `increment-condition`. Without this attribute, every request counts toward the quota, including ones that fail because your backend returned a 5xx. Consumers get penalized for errors outside their control. The increment-condition attribute accepts a Boolean policy expression so you can restrict counting to successful responses only, for example @(context.Response.StatusCode < 400).

Using a non-unique `counter-key` expression. If every tenant resolves to the same key string, they all share one quota bucket. Multi-tenant platforms must produce a distinct key per tenant, typically from a subscription ID, JWT claim, or header value.

Verify that your counter-key expression produces unique, non-colliding values across all expected callers before deployment.

Not exposing remaining quota in response headers. Consumers cannot self-manage their token consumption or context window usage when they have no visibility into how much quota remains. Add outbound policy logic to surface remaining call counts so clients can back off gracefully rather than hitting a hard 403.

Skipping Developer tier testing. Counter resets are difficult to trigger manually in Production. Test quota exhaustion and recovery behavior in the Developer tier first, where the environment is isolated and stakes are low.

How Do You Retrieve and Report Historical Quota Usage?

APIM does not natively store historical quota limit configurations or past counter values; only the current counter state for an active period is queryable through the REST API. To reconstruct usage history, teams need to combine Azure Monitor Diagnostic Logs with an external storage strategy. This is a known gap that affects any team running compliance audits or usage billing workflows.

Azure Monitor Diagnostic Settings for APIM

The Quota By Period Keys - Get endpoint retrieves the current counter value for a specific period, but it tells you nothing about what happened last month or last quarter. For historical reconstruction, Azure Monitor Diagnostic Settings are the practical starting point. Enabling the GatewayLogs diagnostic category sends request-level data to a Log Analytics workspace, where you can write Kusto queries to aggregate call volumes by subscription, API, or time window.

If your team needs to pipe this data into Datadog, Splunk, or a custom dashboard, Event Hub integration is the right path. APIM can stream gateway events to an Event Hub, and your downstream consumers handle aggregation and retention on their own schedule. This approach keeps token consumption metrics and quota-related signals flowing into whatever observability stack you already operate.

Snapshot Pattern for Period Boundaries

For AI coding agents or automated pipelines, a practical pattern is to snapshot quota counter values at the boundary of each renewal period and write them to Azure Table Storage. Because the renewal-period attribute defines the fixed window after which a quota resets, you know exactly when counters will zero out. Capturing a snapshot just before that reset gives you a durable record of peak consumption per period without relying on any built-in APIM history feature.

A simple implementation looks like this:

Call the Quota By Period Keys REST endpoint near the end of each window.
Write the counter value, the subscription key, and the timestamp to Azure Table Storage or Cosmos DB.
Query those snapshots for cost optimization reports or customer-facing usage dashboards.

This pattern is especially useful when your dependency graph spans multiple subscriptions or products, since each combination tracks its quota counter independently and needs its own snapshot record.

Frequently Asked Questions

What HTTP status code does Azure API Management return when a quota is exceeded?

Azure API Management returns a `403 Forbidden` HTTP status code when a subscriber exhausts their quota. The response includes a `Retry-After` header that specifies how many seconds the caller should wait before attempting another request. This behavior is consistent across all service tiers (Developer, Basic, Standard, and Premium), providing a predictable contract for client-side retry logic. Note that in rare cases where underlying compute resources restart, the platform may briefly continue handling requests after the quota ceiling is reached, so respecting the `Retry-After` header is a practical safeguard.

Can the quota policy be applied at the operation level instead of the product level?

Yes, the quota policy can be applied at multiple scope levels in Azure API Management, including the operation level. While the standard `quota` policy typically enforces limits on a per-subscription basis, you can configure it at different levels within your API hierarchy. For more granular control, the `quota-by-key` policy allows you to define custom keys using policy expressions, enabling enforcement at virtually any logical level—by operation, tenant, user, or IP address—rather than being restricted to product or subscription boundaries.

Does the quota counter reset automatically after the renewal period?

Yes, the quota counter resets automatically after the renewal period specified by the `renewal-period` attribute, which is measured in seconds. Setting `renewal-period` to a positive value creates a fixed window after which the counter resets and usage begins accumulating again. However, if you set `renewal-period` to zero, the quota becomes a true lifetime limit that never resets. This flexibility allows you to implement both periodic quotas (hourly, daily, monthly) and permanent consumption caps depending on your subscription tier requirements.

How does quota-by-key handle distributed or multi-region APIM deployments?

The `quota-by-key` policy maintains separate counters based on the key you define (IP address, user ID, tenant ID, etc.), but in distributed or multi-region deployments, quota counters are typically managed at the instance level. Azure API Management synchronizes state across regions, but there can be brief consistency windows where different regions temporarily hold different counter values. For mission-critical quota enforcement across multiple regions, monitor synchronization latency and consider implementing additional client-side safeguards or using the Premium tier, which offers better consistency guarantees for distributed scenarios.

Is there a way to programmatically reset a quota counter before the renewal period ends?

Azure API Management does not expose a built-in API endpoint to programmatically reset quota counters mid-cycle. Counters reset automatically only when the `renewal-period` expires. If you need to reset a quota before the renewal period ends, you must work around this limitation by either restarting the API Management instance (which clears all counters), using the Management REST API to update policy configurations, or implementing custom logic in a backend service that tracks quotas independently. For most scenarios, designing renewal periods that align with your business cycles avoids the need for mid-cycle resets.

What is the difference between the quota policy and the rate-limit policy in Azure API Management?

The quota policy and rate-limit policy solve different problems. Rate limits protect against short, intense bursts by enforcing limits at the requests-per-second level with frequent resets. Quotas control call rates over longer periods (hours, days, months) and accumulate usage across a renewal window. Use rate limits to protect backend stability from sudden spikes; use quotas to enforce subscription tiers or manage long-term token consumption budgets. They are often used together: rate limits handle immediate traffic spikes while quotas enforce fair-use policies across billing cycles.

Does the llm-token-limit policy replace the quota policy for AI API endpoints?

The `llm-token-limit` policy and the `quota` policy serve complementary purposes for AI API endpoints. The `llm-token-limit` policy specifically measures and limits token consumption for large language models, while the `quota` policy enforces broader call volume and bandwidth limits. For AI endpoints, you typically use both: `llm-token-limit` to cap tokens per request or per period, and `quota` to enforce subscription-level call limits. Neither replaces the other; they work together to provide comprehensive cost control and fair-use enforcement for AI workloads.

How do I expose remaining quota to API consumers in the response headers?

Azure API Management does not automatically include remaining quota information in response headers. To expose quota status to API consumers, you must implement a custom solution: use the `set-header` policy to add headers that reflect remaining quota, or leverage the `quota-by-key` policy with a custom backend service that tracks and returns quota metrics. Alternatively, you can document quota limits in your API contract and let clients calculate remaining quota based on the `Retry-After` header returned when limits are exceeded. Some teams expose quota details through a separate management endpoint rather than in every response.

What is the quota-by-key policy and when should I use it instead of the standard quota policy?

The `quota-by-key` policy enforces call volume and bandwidth limits against an arbitrary key you define at runtime—such as an IP address, user ID, or JWT claim—rather than being tied to a subscription. Use `quota-by-key` when a single subscription serves multiple distinct tenants or users who each need isolated quotas. It supports dynamic key expressions using C# policy expressions, allowing you to extract keys from headers, claims, or request context. The standard `quota` policy is simpler for single-tenant scenarios; `quota-by-key` is essential for multi-tenant SaaS products where per-subscription enforcement is insufficient.

What are the core attributes needed to configure the quota policy?

The quota policy exposes five key attributes: `calls` (maximum requests per renewal window), `bandwidth` (maximum kilobytes transferred per window), `renewal-period` (window duration in seconds; zero means lifetime limit), `counter-key` (for quota-by-key only; defines the key against which limits are enforced), and `increment-condition` (optional Boolean expression that decides whether a request counts toward quota). Most production deployments use `calls` and `renewal-period` as a minimum. The `increment-condition` is powerful for excluding failed or unauthorized requests from quota consumption, preserving allowance for legitimate traffic.

Which Azure API Management tiers support the quota-by-key policy?

The `quota-by-key` policy is supported on Developer, Basic, Standard, and Premium tiers. The Consumption tier does not support it. If your architecture relies on Consumption for serverless scaling cost optimization, you must handle per-key enforcement at the application layer or restructure your tier choice. The standard `quota` policy, by contrast, covers all classic tiers without this restriction. Teams planning high-throughput AI workloads on Consumption should account for this gap early, as retrofitting tier changes after launch carries significant migration costs.

Nicola

Developer and creator of vexp — a context engine for AI coding agents. I build tools that make AI coding assistants faster, cheaper, and actually useful on real codebases.

Cost & Optimization

Vibe Coding Is Fun Until the Bill Arrives: Token Optimization Guide

Vibe coding with AI is addictive but expensive. Freestyle prompting without context management burns tokens 3-5x faster than structured workflows.

Nicola·May 25, 2026

Windsurf

Windsurf Credits Running Out? How to Use Fewer Tokens Per Task

Windsurf credits deplete fast because the AI processes too much irrelevant context. Reduce what it needs to read and your credits last 2-3x longer.

Nicola·May 14, 2026

Antigravity

Antigravity Knowledge Base: How the IDE Learns (And Where It Falls Short)

Antigravity's knowledge base feature learns your codebase over time. But it misses dependency relationships and cross-file connections that matter most.

Nicola·May 12, 2026