Rate Limits

Qubax enforces rate limits per API key to protect shared infrastructure and ensure fair access for all users. This page explains how limits are applied, how to interpret a 429 response, and how to build a reliable retry strategy.

Per-Key Limits

Limits are tracked against your API key (the qbx_live_... token), not against individual requests or sessions. Every request made with a key draws from that key's shared quota. Two common dimensions are enforced together:

Requests per minute (RPM) — the number of calls allowed in a rolling 60-second window.
Tokens per minute (TPM) — the total number of input plus output tokens allowed per minute across all requests.

Your specific limits depend on your plan and tier. Every response includes headers that tell you exactly where you stand:

Header	Meaning
x-ratelimit-limit-requests	Max requests per minute for your key.
x-ratelimit-remaining-requests	Requests remaining in the current window.
x-ratelimit-limit-tokens	Max tokens per minute for your key.
x-ratelimit-remaining-tokens	Tokens remaining in the current window.
x-ratelimit-reset-requests	Time until the request window resets.

ℹ️

Inspect these headers on every response to throttle proactively and avoid hitting the hard limit. When x-ratelimit-remaining-* approaches zero, pause briefly before sending the next request.

Handling 429 Responses

When you exceed a limit, Qubax responds with HTTP 429 Too Many Requests and a JSON body describing the violation. The response also includes a Retry-After header giving the recommended wait time in seconds.

JSON

{
  "error": {
    "message": "Rate limit reached for qbx_live_... at 60 requests per minute.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Treat a 429 as a signal to slow down, not as a fatal error. The request was never processed, so it is safe to retry once the window resets.

Retry Strategies

A robust client retries transient failures automatically. The recommended approach is exponential backoff with jitter: double the delay after each failure and add a small random offset so that many clients do not retry in lockstep.

Python

import random
import time
from openai import OpenAI, RateLimitError, APIConnectionError, APITimeoutError

client = OpenAI(api_key="qbx_live_...", base_url="https://api.qubax.ai/v1")

MAX_RETRIES = 5

def chat_with_retry(**kwargs):
    delay = 1.0
    for attempt in range(MAX_RETRIES):
        try:
            return client.chat.completions.create(**kwargs)
        except (RateLimitError, APIConnectionError, APITimeoutError) as e:
            if attempt == MAX_RETRIES - 1:
                raise
            # Respect Retry-After when available, otherwise back off.
            wait = getattr(e, "retry_after", None) or (delay + random.uniform(0, 0.5))
            time.sleep(wait)
            delay *= 2

response = chat_with_retry(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Best practices for production traffic:

Retry only on 429, 5xx, timeouts, and connection errors — never on 4xx client errors like 400 or 401.
Honor the Retry-After header whenever it is present.
Cap the number of retries (typically 3–5) to avoid runaway retry loops.
Add jitter to prevent synchronized retry storms when many requests fail at once.
For high throughput, add client-side throttling so you stay just below your RPM/TPM limits instead of relying on 429s.

⚠️

Retrying a request that produces output counts toward your token quota again if it succeeds. Keep your retry cap modest, and consider idempotency keys for write-style operations to avoid duplicate side effects.

←

Embeddings

Python SDK

→