AI Integration & Development

Groq API Free Tier Limits in 2026: What You Actually Get

Groq API free tier limits in 2026: exact RPM, RPD, TPM, and TPD for every model including Llama 3.3, Llama 4 Scout, and Whisper.

I've been integrating Groq into a few side projects lately, and the first question I always ask about any new API is: what's the catch? With Groq's free tier, the answer is surprisingly simple: rate limits, and not punishing ones.

Here's a complete breakdown of what Groq's free tier gives you in 2026, model by model, so you can decide whether it fits your project before writing a single line of code.

Build Your Own AI Agent From Scratch

Build Your Own AI Agent From Scratch

Build a complete AI agent from scratch in Python — no frameworks, no hype. 16 chapters covering tools, memory, reasoning, MCP, multi-agent systems & more.

Learn More

What Is Groq, Exactly?

Groq is an AI inference company that built custom chips called LPUs (Language Processing Units) specifically designed to run large language models fast. Not "fast" in the marketing sense. Genuinely fast. We're talking 500+ tokens per second on some models, which is an order of magnitude beyond what you get from most hosted APIs.

Their free API tier exists because Groq's actual business is selling LPU hardware to enterprises. The free API is essentially a showcase. That's good news for developers: the incentive to keep it generous is real, and the free tier has been available since launch with no signs of going away.

No credit card required. Sign up at console.groq.com with an email or Google account and you're making API calls within minutes.

The Free Tier Rate Limits (March 2026)

These are pulled directly from Groq's official documentation. Rate limits apply at the organization level, not per user.

| Model | RPM | RPD | TPM | TPD | |---|---|---|---|---| | llama-3.1-8b-instant | 30 | 14,400 | 6,000 | 500,000 | | llama-3.3-70b-versatile | 30 | 1,000 | 12,000 | 100,000 | | meta-llama/llama-4-scout-17b-16e-instruct | 30 | 1,000 | 30,000 | 500,000 | | moonshotai/kimi-k2-instruct | 60 | 1,000 | 10,000 | 300,000 | | qwen/qwen3-32b | 60 | 1,000 | 6,000 | 500,000 | | openai/gpt-oss-120b | 30 | 1,000 | 8,000 | 200,000 | | openai/gpt-oss-20b | 30 | 1,000 | 8,000 | 200,000 | | groq/compound | 30 | 250 | 70,000 | — | | groq/compound-mini | 30 | 250 | 70,000 | — | | allam-2-7b | 30 | 7,000 | 6,000 | 500,000 | | whisper-large-v3 (speech-to-text) | 20 | 2,000 | — | — | | whisper-large-v3-turbo | 20 | 2,000 | — | — |

RPM = requests per minute, RPD = requests per day, TPM = tokens per minute, TPD = tokens per day.

One important nuance: you hit whichever limit arrives first. If you send 30 small requests in under a minute, you've hit your RPM cap even if you've barely touched your TPM. Design your code around both axes.

Also worth knowing: cached tokens don't count toward your rate limits. If you're using a consistent system prompt across requests, prompt caching can meaningfully extend how far your free tier goes.

Which Model Should You Use?

The right model depends on what you're building.

For prototyping and high request volume: llama-3.1-8b-instant is the workhorse. At 14,400 requests per day, it's the most permissive model on the free tier by a wide margin. It's fast, capable enough for most tasks, and the daily token budget of 500,000 is generous. Start here.

For quality over volume: llama-3.3-70b-versatile is the step up. You're capped at 1,000 requests per day, but the output quality is substantially better for complex reasoning, summarization, and generation tasks. Good for workflows where you're not hammering the API.

For newer architecture: meta-llama/llama-4-scout-17b-16e-instruct has the highest TPM on the free tier at 30,000. Useful if your use case involves long contexts or large prompts. The 1,000 RPD cap applies here too.

For speech-to-text: Both Whisper models give you 2,000 requests per day and 7,200 audio seconds per hour. That's roughly two hours of audio per hour of clock time. Workable for transcription pipelines that don't need to operate continuously.

What Happens When You Hit a Limit?

The API returns a 429 Too Many Requests status code. Your code should handle this.

The response headers tell you exactly where you stand before you hit the wall:

x-ratelimit-limit-requests: 14400       // your RPD cap
x-ratelimit-remaining-requests: 14370   // how many you have left today
x-ratelimit-limit-tokens: 6000          // your TPM cap
x-ratelimit-remaining-tokens: 5997      // tokens remaining this minute
x-ratelimit-reset-tokens: 7.66s         // when TPM resets

Read these headers in your application and you can implement backoff logic or queuing before you ever see a 429.

Is the Free Tier Enough for Your Use Case?

Here's how it maps to common scenarios:

Personal projects and side projects: Almost certainly yes. 14,400 requests per day on the 8B model is more than most hobby projects will ever touch. If you're building a personal tool, a CLI utility, or experimenting with AI integration, you won't hit the ceiling.

Prototyping and demos: Yes. The free tier is designed for exactly this. Spin up a proof of concept, validate the idea, then decide whether to upgrade.

Production applications with real user traffic: It depends on load. If you're serving more than a few dozen active users making real-time API calls, you'll run into the per-minute limits fairly quickly. At 30 RPM, you have one request every two seconds. That's a tight budget once concurrent usage kicks in.

Batch processing jobs: Probably not the right fit. The daily limits will constrain large batch workloads. Groq offers a Batch API at 50% off paid rates, which is a better path once you need volume.

Audio transcription pipelines: The 2,000 RPD limit on Whisper is modest but workable for internal tooling or low-frequency transcription tasks.

How the Free Tier Compares to Developer Plan

The Developer plan gives you roughly 10x higher token consumption and unlocks Batch and Flex processing. Rate limits scale up substantially: for example, llama-3.1-8b-instant on the Developer plan goes from 14,400 to 500,000 RPD.

If you're past the prototype stage and building something with real traffic, the upgrade math is usually straightforward. But the free tier is genuinely useful as a starting point, not just a trial with an aggressive expiration forcing you to pay.

The One Thing Most Developers Miss

The rate limits are per organization, not per API key. If you have multiple keys under the same organization, they share the same bucket. This trips people up when they think they can multiply their limits by creating additional keys. You can't.

If you need separate rate limit pools, you need separate organizations.

Getting Started

Sign up at console.groq.com. Generate an API key. The endpoint is OpenAI-compatible, so if you're already using the OpenAI SDK you can point it at Groq's base URL with a one-line change:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.GROQ_API_KEY,
  baseURL: 'https://api.groq.com/openai/v1'
});

const response = await client.chat.completions.create({
  model: 'llama-3.3-70b-versatile',
  messages: [{ role: 'user', content: 'Hello' }]
});

That's it. No SDK to install beyond what you already have, no new authentication scheme to learn. If you've built on OpenAI before, you're 30 seconds away from running on Groq.

The free tier is real, it's no-credit-card-required, and for most development use cases it's more than enough to build something worth building. The limits are there, but they're not punishing. And the response headers give you everything you need to handle them gracefully.