OpenAI WebSocket Mode for Responses API: How to Use It and Why It's a Game-Changer for AI Agents (2026)

OpenAI has officially launched WebSocket mode for its Responses API (wss://api.openai.com/v1/responses) — a persistent, low-latency connection designed specifically for long-running agentic workflows. If you're building AI agents that loop through dozens of tool calls, this is the most impactful infrastructure update OpenAI has shipped in recent months.

What Is WebSocket Mode for the Responses API?

Unlike the traditional HTTP REST approach where every turn opens a brand-new connection, WebSocket mode lets your agent maintain a single persistent connection to /v1/responses across the entire workflow.

Each new turn sends only incremental inputs (new user messages or tool outputs) along with a previous_response_id reference — no need to resend full conversation history. This is made possible by a connection-local in-memory cache that the server keeps for your most recent response on that socket.

Key distinction: This is different from the existing OpenAI Realtime API (wss://api.openai.com/v1/realtime), which handles speech-to-speech audio. The new WebSocket mode is for the text/chat Responses API, aimed at orchestration, agentic coding, and tool-heavy pipelines.

Why This Matters: Performance Gains

The old pattern — HTTP polling with full context resent each turn — adds significant overhead in agents that call many tools. WebSocket mode directly fixes this.

Workflow Type	HTTP REST Pattern	WebSocket Mode
Single turn Q&A	Fine	Fine
5–10 tool call loop	Moderate overhead	Faster
20+ tool call chain	High overhead, slow	Up to ~40% faster
ZDR / `store=false`	Works	Fully compatible
Parallel runs	N/A	Multiple connections needed

The in-memory cache is the key. Instead of re-hydrating context from disk on every turn, the server reuses connection-local state for continuation — making each round-trip meaningfully faster in long agent loops.

How to Connect: Step-by-Step

Step 1 — Open the Connection

Install the websocket-client library if using Python (pip install websocket-client), then connect with your API key:

from websocket import create_connection
import json, os

ws = create_connection(
    "wss://api.openai.com/v1/responses",
    header=[
        f"Authorization: Bearer {os.environ['OPENAI_API_KEY']}",
    ],
)

Step 2 — Send Your First `response.create`

Fire the first turn with the full system prompt, tools, and the user's initial message:

ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-5.2",
    "store": False,
    "input": [
        {
            "type": "message",
            "role": "user",
            "content": [{"type": "input_text", "text": "Find the fizz_buzz() function in my codebase."}]
        }
    ],
    "tools": [
        # your tool definitions here
    ]
}))

Step 3 — Continue Turns Incrementally

For every follow-up turn, only send new inputs + chain from the previous response ID. Never resend full conversation history.

ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-5.2",
    "store": False,
    "previous_response_id": "resp_abc123",
    "input": [
        {
            "type": "function_call_output",
            "call_id": "call_xyz",
            "output": "{ 'result': 'function found at line 42' }",
        },
        {
            "type": "message",
            "role": "user",
            "content": [{"type": "input_text", "text": "Now optimize it for performance."}],
        }
    ],
    "tools": []
}))

Step 4 — Warm Up for Faster First-Turn Response

Pre-warm the connection with generate: false to load context into cache before the user speaks:

ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-5.2",
    "store": False,
    "generate": False,
    "input": [
        {"type": "message", "role": "system",
         "content": [{"type": "input_text", "text": "You are a helpful booking assistant."}]}
    ],
    "tools": []
}))

Integrating With Voice Agents

The WebSocket Responses API is the orchestration brain of your voice agent pipeline. Here's the full architecture:

User speaks
    ↓
[STT — Whisper / Deepgram]
    ↓  (transcript text)
[Responses API WebSocket] ← persistent connection
    ↓  (text + tool calls)
[Tool Execution Layer]  (calendar, CRM, search, etc.)
    ↓  (tool result)
[Responses API WebSocket] ← incremental continuation
    ↓  (final text response)
[TTS — OpenAI TTS / ElevenLabs]
    ↓
User hears response

Why not just use the Realtime API for everything? The Realtime API (/v1/realtime) is best for native speech-to-speech without intermediate text. But if you need custom tool execution logic, text processing middleware, or store=false ZDR compliance, the Responses API WebSocket + STT + TTS pattern gives you far more control.

Key Use Cases

1. Agentic Coding Assistants

An AI coding agent that runs read_file → analyze → edit → run_tests → fix → run_tests in a loop is exactly what this is built for. With 20+ tool call chains being up to 40% faster, coding agents like Cursor-style tools benefit enormously.

2. Voice-Based Customer Support Bots

Phone bots (built with Twilio, Plivo, or Exotel) can now use the Responses API WebSocket as the brain — keeping one persistent connection open per call session, handling CRM lookups, booking confirmations, and escalation logic through tool calls, all over a single socket.

3. Real-Time Orchestration Pipelines

Multi-agent orchestration systems — where a supervisor agent delegates tasks to sub-agents — benefit from incremental input continuation. Each delegation round trip doesn't re-upload the full context.

4. Long-Running Research Agents

An agent that browses the web, reads documents, calls search APIs, and synthesizes answers can now run a full 30–50 step pipeline without latency overhead accumulating at every turn.

5. AI Tutors and Learning Bots

Educational platforms running multi-turn Socratic dialogue with code execution and adaptive questioning can maintain session state on one persistent connection per student, with clean ZDR compliance for student data privacy.

How It Improves Existing Agents

No repeated context uploads — only new items are sent per turn, not the full thread
Connection-local cache — the server reuses in-memory state instead of loading from disk on each turn
ZDR-compatible — works with store=false, so no conversation data is persisted to OpenAI servers
Warmup support — pre-load tools and instructions before the user's first message to eliminate cold-start latency
Sequential safety — runs are executed one at a time on a connection, preventing race conditions

Connection Limits and Error Handling

Max 60 minutes per WebSocket connection — implement a reconnect handler that resumes from the last response_id
No multiplexing — if you need parallel agent runs, open separate connections
previous_response_not_found — returned when the cached ID is missing; handle by sending full context again or using /responses/compact first

def reconnect_and_continue(last_response_id, full_context):
    ws = create_connection(
        "wss://api.openai.com/v1/responses",
        header=[f"Authorization: Bearer {os.environ['OPENAI_API_KEY']}"]
    )
    ws.send(json.dumps({
        "type": "response.create",
        "model": "gpt-5.2",
        "store": True,
        "previous_response_id": last_response_id,
        "input": full_context,
        "tools": []
    }))
    return ws

`/responses/compact` — Your Context Window Safety Net

For very long agent runs that approach context limits, use /responses/compact to compress history, then start a fresh chain:

compacted = client.responses.compact(model="gpt-5.2", input=long_input_array)

ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-5.2",
    "store": False,
    "previous_response_id": None,
    "input": [
        *compacted.output,
        {"type": "message", "role": "user",
         "content": [{"type": "input_text", "text": "Continue from here."}]}
    ],
    "tools": []
}))

Quick Reference: Which Transport to Use

Scenario	Best Transport
Browser voice app (mic input)	WebRTC (`/v1/realtime`)
Server-to-server speech-to-speech	WebSocket Realtime API
Server agent with many tool calls	WebSocket Responses API (new)
Simple single-turn chat	HTTP REST `/v1/responses`
Long agentic coding / research runs	WebSocket Responses API (new)

OpenAI's new WebSocket mode for the Responses API marks a clear architectural shift — from stateless HTTP calls to stateful, session-aware agent connections. For any developer building production AI agents in 2026, this is the right transport layer to adopt now.

OpenAI WebSocket Mode for Responses API: How to Use It and Why It's a Game-Changer for AI Agents (2026)

What Is WebSocket Mode for the Responses API?

Why This Matters: Performance Gains

How to Connect: Step-by-Step

Step 1 — Open the Connection

Step 2 — Send Your First `response.create`

Step 3 — Continue Turns Incrementally

Step 4 — Warm Up for Faster First-Turn Response

Integrating With Voice Agents

Key Use Cases

1. Agentic Coding Assistants

2. Voice-Based Customer Support Bots

3. Real-Time Orchestration Pipelines

4. Long-Running Research Agents

5. AI Tutors and Learning Bots

How It Improves Existing Agents

Connection Limits and Error Handling

`/responses/compact` — Your Context Window Safety Net

Quick Reference: Which Transport to Use

Comments

More from this blog

Secure Image Uploads from Expo to Amazon S3 Using Presigned URLs

How to Run Open-Source AI Models Locally for Free on Mac and Windows

Deploying Qwen3 235B on JarvisLabs with vLLM

Stop Committing Your .env Files: The Global Gitignore Hack Every Developer Forgets

the unsexy habits that actually get you hired

Command Palette

What Is WebSocket Mode for the Responses API?

Why This Matters: Performance Gains

How to Connect: Step-by-Step

Step 1 — Open the Connection

Step 2 — Send Your First response.create

Step 3 — Continue Turns Incrementally

Step 4 — Warm Up for Faster First-Turn Response

Integrating With Voice Agents

Key Use Cases

1. Agentic Coding Assistants

2. Voice-Based Customer Support Bots

3. Real-Time Orchestration Pipelines

4. Long-Running Research Agents

5. AI Tutors and Learning Bots

How It Improves Existing Agents

Connection Limits and Error Handling

/responses/compact — Your Context Window Safety Net

Quick Reference: Which Transport to Use

Comments

More from this blog

Step 2 — Send Your First `response.create`

`/responses/compact` — Your Context Window Safety Net