A client came to us with a simple brief: their support team was drowning. 200+ tickets a day, 80% of them asking the same 15 questions. They'd tried a basic FAQ bot before — the kind that pattern-matches on keywords — and users hated it. They wanted something smarter. We had 3 weeks. Here's what we shipped and how we built it.
The Constraint That Shaped Everything
The client's engineering team was two people maintaining a legacy PHP monolith. There was zero appetite for a new backend service, a new database, or any infrastructure they'd have to own. The deliverable had to be a single script tag — drop it into the site, it works. That constraint killed several ideas immediately: no server-side session storage, no custom authentication flow, no webhooks the client would need to handle. Everything had to live in a system we controlled.
The Stack
- Next.js API routes for the widget backend — serverless, no infra to maintain
- Supabase for conversation history and escalation state (pgvector for embeddings)
- OpenAI text-embedding-3-small for document vectorization
- Claude claude-sonnet-4-6 as the reasoning model — better instruction-following than GPT-4o for our eval set
- Vanilla JS widget bundle (~18KB gzipped) — no React in the client embed
- Cloudflare for edge caching of the widget script itself
Knowledge Ingestion: The Part Nobody Talks About
The AI is only as good as what it knows. We built a lightweight ingestion pipeline: the client pastes a list of URLs (their docs site, FAQ pages, product pages) and we crawl, chunk, and embed them. We settled on 512-token chunks with 10% overlap after testing — smaller chunks preserved more precision in retrieval, but too small and the context was incomplete. We also added a manual override layer: a simple CMS where the client can write 'golden answers' to their top 20 questions. These get pinned to the top of the retrieval results regardless of vector similarity score.
“RAG pipelines fail silently. The model confidently answers from slightly wrong context and the user gets plausible-sounding misinformation. The golden answers layer was the most important thing we built.”
The Escalation Logic
This was the hard part. 'Smart escalation' sounds simple — if the bot can't answer, hand off to a human. In practice: how do you know when the bot can't answer? A low retrieval similarity score is one signal, but not enough. A user expressing frustration is another. A question touching billing, cancellations, or anything with legal exposure should always escalate regardless of confidence.
We ended up with a three-signal system: (1) retrieval confidence below threshold, (2) a secondary LLM classifier call that checks if the user message matches a blocklist of high-stakes topics, and (3) sentiment analysis on the last 3 messages. If two of three signals fire, we escalate. On escalation, the widget creates a support ticket pre-populated with the full conversation transcript and the user's contact info — zero re-typing for the human agent.
What Broke First
Token costs. In testing, we were streaming full page content into context without thinking. A single conversation was hitting 8K tokens in 4 turns. At production volume — 200 tickets/day — that was a real budget problem. The fix: strict context trimming that keeps the 3 most recent exchanges plus the top-3 retrieved chunks, never more. We also added response caching for common questions at the Redis layer. After tuning, cost per conversation landed at $0.003 — under the client's budget target.
The Numbers After 6 Weeks
- 73% of tickets fully resolved by the AI — no human needed
- Average first-response time dropped from 4 hours to under 3 seconds
- Human support tickets down from 200/day to 54/day
- CSAT score held at 4.2/5 — the same as the previous human-only baseline
The widget is now one of our live demos at widget.easydevs.xyz — fully interactive, no signup. If you want to see the escalation flow, try asking it something it can't answer. The handoff is seamless.