What is content architecture for agentic products?

Content architecture for agentic products is the discipline of defining how an AI agent should behave, then holding it to that standard with measurable evaluation in production. It treats content as a system — not just copy — covering behavioral doctrine, interaction philosophy, voice and behavior specifications, and scorable quality dimensions that can be evaluated at scale.

What is Project Voyager at Zillow?

Project Voyager is Zillow's first agentic search experience, built and evaluated as a content system. Nicole LeBlanc's team owned the user-facing content layer — suggested prompts, the Omnibar launch, and multi-surface scaling — and created the governing behavioral system: an AI Behavioral Guidance doctrine and the Voice & Behavior Spec. Nicole personally reviewed 15,632 Amplitude sessions to build the prompt-change backlog. Her team achieved +340% velocity during embedded search work.

What is the Voice and Behavior Spec?

The Voice & Behavior Spec (V&B Spec) is a working group document Nicole LeBlanc stood up and ran to a VP review at Zillow. Its purpose is to define correct AI agent behavior and provide a repeatable way to evaluate it — specifically resolving conflicts where guardrails compete, such as 'answer in the first sentence' versus 'hedge when you can't cite.' It grew into a cross-cutting UX Quality Baseline.

What is the UX Quality Baseline?

The UX Quality Baseline is a cross-cutting quality framework with four scorable dimensions: Answer Clarity & Directness, Information Usefulness, Context Sensitivity, and Tone & Language. Each dimension becomes a reusable test set, turning a written spec into a benchmark the product is held to. It became Zillow's first universal behavioral guardrails.

What is the Content Design Hub at Zillow?

The Content Design Hub is an AI-governed internal platform Nicole LeBlanc built at Zillow using Replit and the Writer API. It includes Content Studio (AI-assisted authoring) and Content Quality Checkup (8 criteria with an 80% pass bar). It replaced manual triage for ad hoc content requests. Results: 60% of users saved 30+ minutes per task, 80% adoption of the Content Improver, 80% said collaboration felt faster and clearer, and $85K+ estimated annual savings.

Content Architecture for Agentic Products

§ 01 / Core — agentic work

Project Voyager

Zillow's first agentic search experience, built and evaluated as a content system.

Voyager content system · ownership Core working team kept deliberately small

Accountable

Nicole L.— this is me

Content Design · team lead

Driver

Angela G.

Content Design · my report

Contributor

Natasha K.

Model Design

Collaborator

Xan V.

AI Data Science

Collaborator

Ondrej L.

AI & Engineering

How it came together From embedded search work to a system other teams design against

Mid-24 → Q3 25

Search Tiger Team. My team was embedded in fast-shipping search work (+340% velocity). We learned behavioral content patterns in production before there was a name for them.

Q3–Q4 25

Voyager front-end experience. My team of CDs owned the user-facing layer: suggested prompts, launching the Omnibar, and scaling Voyager across more surfaces and contexts. This became the core focus of my content designers in Core Zillow.

Q4 25

In parallel, Angela partnered with the VP of AI Product on explainability, fair-housing disclosures, and error states. The first evidence we needed a behavioral system.

Dec 2, 25

Published the AI Behavioral Guidance doctrine in the relaunched style guide: Charter, Interaction Philosophy, Product Character Model, Intent Map.

Feb 12, 26

Stood up the V&B Spec working group and ran it to a VP review. Its real job: resolving where guardrails compete — e.g. "answer in the first sentence" vs. "hedge when you can't cite."

Q2 26

Grew the spec into a cross-cutting UX Quality Baseline, four scorable dimensions that feed test sets. They became Zillow's first universal behavioral guardrails.

May 26

Went through 15,632 Amplitude sessions on my own, then wrote the prompt-change backlog the team is working through now.

Now · 2026

Turning everything the spec taught us on Voyager into a cross-cutting quality framework other teams design against. The manual triage we used to run for ad hoc requests is mostly behind us. Still very much in motion, not finished.

How I stood it up

I didn't wait for a mandate. Seeing the drift risk in an unstructured system, I audited early production outputs myself to find the real quality gaps, then stood up and ran the cross-functional Voice & Behavior Spec Working Group to define correct AI behavior and a repeatable way to evaluate it.

How I influenced without authority

None of the core technical partners reported to me, so I made our standards the path of least resistance. I published the doctrine where teams already worked: Glean-searchable and in the GitLab AI Marketplace. When fixes needed engineering priority I couldn't command, I partnered with Design leadership to make the case to the Voyager product team on one principle: in an agentic experience, the language layer is the core product logic. It earned a slot on the roadmap.

How I built the partnerships

I earned trust across Model Design, Data Science, and Engineering by doing the translation work between design intuition and machine evaluation myself, and showing up with data instead of opinions: an "Evidence Spine" of curated, annotated model outputs. With a clear split (Content Design defines the behavior, Model Design owns execution), we built a tight loop to solve the highest-stakes calls: compliance disclosures, uncertainty, and financial-stress markers.

The problem00

What Voyager was actually saying

Before any framework, this is what showed up in real sessions. The model was capable, but in the moments that mattered most it would contradict itself, bury the answer, or land on something discouraging. These are the failures the language layer exists to catch.

MineI went through these sessions for one thing: which failures were language problems we could fix in the spec, and which were model problems we couldn't.

Agent says it can help, reverses itself, then re-offers the queries it just declined — quality gap · contradiction

Agent buries the answer instead of leading with the takeaway — quality gap · buried answer

Zero-results reply uses discouraging, passive-aggressive language — quality gap · discouraging tone

Three real sessions, each a different way the agent lost the person it was trying to help.

↓so we built the layer

The response01

The language & behavior layer

Once we knew what was breaking, the answer wasn't more rules. It was a shared layer that tells the agent how to behave: a behavioral doctrine, a spec, and response standards that hold one voice across every surface and failure state.

MineI'm accountable for this layer. I set the operating model we work from, where Content Design defines the behavior and the reasoning, and Model Design makes it work in the system.

My teamAngela, my principal CD, drives the spec day to day. Natasha on Model Design turns it into the system prompts the agent actually runs on.

How I arrived at the doctrine: I started from the failures, not a blank page. I read where the agent broke down, grouped the breakdowns by the kind of decision it kept getting wrong, and each group became a part of the guidance. Charter answers why the layer exists, Interaction Philosophy how the agent engages, the Product Character Model who it is, and the Intent Map what the person needs in the moment.

AI Behavioral Guidance — doctrine I authored governs humans · agents · failure states

"Without this layer, decision rights live in people's heads. With it, they live in artifacts that scale."

CharterWhy the layer exists: language drives behavior when certainty is low — so it makes the calls explicit (explore vs. decide, clarify vs. direct, confidence vs. pressure).ExampleOn BuyAbility, language turns affordability math into confidence — without implying approval or inevitability.

Interaction PhilosophyHow the agent engages — when it asserts vs. asks, how it expresses uncertainty, and where its authority ends.ExampleLeads with a recommendation when intent is clear; asks one clarifying question only when the answer would change the result.

Product Character ModelWho the agent is — five character attributes, each paired with explicit anti-patterns.ExampleAttribute: a knowledgeable local, never a salesperson. Anti-pattern: manufacturing urgency ("act now — it won't last").

Emotional & Cognitive Intent MapWhat the user is feeling and needing in the moment, mapped to the behavior that fits it.ExampleIn affordability stress, the user needs calm and options — so the agent slows down and reframes tradeoffs, not upsells.

Without it, teams answer these locally and in conflict — the thing that fragments a multi-product org.

Standard 1 — lead with what mattersResponse Standards v1

✗ Before · guidance buried

"Both homes have strong advantages. Home A offers more space and a larger yard, while Home B is closer to downtown and has a shorter commute. Depending on your priorities, either could be a good fit."

✓ After · decision-first

"If a shorter commute matters most, Home B is the stronger fit — but if you're prioritizing space and outdoor area, Home A is the better option. Home A offers a larger yard and more square footage, while Home B is closer to downtown and easier for daily travel."

Prompt library — UI copy → query sent to VoyagerMoira O. · senior CD, my team

UI copy (≤40 ch)	Query sent to Voyager
What stands out?	What makes this home stand out?
How does the price compare?	Is the asking price comparable to similar homes?
Tradeoffs to consider	What tradeoffs should I consider when looking at this home?
Can I afford to buy?	Can I afford to buy this home right now?
How accurate is the Zestimate?	How confident are you in the value of the Zestimate for this property?

20+ static prompts shipped 2/17/2026. The UI label and the underlying query differ by design — the label is tappable triage; the query carries context. Produced by a senior CD on my team.

Search patterns content systemAngela G. · drives this

How the agent narrates a mover's search — separating signal strength so it reads as accurate and collaborative, never assumptive.

What we knowConfirmed patterns — strong, repeated signals. Written with confidence.

What we thinkEmerging signals — soft language ("may," "recently"), no conclusions.

What we askMissing, high-impact input — framed as helpful, never extractive.

✓ Good · grounded in behavior

"You consistently prioritize homes with parking and updated interiors."

✗ Bad · unverified inference

"You love homes with character and unique charm."

Angela's system, in draft with Model Design, Data Science, and Engineering — the kind of in-flight work I support rather than author.

↓ships to production

The feedback loop02

How the loop runs

The language layer shipped, and then I could watch it work. I used Amplitude to evaluate real Voyager sessions, turned what I found into a prioritized prompt-change summary, and handed it to the cross-cutting quality eval team, who turn each change into a scored eval: real sessions with the correct behavior defined, so every future release is graded on it. I've since taught that Amplitude process to Angela and the team, so the loop runs without me.

SharedThe Amplitude instrumentation and the eval team's test harness are shared. I didn't build the pipeline. I read what it was telling us and fed it back in.

MineI built the evaluation process and taught it to Angela and the team, so the loop runs without me.

514K+

unique users · Omnibar

19.3%

had 4+ conversation turns

0.94/1.0

avg response quality

1.81/2.0

task completion

01 · Evaluate

Amplitude session review

15,632 real Voyager sessions, read and scored for where the agent lost the user.

→

02 · Synthesize

Prompt-change summary

Failures grouped into 15 specific changes, ranked P0–P2.

→

03 · Evaluate

Cross-cutting quality eval team

Angela (Content Design) and Natasha (Model Design) sit on it. Each change becomes a scored eval, real sessions paired with the correct behavior, that every release is graded against.

Evaluated changes feed back into the model — that's the loop

The artifact at the center of it: fifteen specific changes pulled from those sessions, ranked P0 to P2, precise enough for the eval team to turn straight into scored evals.

Master prompt-change summary · P0–P2Language & Prompt Recs

#	Prompt change	Addresses	Pri
1	Pre-tool acknowledgment for address inputs	Silent failures	P0
2	Fallback message on null / error tool result	Silent failures	P0
3	Address intent classifier → route to lookup	Silent failures	P0
4	Forced-choice clarification (a / b / c + invite)	Clarification drop-off	P1
7	Frustration acknowledgment before content	Retry storms	P1
14	Response-completion guardrail (no mid-sentence cut)	Truncation	P1
9	Empathy-first structure for troubleshooting	Troubleshooting	P2
12	Listing-specific vs. general Q&A classifier	Question answering	P2

↓where we're taking it

What's next03

From Voyager to eval-led design

This is the chapter my team is writing now. We're taking what the spec taught us on Voyager and turning it into a cross-cutting quality framework, so every team designs against the same scorable definition of good. To get it on the roadmap, I made the case in leadership pitch decks that tied each quality gap to a business risk — then turned that buy-in into a dedicated eval workstream.

MineMy team is embedded in the cross-cutting eval team, shepherding the four quality dimensions at its center: clarity, usefulness, context sensitivity, and tone. Each becomes a reusable test set, turning a written spec into something you can measure against.

UX quality baseline — answer clarity & directness, information usefulness, context sensitivity, tone & language — each becoming a test set — UX quality baseline · cross-cutting framework

The four dimensions we're standardizing across teams. Each one becomes a test set, which is the bridge from a spec people read to a benchmark the product is held to.

How I got it funded: a written spec doesn't move a roadmap. I built the case as pitch decks that tied each quality gap to a business risk — buried answers and discouraging tone read as trust and conversion exposure — and walked them through leadership. That's what turned "good content" into prioritized, scorable work: the four dimensions landed on the roadmap with a dedicated eval workstream behind them.

Pitch slide — four foundational UX quality dimensions, each with a today vs. target example — pitch deck · the four dimensions

Pitch slide — leading with clarity as a core conversation design principle, with supporting sources — pitch deck · leading with clarity

Two slides from the pitch. Left: the four dimensions framed as today-vs-target. Right: grounding the first dimension in established conversation-design research so the bar reads as principled, not preference.

A trustworthy agent is governed.

Project Voyager

What Voyager was actually saying

The language & behavior layer

How the loop runs

From Voyager to eval-led design

Experience Health Score

The CD Hub