Content architecture for agentic products
Nicole LeBlanc — Senior Manager, Content Design
§ 00 / Thesis

A trustworthy agent is governed.

I lead a team that treats content as a system you can evaluate — defining how an agent should behave, then holding it to that standard with real scores, in production, while it ships. I stood this up for Voyager, Zillow's first agentic search experience and now a governance system other teams design against, without a mandate and without authority over the partners.

That language layer is what lets a non-expert operate an expert tool: a homebuyer who shouldn't have to learn the MLS, an engineer who shouldn't have to learn the CAD stack. The domain changes; the governance problem doesn't.

How I operate The org I lead, and the partners I work across
lead
Nicole LeBlanc
Sr Manager, Content Design · Core Zillow + Rentals
Core Zillow
AI Language Layer
Handles AI product experiences
Principaldips in
Seniordips in
Product Comms + Emerging Experiences
Mid-Career
works with ProductEngResearchBrand
Agentichorizontal lead
Agentic content, cross-org
standards · evaluation · governance
Principalfrom AI Language Layer
works with Model DesignBrandAI Science & EngResearchEvals
Rentalsstandalone
B2B + Consumer
Senior
Senior
works with ProductEngResearchBrandPartnerships
§ 01 / Core — agentic work

Project Voyager

Zillow's first agentic search experience, built and evaluated as a content system.
Voyager content system · ownership Core working team kept deliberately small
Accountable
Nicole L.— this is me
Content Design · team lead
Driver
Angela G.
Content Design · my report
Contributor
Natasha K.
Model Design
Collaborator
Xan V.
AI Data Science
Collaborator
Ondrej L.
AI & Engineering
How it came together From embedded search work to a system other teams design against
Mid-24 → Q3 25
Search Tiger Team. My team was embedded in fast-shipping search work (+340% velocity). We learned behavioral content patterns in production before there was a name for them.
Q3–Q4 25
Voyager front-end experience. My team of CDs owned the user-facing layer: suggested prompts, launching the Omnibar, and scaling Voyager across more surfaces and contexts. This became the core focus of my content designers in Core Zillow.
Q4 25
In parallel, Angela partnered with the VP of AI Product on explainability, fair-housing disclosures, and error states. The first evidence we needed a behavioral system.
Dec 2, 25
Published the AI Behavioral Guidance doctrine in the relaunched style guide: Charter, Interaction Philosophy, Product Character Model, Intent Map.
Feb 12, 26
Stood up the V&B Spec working group and ran it to a VP review. Its real job: resolving where guardrails compete — e.g. "answer in the first sentence" vs. "hedge when you can't cite."
Q2 26
Grew the spec into a cross-cutting UX Quality Baseline, four scorable dimensions that feed test sets. They became Zillow's first universal behavioral guardrails.
May 26
Went through 15,632 Amplitude sessions on my own, then wrote the prompt-change backlog the team is working through now.
Now · 2026
Turning everything the spec taught us on Voyager into a cross-cutting quality framework other teams design against. The manual triage we used to run for ad hoc requests is mostly behind us. Still very much in motion, not finished.
How I stood it up

I didn't wait for a mandate. Seeing the drift risk in an unstructured system, I audited early production outputs myself to find the real quality gaps, then stood up and ran the cross-functional Voice & Behavior Spec Working Group to define correct AI behavior and a repeatable way to evaluate it.

How I influenced without authority

None of the core technical partners reported to me, so I made our standards the path of least resistance. I published the doctrine where teams already worked: Glean-searchable and in the GitLab AI Marketplace. When fixes needed engineering priority I couldn't command, I partnered with Design leadership to make the case to the Voyager product team on one principle: in an agentic experience, the language layer is the core product logic. It earned a slot on the roadmap.

How I built the partnerships

I earned trust across Model Design, Data Science, and Engineering by doing the translation work between design intuition and machine evaluation myself, and showing up with data instead of opinions: an "Evidence Spine" of curated, annotated model outputs. With a clear split (Content Design defines the behavior, Model Design owns execution), we built a tight loop to solve the highest-stakes calls: compliance disclosures, uncertainty, and financial-stress markers.

The problem00

What Voyager was actually saying

Before any framework, this is what showed up in real sessions. The model was capable, but in the moments that mattered most it would contradict itself, bury the answer, or land on something discouraging. These are the failures the language layer exists to catch.

MineI went through these sessions for one thing: which failures were language problems we could fix in the spec, and which were model problems we couldn't.
quality gap · contradiction
Agent says it can help, reverses itself, then re-offers the queries it just declined
quality gap · buried answer
Agent buries the answer instead of leading with the takeaway
quality gap · discouraging tone
Zero-results reply uses discouraging, passive-aggressive language
Three real sessions, each a different way the agent lost the person it was trying to help.
so we built the layer
The response01

The language & behavior layer

Once we knew what was breaking, the answer wasn't more rules. It was a shared layer that tells the agent how to behave: a behavioral doctrine, a spec, and response standards that hold one voice across every surface and failure state.

MineI'm accountable for this layer. I set the operating model we work from, where Content Design defines the behavior and the reasoning, and Model Design makes it work in the system.
My teamAngela, my principal CD, drives the spec day to day. Natasha on Model Design turns it into the system prompts the agent actually runs on.

How I arrived at the doctrine: I started from the failures, not a blank page. I read where the agent broke down, grouped the breakdowns by the kind of decision it kept getting wrong, and each group became a part of the guidance. Charter answers why the layer exists, Interaction Philosophy how the agent engages, the Product Character Model who it is, and the Intent Map what the person needs in the moment.

AI Behavioral Guidance — doctrine I authored governs humans · agents · failure states
"Without this layer, decision rights live in people's heads. With it, they live in artifacts that scale."
CharterWhy the layer exists: language drives behavior when certainty is low — so it makes the calls explicit (explore vs. decide, clarify vs. direct, confidence vs. pressure).ExampleOn BuyAbility, language turns affordability math into confidence — without implying approval or inevitability.
Interaction PhilosophyHow the agent engages — when it asserts vs. asks, how it expresses uncertainty, and where its authority ends.ExampleLeads with a recommendation when intent is clear; asks one clarifying question only when the answer would change the result.
Product Character ModelWho the agent is — five character attributes, each paired with explicit anti-patterns.ExampleAttribute: a knowledgeable local, never a salesperson. Anti-pattern: manufacturing urgency ("act now — it won't last").
Emotional & Cognitive Intent MapWhat the user is feeling and needing in the moment, mapped to the behavior that fits it.ExampleIn affordability stress, the user needs calm and options — so the agent slows down and reframes tradeoffs, not upsells.
Without it, teams answer these locally and in conflict — the thing that fragments a multi-product org.
Standard 1 — lead with what mattersResponse Standards v1
Before · guidance buried
"Both homes have strong advantages. Home A offers more space and a larger yard, while Home B is closer to downtown and has a shorter commute. Depending on your priorities, either could be a good fit."
After · decision-first
"If a shorter commute matters most, Home B is the stronger fit — but if you're prioritizing space and outdoor area, Home A is the better option. Home A offers a larger yard and more square footage, while Home B is closer to downtown and easier for daily travel."
Prompt library — UI copy → query sent to VoyagerMoira O. · senior CD, my team
UI copy (≤40 ch)Query sent to Voyager
What stands out?What makes this home stand out?
How does the price compare?Is the asking price comparable to similar homes?
Tradeoffs to considerWhat tradeoffs should I consider when looking at this home?
Can I afford to buy?Can I afford to buy this home right now?
How accurate is the Zestimate?How confident are you in the value of the Zestimate for this property?
20+ static prompts shipped 2/17/2026. The UI label and the underlying query differ by design — the label is tappable triage; the query carries context. Produced by a senior CD on my team.
Search patterns content systemAngela G. · drives this

How the agent narrates a mover's search — separating signal strength so it reads as accurate and collaborative, never assumptive.

What we knowConfirmed patterns — strong, repeated signals. Written with confidence.
What we thinkEmerging signals — soft language ("may," "recently"), no conclusions.
What we askMissing, high-impact input — framed as helpful, never extractive.
Good · grounded in behavior
"You consistently prioritize homes with parking and updated interiors."
Bad · unverified inference
"You love homes with character and unique charm."
Angela's system, in draft with Model Design, Data Science, and Engineering — the kind of in-flight work I support rather than author.
ships to production
The feedback loop02

How the loop runs

The language layer shipped, and then I could watch it work. I used Amplitude to evaluate real Voyager sessions, turned what I found into a prioritized prompt-change summary, and handed it to the cross-cutting quality eval team, who turn each change into a scored eval: real sessions with the correct behavior defined, so every future release is graded on it. I've since taught that Amplitude process to Angela and the team, so the loop runs without me.

SharedThe Amplitude instrumentation and the eval team's test harness are shared. I didn't build the pipeline. I read what it was telling us and fed it back in.
MineI built the evaluation process and taught it to Angela and the team, so the loop runs without me.
514K+
unique users · Omnibar
19.3%
had 4+ conversation turns
0.94/1.0
avg response quality
1.81/2.0
task completion
01 · Evaluate
Amplitude session review
15,632 real Voyager sessions, read and scored for where the agent lost the user.
02 · Synthesize
Prompt-change summary
Failures grouped into 15 specific changes, ranked P0–P2.
03 · Evaluate
Cross-cutting quality eval team
Angela (Content Design) and Natasha (Model Design) sit on it. Each change becomes a scored eval, real sessions paired with the correct behavior, that every release is graded against.
Evaluated changes feed back into the model — that's the loop

The artifact at the center of it: fifteen specific changes pulled from those sessions, ranked P0 to P2, precise enough for the eval team to turn straight into scored evals.

Master prompt-change summary · P0–P2Language & Prompt Recs
#Prompt changeAddressesPri
1Pre-tool acknowledgment for address inputsSilent failuresP0
2Fallback message on null / error tool resultSilent failuresP0
3Address intent classifier → route to lookupSilent failuresP0
4Forced-choice clarification (a / b / c + invite)Clarification drop-offP1
7Frustration acknowledgment before contentRetry stormsP1
14Response-completion guardrail (no mid-sentence cut)TruncationP1
9Empathy-first structure for troubleshootingTroubleshootingP2
12Listing-specific vs. general Q&A classifierQuestion answeringP2
where we're taking it
What's next03

From Voyager to eval-led design

This is the chapter my team is writing now. We're taking what the spec taught us on Voyager and turning it into a cross-cutting quality framework, so every team designs against the same scorable definition of good. To get it on the roadmap, I made the case in leadership pitch decks that tied each quality gap to a business risk — then turned that buy-in into a dedicated eval workstream.

MineMy team is embedded in the cross-cutting eval team, shepherding the four quality dimensions at its center: clarity, usefulness, context sensitivity, and tone. Each becomes a reusable test set, turning a written spec into something you can measure against.
UX quality baseline · cross-cutting framework
UX quality baseline — answer clarity & directness, information usefulness, context sensitivity, tone & language — each becoming a test set
The four dimensions we're standardizing across teams. Each one becomes a test set, which is the bridge from a spec people read to a benchmark the product is held to.

How I got it funded: a written spec doesn't move a roadmap. I built the case as pitch decks that tied each quality gap to a business risk — buried answers and discouraging tone read as trust and conversion exposure — and walked them through leadership. That's what turned "good content" into prioritized, scorable work: the four dimensions landed on the roadmap with a dedicated eval workstream behind them.

pitch deck · the four dimensions
Pitch slide — four foundational UX quality dimensions, each with a today vs. target example
pitch deck · leading with clarity
Pitch slide — leading with clarity as a core conversation design principle, with supporting sources
Two slides from the pitch. Left: the four dimensions framed as today-vs-target. Right: grounding the first dimension in established conversation-design research so the bar reads as principled, not preference.
§ 02 / Transferable — not agentic Measurement

Experience Health Score

A shared language for quality — design intuition turned into a number the business can act on.
How it kicked off

Born in the Q1 2026 Rentals metrics workstream. I co-lead the pilot with Mark P., senior director of design for Rentals. The premise: make quality observable instead of intuited.

Why it matters

PMs and execs get one number they can read at a glance — comparable across steps and legible to partners outside the design org.

Where it is now

Pilot v2, calibrating against a tenant management workflow. Phase 2 correlates rubric scores to real funnel data with Data Science.

Critical
< 50
At risk
50–69
Caution
70–84
Healthy
85–100
Clarity & structure33 pts
Does this step help users move forward with confidence?
Readability & plain language5 · C
Information hierarchy4 · C
Labels & affordance language6 · C
Visual hierarchy4 · D
Cognitive load & density7 · D
Design-system adherence4 · D
Tone & voice3 · C
Conversion readiness33 pts
Does it behave like high-converting steps?
Interaction affordance clarity7 · D
CTA prominence & placement7 · D
Competing actions7 · C+D
Abandonment content patterns6 · C
Consistent navigation6 · D
Risk signals34 pts
Does it introduce hesitation or loss of trust?
Anxiety-inducing language7 · C
Reassurance & confirmation copy7 · C
Irreversibility communication6 · C
Error recoverability7 · D
Trust & anxiety signals (visual)7 · D
Met = full pointsPartial = halfNot Met = 0N/A = excluded from maxC = Content · D = Design
How EHS & Price of Friction come together → revenue
Price of Friction — a model I coined

Puts a dollar figure on UX friction: step value × drop-off %. The question funnel data can't answer — what is this friction actually costing us?

Experience Health Score

Diagnoses why a step leaks — which quality criteria fail there, scored from a screenshot.

The loop: PoF locates where revenue is at risk; EHS explains why; a Critical EHS score auto-triggers a PoF calc, and high-PoF steps jump the EHS queue. Phase 2 correlates rubric scores to real funnel conversion with Data Science — connecting design quality → user outcomes → revenue.
Buckets, weights, and criteria are verbatim from the EHS rubric (Pilot v2).
§ 03 / Transferable — not agentic Leadership · tooling

The CD Hub

Content quality as a self-serve layer — piloted Q4 2025, now running across teams.
The same pattern, outside the agent: I turn content judgment into infrastructure people can use. Piloted in Q4 2025, the Hub now runs across teams as a functioning system — replacing the manual triage we used to do for ad hoc content requests, so partners self-serve quality before human review.
60%
of users saved 30+ min per task
80%
adoption of the Content Improver
80%
said collaboration felt faster & clearer
CD Hub · Content quality checkup
Content quality checkup — 8 criteria with an 80% pass bar
CD Hub · Content Studio
Content Studio — create or improve copy, auto-checked against standards
The Hub itself: a quality scorecard with an explicit 80% bar (8 criteria, 2 contextual and N/A-able) and a Create / Improve studio.
How it's built, and how it works

It lives in Replit and runs on the Writer API. A draft goes into the Content Improver and comes back scored and rewritten for clarity, voice, and tone against our codified standards, inline, before it ever reaches a human reviewer.

At a Replit Showcase, our SVP of Engineering took an interest and wanted to integrate it deeper into Zillow's systems. It's now the official self-serve content infrastructure for the Rentals and MoverXP (Core Zillow) teams. Same move as Voyager: encode content judgment into a machine-usable layer, then put it where the work happens.

ReplitWriter APICore Zillow infrastructure
"After testing with a draft UX copy and running it through the improved version, I could get better copy super easily — and it was pretty easy to navigate."
— Senior Product Designer, Core Zillow
"It took me less than a minute to figure out where I could get that help. When you don't have dedicated content support for a project, it can really make the content quality go up."
— Senior Product Designer, Rentals
"Thanks for the tool! When you don't have content resources to provide strategic input — just UX copy for feature enhancements — great tool!"
— Product Manager, Integrated Experiences
§ 04 / Reflections
What this work taught me

Judgment is what makes a product good, and judgment doesn't scale. The constraint here was never talent or tooling. It was governance: who owns a behavior, where the decision lives, and how it holds when no one is watching. The work that matters most isn't authoring the spec. It's building the team and the operating model that make these calls well at scale.

What I'd do differently

I've proven the model three times now, building the case project by project before pushing for content to be resourced as a system rather than a service. That's felt like diligence, but it's slow. I'd make the case on day one for content evaluated as core product infrastructure, with the ownership and standards that implies, and let the projects be the evidence. Getting faster at that kind of foresight is the part of the job I'm focused on now.

What I'm still working out

How much judgment to encode, and how much to leave to people. The highest-stakes calls resist being fully specified: disclosure, deference, how an agent behaves under a user's financial stress. The answer feels as organizational as it is editorial. Where this discipline sits, and who it answers to, as it grows from one team's practice into a company standard.

What I'm building toward

Content and behavioral quality as company infrastructure: a standard teams design against by default, with the evaluation, ownership, and leadership to keep raising the bar. I want to lead that function, not just author it. The next chapter is less about what I produce and more about the capability I leave behind.

Nicole LeBlanc — Content Design Leadership