I lead a team that treats content as a system you can evaluate — defining how an agent should behave, then holding it to that standard with real scores, in production, while it ships. I stood this up for Voyager, Zillow's first agentic search experience and now a governance system other teams design against, without a mandate and without authority over the partners.
That language layer is what lets a non-expert operate an expert tool: a homebuyer who shouldn't have to learn the MLS, an engineer who shouldn't have to learn the CAD stack. The domain changes; the governance problem doesn't.
I didn't wait for a mandate. Seeing the drift risk in an unstructured system, I audited early production outputs myself to find the real quality gaps, then stood up and ran the cross-functional Voice & Behavior Spec Working Group to define correct AI behavior and a repeatable way to evaluate it.
None of the core technical partners reported to me, so I made our standards the path of least resistance. I published the doctrine where teams already worked: Glean-searchable and in the GitLab AI Marketplace. When fixes needed engineering priority I couldn't command, I partnered with Design leadership to make the case to the Voyager product team on one principle: in an agentic experience, the language layer is the core product logic. It earned a slot on the roadmap.
I earned trust across Model Design, Data Science, and Engineering by doing the translation work between design intuition and machine evaluation myself, and showing up with data instead of opinions: an "Evidence Spine" of curated, annotated model outputs. With a clear split (Content Design defines the behavior, Model Design owns execution), we built a tight loop to solve the highest-stakes calls: compliance disclosures, uncertainty, and financial-stress markers.
Before any framework, this is what showed up in real sessions. The model was capable, but in the moments that mattered most it would contradict itself, bury the answer, or land on something discouraging. These are the failures the language layer exists to catch.
Once we knew what was breaking, the answer wasn't more rules. It was a shared layer that tells the agent how to behave: a behavioral doctrine, a spec, and response standards that hold one voice across every surface and failure state.
How I arrived at the doctrine: I started from the failures, not a blank page. I read where the agent broke down, grouped the breakdowns by the kind of decision it kept getting wrong, and each group became a part of the guidance. Charter answers why the layer exists, Interaction Philosophy how the agent engages, the Product Character Model who it is, and the Intent Map what the person needs in the moment.
| UI copy (≤40 ch) | Query sent to Voyager |
|---|---|
| What stands out? | What makes this home stand out? |
| How does the price compare? | Is the asking price comparable to similar homes? |
| Tradeoffs to consider | What tradeoffs should I consider when looking at this home? |
| Can I afford to buy? | Can I afford to buy this home right now? |
| How accurate is the Zestimate? | How confident are you in the value of the Zestimate for this property? |
How the agent narrates a mover's search — separating signal strength so it reads as accurate and collaborative, never assumptive.
The language layer shipped, and then I could watch it work. I used Amplitude to evaluate real Voyager sessions, turned what I found into a prioritized prompt-change summary, and handed it to the cross-cutting quality eval team, who turn each change into a scored eval: real sessions with the correct behavior defined, so every future release is graded on it. I've since taught that Amplitude process to Angela and the team, so the loop runs without me.
The artifact at the center of it: fifteen specific changes pulled from those sessions, ranked P0 to P2, precise enough for the eval team to turn straight into scored evals.
| # | Prompt change | Addresses | Pri |
|---|---|---|---|
| 1 | Pre-tool acknowledgment for address inputs | Silent failures | P0 |
| 2 | Fallback message on null / error tool result | Silent failures | P0 |
| 3 | Address intent classifier → route to lookup | Silent failures | P0 |
| 4 | Forced-choice clarification (a / b / c + invite) | Clarification drop-off | P1 |
| 7 | Frustration acknowledgment before content | Retry storms | P1 |
| 14 | Response-completion guardrail (no mid-sentence cut) | Truncation | P1 |
| 9 | Empathy-first structure for troubleshooting | Troubleshooting | P2 |
| 12 | Listing-specific vs. general Q&A classifier | Question answering | P2 |
This is the chapter my team is writing now. We're taking what the spec taught us on Voyager and turning it into a cross-cutting quality framework, so every team designs against the same scorable definition of good. To get it on the roadmap, I made the case in leadership pitch decks that tied each quality gap to a business risk — then turned that buy-in into a dedicated eval workstream.
How I got it funded: a written spec doesn't move a roadmap. I built the case as pitch decks that tied each quality gap to a business risk — buried answers and discouraging tone read as trust and conversion exposure — and walked them through leadership. That's what turned "good content" into prioritized, scorable work: the four dimensions landed on the roadmap with a dedicated eval workstream behind them.
Born in the Q1 2026 Rentals metrics workstream. I co-lead the pilot with Mark P., senior director of design for Rentals. The premise: make quality observable instead of intuited.
PMs and execs get one number they can read at a glance — comparable across steps and legible to partners outside the design org.
Pilot v2, calibrating against a tenant management workflow. Phase 2 correlates rubric scores to real funnel data with Data Science.
Puts a dollar figure on UX friction: step value × drop-off %. The question funnel data can't answer — what is this friction actually costing us?
Diagnoses why a step leaks — which quality criteria fail there, scored from a screenshot.
It lives in Replit and runs on the Writer API. A draft goes into the Content Improver and comes back scored and rewritten for clarity, voice, and tone against our codified standards, inline, before it ever reaches a human reviewer.
At a Replit Showcase, our SVP of Engineering took an interest and wanted to integrate it deeper into Zillow's systems. It's now the official self-serve content infrastructure for the Rentals and MoverXP (Core Zillow) teams. Same move as Voyager: encode content judgment into a machine-usable layer, then put it where the work happens.
Judgment is what makes a product good, and judgment doesn't scale. The constraint here was never talent or tooling. It was governance: who owns a behavior, where the decision lives, and how it holds when no one is watching. The work that matters most isn't authoring the spec. It's building the team and the operating model that make these calls well at scale.
I've proven the model three times now, building the case project by project before pushing for content to be resourced as a system rather than a service. That's felt like diligence, but it's slow. I'd make the case on day one for content evaluated as core product infrastructure, with the ownership and standards that implies, and let the projects be the evidence. Getting faster at that kind of foresight is the part of the job I'm focused on now.
How much judgment to encode, and how much to leave to people. The highest-stakes calls resist being fully specified: disclosure, deference, how an agent behaves under a user's financial stress. The answer feels as organizational as it is editorial. Where this discipline sits, and who it answers to, as it grows from one team's practice into a company standard.
Content and behavioral quality as company infrastructure: a standard teams design against by default, with the evaluation, ownership, and leadership to keep raising the bar. I want to lead that function, not just author it. The next chapter is less about what I produce and more about the capability I leave behind.