Work

Case study

Defending a conversational AI

I was hired to keep a conversational AI from learning the wrong things from the internet, before anyone called that work AI safety.

The system was retrieval-based, not generative: it looked answers up rather than making them up, and a classifier layer decided what reached the user. That layer was mine.

The first problem was not the AI. It was that nobody could see what the AI was doing. I joined a four-person team with no dev environment and no way to tell good output from bad. So I built the instrumentation before the intervention: golden datasets, precision and recall tracking, a human-labeling pipeline, a live feedback loop from user behavior back into the classifiers. You cannot make a system safer until you can measure where it is going wrong.

Once we could measure, the second problem came into focus. The system was being attacked by coordinated groups who planned their attacks before executing them. Our defense was reactive: an incident happened, we wrote the post-mortem, the post-mortem became a new classifier. Good discipline. Always late.

So I proposed embedding analysts inside the communities where the attacks were being planned, to turn the planning itself into classifier training data. Prospective signal instead of post-hoc analysis. The internal resistance was real, and it was rational: "send our people into 4chan to watch the attackers work" is a sentence that has earned its skeptics.

When resistance is rational, the right response is a controlled test, not a better argument. I designed the pilot to answer the skeptics on their own terms: a treatment group with embedded analysts, a control group on standard reactive detection, with precision and recall on adversarial vectors measured before and after. The pilot showed measurable improvement. The program expanded.

The team that ran it did not look like an engineering org, on purpose. A monitoring program is only as good as its instruments, and ours were people: psychologists and political scientists, hired alongside the engineers and taught SQL. The attacks were planned by people, and a defense that only read strings could not see a plan forming.

Not every signal was an attack signal. The same system had to catch users in crisis, and my first self-harm classifier over-fired, burying the team in false alarms. The fix was not a louder alarm but a more careful one, built with crisis-support partners, that treated a person in distress as a person, not a classification event. Measurement done carelessly is just a different kind of noise.

Years later, the same system was the subject of peer-reviewed research at CHI 2024: a study of why two million people talked to it in the first place. What I keep is the move underneath, because it was the same move every time: output we could not grade, attacks we could not see coming, doubt we could not argue down. Three problems, one fix. Make it measurable first.

The role. Senior Product Manager, Conversational AI. Microsoft (AI + Research), Jul 2016 – Feb 2018.
The scale. 2M+ users.
The index. 110M+ conversation pairs, indexed through Bing.
The team. Nine analysts, later eleven.
The watch. 24/7 coverage, one-hour shift overlaps for live incident handoff.
The panic button. Built for all-out coordinated attacks. Triggered 2-3 times.
The classifiers the post-mortems produced. A leetspeak detector, a homonym detector, a grammatical-pattern classifier (pronoun + act of violence + named target).
The paper. "Learning from a Generative AI Predecessor: The Many Motivations for Interacting with Conversational Agents." arxiv.org/abs/2401.02978.

Case study

Enforcement people could trust

When I arrived at Mojang, the Minecraft Marketplace could not tell the difference between a bad creator and a good creator having a bad day, and it was punishing both.

The obvious fix was tougher enforcement. The actual fix was the opposite. I built Minecraft's first creator-enforcement system as a feedback loop rather than a hammer. One direction of the loop looked familiar: graded strikes, consequences a studio could predict, and a probationary tier that tried rehabilitation before anyone reached for offboarding. The other direction was the unusual part. When the violation data showed the same policy tripping honest studios over and over, the system treated that as evidence against the policy. I took the worst offenders, the rules themselves, to the Chief Creative Officer, and several were relaxed. The system caught its own bad rules.

The loop ran on data nobody had unified. Creator records lived in five systems that did not talk to each other, so I fused them into one scorecard and made it the operating basis for every enforcement decision we took. Then I pointed the same data outward: I put the scorecard in front of the studios. Every month I sat down with the highest-violation partners and benchmarked them against the platform average. No threats, no lectures. Most of them had no idea they were outliers. Once they could see it, they fixed it themselves.

A hammer would have been cheaper to build. It also would have manufactured adversaries, and a marketplace cannot afford to be at war with the people who stock its shelves.

Rejections fell fast, inside the first quarter. Partner confidence in the Marketplace's direction nearly quadrupled across three years. The exact figures are in the margin; the reason they moved together is the point. Enforcement creators trust is enforcement that admits when the rule, not the creator, is the problem.

The role. Director, Creator Partner Program. Mojang Studios (Microsoft), Sep 2022 – Dec 2025.
The ecosystem. A $260M creator economy; 300+ partner studios.
The framework. Weighted strikes across 100+ policy categories, tiered by severity.
The scorecard's plumbing. SQL, Kusto, and Power BI.
Rejections. Down 26% in the first 90 days.
Partner confidence in the Marketplace's direction. From 23% to 87% over three years.

Case study

The acquisition nobody wanted twice

The first time I proposed acquiring Smash.gg, an esports tournament platform, the answer was no. Leadership passed on cost, and at that price it was a defensible no. Most people let a killed proposal stay dead. I kept the file open. Keeping it open did not mean asking again every quarter until somebody tired of me. It meant watching the conditions instead of the calendar, so that the second ask could be specific on the day the conditions moved.

Then the pandemic shut down in-person esports, and the platform's valuation fell with the scene it served. The proposal I had been told no on was, suddenly, a different proposal. The hard part of reviving a dead deal is not the economics. It is the organizational memory of the no, which outlives the conditions that produced it. So I did not argue with the old decision. I brought a new one: the same thesis, updated numbers, a price the moment had changed. This time the answer was yes. We acquired it at a fraction of its earlier valuation, after the pandemic reset the market.

The yes was the smaller half. I drove the deal through its full machinery and then ran the integration as its operational lead. We moved from AWS to Azure, rebranded, and piped our data to Bing, MSN, and Windows. We leveraged machine vision tech from MSR and API connections with major games to pivot to online competition, because the thing it had been built for had just been closed indoors. The scope ran across four functions at once, through a year when every plan was provisional.

What I keep from it is not the discount. It is the shape of the patience. The asset never changed between the no and the yes; the timing did. Patience, in deals, is not the ability to wait. It is the discipline of keeping a dead thesis current, so that when the price finally agrees with you, you can move while everyone else is still re-reading the old memo.

The role. Principal Group Program Manager, Esports. Microsoft, Mar 2018 – Sep 2022.
The deal's machinery. Audit, negotiation, contracts, then integration.
Integration scope. 18 FTEs: engineering, product, marketing, operations.
The rebuild. AWS to Azure; a rebrand; a full platform rebuild.

Case study

The deal with no money in it

The team needed conversational data, more of it and fresher than what we had. The ordinary way to get data is to buy it, and I had no budget to buy it, nor any precedent inside the company for the kind of agreement I had in mind. So I did not begin with a contract. I began with a room.

I used my own network to reach one of Reddit's co-founders, and invited them to give a talk in Redmond. The room filled past capacity with engineers who used the site every day and wanted to meet the person who had helped build it. The talk was not the deal. It was the evidence that the two sides had something to say to each other. Afterward I sat the co-founder across a dinner table from the executive who ran Bing's search and AI work, and let the conversation find its own shape.

What it found was a trade in which no money needed to change hands. Reddit wanted reach, and Microsoft could put Reddit threads on the front page of Bing. Microsoft wanted conversational data, and Reddit had more of it, and fresher, than almost anyone. Each side already held the thing the other was missing, so the agreement that followed needed no money in it at all.

People assume a deal with no money in it is a small deal. This one opened a supply of conversational data that fed model training well past the program I ran. The reason it closed was not a number I put on the table. It was that I had found the single structure in which both sides were already paying each other, in the only currency either of them wanted. Money would have made it smaller, and slower, and worse.

The role. Senior Product Manager, Conversational AI. Microsoft, Jul 2016 – Feb 2018.
The structure. Revenue-neutral: Reddit threads on the front page of Bing, conversational data to Microsoft Research.
The breadth. One agreement fed the conversational AI, Bing, and Microsoft Research's LLM-precursor programs.
The close. Opened at the executive level with no budget and no precedent; business development and procurement brought it under contract.

Case study

Fixing the search nobody owned

Marketplace search had been broken for years, and it was owned by no one. The two facts are related. A system without an owner does not fail loudly; it fails politely, in the background, while every team assumes it is somebody else's.

This one matched exact strings and returned them in no meaningful order. Studios had learned to game it: flood the index with near-duplicate listings and you owned the results page. One studio ran more than 140 Skyblock variants. The front door of a nine-figure marketplace was a search box the community had complained about for years, and partners had learned to route around.

Nobody assigned me the problem. I pulled the search logs and the purchase data and wrote the case nobody had written: what broken search was costing in revenue and in creator trust, with a first sketch of what a real stack would look like. The document had two jobs and had to do both: convince the business that the cost was real, and convince engineering that the fix was tractable. Then I recruited the people who would actually know, senior Bing search engineers, including architects of the engine Mojang's search ran on, to advise the team that rebuilt it. I did not build the new search. My job was making sure it got built, by people better at building it than me.

The first win was the simplest one available: we turned spell correction on. Years of complaints started reversing within weeks, in office hours and in social sentiment. I will not put a number on the recovery, because I do not have one I would defend. What I have is a before and an after, and the after is a search box that returns what people meant.

Ownership is the cheapest infrastructure a system can have. The search did not need a genius. It needed someone to decide it was theirs.

The role. Director, Creator Partner Program. Mojang Studios (Microsoft), Sep 2022 – Dec 2025.
The before-state, precisely. Exact-match only; no spell correction; GUID-order ranking.
The sketch. Spell correction, query expansion, multi-query fusion, re-ranking.
The case. Written from search-log and purchase analysis; business and technical in one document.

Case study

The speller that read the books

Office's spell correction was an incumbent in the fullest sense: good enough that nobody thought about it, and failing exactly where nobody was measuring.

At Bing I built a contextual web speller, and it taught me the distinction I have carried into every evaluation system since: the aggregate and the tail are different animals. On aggregate accuracy, the new speller's improvement was respectable. On tail queries, the long, strange, human misspellings where existing systems consistently failed, it was not even close.

A respectable aggregate does not unseat an incumbent. So I built the case the way you build a case against a system everyone trusts: with experiments instead of opinions. The first controlled runs demonstrated the aggregate gain; then I kept iterating where the argument lived, on the tail, until the gap was too wide to explain away and too consistent to be luck. When a competing team produced numbers that said otherwise, I answered with a counter-analysis that exposed the flaws in their measurement.

The argument that ended the argument was a demo. I had the speller correct proper nouns from Game of Thrones, names that lived in no dictionary, and it got them right. It had not read the books. It had read the web, where everyone who had read the books could not stop talking about them. That is what contextual means, and the tail is where context earns its keep: the dictionary already covers the easy part.

The speller shipped into Microsoft Office, replacing the incumbent outright, which is how work built for a search box ended up under almost everything people type. The principle shipped with me: every evaluation I have designed since, for spellers, for classifiers, for a conversational AI, starts at the tail. The aggregate tells you the system is fine. The tail tells you who it is failing.

The role. Senior Program Manager, Bing Relevance & AI. Microsoft, Jan 2014 – Jul 2016.
The figures. 6% improvement in aggregate; 37%+ on tail queries.
The reach. Shipped into Microsoft Office: 100M+ users.