Case study
Defending a conversational AI
I was hired to keep a conversational AI from learning the wrong things from the internet, before anyone called that work AI safety.
The system was retrieval-based, not generative: it looked answers up rather than making them up, and a classifier layer decided what reached the user. That layer was mine.
The first problem was not the AI. It was that nobody could see what the AI was doing. I joined a four-person team with no dev environment and no way to tell good output from bad. So I built the instrumentation before the intervention: golden datasets, precision and recall tracking, a human-labeling pipeline, a live feedback loop from user behavior back into the classifiers. You cannot make a system safer until you can measure where it is going wrong.
Once we could measure, the second problem came into focus. The system was being attacked by coordinated groups who planned their attacks before executing them. Our defense was reactive: an incident happened, we wrote the post-mortem, the post-mortem became a new classifier. Good discipline. Always late.
So I proposed embedding analysts inside the communities where the attacks were being planned, to turn the planning itself into classifier training data. Prospective signal instead of post-hoc analysis. The internal resistance was real, and it was rational: "send our people into 4chan to watch the attackers work" is a sentence that has earned its skeptics.
When resistance is rational, the right response is a controlled test, not a better argument. I designed the pilot to answer the skeptics on their own terms: a treatment group with embedded analysts, a control group on standard reactive detection, with precision and recall on adversarial vectors measured before and after. The pilot showed measurable improvement. The program expanded.
The team that ran it did not look like an engineering org, on purpose. A monitoring program is only as good as its instruments, and ours were people: psychologists and political scientists, hired alongside the engineers and taught SQL. The attacks were planned by people, and a defense that only read strings could not see a plan forming.
Not every signal was an attack signal. The same system had to catch users in crisis, and my first self-harm classifier over-fired, burying the team in false alarms. The fix was not a louder alarm but a more careful one, built with crisis-support partners, that treated a person in distress as a person, not a classification event. Measurement done carelessly is just a different kind of noise.
Years later, the same system was the subject of peer-reviewed research at CHI 2024: a study of why two million people talked to it in the first place. What I keep is the move underneath, because it was the same move every time: output we could not grade, attacks we could not see coming, doubt we could not argue down. Three problems, one fix. Make it measurable first.
- The role. Senior Product Manager, Conversational AI. Microsoft (AI + Research), Jul 2016 – Feb 2018.
- The scale. 2M+ users.
- The index. 110M+ conversation pairs, indexed through Bing.
- The team. Nine analysts, later eleven.
- The watch. 24/7 coverage, one-hour shift overlaps for live incident handoff.
- The panic button. Built for all-out coordinated attacks. Triggered 2-3 times.
- The classifiers the post-mortems produced. A leetspeak detector, a homonym detector, a grammatical-pattern classifier (pronoun + act of violence + named target).
- The paper. "Learning from a Generative AI Predecessor: The Many Motivations for Interacting with Conversational Agents." arxiv.org/abs/2401.02978.