#221 Building an AI tutor with Google DeepMind with Bibi Groot (Eedi’s Chief Impact Officer)

Apple PodcastsSpotifyGoogle PodcastsAnchorPodbeanStitcherPocketCasts

Episode details

In this episode of the Mr Barton Maths podcast, Craig sits down with Bibi Groot, behavioural scientist at Eedi, to unpack the rigorous research behind their ed-tech work. Bibi traces her journey from the UK’s Behavioural Insights Team — where she applied frameworks like EAST (Easy, Attractive, Social, Timely) to public policy — to becoming Eedi’s first behavioural scientist after a stint completing a PhD at UCL and having twins. The conversation builds methodically from the fundamentals of randomised control trials (and why they’re so notoriously difficult to run well in schools) through the headline results of Eedi’s two-year, 20-school RCT showing that students using the platform gained the equivalent of two to four extra months of progress, before diving into the much-publicised Google DeepMind collaboration. That study, run with LearnLM and a human-in-the-loop safety net, found that an AI tutor matched a human tutor on immediate question success and actually outperformed humans on short-term transfer questions — likely because the AI was relentlessly Socratic where time-pressured human tutors tended to short-circuit students’ metacognition. Bibi closes by previewing Eedi’s much larger four-arm follow-up trial (running until July 2026) testing whether deep student context beats strong pedagogy alone, plus exciting new pilots bringing DQR and WhatsApp-delivered AI tutoring to learners in Guyana, India, and Sub-Saharan Africa.

Talking points

  1. Bibi’s background: From the Behavioural Insights Team (joined in 2014 as employee #30) to a PhD in Behavioural Public Policy at UCL, to becoming Eedi’s first behavioural scientist in 2022.
  2. The EAST framework: Easy, Attractive, Social, Timely — decades of behavioural science compressed into four design principles that apply equally to students and teachers.
  3. What an RCT actually is: Random allocation to control vs. treatment groups, why randomisation is the whole game, and why so few ed-tech companies bother running them (cost, time, ~2 years end-to-end).
  4. Why RCTs are hard in schools: Contamination within classes, ethical concerns about withholding treatment, large sample sizes needed when you randomise at school or year-group level, and waitlist designs as a workaround.
  5. Eedi’s 20-school RCT (2023–2025): 3,448 Year 7 students, randomised at school level, independently evaluated by Prof. Steve Higgins / What Works Education. Effect sizes grew over time — Cohen’s d of 0.17 at 12 months, 0.46 at 18 months, stabilising around 0.30 at 24 months (≈ 2–4 months of additional progress).
  6. Research integrity: The “file drawer problem,” pre-registration of analyses, the importance of independent evaluation, and using standardised assessments (NWEA, STAR Maths) so you can’t accidentally teach to the test.
  7. Implementation matters: Platform usage data combined with standardised outcomes reveals which teacher behaviours actually drive student gains — real-world field experiments beat lab conditions.
  8. “Kill your darlings”: The tension between adding cool features and stripping the platform back so it works for the median user, not just power users.
  9. The Google DeepMind collaboration — origins: Started with Simon bonding with someone at Google over folk music at a conference.
  10. The first DeepMind trial design: 165 students, three arms — static hint (control), human tutor interactive dialogue, and LearnLM AI tutor with human-in-the-loop safety review.
  11. Headline finding #1: Interactive dialogue smashed static hints — 90%+ retry success vs. 65% for static hints, because dialogue is harder to ignore than a paragraph.
  12. Headline finding #2 (the one that made the news): On short-term transfer questions, AI tutor beat human tutor (66% vs. 60% vs. 56% static). Hypothesis: the AI was rigorously Socratic, while time-pressured human tutors tended to skip the metacognitive step and just diagnose-and-fix.
  13. Safeguarding results: Out of 3,000+ conversations, zero safeguarding issues and only five maths hallucinations (mostly misreading diagram images). LLMs have got genuinely good at maths.
  14. Where the AI struggled: Reading social cues — it would keep pushing Socratically when students had clearly understood and wanted to move on.
  15. The fair comparison: AI tutors aren’t competing with the gold-standard 1:1 human tutor who knows a child intimately — they’re competing with the realistic scenario of one tired teacher trying to support 30 kids.
  16. The new four-arm trial (running until July 2026): 1,200 students, 10 schools, comparing static control, human tutor, pedagogy-only AI, and pedagogy + hyper-context AI — to isolate how much benefit comes from deep student context vs. strong pedagogy alone.
  17. Bias risks in AI feedback: Stanford research showing AI gave Black students softer writing feedback when given demographic data — why Eedi deliberately withholds gender, ethnicity, age, and name from its AI and feeds it learning history instead.
  18. Bibi’s predictions for the new trial: Static content last; human tutor and pedagogy-only AI roughly tied; pedagogy + hyper-context hopefully out in front. Results expected by start of academic year 2026.
  19. What’s next for Eedi globally: DQR pilots with the Inter-American Development Bank in Guyana (turning a 30-day pen-and-paper assessment cycle into a 5-minute one); AI tutors for severely low-literacy contexts in Latin America, India, and Sub-Saharan Africa; and WhatsApp-delivered AI tutoring sent to parents’ phones in India, with behavioural prompts designed (using David Yeager’s mentor mindset research) to get parents to hand over the phone in the first place.

Video:

Links from Bibi

  1. Follow Eedi on LinkedIn here.
  2. Self-Regulated AI Use Hinders Long-Term Learning: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5604932
  3. Generative AI without guardrails can harm learning: Evidence from high school mathematics: https://www.pnas.org/doi/10.1073/pnas.2422633122
  4. Marked Pedagogies: Examining Linguistic Biases in Personalized Automated Writing Feedback: https://dl.acm.org/doi/10.1145/3785022.3785113

New stuff I have been working on:

  1. My Tips for Teachers Guides to… series
  2. My updated mrbartonmnaths website

My usual plugs

Leave a Reply