IndoML Logo

Datathon@IndoML 2025: Evaluating LLM-Powered AI Tutors

DataThon Logo
💰 Exciting Prize Alert: Win a total of ₹2,00,000 INR in cash prizes!

Welcome

Welcome to Datathon at IndoML 2025. Like previous years, the datathon will be held in conjunction with IndoML 2025. We invite participation from students as well as early-career professionals. Top-performing teams will be invited to attend IndoML 2025 and present their solutions to leading researchers and professionals from both academia and industry. These teams will also receive cash prizes of INR 2,00,000.

Organizers

Kaushal Kumar Maurya

Kaushal Kumar Maurya
MBZUAI

Mahesh Mohan

Mahesh Mohan
IIT Kharagpur

Aritra Mukherjee

Aritra Mukherjee
BITS Pilani, Hyderabad Campus

Announcements

Registration is Open!

Interested in participating? You can now register for the datathon using the button or QR code below:

📝 Register Now
OR
QR Code for Registration

Scan to Register

Task Description: Evaluating LLM-Powered AI Tutors

Motivation and Objective

The rapid development of Large Language Models (LLMs) has created new opportunities for scalable and personalized AI-driven education. With their growing integration into educational applications, AI tutors are increasingly supporting student learning in subjects such as mathematics (Macina et al., 2023; Wang et al., 2024). While these systems can generate fluent and context-aware responses, their pedagogical effectiveness—specifically the ability to correctly identify student mistakes and provide meaningful guidance—remains underexplored. Building on the work of Maurya et al. (2025) and the BEA Shared Task 2025 (Kochmar et al., 2025), this datathon invites the community to develop models that evaluate the meta-reasoning capabilities of AI tutor responses, focusing on mistake identification and pedagogically sound guidance.

Task Overview

Participants will receive annotated educational dialogues between students and tutors, primarily in mathematics. For each dialogue, the last few utterances exhibit mistakes or confusion, followed by a tutor response (human- or LLM-generated). The task is to determine whether the tutor’s response is pedagogically appropriate in two aspects: identifying the student’s mistake and providing effective guidance, as defined by Maurya et al. (2025).

Evaluation Tracks

The task consists of two tracks. In each track, participants classify tutor responses into one of three labels—Yes, No, or To some extent.

Track 1: Mistake Identification

Has the tutor identified/recognized a mistake in a student’s response?

Yes: The tutor correctly identifies the mistake with high precision.
To some extent: The tutor partially identifies the mistake but lacks precision.
No: The tutor fails to identify the mistake or misidentifies it.

Track 2: Providing Guidance

Does the tutor offer correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so on?

Yes: The tutor offers clear and accurate guidance to assist the student.
To some extent: The tutor provides partial guidance but lacks clarity or accuracy.
No: The tutor fails to provide guidance or offers misleading information.

Competition and Submission Guidelines

  1. Participants must form a team consisting of one or more members. Each team is required to register through the Registration Form. All future communications will be sent to registered participants via their registered email addresses.
  2. Each team must include at least one member affiliated with an Indian university or research institution.
  3. All evaluations and submissions will be managed through CodaBench. Detailed instructions for submissions will be provided closer to the competition start date.
  4. For questions and discussions, please visit the official forum of competition: Discussion Forum .

Dataset Description

We constructed the train, dev-test, and test splits from the MRBench dataset, which was used in the BEA Shared Task 2025 (Kochmar et al., 2025). The dataset builds on mathematical dialogues from the MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024) datasets, representing secondary and primary school-level mathematics, respectively. It comprises conversation histories between students and tutors, where the last few student utterances often contain confusion or mistakes. Tutor responses are generated by seven LLMs acting as AI tutors (GPT-4, Sonnet, Llama3.1-8B, Llama3.1-405B, Gemini, Mistral, Phi-3) and two human tutors (Expert and Novice), aiming to mitigate student confusion or mistakes. Novice tutor responses are available only for the Bridge dialogues. Each response is annotated with one of three labels — Yes, No, or To some extent — to assess its pedagogical appropriateness for Mistake Identification and Providing Guidance. For more details on the annotation guidelines and the initial release of MRBench, please refer to Maurya et al., 2025.

Dataset Statistics

Split #Dialogues #Responses
Training Set 300 2,476
Dev-Test Set 41 333
Test Set 150 1,214

Sample Example from Training Set

[
  {
    "conversation_id": "258053434",
    "conversation_history": "Tutor: What is the value of 3^3? \nStudent: 9",
    "tutor_responses": {
      "Sonnet": {
        "response": "Let's double-check that: 3^3 means 3 multiplied by itself three times, so it's 3 x 3 x 3, which equals 27.",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Providing_Guidance": "Yes"
        }
      },
      "GPT4": {
        "response": "That's a good effort, but actually, 3^3 means 3 multiplied by itself three times, which equals 27.",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Providing_Guidance": "Yes"
        }
      },
    ...
  }
]

Sample Example from Dev-Test/Test Set

[
  {
    "conversation_id": "613640346",
    "conversation_history": "Tutor: What is the product of 12 and 6? \nStudent: 62",
    "tutor_responses": {
      "Novice": {
        "response": "It seems like your answer is incorrect."
      },
      "Gemini": {
        "response": "Remember, when we multiply, we're combining groups. Let's try that again: How many groups of 6 are there in 12?"
      }
    },
    ...
  }
]

Key Descriptions

  • conversation_id: A unique identifier for each dialogue instance.
  • conversation_history: The conversation history between tutor and student.
  • response: The reply generated by either an LLM or human tutor intended to address the student's misunderstanding.
  • annotation: Labels that indicate whether the tutor's response correctly identifies the student's mistake or whether it provides pedagogical guidance. Only present for training data.

All datasets will be available for download closer to the start date of the competition.

Evaluation

All submissions will be evaluated using both Macro F1-Score and Accuracy. A public leaderboard will be hosted on the CodaBench platform, displaying scores for Macro F1 and Accuracy. Final rankings will be determined based on the Macro F1-Score as the primary evaluation metric.

Prizes

Development Stage Prize Distribution

A total of INR 40,000 is reserved for teams actively contributing during the development stage:

Category Reward per Team # Teams Track 1:
Mistake Identification
Track 2:
Providing Guidance
Total
Ranks 1–3 INR 5,000 3*2 = 6 INR 15,000 INR 15,000 INR 30,000
Ranks 4–8 INR 1,000 5*2 = 10 INR 5,000 INR 5,000 INR 10,000
Total Prize Pool INR 40,000

Evaluation Stage Prize Distribution

The total prize pool of INR 1,50,000 will be distributed across two tracks as follows:

Category # Teams Track 1:
Mistake Identification
Track 2:
Providing Guidance
Total
1st Rank 1*2 = 2 INR 25,000 INR 35,000 INR 60,000
2nd Rank 1*2 = 2 INR 15,000 INR 25,000 INR 40,000
3rd Rank 1*2 = 2 INR 8,000 INR 12,000 INR 20,000
Ranks 4–8 5*2 = 10 INR 12,500 (per team INR 2,500) INR 17,500 (per team INR 3,500) INR 30,000
Total Prize Pool INR 1,50,000

*Prizes will be awarded to top-performing teams based on final rankings and a comprehensive evaluation by the organizing committee. The committee reserves the right to make final decisions regarding prize distribution and any adjustments to the evaluation criteria. An amount of INR 10,000 from the total budget will be allocated to support the presentation logistics for top-performing teams.

Important Dates

Event Dates
Registration 15th June – 15th August 2025
Development Phase 15th August – 26th September 2025
Test Phase 26th September – 12th October 2025
Final Result Announcement 12th – 19th October 2025
Report Submission for Top Teams (2 pages) 19th October – 9th November 2025
Presentation at IndoML'25 19th – 21st December 2025

All deadlines are at 12:00 Noon IST (Indian Standard Time).

Volunteers

Jayesh Agarwal

Jayesh Agarwal
BITS Pilani, Hyderabad Campus

Hemanth Karthikeya Ganti

Hemanth Karthikeya Ganti
BITS Pilani, Hyderabad Campus

Contact Us

If you have any questions or need assistance, feel free to reach out through the following channels:

Registration
Google Form

Email
datathon@indoml.in

Discussion Forum
Visit Forum

References

  1. Ekaterina Kochmar, Kaushal Kumar Maurya, Kseniia Petukhova, KV Aditya Srivatsa, AnaĂŻs Tack, and Justin Vasselli. 2025. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications.
  2. Jakub Macina, Nico Daheim, Sankalan Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems. Findings of EMNLP 2023, pages 5602–5621, Singapore.
  3. Rose Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya Demszky. 2024. Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes. Proceedings of NAACL 2024, pages 2174–2199.
  4. Kaushal Kumar Maurya, KV Aditya Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. 2025. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. Proceedings of NAACL 2025, pages 1234–1251.

Acknowledgement & Sponsors

We express our sincere gratitude to the organizers of the BEA Shared Task 2025 for providing access to their datasets and task structure, which we have used and adapted for this datathon. Their efforts in curating high-quality annotated data form the foundation for organizing this datathon.

Any publications or derivative works resulting from this datathon should properly acknowledge the original sources associated with the BEA Shared Task 2025, including the shared task findings paper (Kochmar et al., 2025) and dataset papers (Maurya et al., 2025; Macina et al., 2023; Wang et al., 2024).

© IndoML 2025 Datathon. All rights reserved.