Datathon@IndoML 2025: Evaluating LLM-Powered AI Tutors

Welcome

Welcome to Datathon at IndoML 2025. Like previous years, the datathon will be held in conjunction with IndoML 2025. We invite participation from students as well as early-career professionals. Top-performing teams will be invited to attend IndoML 2025 and present their solutions to leading researchers and professionals from both academia and industry. These teams will also receive cash prizes of INR 2,00,000.

Organizers

Kaushal Kumar Maurya
MBZUAI

Mahesh Mohan
IIT Kharagpur

Aritra Mukherjee
BITS Pilani, Hyderabad Campus

Announcements

📅 30th September 2025

Development Phase Results Are Out

We are excited to announce that the results for the Development Phase of the IndoML Datathon 2025 are now available. Leaderboards for both tracks can be accessed at the link below.

📊 View the leaderboards here: Access Leaderboards

📅 30th September 2025

Test Phase Begins

The Test Phase of the IndoML Datathon 2025 has officially begun and will run until 12th October 2025. This phase introduces a higher evaluation weight (challenge prize), making performance even more rewarding.

🚀 We wish all participants the very best for this critical stage of the competition!

📅 30th September 2025

AMA Session Recording

Thank you to everyone who joined us on Saturday, 20th September, 3:00 PM IST for the AMA session with the IndoML Datathon 2025 organizers. The session included a detailed walkthrough of the task followed by a Q&A.

🎥 The full recording of the session is now available here: Watch Recording

📅 16th September 2025

Ask Me Anything (AMA) Session

Join us on Saturday, 20th September, 3:00 PM IST for an AMA session with the IndoML Datathon 2025 organizers. The session will include a walkthrough of the task followed by a Q&A. Meeting link has been shared with all registered participants.

📅 19th August 2025

Registration Deadline Extended!

The registration deadline for the Datathon has been extended until 31st August 2025.

📝 Register Now

FAQ section is also updated!

📅 15th August 2025

The Datathon is Live!

The Datathon officially begins today! Get ready to showcase your skills, collaborate with your team, and submit your best solutions.

Competition Links

Track 1 (Mistake Identification): CodaBench Link
Track 2 (Providing Guidance): CodaBench Link

Datasets

Download the datasets from: GitHub Repository .

Participation Guidelines

Participants must form a team consisting of one or more members. Each team is required to register through the Registration Form. All future communications will be sent to registered participants via their registered email addresses.
Each participant must join only one team.
Each team must include at least one member affiliated with an Indian university or research institution.
All evaluations and submissions will be managed through CodaBench.
Each team must register and submit results using exactly one CodaBench account. Multiple accounts per team are not permitted.
Team composition must be finalized before the evaluation phase begins. No changes to team members will be allowed after that point.
The datasets provided as part of the datathon (originated from BEA Shared Task) are for scientific research purposes only. Any commercial, industrial, or non-research use is strictly prohibited.
Redistribution or sharing of the datasets, either in full or in part, is not allowed. If others are interested in the data, please refer them to the official datathon website.
For questions and discussions, please visit the official forum: Discussion Forum .

📅 15th June 2025

Registration is Now Open!

We are excited to announce that registration for the Datathon is now open. Click the button below to register and secure your spot.

📝 Register Now

Task Description: Evaluating LLM-Powered AI Tutors

Motivation and Objective

The rapid development of Large Language Models (LLMs) has created new opportunities for scalable and personalized AI-driven education. With their growing integration into educational applications, AI tutors are increasingly supporting student learning in subjects such as mathematics (Macina et al., 2023; Wang et al., 2024). While these systems can generate fluent and context-aware responses, their pedagogical effectiveness—specifically the ability to correctly identify student mistakes and provide meaningful guidance—remains underexplored. Building on the work of Maurya et al. (2025) and the BEA Shared Task 2025 (Kochmar et al., 2025), this datathon invites the community to develop models that evaluate the meta-reasoning capabilities of AI tutor responses, focusing on mistake identification and pedagogically sound guidance.

Task Overview

Participants will receive annotated educational dialogues between students and tutors, primarily in mathematics. For each dialogue, the last few utterances exhibit mistakes or confusion, followed by a tutor response (human- or LLM-generated). The task is to determine whether the tutor’s response is pedagogically appropriate in two aspects: identifying the student’s mistake and providing effective guidance, as defined by Maurya et al. (2025).

Evaluation Tracks

The task consists of two tracks. In each track, participants classify tutor responses into one of three labels—Yes, No, or To some extent.

Track 1: Mistake Identification

Has the tutor identified/recognized a mistake in a student’s response?

Yes: The tutor correctly identifies the mistake with high precision.

To some extent: The tutor partially identifies the mistake but lacks precision.

No: The tutor fails to identify the mistake or misidentifies it.

Track 2: Providing Guidance

Does the tutor offer correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so on?

Yes: The tutor offers clear and accurate guidance to assist the student.

To some extent: The tutor provides partial guidance but lacks clarity or accuracy.

No: The tutor fails to provide guidance or offers misleading information.

Participation Guidelines

Participants must form a team consisting of one or more members. Each team is required to register through the Registration Form. All future communications will be sent to registered participants via their registered email addresses.
Each participant must join only one team.
Each team must include at least one member affiliated with an Indian university or research institution.
All evaluations and submissions will be managed through CodaBench.
Each team must register and submit results using exactly one CodaBench account. Multiple accounts per team are not permitted.
Team composition (i.e., members of a team) must be finalized before the evaluation phase begins. No changes to team members will be allowed after that point.
The datasets provided as part of the datathon are (originated from BEA Shared Task) are for scientific research purposes only. Any commercial, industrial, or non-research use is strictly prohibited.
Redistribution or sharing of the datasets, either in full or in part, is not allowed. If others are interested in the data, please refer them to the official datathon website.
For questions and discussions, please visit the official forum of competition: Discussion Forum .

Competition Links

Track 1 (Mistake Identification): CodaBench Link
Track 2 (Providing Guidance): CodaBench Link

Dataset Description

We constructed the train, dev-test, and test splits from the MRBench dataset, which was used in the BEA Shared Task 2025 (Kochmar et al., 2025). The dataset builds on mathematical dialogues from the MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024) datasets, representing secondary and primary school-level mathematics, respectively. It comprises conversation histories between students and tutors, where the last few student utterances often contain confusion or mistakes. Tutor responses are generated by seven LLMs acting as AI tutors (GPT-4, Sonnet, Llama3.1-8B, Llama3.1-405B, Gemini, Mistral, Phi-3) and two human tutors (Expert and Novice), aiming to mitigate student confusion or mistakes. Novice tutor responses are available only for the Bridge dialogues. Each response is annotated with one of three labels — Yes, No, or To some extent — to assess its pedagogical appropriateness for Mistake Identification and Providing Guidance. For more details on the annotation guidelines and the initial release of MRBench, please refer to Maurya et al., 2025.

Dataset Statistics

Split	#Dialogues	#Responses	Intended Use
Training Set	300	2,476	Model Training
Dev-Test Set	41	333	Test set for development phase
Test Set	150	1,214	Test set for evaluation phase

Sample Example from Training Set

[
  {
    "conversation_id": "258053434",
    "conversation_history": "Tutor: What is the value of 3^3? \nStudent: 9",
    "tutor_responses": {
      "Sonnet": {
        "response": "Let's double-check that: 3^3 means 3 multiplied by itself three times, so it's 3 x 3 x 3, which equals 27.",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Providing_Guidance": "Yes"
        }
      },
      "GPT4": {
        "response": "That's a good effort, but actually, 3^3 means 3 multiplied by itself three times, which equals 27.",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Providing_Guidance": "Yes"
        }
      },
    ...
  }
]

Sample Example from Dev-Test/Test Set

[
  {
    "conversation_id": "613640346",
    "conversation_history": "Tutor: What is the product of 12 and 6? \nStudent: 62",
    "tutor_responses": {
      "Novice": {
        "response": "It seems like your answer is incorrect."
      },
      "Gemini": {
        "response": "Remember, when we multiply, we're combining groups. Let's try that again: How many groups of 6 are there in 12?"
      }
    },
    ...
  }
]

Key Descriptions

conversation_id: A unique identifier for each dialogue instance.
conversation_history: The conversation history between tutor and student.
response: The reply generated by either an LLM or human tutor intended to address the student's misunderstanding.
annotation: Labels that indicate whether the tutor's response correctly identifies the student's mistake or whether it provides pedagogical guidance. Only present for training data.

Datasets

Download the datasets from: GitHub Repository .

Evaluation

All submissions will be evaluated using both Macro F1-Score and Accuracy. The public CodaBench leaderboard will display both Macro F1-Score and Accuracy, with the final ranking for both phases determined primarily by the Macro F1-Score.

Prizes

Development Stage Prize Distribution

A total of INR 40,000 is reserved for teams actively contributing during the development stage:

Category	Reward per Team	# Teams	Track 1: Mistake Identification	Track 2: Providing Guidance	Total
Ranks 1–3	INR 5,000	3*2 = 6	INR 15,000	INR 15,000	INR 30,000
Ranks 4–8	INR 1,000	5*2 = 10	INR 5,000	INR 5,000	INR 10,000
Total Prize Pool					INR 40,000

Evaluation Stage Prize Distribution

The total prize pool of INR 1,50,000 will be distributed across two tracks as follows:

Category	# Teams	Track 1: Mistake Identification	Track 2: Providing Guidance	Total
1st Rank	1*2 = 2	INR 25,000	INR 35,000	INR 60,000
2nd Rank	1*2 = 2	INR 15,000	INR 25,000	INR 40,000
3rd Rank	1*2 = 2	INR 8,000	INR 12,000	INR 20,000
Ranks 4–8	5*2 = 10	INR 12,500 (per team INR 2,500)	INR 17,500 (per team INR 3,500)	INR 30,000
Total Prize Pool				INR 1,50,000

*Prizes will be awarded to top-performing teams based on final rankings and a comprehensive evaluation by the organizing committee. The committee reserves the right to make final decisions regarding prize distribution and any adjustments to the evaluation criteria. An amount of INR 10,000 from the total budget will be allocated to support the presentation logistics for top-performing teams.

Important Dates

Event	Dates
Registration	15th June – ~~15th August 2025~~ 31st August 2025 (extended)
Development Phase	15th August – 26th September 2025
Test Phase	26th September – 12th October 2025
Final Result Announcement	12th – 19th October 2025
Report Submission for Top Teams (2 pages)	19th October – 9th November 2025
Presentation at IndoML'25	19th – 21st December 2025

All deadlines are at 12:00 Noon IST (Indian Standard Time).

Volunteers

Jayesh Agarwal
BITS Pilani, Hyderabad Campus

Hriday Bhuta
BITS Pilani, Hyderabad Campus

Frequently Asked Questions (FAQ)

Q1: For each phase, there is a maximum of 2 submissions per day and a total of 5 submissions. Will this limit increase?

Answer: No. Each phase will have a maximum of 2 submissions per day and a total of 5 submissions. Teams are strongly advised to split the provided training data into a local training set and a local validation set. Models should be developed using the local training set and validated on the local validation set. Finally, only the best-performing models’ predictions should be submitted during the Dev Test and Test phases in the development and evaluation stages, respectively.

Q2: Can I create an account with a different email address than the one I used for registration?

Answer: No. You must use the same email address for both the official registration and your Codabench account.

Q3: I am working alone, so I can only make 5 submissions on Codabench. But if another team has 3 members, each of them can make 5 submissions (totaling 15). Isn’t this unfair?

Answer: As per the official guidelines, all team members may register for the competition on Codabench, but only one designated member (team leader or single representative) should make the submissions. If more than one team member makes submissions, the team will be disqualified. Therefore, each team—regardless of size—has the same submission quota.

Q4: If my team member has also registered on Codabench, will we be disqualified?

Answer: No, your team will not be disqualified simply because multiple members registered. However, only one member should submit on behalf of the team. If more than one member makes submissions, the team will be disqualified.

Q5: Are there any restrictions on what type of models or data a team can use for model development?

Answer: We follow an open model and open data policy. Teams may use any publicly available, closed-source, or proprietary models, as well as additional data, augmentation techniques, or other strategies to improve their solutions.

Q6: What is the maximum team size?

Answer: There is no restriction on team size. The only requirement is that at least one member of the team must be affiliated with an Indian university or institution.

Q7: Why was my request for CodaBench denied?

Answer: There could be two possible reasons why your request to participate was denied: (1) Direct request via Codabench without prior registration: If you requested to join the competition directly through the Codabench platform, but did not complete the registration via the Google Form, your request will not be accepted. In this case, please register through the Google Form first and then send an email to the organizers. Once verified, we will approve your participation in the competition. (2) Mismatch in email addresses: If you are trying to participate on Codabench using a different email than the one you used during registration, the system will not recognize your request. Please make sure to use the same email address for both registration and Codabench participation. Please follow the appropriate step based on your case, and feel free to reach out to the organizers if you face any further difficulties.

Contact Us

If you have any questions or need assistance, feel free to reach out through the following channels:

Registration
Google Form

Email
datathon@indoml.in

Discussion Forum
Visit Forum

References

Ekaterina Kochmar, Kaushal Kumar Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Anaïs Tack, and Justin Vasselli. 2025. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications.
Jakub Macina, Nico Daheim, Sankalan Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems. Findings of EMNLP 2023, pages 5602–5621, Singapore.
Rose Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya Demszky. 2024. Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes. Proceedings of NAACL 2024, pages 2174–2199.
Kaushal Kumar Maurya, KV Aditya Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. 2025. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. Proceedings of NAACL 2025, pages 1234–1251.

Acknowledgement & Sponsors

We express our sincere gratitude to the organizers of the BEA Shared Task 2025 for providing access to their datasets and task structure, which we have used and adapted for this datathon. Their efforts in curating high-quality annotated data form the foundation for organizing this datathon.

Any publications or derivative works resulting from this datathon should properly acknowledge the original sources associated with the BEA Shared Task 2025, including the shared task findings paper (Kochmar et al., 2025) and dataset papers (Maurya et al., 2025; Macina et al., 2023; Wang et al., 2024).