Iteratively engineered system prompts significantly improve an AI chatbot’s ability to guide students in constructing scientific arguments.
Objective:
The purpose of this study was to design, refine, and evaluate an AI chatbot capable of supporting middle school students as they engage in scientific argumentation, specifically by providing high-quality, pedagogically grounded feedback aligned with established learning theories. The study aimed to determine how different versions of a system prompt influence the chatbot’s relevance, clarity, motivational quality, and ability to promote reflective thinking.
Methods:
The researchers used a human-in-the-loop design-based research approach, iteratively developing three versions of the chatbot’s system prompt. Each successive version incorporated more detailed instructional guidance rooted in learning theory, scientific argumentation frameworks, and formative feedback principles. The chatbot, powered by GPT-4o, was embedded in a science assessment task involving ecosystem simulations. To evaluate performance, 18 synthetic student inputs were created across six use cases (e.g., clarifying claims, clarifying evidence, clarifying reasoning). Each input was tested with each prompt version, generating 162 chatbot responses. Researchers coded each response using a rubric assessing relevance, clarity, engagement/motivation, and reflective thinking. Statistical analyses, including ANOVA and Tukey post hoc tests, were conducted to compare performance across prompt versions, input types, and use cases.
Key Findings:
-
All four feedback categories—relevance, clarity, engagement/motivation, and reflective thinking—showed significant improvement as the system prompt evolved.
-
The largest gains occurred between the minimally engineered prompt (Version 0) and the improved instructional prompt (Version 1), while Version 2 provided additional refinements and consistency.
-
The final version produced shorter, clearer responses that avoided giving direct answers and instead encouraged students to elaborate, reflect, and connect evidence to claims.
-
Earlier prompt versions showed uneven quality depending on input type (high-quality, low-quality, or typo-filled). In contrast, Version 2 provided consistent responses across all input qualities.
-
Response quality also became more consistent across different use cases, indicating improved adaptability and instructional alignment.
-
The study identified specific patterns that characterize high-quality chatbot behavior—relevance to the argumentation framework, developmental appropriateness, concise phrasing, and intentional prompting for reflection.
Implications:
The study demonstrates that high-quality chatbot performance depends heavily on strong system-prompt design grounded in learning science. The resulting design principles show how AI tools can be shaped to scaffold scientific reasoning, model effective argumentation, and promote metacognitive engagement. These findings offer actionable guidance for educators, researchers, and developers building AI-enhanced learning environments. The work also establishes a replicable model for using human-in-the-loop design to align LLM outputs with instructional goals, potentially improving equitable access to high-quality feedback at scale.
Limitations:
The study relied on synthetic student inputs, limiting ecological validity. The rubric used to evaluate chatbot responses demonstrated only moderate initial inter-rater agreement, indicating the need for further refinement. The findings are specific to middle-school science argumentation tasks and may not generalize to other subjects or grade levels. Additionally, ethical issues such as role transparency, equity, and data privacy were not the primary focus and remain areas for further attention.
Future Directions:
Future research should validate these findings with real students and teachers, refine evaluation frameworks for AI feedback, and explore chatbots that support additional components of argumentation such as counterclaims or alternative explanations. Researchers should also investigate co-design approaches that involve educators and learners in constructing system prompts, examine long-term classroom integration, and address key ethical considerations in AI-supported learning.
Title and Authors:
“A Framework for Designing an AI Chatbot to Support Scientific Argumentation” by Field M. Watts, Lei Liu, Teresa M. Ober, Yi Song, Euvelisse Jusino-Del Valle, Xiaoming Zhai, Yun Wang, and Ninghao Liu.
Published On:
8 November 2025
Published By:
Education Sciences (MDPI)