TeacherServer AI Tools: AI in Education

article Article Summary

Sep 30, 2025

Decision tree classifiers using student interaction features can accurately predict programming module outcomes with 85-91% accuracy while providing interpretable visualizations that identify key course materials, offering educators actionable insights fo

Decision tree classifiers using student interaction features can accurately predict programming module outcomes with 85-91% accuracy while providing interpretable visualizations that identify key course materials, offering educators actionable insights for timely intervention with at-risk students.

Objective: This study developed a machine learning approach to predict student progress at the module level in large-scale online programming courses for K-12 students. The primary goal was to predict whether students would pass, fail, or not submit the final problem in each course module based on their interactions with preceding course materials. The researchers aimed to create accurate predictions while maintaining interpretability for educators, enabling them to identify struggling students, understand which course materials are most critical for success, and intervene before students drop out.

Methods: The researchers analyzed log data from four online Python programming courses conducted over five weeks in 2018, targeting different age groups and skill levels: Novice Blockly (ages 10-14, no experience), Beginners Blockly (ages 12-16, no experience), Beginners Python (same as Beginners Blockly but in Python), and Intermediate Python (ages 14-18, with experience). Across all courses, approximately 35,000 students were enrolled, with active participation rates ranging from 20% (Blockly courses) to 45-50% (Python courses).

Each course contained 10 modules with content slides (providing instructions, hints, and interactive examples) and problem slides (requiring code submission for auto-grading). The researchers extracted student interaction features from log data, recording nine types of events including slide visits, code runs, problem attempts, and pass/fail outcomes. For each module, they created interaction vectors where content slide elements took values "Completed"/"Not completed" and problem slide elements took values "Passed submission"/"Failed submission"/"No submission."

The predictive task was framed as a three-class classification problem predicting performance on the last problem slide in each module. Two feature selection algorithms were applied: Correlation-Based Feature Selection (CFS), which selects features highly correlated with the target but uncorrelated with each other, and Information Gain Ratio (GR), which ranks features by their informativeness. Decision tree (DT) classifiers were built under three conditions: without feature selection, with CFS, and with GR.

The researchers chose decision trees for their intrinsic interpretability—providing rule-based explanations and tree visualizations—and computational efficiency. They compared DT performance against a majority class baseline and three other algorithms: Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM). All models were implemented using Weka with default hyperparameters and evaluated using 10-fold stratified cross-validation. Performance metrics included overall accuracy, precision, recall, F1 score, and for DTs, the number of leaves and total tree size as interpretability indicators.

Key Findings:

Exploratory Analysis: Heat map visualizations revealed three distinct student groups at the course level. First, a large group of disengaged, at-risk students (< 10% content slide completion, 0-30% problem slide completion) appeared in bright yellow/green regions of bottom-left corners across all courses. Second, disengaged but successful students (> 90% problem slides passed, < 10% content slides completed) were visible in top-left corners, particularly in advanced Python courses targeting students with prior experience. Third, content slide completion was generally low (most students completing 0-20%), suggesting selective engagement.

At the module level, an additional group emerged: engaged, high-performing students (> 90% problem slides, 40-60% content slides) visible near top-right corners in several modules. This pattern was most apparent in earlier modules of Novice Blockly and specific modules in Beginners Python. More advanced courses showed higher proportions of disengaged high-performers compared to at-risk students, while beginner block-based courses showed more even representation.

Predictive Performance: Decision trees achieved consistently high performance across all courses: 85.2-90.9% accuracy, 0.88-0.92 precision, 0.78-0.98 recall, and 0.83-0.92 F1 score. The comparison algorithms performed slightly better—SVM achieved 83.9-94.4% accuracy, LR achieved 83.8-94.5%, and RF achieved 83.3-94.5%—but lacked DT's interpretability and were slower to train. All methods substantially surpassed majority class baselines (58.5-71.8% accuracy).

Class-specific performance showed DTs excelled at predicting "Passed" and "No submission" outcomes with high precision, recall, and F1 scores. However, predicting "Failed" remained challenging, with significantly lower metrics (precision 0.00-0.61, recall 0.00-0.08, F1 0.00-0.28) across all courses. This was attributed to severe class imbalance—failed submissions represented at most 4.3% of any module's submissions. Accuracy remained consistently high (80%+) through Module 6 before declining in later modules, corresponding to increased student dropout.

Feature Selection Impact: Both CFS and GR feature selection dramatically reduced tree complexity while maintaining or even improving accuracy. Trees without feature selection averaged 6.1-13.9 leaf nodes and 9.9-23.6 total nodes. With CFS, these reduced to 3.4-4.7 leaves and 4.7-6.8 nodes; with GR, to 3.2-4.1 leaves and 4.3-7.0 nodes. GR produced slightly smaller trees than CFS. This demonstrated that compact, highly interpretable models could achieve competitive performance.

Important Course Materials: Feature selection revealed that problem slide completion was the most critical predictor—accounting for approximately 75% of selected features across modules. Students completing earlier problem slides, particularly later ones in the sequence, were significantly more likely to pass the final problem. This aligned with the "doer effect" emphasizing active practice over passive consumption.

Interactive content slides, though less common among selected features, played important roles in certain courses—representing 25% of features in Novice Blockly and 40% in Intermediate Python. Content slides with runnable code yielded better outcomes when students actively ran the code rather than just viewing it. Newly introduced platform features like interactive steps and paired problems appeared frequently among high-ranked features.

Course Differences: Comparing Beginners Blockly with Beginners Python (identical problems but different environments) revealed important differences. The feature selection algorithm chose mostly problem slides (23/27) for Beginners Python but primarily content slides (18/29) for Beginners Blockly. This suggested more advanced students in Python courses relied on prior knowledge and skipped content, while less experienced Blockly students needed foundational content more.

Intervention Window: Analysis showed a substantial prediction window for educator intervention. The median time gap between students' last preceding slide visit and the final problem deadline ranged from 176.2 hours (7.3 days, Beginners Blockly) to 196.1 hours (8.2 days, Intermediate Python). Most students (78-96%) reached the second-to-last slide, and most then attempted the final problem, providing ample opportunity for real-time intervention.

Implications: The study demonstrates that student interaction features from course logs are valuable predictors of module-level outcomes, enabling accurate identification of at-risk students with sufficient time for intervention. Decision trees provide an optimal balance between predictive accuracy and interpretability for educational contexts, achieving performance competitive with black-box models while offering actionable insights through tree visualizations and feature rankings.

The approach helps educators understand student learning pathways, identify critical course materials that predict success, detect points where students struggle and potentially give up, and recognize redundant or low-value content for improvement. The strong performance in predicting "No submission" outcomes is particularly valuable for preventing dropout. The method's computational efficiency—DTs are fast to train and evaluate—enables real-time predictions during classroom sessions or as students use the platform.

The findings support a data-driven approach to course improvement, where educators can: (1) emphasize completion of key problem slides students tend to skip, (2) enhance interactive content slides that appear in decision paths, (3) redesign or remove materials with low pedagogical value, (4) differentiate support based on student group (at-risk vs. high-performing), and (5) adapt content complexity based on student experience levels (Blockly vs. Python courses).

Limitations: The study acknowledges several constraints. Predicting failed submissions remains highly challenging due to severe class imbalance (< 5% of submissions in most modules), meaning models lack sufficient examples to generalize well for this important outcome category. The dataset represents convenience sampling from a single platform in 2018, potentially limiting generalizability to other programming education contexts, platforms, or time periods.

All participants were primarily from Australia with small numbers from New Zealand and UK, limiting geographic and cultural diversity. The courses ran over only five weeks, preventing analysis of longer-term learning trajectories or retention patterns. Student dropout in later modules (as evidenced by declining class distributions) affected prediction accuracy and F1 scores toward course ends.

The analysis focused on module-level predictions rather than finer-grained problem-by-problem progression, potentially missing intermediate struggle points. The study did not examine the order in which students attempted slides, which might provide additional insights into effective learning sequences. The evaluation by educators, while positive, involved only two reviewers assessing 14 modules, representing limited external validation.

The decision trees, while interpretable, may mislead educators into thinking unlisted slides should be removed, when they might serve supporting roles not captured by the algorithm. The approach predicts outcomes but doesn't explain why certain materials are important beyond correlation patterns, limiting pedagogical insight into underlying learning mechanisms.

Future Directions: The researchers propose several extensions. Examining slide progression: Investigating the value of slides preceding key decision tree slides and those ranked lower would provide a more complete picture of how each module step influences the next, enabling better understanding of complete learning pathways rather than just pivotal moments.

Classifying problem slide types: Further categorizing problems (e.g., Parsons problems where students arrange code blocks, paired problems with connected solutions) would help understand which problem formats are most effective for different learning objectives and student populations, encouraging innovation in assessment design.

Incorporating slide sequence data: Adding temporal information about the order students attempted slides could reveal whether certain sequences are more effective than others, informing recommendations about optimal learning paths and identifying students who might benefit from reviewing earlier materials.

Progressive prediction windows: Investigating how early accurate predictions can be made by using the first n slides to predict the final outcome, progressively increasing n, would help determine the earliest reliable intervention points and how prediction confidence evolves as students progress.

Addressing class imbalance: Exploring techniques like oversampling, synthetic data generation (SMOTE), or cost-sensitive learning to improve prediction of failed submissions would make the approach more valuable for identifying students who attempt but struggle with problems, not just those who don't attempt at all.

Longitudinal studies: Following student cohorts across multiple courses or academic years would reveal how programming skills develop over time, whether early struggles predict later success, and how intervention effectiveness compounds across educational trajectories.

Generalization testing: Applying the approach to programming courses on other platforms, in other countries, teaching other languages (JavaScript, Java, C++), or using different pedagogical models would validate whether findings transfer and identify context-specific versus universal patterns.

Comparison with newer models: Evaluating performance against recent deep learning approaches (LSTMs, Transformers) and advanced explainable AI methods (SHAP, LIME applied to ensemble models) would clarify whether the interpretability-accuracy tradeoff of decision trees remains favorable as machine learning advances.

Real-world deployment: Implementing the system in live courses with randomized controlled trials comparing student outcomes with and without the intervention would provide causal evidence of effectiveness and identify practical implementation challenges.

Title and Authors: "A Machine Learning Approach for Predicting Student Progress in Online Programming Education" by Vincent Zhang, Bryn Jeffries, and Irena Koprinska from the School of Computer Science, The University of Sydney, Sydney, NSW 2006, Australia.

Published On: Received October 3, 2024; Accepted August 12, 2025; Published online September 25, 2025.

Published By: International Journal of Artificial Intelligence in Education (Int J Artif Intell Educ), published by Springer. DOI: https://doi.org/10.1007/s40593-025-00510-9. This is an open access article licensed under Creative Commons Attribution 4.0 International License, with funding enabled and organized by CAUL and its Member Institutions.

Comments

Please log in to leave a comment.