Conference Presentations by Nik Muhammad Aiman

VIRTUAL LANGUAGE AND COMMUNICATION POSTGRADUATE INTERNATIONAL SEMINAR 2024 (VLCPIS 2024) PROCEEDINGS, 2024
The establishment of Artificial Intelligence (AI) technology has made automated writing evaluatio... more The establishment of Artificial Intelligence (AI) technology has made automated writing evaluation (AWE) is now accessible for everyone, not only as a tool for evaluating writing tasks, but also for providing feedback on the students’ writing tasks. Therefore, it is important to ensure how reliable these AI tools are in conducting the evaluations. Previous research has demonstrated that GPT-4-powered ChatGPT outperformed human raters in terms of internal consistency in evaluating writing tasks, while its inter-rater reliability over human raters was relatively comparable. As Copilot provides free access to GPT-4, this paper aimed to investigate whether GPT-4-powered Copilot was able to evaluate writing tasks with the same intra and inter-consistency as GPT-4-powered ChatGPT. The study was conducted using 51 college students’ essays, which were evaluated by two human raters and assessed twice using GPT-4-powered Copilot. Intraclass Correlation Coefficient (ICC) scores were calculated to obtain and compare the intra and inter-rater reliability of the human raters an Copilot along with the means of the marks provided. The results revealed that the means of the marks assigned by Copilot were notably higher than those awarded by human raters. Additionally, the findings indicated that GPT-4-powered Copilot not only demonstrated significantly lower internal
consistency compared to human raters, but GPT-4-powered Copilot was also poorly consistent over human raters. In conclusion, GPT-4-powered Copilot was not as reliable as GPT-4- powered ChatGPT in evaluating writing tasks. Thus, utilizing GPT-4-powered Copilot as a writing tasks evaluator is not recommended to avoid obtaining invalid and unreliable
evaluation.
Uploads
Conference Presentations by Nik Muhammad Aiman
consistency compared to human raters, but GPT-4-powered Copilot was also poorly consistent over human raters. In conclusion, GPT-4-powered Copilot was not as reliable as GPT-4- powered ChatGPT in evaluating writing tasks. Thus, utilizing GPT-4-powered Copilot as a writing tasks evaluator is not recommended to avoid obtaining invalid and unreliable
evaluation.