ChatGPT is still no match for humans when it comes to accounting

3 minutes read

[ad_1]

Last month, OpenAI launched its newest AI chatbot product, GPT-4. According to the folks at OpenAI, the bot, which uses machine learning to generate natural language text, passed the bar exam with a score of 90^th percentile, passed 13 out of 15 AP exams and scored a near perfect score on the Verbal GRE exam.

Inquisitive minds at BYU and 186 other universities want to know how OpenAI technology will fare on accounting exams. So, they tested the original version, ChatGPT. Researchers say that while there is still work to be done in accounting, it is a game changer that will change the way everyone teaches and learns — for the better.

“When this technology first emerged, everyone was concerned that students could now use it to cheat,” said lead study author David Wood, a BYU accounting professor. “But the opportunity to cheat is always there. So for us, we’re trying to focus on what we can do with this technology now that we weren’t able to do before to improve teaching processes for faculty and learning processes for students. Testing it was very eye opening.

Since its debut in November 2022, ChatGPT has become the fastest growing technology platform ever, reaching 100 million users in less than two months. In response to heated debate about how a model like ChatGPT should account for education, Wood decided to recruit as many professors as he could to see how AI stacked up against real university accounting students.

Her co-author recruitment pitch on social media exploded: 327 co-authors from 186 educational institutions in 14 countries participated in the study, contributing 25,181 class accounting exam questions. They also recruited BYU undergraduate students (including Wood’s daughter, Jessica) to feed another 2,268 textbook test bank questions to ChatGPT. The questions cover accounting information systems (AIS), auditing, financial accounting, managerial accounting and taxes, and vary in difficulty and type (true/false, multiple choice, short answer, etc).

While ChatGPT’s performance was impressive, students outperformed. Students scored an overall average of 76.7%, compared to a ChatGPT score of 47.4%. At 11.3% of questions, ChatGPT scored higher than the average student, especially in AIS and auditing. But AI bots do worse at tax, financial, and managerial assessments, perhaps because ChatGPT struggles with the math processes required for the latter kind.

In terms of question type, ChatGPT did better with true/false questions (68.7% correct) and multiple choice questions (59.5%), but struggled with short answer questions (between 28.7% and 39.1%). In general, higher-level questions are more difficult for ChatGPT to answer. In fact, sometimes ChatGPT will provide an official written description for an incorrect answer, or answer the same question in a different way.

“It’s not perfect; You wouldn’t use it for everything,” says Jessica Wood, who is currently a freshman at BYU. “Trying to learn using only ChatGPT is a stupid task.”

The researchers also discovered several other interesting trends through this study, including:

ChatGPT doesn’t always recognize when doing math and makes absurd mistakes like adding two numbers in a subtraction problem, or dividing a number incorrectly.
ChatGPT often provides an explanation of the answer, even if the answer is wrong. At other times, ChatGPT’s description is accurate, but then it will proceed to select the wrong multiple choice answer.
ChatGPT sometimes fabricates facts. For example, when providing a reference, it results in a seemingly real reference that is completely fabricated. The works and sometimes their authors don’t even exist.

That said, the authors fully expect GPT-4 to improve exponentially on the accounting questions posed in their study, and the issues noted above. What they found most promising was how chatbots could help improve teaching and learning, including the ability to design and test assignments, or perhaps be used to structure parts of a project.

“This is an opportunity to reflect on whether or not we are teaching value-added information,” said study co-author and fellow BYU accounting professor Melissa Larson. “This is a distraction, and we need to assess where we go from here. Of course, I will still have TA, but this will force us to use it in a different way.”