The Road to Enhanced Theory of Mind (ToM) Performance in Large Language Models (LLMs)

Abstract: Despite the impressive advancements of Large Language Models (LLMs) as of 2023, they continue to grapple with complex reasoning tasks, particularly Theory of Mind (ToM) tasks, which necessitate understanding agents’ beliefs, goals, and mental states. This study investigates the ToM performance of GPT-4 and three GPT-3.5 variants (Davinci-2, Davinci-3, GPT3.5-Turbo) to evaluate the efficacy of in-context learning in enhancing their ToM comprehension. Through the use of two-shot chain of thought reasoning and step-by-step thinking instructions, it was found that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) could improve their ToM accuracy via in-context learning. GPT-4 exhibited superior performance in zero-shot settings, achieving nearly 80% ToM accuracy, but still fell short compared to the 87% human accuracy on the test set. However, when provided with prompts for in-context learning, all RLHF-trained LLMs exceeded 80% ToM accuracy, with GPT-4 reaching 100%. These findings underscore the potential of appropriate prompting to enhance LLM ToM reasoning and highlight the context-dependent nature of LLM cognitive capacities.

Introduction: Large language models (LLMs) have demonstrated remarkable success across a variety of tasks, yet they still face significant challenges in tasks requiring complex reasoning. Particularly, the area of “theory of mind” (ToM) reasoning, which involves understanding the mental state of agents, including their goals and beliefs, remains a complex cognitive task that LLMs have yet to master adequately. Despite LLMs’ considerable advancements in responding accurately to a broad range of everyday questions, their performance in ToM tasks has been relatively poor.

ToM is a crucial component of social understanding, enabling intricate social exchanges and anticipating others’ actions or responses. It is considered a complex cognitive capacity, most highly developed in humans and a select number of other animals. As ToM reasoning involves inferential reasoning, assessing and improving LLMs’ proficiency in ToM tasks could provide valuable insights into their potential for a broader range of tasks requiring inferential reasoning.

There is evidence that in-context learning approaches can enhance the reasoning capacity of LLMs. For sufficiently large language models, few-shot task demonstrations at inference time can boost their performance, a process often referred to as “few-shot learning”. Moreover, the LLMs’ capability to perform complex reasoning was improved when the few-shot examples in the prompt contain the reasoning steps for reaching a conclusion (“chain-of-thought reasoning”). Even in the absence of exemplar demonstrations, instructing language models to think “step-by-step” enhances their reasoning performance.

Despite some studies supporting LLMs’ capability to perform ToM reasoning, others have questioned this ability. Quantitative evaluations of ToM performance in previous studies have mainly focused on single-word or multiple-option completion, and most criticisms of LLMs’ ToM abilities relied on zero-shot testing or examples lacking step-by-step reasoning towards an answer. This study examines whether recent LLMs might exhibit improved ToM performance when provided with suitable prompts, specifically focusing on ToM comprehension questions and exploring the potential of step-by-step thinking, few-shot learning, and chain-of-thought reasoning.

The results reveal that LLMs can leverage chain-of-thought reasoning and step-by-step thinking to significantly enhance their ToM performance. In zero-shot ToM settings, only GPT-4 reached near 80% accuracy. However, with appropriate prompting, all RLHF-trained models exceeded 80% accuracy, with GPT-4 reaching 100% accuracy, nearing human-level performance of 87%. This points towards the effectiveness of appropriate prompting in boosting the ToM reasoning performance of these context-sensitive models.

Reasoning on the Everyday: The power of Large Language Models (LLMs) lies in their ability to grapple with an extensive range of everyday questions. However, their capabilities become limited when dealing with tasks that require a higher level of reasoning. In such scenarios, their performance has been found to be comparatively poor. ToM reasoning, or the ability to understand the mental states, goals, and beliefs of agents, is one such complex cognitive task that has proven to be challenging for LLMs.

The Role of In-Context Learning: In-context learning, involving few-shot task demonstrations and step-by-step reasoning towards an answer, has emerged as a viable approach for enhancing the reasoning capacities of LLMs. While the theoretical understanding of why these prompting techniques are beneficial is still lacking, several recent studies have explored the effects of compositional structure and local dependencies in training data on the efficacy of these methods. This shows promise in improving the performance of LLMs on complex reasoning tasks like ToM.

The Utility of ToM: The ability to perform ToM reasoning is essential for any model dealing with social information and human interactions. It enables the model to reason about the mental states and beliefs of agents, which is invaluable in predicting their actions or responses. Additionally, ToM tasks often involve inferential reasoning, where unobservable information must be inferred from context rather than parsed from the surface text. Thus, enhancing the proficiency of these models in ToM tasks could shed light on their potential for a wider range of tasks that require inferential reasoning.

Potential of Prompting: Our research highlights the potential of appropriate prompting to significantly enhance the ToM performance of LLMs. Notably, all RLHF-trained models exceeded 80% ToM accuracy when provided with suitable prompts, with GPT-4 achieving a remarkable 100% accuracy. This suggests that the type of output generated by LLMs can be highly context-sensitive and that effective prompting techniques can guide LLMs towards generating higher-quality ToM responses. This not only contributes to the overall reliability of their reasoning in a wide range of everyday applications but also opens the door to the application of such techniques in other complex reasoning tasks.

In conclusion, this study underscores the importance of appropriate prompting and in-context learning for enhancing the ToM reasoning performance of Large Language Models (LLMs). While these models have shown great success in various tasks, they still face significant challenges in complex reasoning tasks, particularly those involving theory of mind. However, our results demonstrate that with suitable prompting, LLMs can significantly improve their ToM performance, thus expanding their potential for a wider range of applications.

2 thoughts on “The Road to Enhanced Theory of Mind (ToM) Performance in Large Language Models (LLMs)

  1. John C. says:

    I find this study on the efficacy of in-context learning in enhancing LLMs’ ToM comprehension fascinating. The findings suggest that, while LLMs have come a long way in their ability to respond accurately to a broad range of tasks, they still struggle with complex reasoning tasks that require understanding agents’ beliefs, goals, and mental states. This research highlights the potential of using step-by-step thinking instructions and chain-of-thought reasoning to improve LLMs’ ToM reasoning performance via in-context learning.

    It’s interesting to note that GPT-4 exhibited superior performance in zero-shot settings, but still fell short compared to human accuracy on the test set. However, when provided with prompts for in-context learning, all RLHF-trained LLMs exceeded 80% ToM accuracy, with GPT-4 reaching 100%. These findings underscore the context-dependent nature of LLM cognitive capacities and the potential of appropriate prompting to enhance their ToM reasoning.

    One question that comes to mind is whether the RLHF approach used in this study could be extended to other areas of LLM performance, such as common sense reasoning or emotional intelligence. Additionally, it would be interesting to see how these findings translate to other LLMs beyond GPT-4 and the three GPT-3.5 variants tested in this study.

    Overall, this study provides valuable insights into the potential of in-context learning to enhance LLMs’ ToM comprehension. It sheds light on the complex reasoning tasks that LLMs still struggle with and offers a promising approach for improving their performance in this area.

  2. John C. says:

    I find this study on the efficacy of in-context learning in enhancing LLMs’ ToM comprehension fascinating. The findings suggest that, while LLMs have come a long way in their ability to respond accurately to a broad range of tasks, they still struggle with complex reasoning tasks that require understanding agents’ beliefs, goals, and mental states. This research highlights the potential of using step-by-step thinking instructions and chain-of-thought reasoning to improve LLMs’ ToM reasoning performance via in-context learning.

    It’s interesting to note that GPT-4 exhibited superior performance in zero-shot settings, but still fell short compared to human accuracy on the test set. However, when provided with prompts for in-context learning, all RLHF-trained LLMs exceeded 80% ToM accuracy, with GPT-4 reaching 100%. These findings underscore the context-dependent nature of LLM cognitive capacities and the potential of appropriate prompting to enhance their ToM reasoning.

    One question that comes to mind is whether the RLHF approach used in this study could be extended to other areas of LLM performance, such as common sense reasoning or emotional intelligence. Additionally, it would be interesting to see how these findings translate to other LLMs beyond GPT-4 and the three GPT-3.5 variants tested in this study.

    Overall, this study provides valuable insights into the potential of in-context learning to enhance LLMs’ ToM comprehension. It sheds light on the complex reasoning tasks that LLMs still struggle with and offers a promising approach for improving their performance in this area.

Comments are closed.