Selected Publications | Hyo Jin (Gina) Do

My Google Scholar profile contains a full list of publications.

2025

Generate, Evaluate, Iterate: Synthetic Data for Human-in-the-Loop Refinement of LLM Judges

Hyo Jin Do, Zahra Ashktorab, Jasmina Gajcin, Erik Miehling, Martín Santillán Cooper, and 3 more authors

2025

PDF
Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist

Martín Santillán Cooper, Zahra Ashktorab, Hyo Jin Do, Erik Miehling, Werner Geyer, and 4 more authors

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Nov 2025

Abs DOI PDF

We present a synthetic data generation tool integrated into EvalAssist. EvalAssist is a web-based application designed to assist human-centered evaluation of language model outputs by allowing users to refine LLM-as-a-Judge evaluation criteria. The synthetic data generation tool in EvalAssist is tailored for evaluation contexts and informed by findings from user studies with AI practitioners. Participants identified key pain points in current workflows including circularity risks (where models are judged by criteria derived by themselves), compounded bias (amplification of biases across multiple stages of a pipeline), and poor support for edge cases, and expressed a strong preference for real-world grounding and fine-grained control. In response, our tool supports flexible prompting, RAG-based grounding, persona diversity, and iterative generation workflows. We also incorporate features for quality assurance and edge case discovery.
EvalAssist: Insights on Task-Specific Evaluations and AI-Assisted Judgment Strategy Preferences

Zahra Ashktorab, Michael Desmond, Qian Pan, James M. Johnson, Martin Santillán Cooper, and 5 more authors

In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, Pusan, South Korea, Nov 2025

Abs DOI PDF Slides

With the broad availability of large language models and their ability to generate vast outputs using varied prompts and configurations, determining the best output for a given task requires an intensive evaluation process, one where machine learning practitioners must decide how to assess the outputs and then carefully carry out the evaluation. This process is both time-consuming and costly. As practitioners work with an increasing number of models, they must now evaluate outputs to determine which model performs best for a given task. LLMs are increasingly used as evaluators to filter training data, evaluate model performance or assist human evaluators with detailed assessments. Our application, EvalAssist, supports this process by aiding users in interactively refining evaluation criteria. In our study with machine learning practitioners (n=15), each completing 6 tasks yielding 131 evaluations, we explore how task-related factors and judgment strategies influence criteria refinement and user perceptions. Findings show that users performed more evaluations with direct assessment by making criteria task-specific, modifying judgments, and changing the AI evaluator model. We conclude with recommendations for how systems can better support practitioners with AI-assisted evaluations.
Hide or Highlight: Understanding the Impact of Factuality Expression on User Trust

Hyo Jin Do, and Werner Geyer

In Proceedings of the Eighth AAAI/ACM Conference on Artificial Intelligence, Ethics and Society (AIES 2025), Nov 2025

Abs PDF Slides

Large language models are known to produce outputs that are plausible but factually incorrect. To prevent people from making erroneous decisions by blindly trusting AI, researchers have explored various ways of communicating factuality estimates in AI-generated outputs to end-users. However, little is known about whether revealing content estimated to be factually incorrect influences users’ trust when compared to hiding it altogether. We tested four different ways of disclosing an AI-generated output with factuality assessments: transparent (highlights less factual content), attention (highlights factual content), opaque (removes less factual content), ambiguity (makes less factual content vague), and compared them with a baseline response without factuality information. We conducted a human subjects research (N = 148) using the strategies in question-answering scenarios. We found that the opaque and ambiguity strategies led to higher trust while maintaining perceived answer quality, compared to the other strategies. We discuss the efficacy of hiding presumably less factual content to build end-user trust.
Highlight All the Phrases: Enhancing LLM Transparency through Visual Factuality Indicators

Hyo Jin Do, Rachel Ostrand, Werner Geyer, Keerthiram Murugesan, Dennis Wei, and 1 more author

In Proceedings of the Eighth AAAI/ACM Conference on Artificial Intelligence, Ethics and Society (AIES 2025), Nov 2025

Abs PDF

Large language models (LLMs) are susceptible to generating inaccurate or false information, often referred to as "hallucinations" or "confabulations." While several technical advancements have been made to detect hallucinated content by assessing the factuality of the model’s responses, there is still limited research on how to effectively communicate this information to users. To address this gap, we conducted two scenario-based experiments with a total of 208 participants to systematically compare the effects of various design strategies for communicating factuality scores by assessing participants’ ratings of trust, ease in validating response accuracy, and preference. Our findings reveal that participants preferred and trusted a design in which all phrases within a response were color-coded based on factuality scores. Participants also found it easier to validate accuracy of the response in this style compared to a baseline with no style applied. Our study offers practical design guidelines for LLM application developers and designers, aimed at calibrating user trust, aligning with user preferences, and enhancing users’ ability to scrutinize LLM outputs.
Multi-Level Explanations for Generative Language Models

Lucas Monteiro Paes, Dennis Wei, Hyo Jin Do, Hendrik Strobelt, Ronny Luss, and 6 more authors

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2025

Abs DOI PDF

Despite the increasing use of large language models (LLMs) for context-grounded tasks like summarization and question-answering, understanding what makes an LLM produce a certain response is challenging. We propose Multi-Level Explanations for Generative Language Models (MExGen), a technique to provide explanations for context-grounded text generation. MExGen assigns scores to parts of the context to quantify their influence on the model’s output. It extends attribution methods like LIME and SHAP to LLMs used in context-grounded tasks where (1) inference cost is high, (2) input text is long, and (3) the output is text. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and question answering. The results show that our framework can provide more faithful explanations of generated output than available alternatives, including LLM self-explanations. We open-source code for MExGen as part of the ICX360 toolkit: https://github.com/IBM/ICX360.
Navigating Generative AI Disclosure, Ownership, and Accountability in Co-Creative Domains

Hyo Jin Do, Molly Q Feldman, Jessica He, Angel Hsing-Chi Hwang, and Seyun Kim

In Adjunct Proceedings of the 4th Annual Symposium on Human-Computer Interaction for Work, , Jul 2025

Abs DOI PDF

The increasing integration of generative AI into work has amplified issues of disclosure, ownership, and accountability, including whether and how to acknowledge AI use, who owns AI-generated or co-created work, and who is accountable for risks. In response, governments, organizations, and researchers are introducing new policies, guidelines, and methods for enhanced transparency. However, the complex interplay between multiple stakeholders and technologies, coupled with growing AI agency, continues to spark debates about ownership and accountability of co-created work, leading to open questions about whether, when, and how to disclose and attribute human-AI co-created work. To address these emergent issues, this workshop aims to gather interdisciplinary researchers, practitioners, and experts to discuss key questions from law, technology, design, and HCI research standpoints, with the ultimate goal of promoting responsible generative AI use for work.
Understanding Industry Practitioners’ Experiences in Generative AI Governance

Hyo Jin Do, Swati Babbar, Wenjing Li, Laura Walks, and Shayenna Misko

In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Jul 2025

Abs DOI PDF Video

AI governance has become critical, especially as generative AI technology introduces new complexities and uncertainties that require robust risk management. While the need for frameworks and solutions to support AI governance is widely recognized, understanding and addressing the real-world needs of AI practitioners in operationalizing governance remains underexplored. To bridge this gap, we conducted semi-structured interviews using a design probe with AI governance practitioners across various industry sectors. Our findings provide insights into the experiences and pain points of industry practitioners in AI governance, highlighting key challenges in achieving performance goals, assessing societal impact, securing user data, and navigating technical difficulties. We also identified their technical and explainability needs, including practical guidance on addressing violations, as well as more detailed explanations of AI models, data, and evaluation. We discuss design guidelines for AI governance tools that effectively support practitioners’ needs.
Exploring Industry Practices and Perspectives on AI Attribution in Co-Creative Use Cases

Jessica He, and Hyo Jin Do

In ACM International Conference on Intelligent User Interfaces, Jul 2025

Abs PDF

The increasing adoption of generative AI in human-AI co-creative workflows has led to the development of new policies and design guidelines for disclosing the usage of AI, promoting transparency and accountability in the collaborative process. However, it remains unclear how these policies are being translated into practice in product development. Through semi-structured interviews with 12 industry practitioners, we investigated current approaches and challenges in implementing AI attribution in business products. Our results reveal high variability in AI attribution approaches across products, as they consider factors such as the type of content produced by AI, the presence of human reviewers, stakeholder needs, and regulatory requirements. We also identified technical, user, and product-level challenges of implementing AI attribution in products, including difficulty tracing and discerning the significance of AI contributions, negative impacts on user experience and sense of ownership, and a lack of precedent in product-specific contexts. Our findings offer practical design implications for effective AI attribution strategies in co-creative business use cases.

2024

Grounding with Structure: Exploring Design Variations of Grounded Human-AI Collaboration in a Natural Language Interface

Hyo Jin Do, Michelle Brachman, Casey Dugan, James M. Johnson, Julia Lauer, and 2 more authors

Proc. ACM Hum.-Comput. Interact., Nov 2024

Abs DOI PDF

Selecting an effective utterance among countless possibilities that match a user’s intention poses a challenge when using natural language interfaces. To address the challenge, we leveraged the principle of least collaborative effort in communication grounding theory and designed three grounded conversational interactions: 1) a grounding interface allows users to start with a provisional input and then invite a conversational agent to complete their input, 2) a multiple grounding interface presents multiple inputs for the user to select from, and 3) a structured grounding interface guides users to write inputs in a structure best understood by the system. We compared our three grounding interfaces to an ungrounded control interface in a crowdsourced study (N=80) using a natural language system that generates small programs. We found that the grounding interfaces reduced cognitive load and improved task performance. The structured grounding interface further reduced speaker change costs and improved technology acceptance, without sacrificing the perception of control. We discuss the implications of designing grounded conversational interactions in natural language systems.
Evaluating What Others Say: The Effect of Accuracy Assessment in Shaping Mental Models of AI Systems

Hyo Jin Do, Michelle Brachman, Casey Dugan, Qian Pan, Priyanshu Rai, and 2 more authors

Proc. ACM Hum.-Comput. Interact., Nov 2024

Abs DOI PDF

Forming accurate mental models that align with the actual behavior of an AI system is critical for successful user experience and interactions. One way to develop mental models is through information shared by other users. However, this social information can be inaccurate and there is a lack of research examining whether inaccurate social information influences the development of accurate mental models. To address this gap, our study investigates the impact of social information accuracy on mental models, as well as whether prompting users to validate the social information can mitigate the impact. We conducted a between-subject experiment with 39 crowdworkers where each participant interacted with our AI system that automates a workflow given a natural language sentence. We compared participants’ mental models between those exposed to social information of how the AI system worked, both correct and incorrect, versus those who formed mental models through their own usage of the system. Specifically, we designed three experimental conditions: 1) validation condition that presented the social information followed by an opportunity to validate its accuracy through testing example utterances, 2) social information condition that presented the social information only, without the validation opportunity, and 3) control condition that allowed users to interact with the system without any social information. Our results revealed that the inclusion of the validation process had a positive impact on the development of accurate mental models, especially around the knowledge distribution aspect of mental models. Furthermore, participants were more willing to share comments with others when they had the chance to validate the social information. The impact of inaccurate social information on altering user mental models was found to be non-significant, while 69.23% of participants incorrectly judged the social information accuracy at least once. We discuss the implications of these findings for designing tools that support the validation of social information and thereby improve human-AI interactions.
Facilitating Human-LLM Collaboration through Factuality Scores and Source Attributions

Hyo Jin Do, Rachel Ostrand, Justin D. Weisz, Casey Dugan, Prasanna Sattigeri, and 3 more authors

Nov 2024

Abs PDF

While humans increasingly rely on large language models (LLMs), they are susceptible to generating inaccurate or false information, also known as "hallucinations". Technical advancements have been made in algorithms that detect hallucinated content by assessing the factuality of the model’s responses and attributing sections of those responses to specific source documents. However, there is limited research on how to effectively communicate this information to users in ways that will help them appropriately calibrate their trust toward LLMs. To address this issue, we conducted a scenario-based study (N=104) to systematically compare the impact of various design strategies for communicating factuality and source attribution on participants’ ratings of trust, preferences, and ease in validating response accuracy. Our findings reveal that participants preferred a design in which phrases within a response were color-coded based on the computed factuality scores. Additionally, participants increased their trust ratings when relevant sections of the source material were highlighted or responses were annotated with reference numbers corresponding to those sources, compared to when they received no annotation in the source material. Our study offers practical design guidelines to facilitate human-LLM collaboration and it promotes a new human role to carefully evaluate and take responsibility for their use of LLM outputs.
Multi-Level Explanations for Generative Language Models

Lucas Monteiro Paes, Dennis Wei, Hyo Jin Do, Hendrik Strobelt, Ronny Luss, and 6 more authors

Nov 2024

Abs PDF

Perturbation-based explanation methods such as LIME and SHAP are commonly applied to text classification. This work focuses on their extension to generative language models. To address the challenges of text as output and long text inputs, we propose a general framework called MExGen that can be instantiated with different attribution algorithms. To handle text output, we introduce the notion of scalarizers for mapping text to real numbers and investigate multiple possibilities. To handle long inputs, we take a multi-level approach, proceeding from coarser levels of granularity to finer ones, and focus on algorithms with linear scaling in model queries. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and context-grounded question answering. The results show that our framework can provide more locally faithful explanations of generated outputs.

2023

Inform, Explain, or Control: Techniques to Adjust End-User Performance Expectations for a Conversational Agent Facilitating Group Chat Discussions

Hyo Jin Do, Ha-Kyung Kong, Pooja Tetali, Karrie Karahalios, and Brian P. Bailey

Proc. ACM Hum.-Comput. Interact., Oct 2023

Abs DOI PDF

A conversational agent (CA) effectively facilitates online group discussions at scale. However, users may have expectations about how well the CA would perform that do not match with the actual performance, compromising technology acceptance. We built a facilitator CA that detects a member who has low contribution during a synchronous group chat discussion and asks the person to participate more. We designed three techniques to set end-user expectations about how accurately the CA identifies an under-contributing member: 1)information: explicitly communicating the accuracy of the detection algorithm, 2)explanation: providing an overview of the algorithm and the data used for the detection, and 3)adjustment: enabling users to gain a feeling of control over the algorithm. We conducted an online experiment with 163 crowdworkers in which each group completed a collaborative decision-making task and experienced one of the techniques. Through surveys and interviews, we found that the explanation technique was the most effective strategy overall as it reduced user embarrassment, increased the perceived intelligence of the CA, and helped users better understand the detection algorithm. In contrast, the information technique reduced members’ contributions and the adjustment technique led to a more negative perceived discussion experience. We also discovered that the interactions with other team members diluted the effects of the techniques on users’ performance expectations and acceptance of the CA. We discuss implications for better designing expectation-setting techniques for AI-team collaboration such as ways to improve collaborative decision outcomes and quality of contributions.
To Err is AI: Imperfect Interventions and Repair in a Conversational Agent Facilitating Group Chat Discussions

Hyo Jin Do, Ha-Kyung Kong, Pooja Tetali, Jaewook Lee, and Brian P. Bailey

Proc. ACM Hum.-Comput. Interact., Apr 2023

Abs DOI PDF

Conversational agents (CAs) can analyze online conversations using natural language techniques and effectively facilitate group discussions by sending supervisory messages. However, if a CA makes imperfect interventions, users may stop trusting the CA and discontinue using it. In this study, we demonstrate how inaccurate interventions of a CA and a conversational repair strategy can influence user acceptance of the CA, members’ participation in the discussion, perceived discussion experience between the members, and group performance. We built a CA that encourages the participation of members with low contributions in an online chat discussion in which a small group (3-6 members) performs a decision-making task. Two types of errors can occur when detecting under-contributing members: 1) false-positive (FP) errors happen when the CA falsely identifies a member as under-contributing and 2) false-negative (FN) errors occur when the CA misses detecting an under-contributing member. We designed a conversational repair strategy that gives users a chance to contest the detection results and the agent sends a correctional message if an error is detected. Through an online study with 175 participants, we found that participants who received FN error messages reported higher acceptance of the CA and better discussion experience, but participated less compared to those who received FP error messages. The conversational repair strategy moderated the effect of errors such as improving the perceived discussion experience of participants who received FP error messages. Based on our findings, we offer design implications for which model should be selected by practitioners between high precision (i.e., fewer FP errors) and high recall (i.e., fewer FN errors) models depending on the desired effects. When frequent FP errors are expected, we suggest using the conversational repair strategy to improve the perceived discussion experience.
Follow the Successful Herd: Towards Explanations for Improved Use and Mental Models of Natural Language Systems

Michelle Brachman, Qian Pan, Hyo Jin Do, Casey Dugan, Arunima Chaudhary, and 9 more authors

In Proceedings of the 28th International Conference on Intelligent User Interfaces, Sydney, NSW, Australia, Apr 2023

Abs DOI PDF

While natural language systems continue improving, they are still imperfect. If a user has a better understanding of how a system works, they may be able to better accomplish their goals even in imperfect systems. We explored whether explanations can support effective authoring of natural language utterances and how those explanations impact users’ mental models in the context of a natural language system that generates small programs. Through an online study (n=252), we compared two main types of explanations: 1) system-focused, which provide information about how the system processes utterances and matches terms to a knowledge base, and 2) social, which provide information about how other users have successfully interacted with the system. Our results indicate that providing social suggestions of terms to add to an utterance helped users to repair and generate correct flows more than system-focused explanations or social recommendations of words to modify. We also found that participants commonly understood some mechanisms of the natural language system, such as the matching of terms to a knowledge base, but they often lacked other critical knowledge, such as how the system handled structuring and ordering. Based on these findings, we make design recommendations for supporting interactions with and understanding of natural language systems.

2022

How Should the Agent Communicate to the Group? Communication Strategies of a Conversational Agent in Group Chat Discussions

Hyo Jin Do, Ha-Kyung Kong, Jaewook Lee, and Brian P. Bailey

Proc. ACM Hum.-Comput. Interact., Nov 2022

Abs DOI PDF Video

In online group discussions, balanced participation can improve the quality of discussion, members’ satisfaction, and positive group dynamics. One approach to achieve balanced participation is to deploy a conversational agent (CA) that encourages participation of under-contributing members, and it is important to design communication strategies of the CA in a way that is supportive to the group. We implemented five communication strategies that a CA can use during a decision-making task in a small group synchronous chat discussion. The five strategies include messages sent to two types of recipients (@username vs. @everyone) crossed by two separate channels (public vs. private), and a peer-mediated strategy where the CA asks a peer to address the under-contributing member. Through an online study with 42 groups, we measured the balance of participation and perceptions about the CA by analyzing chat logs and survey responses. We found that the CA sending messages specifying an individual through a private channel is the most effective and preferred way to increase participation of under-contributing members. Participants also expressed that the peer-mediated strategy is a less intrusive and less embarrassing way of receiving the CA’s messages compared to the conventional approach where the CA directly sends a message to the under-contributing member. Based on our findings, we discuss trade-offs of various communication strategies and explain design considerations for building an effective CA that adapts to different group dynamics and situations.

2021

Do You Have Time for a Quick Chat? Designing a Conversational Interface for Sexual Harassment Prevention Training

Hyo Jin Do, Seon Hye Yang, Boo-Gyoung Choi, Wayne T. Fu, and Brian P. Bailey

In Proceedings of the 26th International Conference on Intelligent User Interfaces, College Station, TX, USA, Nov 2021

Abs DOI PDF Video

Sexual harassment (SH) incidents are increasing and call into question the effectiveness of traditional SH prevention training. In this paper, we introduce a proof-of-concept design of a conversational interface (CI) for understanding SH cases. Key features of the interface include that it engages the learner in a dyadic conversation, prompts the learner for guidance, and tells a story of SH from a first-person perspective. From a mixed-methods study (N=32), learners experiencing a SH vignette using the conversational interface reported feeling less overwhelmed with the content, more engaged with the situation, and more comfortable discussing the topic compared to reading the same vignette online. Participants also reported that using a first-person narrative made the vignette feel realistic and relatable. However, there was no difference in empathy between the conditions. We discuss these results and implications for designing effective SH prevention training.

2018

Computational methods for socio-computer interaction

Wai Tat Fu, Mingkun Gao, and Hyo Jin Do

In Computational Interaction, Nov 2018
Burst Your Bubble! An Intelligent System for Improving Awareness of Diverse Social Opinions

Mingkun Gao, Hyo Jin Do, and Wai-Tat Fu

In Proceedings of the 23rd International Conference on Intelligent User Interfaces, Tokyo, Japan, Nov 2018

Abs DOI

Social media users are overloaded with diverse opinions by people with opposing stances. Previous research shows that people often look for opinions that reinforce their pre-existing beliefs and stances, which may lead to social polarization. Traditional social media present opinions in a linear list format, which not only lacks structures for people to explore diverse viewpoints but also aggravates their selective exposure to agreeable opinions. To address this problem, we designed an intelligent system that improves awareness of diverse social opinions by providing visual hints and recommendations of opinions (e.g. news articles and comments) on different sides with different indicators. We evaluated our system with news articles about Obamacare repeal issue and their corresponding user comments from Facebook. Results demonstrate that our system could increase people»s awareness of their stances and opinion selection preferences, which mitigates selective exposure and thereby leads to a more balanced perception of social opinions.

2017

Intelligent Interface for Seeing the World Through Different Lenses

Hyo Jin Do

In Companion Proceedings of the 22nd International Conference on Intelligent User Interfaces, Limassol, Cyprus, Nov 2017

Abs DOI

Despite the explosive growth of online social media where people can easily share their thoughts, our current society is more divided than before, gridlocked over society, culture, race, and gender issues. Selective exposure, a confirmatory bias of individuals that favors preexisting opinions while avoiding attitude-inconsistent views, impedes the balanced insight of a controversial issue, which thereby would account for the societal division. In this research proposal, we introduce an intelligent interface that automatically clusters and visualizes diverse opinions about a controversial topic. First, we collect controversial posts from Facebook and its comments. Then, the comments are automatically clustered using a machine-learning algorithm based on features that reflect its contents and the writer’s stance. Lastly, we propose an intelligent user interface with controversial posts and opinion clusters where users would be motivated to hunt for opinion groups that are different from their own perspective.
An Intelligent Interface for Organizing Online Opinions on Controversial Topics

Mingkun Gao, Hyo Jin Do, and Wai-Tat Fu

In Proceedings of the 22nd International Conference on Intelligent User Interfaces, Limassol, Cyprus, Nov 2017

Abs DOI

An enormous amount of posts and comments are shared in online social forums, which often organize these online social opinions based on semantic contents. However, for controversial topics, people with different attitudes and stances often have very distinct perspectives, reactions, and emotions to the same post. Organization by semantic contents often encourages selective exposure to information, which may exacerbate opinion polarization. To address this problem, we design a novel interface that allows people to better understand and appreciate people with different stances in social forums. Our interface was developed to allow interactive visualization and categorization of original posts about a controversial topic with crowd workers’ reactions and emotions from different stances. We evaluated the interface using Reddit posts about US presidential candidates. Results demonstrate that the interface can mitigate selective exposure and help users to adopt a broader spectrum of opinions than the traditional Reddit interface.

2016

Analyzing emotions in twitter during a crisis: A case study of the 2015 Middle East Respiratory Syndrome outbreak in Korea

Hyo Jin Do, Chae-Gyun Lim, You Jin Kim, and Ho-Jin Choi

In 2016 International Conference on Big Data and Smart Computing (BigComp), Nov 2016

DOI

2015

Korean twitter emotion classification using automatically built emotion lexicons and fine-grained features

Hyo Jin Do, and Ho-Jin Choi

In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters, Nov 2015

PDF