Combatting algorithmic bias: a journey in user-centred research, evaluation, and development

22 min readMay 6, 2023

Overview

In this project, we attempted to harness everyday users’ collective power to audit algorithmic bias in AI/ML systems. Specifically, we wondered how we can leverage the power of everyday users to both identify and also generalise cases of bias and unfairness in ML systems, and then synthesise those findings in a form that is readily actionable by ML teams. In doing so, we also examined the barriers to achieve this goal.

This project was the perfect opportunity for us to practise and improve at the following:

capture and understand tasks and goals
generate ideas to support said tasks and goals
measure success of those ideas and their realisation

Introduction to Design Processes

Of course, a prerequisite to analysis and conception is an organised and effective manner of inquisition and development. I discuss these in depth, and examine the benefits and tradeoffs of these methods:

All of these systems share the same underlying purpose — to facilitate creative, relevant ideation and its subsequent execution. So, why are there so many, if they all do the same thing?
As we discussed in lecture, one possible reason is branding. Different organisations develop their own model to foster their brand identity and be unique. Another reason is that these models all have their shortcomings, which can affect different organisations differently. For example, the analysis-synthesis bridge model is very linear, much like previous ideation frameworks that are more rigid. Similarly (or perhaps differently, depending on your perspective), while IDEO’s process model allows for some exploration further on, it still feels quite linear, much like NN/g’s lifecycle. The loop and double diamond, however, are more tolerant of revisiting problems and reflecting. Another idiosyncrasy about the loop is that it mentions momentum — a team’s ability to keep moving; this is not a focus for the other models.
UCRE’s model seems motivated by the same reasons — to provide a more context-specific, more versatile approach to UX research and innovation. Indeed, our examination of exploration, certainty, and effort allow for a robust shift in techniques while maintaining as much breadth and extrapolation power as possible. While less procedural, it is also more versatile.

Background Research

I also conducted background research in order to familiarise myself with relevant terminology, gain access into this paradigm, learn to ask more valuable questions, better challenge assumptions, move beyond the obvious, and, last but not least, understand the context within which we explore.

To understand people effectively is to understand contexts — people act differently in different situations, which is the foundation of social psychology.
Context is important to product and service design because our experiences are multidimensional, including both rational decisions and emotional response. These drive our identity, emotions, and sensation. More specifically, we see that context affects the usability, compatibility, and effectiveness of the product. We also see that context helps us understand what we can and can’t control.
In addition, we conduct background research to understand context because this research is able to shed light on the true nature of the environment — the industry, technology, and current practices. This provides us as designers with the colour needed to make accurate decisions. In addition, with this research and better understanding, we would be better able to predict and prepare for change as well.

Relevant sources to consider include specialty media, white papers, industry experts, and research papers.

My approach to this was fairly intuitive. I scrolled through social media (specifically, YouTube and Facebook) and asked myself, “Did I notice anything interesting? Did I notice anything I don’t think other people would see? Did I notice anything suggestive of an algorithm’s judgement?”

In addition, I also read some papers — here, I paid attention to whether I saw anything relevant or anything interesting cited by/citing the paper I was currently perusing (a snowball approach, if you will).

Sources and examples obtained to contextualise algorithms and their bias. Courtesy of Jonathan Ho

Data Exploration

To better understand the problem space and inform our knowledge, I also turned to some preliminary data exploration. Ideally, analytics help us with identifying potential problems, identifying potential causes of those issues, and supporting qualitative research.

More specifically, I noted the below:

1. This entails not only identifying what is going on, but also relating this to what should be going on. With very detailed and concrete factors that are measurable, the UX teams can diagnose specific issues or inform further inquiries, especially if there is a discrepancy between the status quo and the normative ideal.
2. With the information gleaned from issue indication, teams are able to formulate hypotheses and questions about blockers. This is where analytics and experiments can help identify the validity or the relevance of each of those questions — for example, by analysing web traffic or event tracking, one can see whether different visual layouts are affecting the customers’ flow.
3. Using all of the data accumulated prior, they have some semblance of what’s going on. However, the issue lies in what the ideal solution is, and this is where triangulation comes in. Not only are we curious about how are people interacting with the platform (e.g. when they are leaving a form), but we also want to know why, e.g. people have privacy concerns or they didn’t notice a link. These can help define the solution space.

We looked at a Twitter dataset, which includes more thant 30,000 tweets about three recent cases of everyday algorithm auditing that occurred/discussed on Twitter via everyday users: (1) ImageNet Roulette, (2) Twitter’s Image Cropping Algorithm, and (3) Portrait AI.

We began by separating the data into the three cases and conducted analyses within each class. We first compiled some descriptive statistics to summarise each dataset, and then analysed tweet type (humour/funny, research/scientific, call-out/community, misclassified/irrelevant) to discuss the idiosyncrasies of each dataset.

Data Overview: Case Summaries. Courtesy of The Calm B4 The Storm

Data Overview: Classifications. Courtesy of The Calm B4 The Storm

From these, we realised that people tend to respond differently towards different types of algorithmic bias in our systems. In addition, we saw a surprising amount of misclassified tweets.

After discussing the findings, we hypothesised that language and culture directly correlate to accessibility to technologies, which thus affects how users interact with these AI systems. In addition, we posit that the general public tends to contribute and interact more with content covering discrimination, including discussing race and its presentation in AI technologies.

Usability Tests

Ah, but alas — we still have no significant understanding of users’ interactions with these systems! To make all of our exploration and background research useful, we need to familiarise ourselves with the context in which we engage with these systems.

I reviewed cognitive walkthroughs and thinking aloud as usability testing techniques:

The cognitive walkthrough method (hereinafter CW) and the thinking aloud method (hereinafter TA) are both very useful for understanding the usability and effectiveness of a given interface. They both provide simple usability feedback and focus on the user’s perspective, while being fairly cost effective.
However, they have some differences in their focus. While CW focuses on industry experts and experienced practitioners, TA focuses on users. Specifically, while CW brings in experts to go through a scenario and then answer questions, TA looks at user experiences and thought processes as they complete tasks. So, while CW teaches researchers about roadblocks or confusion points, TA illuminates users’ mental models, goals, and motivations.
Used together and correctly, these methods should help clarify any needs and shortcomings of any interface.

Therefore, we used think-aloud interviews to examine user interactions with transparency in targeted online ads, specifically on YouTube.

These interviews had users first report an ad that they encountered, and then find out why YouTube was showing them that specific ad. After that, we had users update their preferences to prevent seeing gambling ads.

We learned that it is difficult to both navigate and use the ad centre, and that the ads given are based on incorrect data.

Findings from Usability Testing. Courtesy of The Calm B4 The Storm

It was also interesting to note that skipping advertisements was almost second nature to our users. Also, many users had ad-blockers that needed to be disabled for this study.

Furthermore, users actually did not fully understand what reporting entailed. We considered considering “report” as industry-specific terminology.

Also, data used for ad preferences was often inaccurate. For example, accounts were made with false birthdays. These often led to ads and preferences not reflecting users’ actual habits and interests.

Reframing and Defining

With all of this information, we wanted to narrow our scope and project direction.

We started with a reframing activity, “walking the wall”, that helped us to review and examine all of the work and insights we have generated thus far.

Walking the Wall. Courtesy of The Calm B4 The Storm

We noted that many of our questions revolved around questioning our prior assumptions, such as wondering whether users even cared about algorithmic bias.

Therefore, we thought it helpful to reverse our assumptions. For each of our enumerated assumptions, we brainstormed their opposites and how to address these:

Reframing Activity. Courtesy of The Calm B4 The Storm

Now, we elaborated on our intended project definition:

How might we improve AI literacy, particularly regarding the transparency of advertising and access to information?

Ideally, this would tighten the gap between back-end processes that inform online platforms and a user’s understanding of what informs their experience. In addition, understanding this space will help us develop and promote a more interconnected and synergistic society that will better collectively drive productive and ethical action that will help us transcend racial and socioeconomic strata.

We anticipated that these findings would be able to impact anyone with internet access who uses social media platforms, the companies running these platforms, advertises using these platforms, and also designers and engineers who work on algorithms to determine how to target and advertise to users.

Speed Dating

Of course, we cannot solely rely on the work of others (e.g. existing literature). Rather, we must exercise some research methods to generate novel insights.

Using existing literature allows us to ground our work in humanity’s collective domain knowledge, as the reading asserts. However, it can be difficult to navigate the existing literature, especially because the amount of research and effort is so extensive.
In addition to the (very brief) points raised in the reading, I note that existing literature is helpful because it can provide a starting point and a foundation for further research — it helps us grasp the nuances of a field and understand the specific gaps in the literature such that we can generate novel knowledge instead of being redundant.
That being said, it is glaringly indisputable that the quality of research is not uniformly high across projects, nor are the scopes, methods, or fields of previous works necessarily applicable. In addition, biases of other researchers and publications can influence the representation and validity of different works.
With respect to research methods, we choose a research purpose first. This will also guide whether we want to explore primary or secondary research, which in turn influences whether we look for qualitative or quantitative data. How we engage with people further helps us choose suitable methods, and we are able to also conduct different studies with different methods such that they complement each other in helping us anticipate and derive different outcomes.

We considered how to most effectively spark a useful discussion with users to surface their authentic needs. We settled on speed dating, a specific instantiation of exploring speculative futures. I wrote a description of our method and our mission:

Speculative futures provoke, imagine, and dream into what lies ahead, and are inspred by not only current trends but also art, film, and fiction. We design these futures to spark critical reflections, discussions, and reconsiderations of an ideal, ethical, and fair society. Through the ensuing debates, we hope to collectively define a preferable future.

We presented our users with 10 speed dating scenarios, all with relevance to YouTube’s existing systems of algorithm auditing and also newly proposed systems of ensuring user-specific content. We encouraged users to explain their reactions and draw connections from their own experiences.

These interviews lent themselves well to synthesis. By clustering our notes, a few recurring motifs emerged. With these, we were able to visualise typical users and understand their needs and habits.

Research Report. Courtesy of The Calm B4 The Storm

Through this exercise, we learned that users have no faith in other users, users have no faith in the integrity of the platform, some users are more altruistic and would contribute to community-serving features, fairness comes second to efficiency for some users, and increasing perceived control can improve community engagement.

Speed Dating Insights. Courtesy of The Calm B4 The Storm

Surveys: Validating Qualitative Findings

Interviews, as we have seen, are very helpful for exploration. They are interactive and responsive, which provide us with the ability to probe and follow up on interesting points. We are able to build rapport and cooperation for users.

Unfortunately, this comes at a cost: interviews are very expensive and subjective. Conducting interviews takes money and time, especially because of the lengthy synthesis process. To compound matters, interview findings are also subject to the interviewers’ biases and interpretations.

So, it behooves us to follow up on hypotheses with surveys! These are both time- and cost-effective. With surveys, we can generate substantial and significant troves of informative data. Since the questions are standardised, it is easy to aggregate results and generalise to a target population.

Of course, there are caveats as well: we don’t necessarily figure out what questions to ask, and it is difficult to clarify questions or follow up on interesting responses.

In our case, surveys can facilitate investigating users’ perceptions and attitudes towards algorithmic systems and identifying potential biases. Surveys can help answer questions related to, say, users’ trust in algorithmic systems, their awareness of biases, and their experiences with biased or unfair systems. However, surveys’ shortcomings lie in contextualising and otherwise exploring these areas of interest, such as uncovering the underlying motivations behind user perceptions and behaviours, as well as providing detailed information about specific instances of bias or unfairness. Therefore, surveys may be best used in conjunction with other research methods, such as A/B testing, observational studies, or interviews, to provide a more comprehensive understanding of user experiences with algorithmic systems.
Some potential questions that surveys would be well-positioned to answer would be along the lines of “How often do [users] interact with [the algorithmic system], and what tasks do [they] typically perform?” Of course, the realised version of this question in an actual survey would be a lot more specific, less open to interpretation, etc.
As mentioned, surveys may fall short in answering questions that require more in-depth information or contextual understanding, such as “How did the algorithmic system’s recommendations or decisions impact your decision-making process?”

After considering researchers (who focus their work on algorithmic bias and lessening its impacts), young adults (who care about the quality of their content recommendations), everyday users (who believe that crowdsourcing solutions can be beneficial), and young adults who use YouTube (and are also aware of the intersection between algorithms and social issues), we decided to work with the last population.

As many of our goals for this survey focus on validating whether or not individuals will take action on algorithmic bias to support the greater good, this demographic of everyday users will provide us with the most insightful findings. They are already aware of the issues that arise from powerful algorithms and, based on their age demographic, probably have grown up with YouTube. Understanding if and how they face bias will allow us to learn how we can turn their awareness into action.

Rescoping

At this point, we have also realised that our initial question may not be feasibly addressed, given our previous insights.

We revise our project goal:

We want to motivate users to be more intentional and critical of their interactions with algorithmic systems so that any instances of bias or other harmful behaviours are more identifiable and more likely to be acted upon.

Accordingly, we want our survey to help us validate the following:

We want to validate our assumption that some people already care about the quality of content recommendation algorithms.
We want to learn about why they have faith in community-driven efforts.
We want to learn about why they would be willing to sacrifice their short-term utility for public benefit.
We want to validate our assumption that some people do not care about trivial improvements to recommendation quality if it sacrifices their viewing satisfaction and the smoothness of their experience.
We want to validate that encouraging community-driven initiatives to improve algorithmic-centred experiences would encourage people to be more cognisant of interactions with algorithmic systems as a whole.

Survey Questions. Courtesy of The Calm B4 The Storm

From this survey, we learned that users are cynical about algorithms. In addition, few participants have an opinion about community engagement initiatives. However, respondents are more likely to take action when the focus is non-commercial. But, as we saw in our qualitative work, we did confirm that a large proportion of people would choose to prioritise their personal YouTube experience. This also means that more than half of the participants have not taken any action to improve the YouTube algorithm.

Our efforts with the survey largely looked towards validating our assumptions that people care about the quality of content recommendation algorithms, and exploring the faith (or lack thereof) in the potential benefit derived from introducing community-driven initiatives to improve the algorithm. Due to limitations of the survey as a research tool, we were unable to make progress with our first three goals, since they focus on revealing the underlying motivations of the respondents. We hope to explore these in future work through more suitable contextual research methods.
However, we were able to make substantial progress on goals 4 and 5, which were to validate our assumptions regarding users’ personal satisfaction as well as their attitudes towards community driven initiatives. Our results did indeed show that while the majority of users have not and are not willing to take specific actions against a video or ad, users still have a greater tendency to support features that were more obviously focused on benefiting the larger YouTube community (such as content tags). Additionally, when respondents who previously answered that they would be willing to take some sort of action against bias were faced with the idea of their actions potentially impacting their own experience on the platform, they became much more unsure of their stance and willingness to act.

Survey Responses. Courtesy of The Calm B4 The Storm

Speed Dating v2

As discussed above, speed dating allows us to validate user needs. To come up with potential futures to probe, we conducted a “Crazy 8's” activity to identify the greatest areas of uncertainty and risk. After each generating 8 ideas and the user need motivating them, we voted and picked the top 5 needs to explore with the storyboards.

I designed storyboards motivated by the users’ need for an easy and obvious way to input their opinions.

We conducted the interviews with all 15 storyboards, and once again conducted analyses by affinity diagramming.

We learned that users are motivated to contribute because their actions could positively impact a community. That being said, users do not want to be sidetracked when they want entertainment. Users are also unaware of their lack of knowledge regarding algorithmic bias.

In addition, many users felt that a crowdsourced platform will not meet their needs because it will inevitably be abused by other users. Relatedly, a tagging system feels like an unnecessary added level of “security”.

We also realised that we had some misunderstandings. For example, we assumed incorrectly that our participants might be more optimistic about community-led initiatives on YouTube. We also assumed incorrectly that our participants were aware of algorithmic issues on YouTube. Last but not least, we assumed incorrectly that our participants would be more open to the idea of learning about algorithmic bias on the platform itself.

Moving forward, we realised that we must not interfere or change users’ experiences on YouTube. Users also want more transparency from YouTube, so that they know, for example, what happens after a video is reported or what happens after a user clicks “not interested”. All in all, users are generally distrustful of platforms and situations where users can abuse a system.

So, if we were to design an intervention on YouTube, it would have to add as little friction as possible. And, we should focus on allowing greater transparency between the user and YouTube. Last but not least, when designing a community-led initiative on YouTube, we should allow users to visualise their contribution to the entire community.

Prototype Parameters

Through this project, we have determined the following to be the most riskiest assumptions:

Users are not informed about algorithmic bias.
Users are not incentivised to learn about algorithmic bias.
Users will be more incentivised to take action on algorithmic bias if they can see the individual impact of their actions.
Community-driven initiatives could encourage people to be more conscious of interactions with the system.
Community-driven initiatives are the most effective way of generating concern regarding algorithmic bias.

For whatever prototype we design to test and address these, we know that a successful prototype will help users to identify and address suspected cases of algorithmic bias. Conversely, a failure would see users not care more or not care to know more than they do now about algorithmic bias.

Our success will be measured in vivo by the number of installations and number of interactions. However, in the meantime, with our tests, we will have pre- and post-surveys with identical questions such that we can measure changes due to the exercise. Through our pre- and post-surveys, we hope to see an increase in understanding within individuals’ personal definition of algorithmic bias as well as a more considerate selection of biased scenarios. We also hope to see more individuals choose to “report” or “block” (take action) suspected biased content rather than selecting “skip.”

Prototype Testing Plan. Courtesy of The Calm B4 The Storm

We conceptualized our prototype as a YouTube extension that promotes transparency around audience engagement. It not only shows the number of dislikes a video has a received — a feature which the platform had removed — but it also showed the number of reports for a video as well as the top reasons for reporting.
In our prototype testing, we targeted college-aged students who are already very familiar with YouTube. Through intercepting individuals in lounge areas, we were able to have them explore our prototype and identify whether or not our approach to increased transparency will motivate them to think more critically about the content they are absorbing and how it impacts others. Through this prototype testing, we saw nontrivial increases in not only sensitivity to algorithmic bias but also proclivity to action due to the interaction with our lo-fi prototype.
Based on our pre- and post- survey data — as well as synthesis of our prototype testing results — we found that being able to see how others have reported a video did indeed cause more thoughtful discussion about what constitutes algorithmic bias as well as the different ways users can take action. This validates the effectiveness of our prototype at increasing both users’ awareness and critical engagement with the media they consume and encounter.
However, we understand that the interactions with the prototype may not entirely reflect an in-vivo, typical engagement (or lack thereof) with the proposed solution, which forces us to be more conservative with our inferences. It’s also important to recognize the issues with our current prototype in potentially biasing report reasonings.
With that in mind, we have garnered useful insights into future directions for iterative development of our prototype through the think-aloud interactions with our prototype. We discovered concrete and tangible changes to implement in both the design and also flow of the prototype.

Low-fidelity Prototype. Courtesy of The Calm B4 The Storm

As a reminder, our honest signal for success was observing if users can identify and address suspected cases of algorithmic bias. We saw failure as users not caring more or not caring to know more than they do now about algorithmic bias. To quantify this measure, we utilized our pre- and post- survey data; first, we took the average bias rating for each scenario and compared to see if there was any change in sentiment from before the prototype test vs. after. We did, in fact, notice an overall positive trend in the bias ratings, indicating that the increased transparency in reporting data led to users thinking more critically about the content they are watching (see appendix for specific numbers). As users become more and more critical of the YouTube algorithm, they are more likely to suspect videos of fitting into their definition of algorithmic bias. In terms of taking action, we coded responses as either “action” or “inaction”, and found that the fraction of responses indicating some form of action in the post-test (7/11) was actually greater than that of the pre-test (4/11). Thus, not only are users becoming more critical of the content they watch, they are also more likely to address suspected cases of algorithmic bias.
In terms of our assumptions, our prototype test 1) solidified the fact that users are indeed not informed about algorithmic bias and 2) showed that community-driven initiatives could indeed encourage people to be more conscious of interactions with the YouTube system. First, when users were giving examples of algorithmic bias in the pre-test, many struggled to come up with an answer or voiced being unconfident with their response. Thus, when it comes to actually coming up with tangible examples of algorithmic bias, many found that they knew less than they thought. Secondly, our prototype test showed that revealing how others have reported a video caused users to think more critically about the content they interacted with. Many users said that seeing the top reasons for other users reporting would cause them to look back at the video and consider how those factors could be seen in the content.
We qualify this report of a success, however, by noting two concerning risks. First, we realize that the changes in proclivity to action and sensitivity might not be statistically significant, so we would need more research to see if the same trends emerge. In addition (and perhaps more worryingly), we understand that our prototype does not necessarily reflect an in-vivo engagement with our prototype. Our test asked users to step through the entire reporting process and then reflect on hypothetical scenarios. However, we assume that a “normal”, everyday user will not be stepping through the entire reporting process at all, much less with the same level of attention and consideration we saw in our tests.

Proposed Changes and Next Steps

After conducting our test and (once again!) synthesising our results with affinity diagramming, it was apparent that we have to delay the presentation of the most popular reporting reasons and also modify the report button to better stand out.

In order to maintain the honesty and authenticity of each report, we do not want users to see the most popular reporting reasons before reporting. However, we recognize (and have validated) that seeing these rankings — for example, in the form of “Your report matched 70% of other reports for this video” — validates users and reduces the monolithic stature of YouTube. Therefore, we plan to not remove, but instead delay the presentation of top reporting reasons to be displayed in the screen that pops up after they submit.
From user testing, we found that the current design of the ‘report’ button was far too ambiguous, especially since it cautions users from a video. By creating more urgency in the design of our extension (more distinct placement for the report button), we may encourage users to similarly reflect this urgency in their own actions. This may include creating a button that is more visually distinct from the other statistics and options that typically accompany a YouTube video.

Furthermore, we have to consider the in vivo interactions with the prototype.

However, we know our prototype generates and takes advantage of recency bias. Since we’ve showed them a prototype about reporting, it may skew the results such that the testers will mention reporting as a solution more often due to just being showed that. As a group, we have to figure out how we can get the same results that we achieved without explicitly testing on reporting. To get users to report more often, we will require a change of mentality from them, which thus begs the question, “how can we can educate users more such that they will be able to report more often in normal circumstances?”

Project Review

We compiled all of the work, experimentation, and insights gathered into a cohesive story.

We investigated how to motivate users to be more critical of their interactions with algorithmic systems so that any instances of harmful behaviors are more identifiable and likely to be acted upon. We chose YouTube as our platform for analysis.

To explore this, we conducted data analysis, think-aloud testing, walking the wall, contextual inquiries, speed dating, affinity diagramming, surveys, and prototype design and testing.

We learned that users heavily prioritize their viewing experience. Users are also generally cynical and distrustful of platforms and algorithmic systems. And, when given more transparency with their platform, users perceive a sense of control. Last but not least, users feel more empowered when an interface allows them to visualise the impact of their actions, especially when they positively impact a community.

Therefore, our prototype is a third-party extension that tracks users’ reports and broadcasts these reports to other users. We anticipate users will be seeking access to more information about video reports through our extension. By increasing the transparency surrounding reported videos, we anticipate users to be more aware of their interactions with algorithmic systems.