The Study on How DEI Causes Hostile Attribution Bias
The NCRI study is a step in the right direction of identifying issues with DEI training, but like a lot of psych studies, has numerous flaws.
Social Context
The Network Contagion Research Institute (NCRI) is a lab affiliated with Rutgers University’s Social Perception Lab that studies “cyber-social threat identification and forecasting.” Think QAnon and the online versions of neo-Nazis, anarcho-socialist networks, and other fringe movements.
As their website likes to advertise, their studies have gotten a decent amount of coverage in the media. About a month ago, they released a report entitled “Instructing Animosity: How DEI Pedagogy Produces the Hostile Attribution Bias.” This report made the news for not making the news.
For example,
writing in National Review and writing in his Substack newsletter reported on how NCRI was in talks with reporters with The New York Times and Bloomberg to cover the study, but both The Times and Bloomberg dropped the stories in editorial review.The Bloomberg article was dropped by Bloomberg’s “Equality” editors without explanation to NCRI. The Times reporter was more forthcoming, telling NCRI “I told my editor I thought if we were going to write a story casting serious doubts on the efficacy of the work of two of the country’s most prominent DEI scholars [Robin DiAngelo and Ibram X. Kendi], the case against them has to be as strong as possible.” NCRI says they were told by the Times reporter that the Times might revisit the study if it underwent peer review.
A few days later National Review Online ran a feature editorial reiterating the points made in the earlier articles. The Times for its part reached out to National Review denying that there was an article “ready for publication” and asserting, “Our journalists are always considering potential topics for news coverage, evaluating them for newsworthiness, and often choose not to pursue further reporting for a variety of reasons. Speculative claims from outside parties about The Times’s editorial process are just that.”
In short, NCRI was used to getting coverage of their studies from mainstream media. They did a study that went against one of the biases of mainstream media by criticizing the “diversity, equity, and inclusion (DEI)” industry. Then, NCRI did not get the coverage it wanted or was used to getting. The most comical aspect of this was Bloomberg’s DEI-adjacent “Equality” editorial staff axing a story on the harmful effects of DEI training. It must be nice to have a platform to silence your own critics.
More recently,
tried to downplay the reaction to the lack of news coverage of the study on , the podcast he co-hosts with . He pointed out that he managed to get an op-ed published in The New York Times critical of the DEI industry in 2023, though he admitted this was partly due to him knowing a sympathetic editor at the Times.I re-read the Colin Wright and National Review articles and found them to be mostly fair.1 Granted, Colin Wright’s article has a little hyperbole in it. For instance, it starts with “In a stunning series of events…,” but you don’t have a successful Substack newsletter by reporting on things in a dry and boring way.
There is a point that Jesse Singal and I can agree on: this is not so much “stunning” as just business as usual. It should not be a surprise to anyone who has kept up with the popular culture in 2024 that The New York Times and Bloomberg have editorial biases and that these biases are toward protecting the DEI industry, not criticizing it.
My Background
I have been critical of psychology experiments in the past for not generalizing beyond the university student volunteers that populate them. I have also reported on insights from psychology experiments about flaws in human memory and the fallacy of generic generalization. Psychology often asks interesting questions that are very relevant to our lives, but the actual results of psychological science are often flawed due to methodological or interpretative issues.
I am not surprised there is a replication crisis in psychology and behavioral science more generally, and you can count me among those who are generally skeptical of results from psychology.
On the other hand, I have also been critical of DEI initiatives because they can be used to deplatform viewpoints outside of a very narrow orthodoxy of allowable viewpoints, impoverishing the discourse for us all. In my own experience with DEI initiatives in the workplace, I have seen them used to demagogue and deplatform “right-wing” viewpoints on issues. I have experienced this not only for issues that are plausibly related to “diversity” matters such as race relations, but also for controversial issues as unrelated to “diversity” as abortion.2
I am not surprised there appears to be increasing push-back against DEI initiatives, and you can count me among those who are skeptical of DEI.
With this background as a double-skeptic, I scrutinize the results of “Instructing Animosity.”
The Study
“Instructing Animosity” actually reports on three similar experiments, one with race, one with religion (specifically, Islam), and one with caste. The first experiment was repeated with two different samples. Each experiment used the same basic format.
For instance, in the race experiment, a sample of test subjects was taken. The test subjects were randomly assigned to either an experimental group or a control group. The experimental group was given readings from authors Ibram X. Kendi and Robin DiAngelo. The control group was given a text about corn production in the United States.
After their respective readings, both groups were given an identical scenario about a college application:
A student applied to an elite East Coast university in Fall 2024. During the application process, he was interviewed by an admissions officer. Ultimately, the student’s application was rejected.
The experiment concluded by giving a questionnaire to test subjects asking them about the presence of racial discrimination on the part of the admissions officer, unfairness to the applicant, the commission of “microaggressions,” etc.
Note that the scenario does not mention either the race of the applicant or the race of the admissions officer. Thus, there is no evidence of racial discrimination in this scenario. Any test subject concluding that there is racial discrimination in this scenario is exhibiting what the authors term “hostile attribution bias.” This is a term based on an earlier study from 1995 and refers to the perception of prejudicial hostility where none is present.
In all three experiments, the experimental groups gave more hostile attribution bias responses than did the control groups. These are the main results of the study. They are easy to read about in the publicly available report on NCRI’s website, so I won’t reiterate the results here.
Criticisms
As anyone who has read my article on scrutinizing statistics might guess, my main questions when reading “Instructing Animosity” pertained to the scope of inference (what population can this sample be generalized to?) and practical significance (just how much more hostile attribution bias did the experimental groups have over the control groups?).
Scope of Inference
The initial version of the first experiment (race) involved “423 undergraduates from Rutgers University.” This version has the generalization issues common with psychology experiments that I alluded to earlier. The scope of inference for this initial version isn’t even just “undergraduates at Rutgers,” but “undergraduate students at Rutgers who volunteer for psychology studies” because oftentimes the population of students who volunteer for studies differs substantially from the larger student body.
To the researchers’ credit, once they discovered they had the possibility of a result with the initial version of their experiment, they repeated the experiment with another sample.
The strength of these notable results motivated NCRI to test for replicability with an experiment on a national sample (n=1086 recruited via Amazon Prime Panels) of college/university students to ensure these findings were not an aberration of student attitudes on Rutgers campus.
This is where things get strange. There is a “Prime Panels” service by the company CloudResearch. The Prime Panels service is an online panels aggregator that lets researchers specify a sample of people to do studies like this one. The sample can be filled by one or more providers depending on the exact nature of the sample requested (for instance, all Americans versus specific subdomains like college students).
The strange part is that there is no “Amazon” labeling on this product. There is the Amazon Mechanical Turk service, however. Mechanical Turk is a crowdsourcing service that lets clients outsource work to be done by real human beings. It has been used in the past by behavioral and social science researchers for experiments, even though that is not what it was originally designed for.
I found at least one study that found that Prime Panels performs better than Mechanical Turk in doing behavioral and social experiments. This should not be surprising since Prime Panels was designed for research experiments and Mechanical Turk was not. For instance, the Mechanical Turk workers population is on average less religious than the general population, and so experiments that rely on the religiosity of test subjects fail to replicate on Mechanical Turk.
Perhaps the NCRI authors have confused the two services and conflated “Amazon Mechanical Turk” with “Prime Panels” to create “Amazon Prime Panels?”3
At any rate, more and more surveys and experiments are using online panels because the older standard methods are becoming less appealing for multiple reasons. These online panels can suffer from bias if the population enrolled in the online panel differs from the target population. Mechanical Turk’s workers being less religious than the general population is an example of this.
Online panels can be made less biased and more representative either by creating a sample that matches known demographic characteristics of the target population (such as age, sex, religion, race\ethnicity, etc.), or by weighting the sample to match known demographic characteristics of the target population, or both.
It doesn’t appear that NCRI made any such adjustment for the first experiment (on race) in their study. The main report does not mention any. The supplement to the report has a table describing the demographics of the sample, and the sample was 66% female versus 29% male, a ratio of about 2.3 to 1. While there are more female than male college students in the United States, the difference is a lot less. It is around a 44% female to 34% male split or a ratio of about 1.3 to 1.
I don’t have access to the raw results, so I can’t investigate if the over-representation of female college students in the sample affects the results of the experiment. (I did reach out to NCRI via the “Contact Us” form on their website, but never heard back.)
The larger point is that this dramatic difference indicates that the sample used for the first experiment reported in “Instructing Animosity” is unrepresentative of the target population of college students in the United States. This calls into question how generalizable the results are.
This is a good example of why we should be concerned about the scope of inference of experiments such as these. The results of such experiments are based on how many test subjects respond in certain ways in the experimental group versus the control group. The experiments are therefore sensitive to how many different kinds of people are included in the sample in the first place.
For instance, if the experimental effect — provided there even is one — were greater on average among female college students than among male ones, the over-representation of female college students in the sample would make the effect appear greater. Likewise, if the effect were concentrated among male college students, the over-representation of female college students would make the effect appear lesser.
The proportion of female versus male college students in the sample is an example of a known attribute of the sample that differs from the target population. The actual attributes that cause the sample to be biased could be something completely unrelated to gender. The gender ratio of the sample indicates that the sample deviates from the demographics of the population. The important takeaway is that because the sample is not representative of a population that we care about, the results aren’t informative.
On the other hand, the samples for the second experiment (religion) and third experiment (caste) were matched to a handful of demographic characteristics of the general population of the United States. From the supplement, it appears both samples were matched on gender, “White” versus “Non White,” five age groups (Gen Z, Millennials, Gen X, Boomers, Post WWI & During WWII), and party identification (Democrat, Republican, Independent, Other).
The samples for the second and third experiments of “Instructing Animosity” are about as representative as those used for the online panel results that you commonly see in mainstream media which rely on online panels.
In summary, we should be doubtful of the results of the first experiment (race) reported in “Instructing Animosity” because the sample is unrepresentative of any real-world population, but the results for the second (religion) and third (caste) experiments are based on samples that appear to be more representative of the general population of the United States.
Practical Significance
More subjects in the experimental groups exhibited hostile attribution bias than in the control groups in all three experiments. This raises the question: how many more?
There are a few challenges in interpreting the practical significance of the “Instructing Animosity” results. Here is Figure 2a from the report:
The bar plot shows the percentage difference in the average (mean) scores of the responses between experimental and control groups. Most of the questions asked the test subjects for a quantity: how many microaggressions, how much harm, how violent, how biased.
Answer Encoding
One issue is that there doesn’t appear to be any consistency in how the test subjects’ responses to these questions are encoded.
The supplemental material contains the actual questions and answer choices. The “how many microaggressions” question had answer options of “0,” “1-2,” “3-4,” “5-6,” or “7 or more.” It is not clear how they calculated a mean from these scores. For instance, how do you score the “7 or more” response?
The “harm” question had answer options of “No harm at all,” “Mild harm,” “Moderate harm,” “Substantial harm,” and “Extreme harm.” Again, it’s not obvious how to score these.
The answer options for the “how biased” question give us a hint as to how the encodings worked, though. They are “0 Not at all biased,” “1,” “2”, “3,” “4,” “5,” and “6 Very biased.” Perhaps the “harm” question was scored on a similar 0 through 5 scale?
Percentage Differences
Another issue is that many of the results are reported as percentage differences. Let’s consider the “how biased” question since that has the most straightforward encoding.
The formula for percentage difference is
where y1 and y2 are the two quantities being compared.
Only the percentage difference is reported, so we don’t know how much hostile attribution there was, to begin with. Consider two different scenarios.
If the test subjects already have a lot of hostile attribution bias, the control group might have a mean answer score of 4.5 to the “how biased” question. The experimental group could have their hostile attribution bias increased by reading Kendi and DiAngelo and might have a mean answer score of 5.5. This would lead to a percentage difference of
If the test subjects are mostly skeptical, evidence-based thinkers, the control group could notice the lack of any racial information in the scenario and might have a mean answer score of 0.30 to the “how biased” question. The experimental group might have one or two test subjects respond to the Kendi and DiAngelo readings with hostile attribution bias and might have a mean answer score of 0.37 to the “how biased” question. This would also lead to a percentage difference of
In the first scenario, the difference in averages is a full 1 point on a 0 to 6 scale. In the second scenario, the difference in averages is a paltry 0.07 on a 0 to 6 scale. The practical significance is much lesser in the second scenario than in the first, but both of these scenarios would lead to about a 20.6 percentage difference as reported in Figure 1a.
While all the results for the first experiment (race) in “Instructing Animosity” are reported as percentage differences, the results for the second experiment (religion) and some of the results for the third experiment (caste) are reported as direct comparisons of the averages.
The the second experiment (religion), test subjects were given two identical scenarios of a fictional person under trial for a bombing attack, a Muslim man named Ahmed Akhtar and a man who is not Muslim named George Green. Instead of using DiAngelo and Kendi material as in the first experiment (race), the NCRI used material from the Institute for Social Policy and Understanding (ISPU) intended to combat “Islamophobia.”
As we can see in Figure 3, the mean answer scores for the Ahmed Akhtar scenario were 5.25 for the experimental group and 4.92 for the control group. This is a difference of 0.33 on a scale of 1 to 7, which is about 6% of the magnitude of the scale. The ISPU material does have an effect of increasing hostile attribution bias, but the effect is small.
The effect sizes seen in the third experiment (caste), appear to be somewhat larger.
In the third experiment (caste), there were effect sizes of 0.58 and 0.35 on a scale from 0 to 4, which are about 14.5% and 8.75% of the magnitude of the scale, respectively.
Lack of Adjustment for Multiple Comparisons
Looking back at Figure 2a, there are asterisks next to the numbers above only some of the bars. Footnote 13 from “Instructing Animosity” explains these:
Because there are multiple questions and multiple t-tests, the authors are effectively doing multiple comparisons. The “significance levels” of these t-tests define their type 1 error rate. With a “significance level” of 0.05, which is standard, about 5% of t-tests will give a false positive. They will report a statistically significant result even when there is no effect.
Only some of the questions led to statistically significant t-tests, and of these, many had marginal results. (You can tell because they only have one asterisk and not two, so they had p-values somewhere between 0.01 and 0.05.) Some of these may have been entirely spurious and the result of statistical error.
One way to compensate for this problem is to make an adjustment for multiple comparisons when doing multiple t-tests. There are several standard approaches for this, but I won’t bore you by going into the details. If any of them had been used, it would have likely pushed some of the more marginal t-test results outside of statistical significance.
Just Averages
Like many other studies, the authors of “Instructing Animosity” spend all of their analysis looking into differences in averages. An average — whether it is a mean or a median — tells you something about the center of a distribution. It doesn’t tell you anything about how spread out the distribution is or what shape it has.
In Figure 3, we saw that the average unfairness score for the experimental group in the Ahmed Ahktar case was 5.25 and the average score for the control group was 4.92. But how spread out are the scores from each person? Do they range over the full length of the scale from 1 to 7? Or are they mostly concentrated in the middle around 4 to 5? If the latter case, then the difference is more practically significant.
Is the difference because there are a few outliers in the experimental group who went all the way up to 7 or because there are a bunch of people who answered “4” when they would have given a “3” had they not been exposed to ISPU material? Is the hostile attribution bias coming from a large, specific effect on a few people or a small, broad effect on many people?
All of these questions could easily be answered just by including two histograms, one for the experimental group’s answers and one for the control group’s answers. Unfortunately, many studies only seem to be concerned about differences in averages.
Conclusion
“Instructing Animosity” reports a small increase in hostile attribution bias when people of the United States are given materials from the IPSU intended to combat anti-Muslim prejudice or are given materials from Equality Labs intended to combat caste prejudice.
The effects on hostile attribution bias caused by readings of Ibram X. Kendi and Robin DiAngelo reported in “Instructing Animosity” are obscured by the use of samples that are unrepresentative of any general population. I expect that if the experiment were re-run with a more representative sample, there would be a similar result: a small increase in hostile attribution bias. However, that remains to be seen.
Thus, the results of “Instructing Animosity” are not Earth-shattering, but they do give some preliminary evidence for a small effect.
It may seem like I am very down on “Instructing Animosity.” However, this is my typical reaction after reading a psychology study. Indeed, professional psychologists like
would advise extreme caution when interpreting the results of psychology studies.I would have the same or similar criticisms for a lot of studies — in psychology and other subjects. My criticisms of “Instructing Animosity” could be addressed by the use of representative samples, adjustment for multiple comparisons, and a couple of histograms. Sadly, I quite often find myself with similar criticisms after reading a research paper.
Nonetheless, I welcome studies like “Instructing Animosity.” As the authors report,
A 2023 study by the Pew Research Center found that 52% of American workers have DEI meetings or training events at work, and according to Iris Bohnet, a professor of public policy at Harvard Kennedy School, $8 billion is spent annually on such programs. Despite widespread investment in and adoption of diversity pedagogy through lectures, educational resources, and training, assessments of efficacy have produced mixed results.
The massive spike in DEI training in the past few years was a fad adopted without much scrutiny. “Instructing Animosity” is a preliminary study that adds itself to the growing body of evidence that the effects of this fad are, at best, “mixed.”
Reference
Jagdeep, A., Jagdeep, A., Lazarus, S., Zecher, M., Fedida, O., Fihrer, G., Vaska, C., Finkelstein, J., Finkelstein, D. S., Yanovsky, S., Jussim, L., Paresky, P., & Viswanathan, I. (2024, November 25). Instructing Animosity: How DEI Pedagogy Produces the Hostile Attribution Bias. https://networkcontagion.us/reports/instructing-animosity-how-dei-pedagogy-produces-the-hostile-attribution-bias/
It sounds like Jesse Singal was reacting to a tweet by
as much as the Colin Wright article. I am not currently on X (formerly Twitter), and I don’t feel inclined to go through Noah Smith’s backlog to find the tweet in question to evaluate how sensational it was.Unless otherwise noted, in this article the word “abortion” is used to mean induced abortion of pregnancy.
Technically, the word “abortion” is a generic term. Any process that is aborted before it comes to completion can, in theory, be labeled “abortion.” However, because of its association with abortion of pregnancy and the emotional weight of such occurrence, the word “abortion” is usually used to mean abortion of pregnancy.
Furthermore, even if we just consider “abortion” to mean abortion of pregnancy, there is ambiguity because in the medical literature the word “abortion” is used to mean two different things: spontaneous abortion, which is commonly called “miscarriage” in the vernacular, occurs when a pregnancy terminates without anyone’s intervention; induced abortion occurs when a pregnancy is terminated on purpose. When “abortion” is used in the vernacular it is commonly used to mean induced abortion.
This ambiguity can lead to misinterpretation. For instance, if a study were to report on abortions in a given population, it could include both spontaneous and induced abortions if it were using the medical literature definition, but it could exclude what are commonly called “miscarriages” if it were using the common definition.
This mistaken mixing of the two phrases might have occurred because there is a very popular “Amazon Prime” subscription service.