DIF testing based on the chi-square statistic is highly sensitive to sample size (e.g., Kim, Cohen, Alagoz, & Kim,
2007). When sample size is large, statistical significance can emerge even when DIF is actually quite small. DIF effect sizes can be investigated to alleviate this concern. We report two DIF effect size measures: the signed and unsigned test difference between the groups (STDS and UTDS, respectively; Meade,
2010). The interpretation of STDS and UTDS is the difference in expected total IRT scale scores across subjects in the focal group that is due to DIF. In DIF analyses, the focal group refers to the particular group of interest, whereas the reference group refers to the group with whom the focal group is to be compared (Angoff,
1993). In the current study, one group is arbitrarily chosen as a focal group and the other is chosen as a reference group. When there is DIF for both item discrimination and item difficulty (or nonuniform DIF), different subjects in the focal group present with larger or smaller effect sizes than subjects in the reference group with the same ability scores. As a result, DIF effect sizes as a whole could equal zero because of the averaging of positive and negative effect size differences (called cancellation). STDS allows cancellation of DIF across both items and subjects, whereas UTDS does not allow such cancellation across items or subjects. In addition to STDS and UTDS, we report the expected test score standardized difference (ETSSD; Meade,
2010). This statistics corresponds to Cohen's
d. Therefore, this metric can be interpreted using the guideline by Cohen (
1988). The calculation of STDS, UTDS, and ETSSD is carried out using Visual DIF software (Meade,
2010). Below, results of three IRT DIF detection methods and DIF effect sizes are presented for each subgroup.