An Application of Alternative Weighting Matrix Collapsing Approaches for Improving Sample Estimates

Secton on Survey Research Methods An Applcaton of Alternatve Weghtng Matrx Collapsng Approaches for Improvng Sample Estmates Lnda Tompkns 1, Jay J. Km 2 1 Centers for Dsease Control and Preventon, atonal Center for Statstcs, 3311 Toledo Road, Room 3115, Hyattsvlle, MD, 20782 2 Centers for Dsease Control and Preventon, atonal Center for Statstcs, 3311 Toledo Road, Room 3111, Hyattsvlle, MD, 20782 Abstract When creatng sample weghts, most U.S. government agences combne small race groups such as the Amercan Indans and Asans wth Whtes dsregardng the dfferent coverage ratos of the groups. Ths paper examnes ths methodology usng the 2003 atonal Intervew Survey (HIS) data of the atonal Center for Statstcs (CHS) and reports the effect on the sample weghts and estmates, specfcally for Whtes, Amercan Indans (AI) and Asans. Two alternatve weghtng approaches wll be used n an effort to reduce the bas. KEY WORDS: coverage rato, sample weghtng, cell collapsng. 1. Introducton Before fnal weghts are developed for survey data, a poststratfcaton (rato or ntal adjustment) factor (PSF) s calculated for each cell (row or column) of a weghtng matrx and appled to the cell. However, for some cells, poststratfcaton factors cannot be computed. For example, f the sample count s zero for a cell, t s mpossble to calculate the PSF because the denomnator of the nvolved fracton s zero. Also f the raw sample count for a fracton s small, the fracton would be consdered unstable. Because of these occurrences n many surveys, cells are checked as to whether they have enough raw sample cases to stand by themselves.. Addtonally, for most surveys, the cells are checked to see whether ts PSF les wthn an acceptable range. Ths rato crteron assures that the fnal weghts are not too large or too small. It should be noted that very large or small weghts can nflate the varance of estmates. If a cell fals ether of the above tests, t s combned wth another cell. The cell collapsng strategy descrbed above has merts. However, Km (2004) rased a potental problem of combnng cells whch are dfferent n coverage ratos. Let be the control count for cell, ˆ the ntally weghted sample count for cell, =1, 2 and f ˆ =, =1, 2, the Intal Adjustment Factor (IAF) for cells 1 and 2. Then 1, = 1, 2, s the coverage rato for cell, f = 1, 2. Let 2= c 1. The PSF for the combned cell was expressed by Km (2004) as: and For cell 1: for cell 2: (1 + c) f 1 f1(1 + c) f 2, (1). (2) Before collapsng, the PSF for cell 1 s f 1. However, because of collapsng, as shown n equaton (1), f 1 s (1 + c) modfed by, whch s called the Collapsng Adjustment Factor (CAF) for cell 1 by Km, et all f1(1 + c) (2005). Smlarly, for cell 2, the CAF s. Usng the above formulas, we can make the followng f1 observatons: when c = 10 and = 4.0, cell 1 wll lose 73 percent of ts own weght to cell 2. For the same c, f f1 =.25, cell 1 wll gan an addtonal 214 percent of ts own weght from cell 2. ote that ths weght shft s artfcal. Thus, Km (2004) and Km and Tompkns (2007) clamed that the current approach of cell collapsng can ntroduce bas, whch can often be large. Most surveys collapse a cell (row or column) wth another f the PSF (rato) for the cell s greater than 2. Ths standard collapsng procedure allows the PSF of the poorly covered cell to decrease below 2. Hence, Km (2004) proposed to truncate (censor) the PSF for the cell at 2 to make sure that the PSF for that cell s 2 or at least 2, dependng on the method. Km, et al (2007) mplemented these two approaches of weght truncaton n ther smulaton studes and found that the latter 3024

Secton on Survey Research Methods outperforms the former and the standard collapsng procedure. When creatng sample weghts, most U.S. government agences combne small race groups such as the Amercan Indans and Asans wth Whtes dsregardng the dfferent coverage ratos of the groups. Ths paper examnes ths methodology usng the 2003 atonal Intervew Survey (HIS) data of the atonal Center for Statstcs (CHS) and reports the effect on the sample weghts and estmates, specfcally for Whtes, Amercan Indans (AI) and Asans. Two alternatve weghtng approaches wll be used n an effort to reduce the bas. 2. Cell Collapsng and Alternatve Weghtng Approaches The HIS uses the followng weghtng matrx: Table 1. Weghtng Matrx < 1 yr 1-4 5-9 10-14 15 19.. Hspanc on-hspanc Black on-hspanc Other M F M F M F In the above table, M stands for male and F for female. The non-hspanc other category, as mentoned before, ncludes all non-hspanc races other than non-hspanc Blacks, (.e., t ncludes Whtes, Amercan Indans, Asans, atve Hawaan and Pacfc Islanders and all multple race groups). It s nterestng to see how much the coverage ratos dffer among the race groups n the others race category. Tables 2a and 2b present coverage ratos for Whtes, Amercan Indans (AI) and Asans by age categores from the 2003 HIS. Table 2a. Ratos for 2003 HIS - Males Age Group Whte AI Asan < 1.85.17.33 1-4.80.44.66 5-9.79.88.59 10-14.80.65.54 15-17.84.46.75 18 19.61.26.55 20 24.59.55.51 25-29.60.44.31 30-34.67.54.65 35-44.67.32.53 45-49.65.51.63 50-54.67.54.57 55-64.70.53.47 65-74.75.44.44 75+.71.51.65 Table 2b. Ratos for 2003 HIS - Females Age Group Whte AI Asan < 1.82 -.38 1-4.80.43.71 5-9.84.70.78 10-14.76.95.70 15-17.77.25.67 18 19.72.10.50 20 24.59.57.50 25-29.68.39.56 30-34.75.46.57 35-44.76.59.59 45-49.76.36.67 50-54.80.31.53 55-64.78.62.45 65-74.75.12.48 75+.76.36.64 In Table 2a, except for one age group (5 9 years), Whte males always have hgher coverage ratos than Amercan Indan males. Also, Whte males always have hgher coverage ratos than Asan males, wthout excepton. One extreme case s age group less than 1, where the coverage rato for Whte males s.85, whle that for Amercan Indans s.17. The coverage rato for Amercan Indan males age < 1 s only 1/5 of that for Whtes. For the same age group, the Asan coverage rate s less than half that of Whtes. Of 15 male age groups, 7 age groups have coverage ratos less than.5 for Amercan Indans. For the 18 19, 20 24 and 25 29 years age groups, coverage ratos for Whtes are also low, but those for Amercan Indans and Asans are even lower, sometmes less than half of that for Whtes. As for females n Table 2b, Whtes always have hgher coverage ratos than Amercan Indans, wth one excepton (10 14 years of age). Also, Whtes are better covered than Asans for all age groups. For the 18 19 years age group, Whtes have a coverage rate whch s more than 7 tmes better than that of Amercan Indans. For the 65 74 year age group, the coverage rato for Whtes s more than 6 tmes that of Amercan Indans. Qute often the Whte coverage rate s much better than that of Amercan Indans. 3025

Secton on Survey Research Methods The followng example demonstrates the effect on weghts and estmates when two cells wth very dfferent coverage ratos are combned. Example 1. Suppose we have the followng ntally weghted sample counts, control counts and the ntal adjustment factors for 2 cells, one for Whtes and the other for Amercan Indans n Table 3. Table 3. Sample Weghtng Data ˆ AI 50 300 6 Whte 17,000 20,000 1.17647 When Whte and Amercan Indan cells n the above table are combned, the new PSF for the combned cell s 300 + 20,000 = 1.1906158 50 + 17, 000 The orgnal PSF for Amercan Indans was 6, but the new PSF s 1.1906158. Hence, the new weghted total for Amercan Indans s 1.1906158 50 60. Snce the control count s 300, we observe an underestmaton of 240, whch equates to an 80 percent underestmaton of Amercan Indans n ths cell. On the other hand, the orgnal PSF for Whtes s 1.18, but the new PSF s 1.1906158. Thus, the new weghted total s 20,240, whch s greater than the control count (20,000). In other words, Whtes pcked up an addtonal weght of 240 due to collapsng. Ths amount s 1.2 percent of the control count (20,000). ote that a 1.2 percent overestmaton for Whtes s neglgble, but an 80 percent underestmaton for Amercan Indan s large. In fact, the Collapsng Adjustment Factors (CAFs) for cells 1 and 2 from equatons (1) and (2) have been mplctly appled to f 1 (6) to reach 1.1906158 n equaton (3). That s, the CAF for cell 1 s: 1.17647(20, 000 / 300 + 1) =.1984358 6(20, 000 / 300) + 1.17647 The new PSF for cell 1 s 6(.1984358) = 1.1906148. f (3) (4) There s a slght dfference between the values n equatons (3) and (4), whch s due to roundng error. As mentoned before, the category of Whte males age <1 has a much hgher coverage rato than Amercan Indans and Asans. The same observaton can be made for females. Consequently, both Whte males and females age <1 were overestmated by 7 percent n 2003. For both genders, n all except two age groups, Whtes are better covered than Amercan Indans, whch causes the former to absorb weghts from the latter. As a result, Amercan Indans, overall, were underestmated by 29.7 percent, as wll be seen n secton 3. Smlarly, Asans were underestmated by 20.7 percent. To rectfy ths problem, we propose two alternatve weghtng procedures. The frst s to weght Amercan Indans and Asans ndependently. Amercan Indans had 197 raw sample cases, whch s enough for ndependent sample weghtng. The number of sample persons s 1,200 for Asans, whch ˆ s more than enough for ndependent sample weghtng. The second procedure s to artfcally nflate to.5 the coverage ratos whch are orgnally lower than.5. Ths s to protect the sample cases n the cells whose coverage ratos are too low, or whose PSF s too hgh. Ths approach s to ensure that the fnal weghted total n the cell s at least half the control count. Accordng to ths approach, the PSF can sometmes go much hgher than 2. Ths approach s somewhat consstent wth the weght truncaton approach by Km, et al (2007). They consdered two approaches of weght truncaton: one allows PSF to go over the threshold (2), but the other does not. The approach proposed here s smlar n sprt to the former. The protecton of the weghts n the poorly covered cells s greater n the approach proposed here because the PSF for ths new approach can ncrease much more than that consdered by Km, et al. Example 2 (Table 4) numercally llustrates the approach proposed here. Table 4. Sample Weghtng Data ˆ AI 50 150 300 6 2 Whte 17,000 20,000 1.17647 In Table 4, we set f for Amercan Indan equal to 2, nstead of 6 as n Table 3. To do so, we had to multply ˆ (50) by 3 to make t 150. In other words, to make sure that f = 2, we had to artfcally nflate ˆ by a factor of 3. If the orgnal f were 3 (ths means ˆ = 100), then we had to artfcally nflate ˆ by a factor of 1.5, nstead of 3. f 3026

Secton on Survey Research Methods When Whte and Amercan Indan cells n the above table are combned, the new PSF for the combned cell s 300 + 20,000 = 1.18367 150 + 17, 000 The new PSF for Whtes s 1.18367, but that for Amercan Indans s 3(1.18367) = 3.55101. Compare 1.1906158 to 3.55101 for the Amercan Indan cell s PSF. The new cell estmate for Amercan Indans s 50(3.55101) = 177.5505. Snce the control count s 300, we observe an underestmaton of 122, whch equates to an approxmate 41 percent underestmaton of Amercan Indans n ths cell. Ths s a bg mprovement n comparson to the result of the orgnal cell collapsng approach. 3. Alternatve Sample Weghtng When ndependently weghtng the sample for Amercan Indans and Asans, a mnmum raw sample count of 20 was used for cell collapsng. That s, startng wth the age group <1 cell, f a raw sample count was less than 20 for a cell, t was combned wth the next nearest cell. It should be noted that no artfcal nflaton of the weghts was done whle combnng cells n each of the race groups. Artfcally nflatng the weghts was, however, employed n collapsng Amercan Indans and Asans wth Whtes. After weghtng was completed, weghts for each sample unt were accumulated for Amercan Indans and Asans, where the results are shown n Tables 5 and 6, respectvely. Table 5. Amercan Indan Weghtng (n 1,000 s) (5) Total Weght Control Count Current 1,496 (-29.7%) 2,127 Inflated 1,752 (-17.4%) 2,127 Independent 2,127 2,127 As the Table 5 shows, when we rely on the current weghtng procedure,.e., when Amercan Indans are collapsed wth Whtes for weghtng, the weght total for Amercan Indans s 29.7 percent lower than ts control count. On the other hand, when a specal measure was taken to protect the weghts n the cells whose coverage ratos were lower than.5, the weght total mproved over the current approach by 12.3 percent. However, the nflaton approach stll underestmates the control count by 17.4 percent. There are two reasons for ths. Frst, we dd not take any measure to protect the cells whose coverage ratos were hgher than.5, even f coverage rato for Amercan Indans was lower than that for Whtes. Second, even f we gave hgher PSF s to cells whose coverage ratos were lower than.5, we dd not rase the rato all the way to the same level as that for Whtes. As can be predcted, when the ndependent weghtng approach was used, the total weght s the same as control. Table 6. Asan Sample Weghtng (n 1,000 s) Total Weght Control Current 9,369 (-20.7%) 11,817 Inflated 9,753 (-17.5%) 11,817 Independent 11,817 11,817 As shown n Table 6, when Asan cells are collapsed wth Whtes for weghtng, as n the current approach, Asans are underestmated by 20.7 percent. ote that ths underestmaton rate s better than that for Amercan Indans. Ths s because Asans, n general, have better coverage ratos than Amercan Indans for both genders. When the nflaton approach was used, the weghted total mproved over the current approach by only 3.2 percent. Ths mprovement s much lower than that observed for Amercan Indans. The dfference s due to the fact that 16 out of 30 Amercan Indan age groups have coverage ratos less than.5, but for Asans, the same observaton could be made for only 7 age groups. Prevalence rates were calculated for 4 health characterstcs based on the three cell collapsng approaches: dabetes, health nsurance coverage, overnght hosptal stay and asthma. It should be noted that one rate for each race was computed just as n publshed survey reports. Table 7 presents prevalence rates for Amercan Indans. Table 7. Prevalence Rates for Amercan Indans Weghted Total as Denomnator Dabetes 9.22 9.43 10.28 Insurance 64.90 63.72 65.33 Overnght 7.73 8.67 8.25 Hosptal Stay Asthma 17.41 16.32 18.04 In Table 7, for all 4 health characterstcs, the prevalence rate for the ndependent weghtng approach s hgher than that for the current weghtng approach. The bggest dfference can be observed for dabetes. The ndependent weghtng approach provdes the prevalence rate for dabetes more than 1 percentage (n absolute term) hgher than the current approach. It s 11 percent hgher n relatve term. The nflaton approach s rate s hgher for 2 characterstcs than the current approach s rate, but 3027

Secton on Survey Research Methods t s lower than the ndependent approach s rate. However, for 2 other characterstcs, the prevalence rate for the truncaton approach s lower than that of the current approach. Table 8 presents prevalence rates for Asans. Table 8. Prevalence Rates for Asans Weghted Total as Denomnator Dabetes 4.35 4.50 4.70 Insurance 83.49 83.44 83.70 Overnght 4.85 5.05 5.09 Hosptal Stay Asthma 5.96 5.84 5.83 As shown n Table 8, the prevalence rate for the ndependent weghtng approach s hgher than that for the current weghtng approach except for asthma. The truncaton approach provdes prevalence rates closer to that of the ndependent weghtng approach for all varables, except for health nsurance. The dfference for the prevalence rates between the current and the ndependent weghtng approach for Asans s much smaller than that for Amercan Indans. Ths may be due to the fact that the coverage ratos for Asans are much more stable than those for Amercan Indans. ote that n calculatng the prevalence rates n Tables 7 and 8, estmated counts were used for both numerators and denomnators. However, control (populaton) counts nstead of estmated counts (weghted totals) can be used for the denomnator, whle estmated counts are stll used for numerator. For example, suppose researchers want to calculate the prevalence rates for Amercan Indans or Asans resdng n certan age groups regons of the naton, snce CHS report does not show the rates for regons. To do so, they can cumulate weghts of, for example, dabetc people n the regons and compute the prevalence rates usng the cumulated weghts as the numerator and the populaton count as the denomnator. The followng two tables show the prevalence rates calculated n that manner: Table 9. Prevalence Rates for Amercan Indans Control Count as Denomnator Dabetes 6.48 7.77 10.28 Insurance 45.65 52.49 65.33 Overnght 5.44 7.14 8.25 Hosptal Stay Asthma 12.25 13.44 18.04 Tables 7 and 9 show the prevalence rates for Amercan Indans. The rates n Table 7 are computed wth the weghted total n the denomnator and those n Table 9, wth the populaton count n the denomnator. The rates n Table 9 are much lower than those n Table 7, except for those for the ndependent weghtng method, whch are the same. The rate for the current approach n Table 9 s 29.7 percent lower than that n Table 7 for each of the four health characterstcs. Smlarly, the rates for the nflaton approach n Table 9 are 17.6 percent lower than those n Table 7. In Table 9, the rates for the current approach are almost one thrd lower than those for the ndependent weghtng approach. The rates for the nflaton approach are between the two approaches. Table 10. Prevalence Rates for Asans Control Count as Denomnator Dabetes 3.45 3.71 4.70 Insurance 66.19 68.87 83.70 Overnght 3.85 4.17 5.09 Hosptal Stay Asthma 4.73 4.82 5.83 Both Tables 8 and 10 show the prevalence rates for Asans. The relatonshp between Tables 8 and 10 s the same as that between Table 7 and Table 9. The rates n Table 10 are much lower than the rates n Table 8, except for those for the ndependent weghtng method, whch remans the same. The rate for the current approach n Table 10 s 20.7 percent lower than that n Table 8 for each of the four health characterstcs. Smlarly, the rates for the nflaton approach n Table 10 are 17.4 percent lower than those n Table 8. Agan, these dfferences are due to the dfferent denomnators, that s, the weghted total or the control count. Comparsons between the rates n Table 7 and the rates n Table 9 and between the rates n Table 8 and the rates n Table 10 show that when the prevalence rates are calculated t s better to use the weghted totals as the denomnator for Amercan Indans and Asans. 4. Concludng Remarks Thus far, we have observed that combnng cells wth varyng coverage ratos results n under- and overestmaton of populaton (control) counts. In order to 3028

Secton on Survey Research Methods allevate ths problem, we proposed ndependent weghtng and weght nflaton approaches for collapsng cells, mplemented these approaches usng HIS data and compared them wth the current weghtng procedure. Currently, Amercan Indans and Asans are combned wth Whtes for sample weghtng. However, coverage rates for Whtes are better, often much better, than those for Amercan Indans n 28 out of 30 age groups. rates for 3 age groups for Amercan Indans are extremely low,.e., they are n the 10 17 percent range, whle they are at least 72 percent for Whtes. Because of ths, the current weghtng approach underestmated Amercan Indan by 29.7 percent. Also Whtes consstently had better coverage ratos than Asans, and as a result, the current weghtng approach underestmated Asans by 20.7 percent. We also estmated the prevalence rates for dabetes, health nsurance coverage, overnght hosptal stay and asthma usng the weghts developed by three dfferent ndependent weghtng approach, except for health nsurance. The prevalence rate can be calculated usng two methods. One s to use weghted counts for both numerator and denomnator, and the other s to use weghted counts for the numerator, but populaton counts for the denomnator. The frst approach was used for the tables above. However, f the second approach were to be used, the rates would be underestmated by 29.7 percent for Amercan Indans and by 20.7 percent for Asans wth the current collapsng approach and 17.7 percent and 17.4 percent, respectvely, wth the nflaton approach. Ths s because ther weghted totals are lower than ther respectve populaton counts. Thus, the frst approach s recommended for computng the prevalence rates. The publc use mcro data (PUM) fle from the survey data we used for ths study has been released to the general publc. ote that the PUM fle contans sample weghts for sample persons n the fle. Some data users of the PUM fle mght want to accumulate weghts for Amercan Indans or Asans, say wth dabetes, to come up wth the number of dabetc Amercan Indans or Asans n the naton or some regon of the naton. However, the result would be a gross underestmaton of the true values for the reason mentoned above. A better approach of gettng the number of dabetc Amercan Indans or Asans n the naton or a regon would be to calculate the prevalence rate usng weghted counts for both the numerator and the denomnator and to then multply the rate by the Amercan Indan or Asan populaton count, respectvely. cell collapsng approaches. For all 4 health characterstcs, Amercan Indans show hgher prevalence rates when they are weghted ndependently than when they are weghted as a part of the Other race category (.e., when they are weghted whle combned wth Whtes). The Amercan Indan dabetes prevalence rate s more than 1 percent hgher when the ndependent weghtng approach s used (10.28 %) than when current weghtng approach s used (9.22 %). The weght nflaton approach shows mxed results for Amercan Indans. For 2 characterstcs, the weght nflaton approach showed hgher prevalence rates than the current weghtng approach, whereas for 2 others, the reverse was observed. For Asans, the prevalence rate for the ndependent weghtng approach s hgher than that for the current weghtng approach, except for asthma. The nflaton approach provdes prevalence rates closer to that of the addton, the current approach appears to underperform when compared to the nflaton approach, even though the latter can be further fne tuned. 5. References Km, J. J. (2004). Effect of collapsng rows/columns of weghtng matrx on weghts. Proceedngs of the Secton on Survey Methods Research, Amercan Statstcal Assocaton CD. Km, J.J., L, J., and Vallant, R. (2007). Cell collapsng n poststratfcaton, to be publshed n Survey Methodology. Km, J.J. and Tompkns, L. (2007). Comparsons of current and alternatve collapsng approaches for mproved health estmates. Paper presented at the 11th Bennal CDC/ASTDR Symposum on Statstcal Methods, n Atlanta, Georga, Aprl 17-18, 2007. DISCLAIMER: The fndngs and conclusons n ths paper are those of the authors and do not necessarly represent the vews of the atonal Center for Statstcs, Centers for Dsease Control and Preventon. In concluson, the ndependent weghtng approach for Amercan Indans and Asans may produce more realstc weghts, and therefore, more accurate estmates. In 3029