Understanding people: Sample matching

Douglas Rivers

Outline

Understanding people: Sample matching

Douglas Rivers

visibility

…

description

9 pages

Abstract
AI

Sampling matching is a novel methodology designed to select study samples from opt-in respondent pools, addressing the challenges of utilizing large but unrepresentative panels for constructing representative samples for target populations. By leveraging comprehensive consumer and voter databases, this approach enhances existing methods and validates its effectiveness through the prediction of the 2005 California special election outcomes. The paper discusses the impact of population coverage and selection bias on sample quality, emphasizing the need for representative sampling in internet-based research.

Key takeaways
AI

Sample matching significantly improves survey sampling by utilizing extensive consumer and voter databases.
The methodology addresses issues of selection bias and population coverage in non-random samples.
Elderly Internet users are overrepresented in panels, affecting sample representativeness.
Sampling methodologies like quota sampling and raking yield unreliable results compared to sample matching.
The study validates sample matching by predicting the 2005 California special election outcome.

Understanding People Sample Matching Sample Matching Representative Sampling from Internet Panels A white paper on the advantages of the sample matching methodology by Douglas Rivers, Ph.D. - founder; President and CEO of YouGovPolimetrix, Inc. and Professor of Political Sci- ence at Stanford University. 285 hamilton avenue suite 200 palo alto ca 94301 T 650.462.8000 F 650.462.8422 www.polimetrix.com Introduction panel of Internet users who allow their Web Sampling matching is a new methodology traffic to be monitored. Knowledge Net- for the selection of study samples from pools works uses RDD to recruit a panel of both of opt-in respondents. This methodology ad- existing Internet users and non-users. Those dresses the primary substantive and techni- without home Internet access are provided cal issues of how large, but unrepresentative, with an inexpensive device that allows them panels can be used to construct represen- to be interviewed on the Internet. However, tative study samples for particular target both NetRatings and Knowledge Networks populations. The procedure uses a listing have struggled with low response rates, high or enumeration of the population that can costs, and limitations imposed by small be obtained from large scale consumer and panel size. voter databases that have been developed Sample quality is largely a function of two in recent years. The existence of such data factors: population coverage and selec- has not been exploited in previous Internet tion bias. Population coverage refers to the research. On both a theoretical and a practi- proportion of the target population that is cal level, this approach substantially im- reachable, while selection bias refers to the proves upon existing weighting procedures. willingness of reachable respondents to com- As validation, we show how this procedure plete an interview. It would be nonsensical, performed in predicting the outcome of the for example, to use an opt-in Internet panel 2005 California special election. for a study of non-internet users, since the panel lacks coverage of that population. On 1. The Web Sampling Problem the other hand, even if a population can be Most samples today, whether for phone or reached by RDD, sample quality will still be the Internet, do not approximate random poor if patterns of respondent cooperation samples. In the case of phone surveys, where cause selection bias. random digit dialing (RDD) or random selection from a list is used to select respon- 1.1 Population Coverage dents, typical response rates for media polls In the early days of Internet surveys, the or market research surveys are in the range primary sampling problem was the “Digital of 20 percent. As a result, sample selection is Divide.” Internet usage was concentrated in primarily determined by who chooses to re- more affluent and better educated segments spond, not the random selection mechanism. of the population, while racial minorities, the elderly, and women were substantially In the case of web surveys, most Internet underrepresented among Internet users. panels do not claim to be randomly selected. Today, nearly three quarters of the adult Panel members are recruited by a variety of population has access to the Internet, either means (banner ads, email lists, promotions, at home, work, or school, so that most of and offers) and those who “opt-in” become the population is, at least in principle, reach- the pool of respondents available for sample able by the Internet. Usage rates are lower selection. for African Americans, Latinos, persons with a high school education or less, and the A few Internet panels, such as NetRatings elderly, but none of these groups is excluded and Knowledge Networks, do use random altogether. selection. NetRatings uses RDD to recruit a 2 Figure 1.1: Race and Internet Access Internet panels and RDD phone samples. In fact, the degree of under-representation of these groups (except for the elderly, dis- cussed in more detail below) is not much different in an opt-in Internet panel, than in an unweighted RDD phone sample. Table 1.1 shows the proportion of several diffi- 285 hamilton avenue suite 200 palo alto ca 94301 T 650.462.8000 F 650.462.8422 www.polimetrix.com cult to reach groups in national media polls conducted by one of the national television networks during 2004. Table 1.1: Unweighted Sample Composition of a National Media Poll Figure 1.1 provides data on Internet access Avg. Implied by race, as measured by the Current Popu- Census of 11 Weight lation Survey. Internet usage has grown at Surveys about the same rate in all racial groups. The effect of this growth, however, has been to Blacks 11.0% 7.9% 1.4 substantially reduce (though not eliminate) the degree to which minority groups are Hispanics 12.4% 4.8% 2.6 underrepresented among Internet users. In 1997, for example, whites were more than twice as likely to have Internet access as Aged 18-24 12.3% 6.4% 1.9 blacks and Hispanics. By 2003, whites were only about a third more likely to have Inter- HS or less 46.6% 32.7% 1.4 net access as blacks. Similar patterns can be found in other groups. The Digital Divide has diminished substantially and will largely Postgraduate 8.7% 17.2% 0.5 disappear in the next decade, as the Internet becomes the vehicle for the delivery of home Never 23.8% 16.2% 1.5 entertainment and communications services. Married Even today, Internet coverage is adequate for most types of research. The problem is Table 1.2: Composition of Opt-in Web Panel not coverage — who can be reached on the internet—but sample selection. Web Internet Census Panel Users 1.2 Selection Bias Most Internet surveys are not conducted using a random sample of Internet users. Blacks 4.3% 9.3% 11.0% Instead, “access panels” have been devel- oped from which samples are selected for Hispanics 3.3% 7.2% 12.4% individual studies. The properties of these panels vary depending upon how they were Aged 18-24 8.7% 16.0% 12.3% recruited. In this section, we compare selec- tion biases in Internet surveys with selection biases in phone surveys. Postgraduate 23.3% 14.7% 8.7% Different types of people have different propensities for participation in survey Married 60.4% 55.3% 54.3% research. These propensities lead to under- representation of certain groups in both Male 58.8% 48.7% 48.9% 3 The conclusion to be drawn from these data especially among younger age groups. (Over is not that opt-in Web panels are representa- 25 percent of those between the ages of 18 tive of any particular population. This is and 29 are not reachable on land lines.) demonstrably false— people who opt-in for Because of regulations on outbound calls taking Web surveys have different demo- to cell phones, this population is no longer graphics than either the population of all reachable in a RDD phone sample. Phone Internet users or the population of all adults. coverage, which as recently as five years 285 hamilton avenue suite 200 palo alto ca 94301 T 650.462.8000 F 650.462.8422 www.polimetrix.com But the same is true for RDD telephone sam- ago was in excess of 96 percent of the adult ples. In both cases, an appropriate method- population, now appears to be under 90 ology is required to produce usable samples percent and will continue to fall. for individual studies. We will discuss vari- ous solutions to this problem in sections 2 Caller ID and answering machines make it and 3 below. harder to contact respondents as well. In a short field period, it is practically impossible 1.3 The Elderly on the Internet to contact more than half of the working The Internet is often viewed as a venue for numbers in a RDD sample. This pushes the young. Among the elderly, there tend to overall response rates to well under 50 be fewer Internet users and a larger propor- percent. tion who express no interest in having Inter- net access. While both statements are true, Finally, declining cooperation for all types a lesser known fact is that elderly Internet of surveys (including in-person interviews) users are much more likely to participate in has reduced the completion rate among web surveys. Therefore, most Internet panels contacted respondents. The overall response have an excess of elderly participants, not a rates are so low that few survey organiza- shortage. tions publish them for phone studies. To some degree, the growing acceptance of opt- Of course, the relevant question is not in Internet samples just reflects a realization whether a panel has too many or too few that most phone samples are opt-in samples elderly, but whether its elderly participants too. are representative or atypical of the elderly population. The evidence suggests that el- 2. Current Practice for Selection derly web survey participants are somewhat and Weighting different—more affluent and knowledgeable about technology—but, after controlling 2.1 Quota Sampling for these factors, similar to elderly phone By far the most common method for sample respondents. The problem of sampling the selection in consumer market research is elderly using an opt-in Internet panel pro- quota sampling. In quota sampling, one vides a good illustration of the issues that a defines a set of groups (e.g., men, women, valid sample selection procedure must deal 18-29 year olds, 30-64 year olds, 65+, etc.) with. There are usually some characteristics and specifies how many respondents should associated with sample selection that need be recruited for each group. Recruitment to be identified to correct sample biases. In is then done on an ad hoc basis and any many years of experience with phone sur- respondents in excess of the specified quota veys, these factors have, for the most part, are turned away. been identified and reasonably satisfactory measures developed for handling them. Needless to say, quota sampling has no basis in sampling theory, since the survey 1.4 Problems with Phone Samples researcher has almost complete discretion The quality of phone samples, however, has in the selection of respondents within the been deteriorating for a variety of reasons. “cells.” In practice, the hard to- fill quotas First, cell phones have replaced land lines, are the last to be filled and often end up 4 being highly unrepresentative. For example, weighting can often have serious implica- many phone surveys use explicit or im- tions for survey estimates. The reliability of plicit quotas for gender, since men are more these estimates then becomes a subjective difficult to reach by phone than women. judgment about which variables to use in Different devices—such as asking for a weighting. male respondent first and then, if none are available, accepting a female respondent— 2.3 Cell Weighting 285 hamilton avenue suite 200 palo alto ca 94301 T 650.462.8000 F 650.462.8422 www.polimetrix.com are employed to “balance” phone samples. An alternative to raking is cell weighting, The resulting samples are often very un- where the population is divided into a set of representative of men, since the available mutually exclusive and exhaustive categories men are less likely to be employed and often (or “cells”). The sample is then weighted by older. Some media organizations have tried the ratio of the population fraction in each to address this problem by asking first for cell to the corresponding sample fraction. the youngest male at home and, if unavail- This is sometimes called post-stratification. able, then to ask for the oldest female. These It differs from the usual type of stratifica- procedures also do not produce accurate age tion in that the sample observations in each distributions within gender groups. cell are not a sample from the corresponding sub-population because of non-response. Quota sampling is a relic of the 1930’s and The procedure is valid if an ignorability as- should not be employed in the twenty-first sumption, similar to that described below, century. It is, unfortunately, the standard holds—the survey measurements need to be sampling procedure for most web surveys. conditionally independent of non-response given the variables used for post-stratifica- 2.2 Raking tion. For samples that have already been selected, the most popular method of weighting is There are two primary deficiencies of cell the method of raking, also known as rim- weighting. First, if the weights are large, weighting, first proposed by W. E. Deming the estimates can be highly inefficient and during the 1940’s. In raking, the sample unstable. It is common practice to trim the marginals are forced to match the known weights (so, for example, weights are con- population marginals (from a census or strained to lie between, say, ½ and 2), but other source) by an iterative procedure. The with current phone and Internet samples, primary advantage of raking is that it does larger weights are often needed to deal with not require the joint distribution of the vari- differential nonresponse. Second, usually ables to be known. It has a number of seri- the cross-classification of only a few vari- ous disadvantages. First, if the population ables is available, so cell weighting is only marginals are skewed the iterative weighting applicable with a small number of variables procedure often does not converge. Second, and categories. This means that the range of it generally does not find the correct weight- nonresponse problems that can be remedied ing for combinations of variables. It can with cell weighting is limited. be shown that the implied joint distribu- tion maximizes the entropy over a certain 3. Sample Selection by Matching class of distributions. Since the weighting variables are often expected to be highly 3.1 Description of Sample Matching inter-correlated (e.g., race, education, and Methodology income), this is undesirable behavior. Third, Sample matching is a newly developed and perhaps most important, raking yields methodology for selection of “representa- unstable and unreliable estimates when tive” samples from non randomly selected the number of variables used to weight the pools of respondents. It is ideally suited for sample is large. Which variables are used for Web access panels, but could also be used 5 for other types of surveys, such as phone The purpose of matching is to find an avail- surveys. able respondent who is as similar as possible to the selected member of the target sample. Sample matching starts with an enumeration The result is a sample of respondents who of the target population. In other con- have the same measured characteristics as texts, this is known as the sampling frame, the target sample. Under certain conditions, though, unlike conventional sampling, the described below, the matched sample will 285 hamilton avenue suite 200 palo alto ca 94301 T 650.462.8000 F 650.462.8422 www.polimetrix.com sample is not drawn from the frame. For a have similar properties to a true random study of registered voters, the target popula- sample. That is, the matched sample mimics tion is the set of registered voters, who are the characteristics of the target sample. It is, enumerated (with some exceptions) in the as far as we can tell, “representative” of the registered voter list. For general population target population (because it is similar to the studies, the target population is all adults, as target sample). enumerated (again with some exceptions) in consumer databases maintained by commer- 3.2 Selection of the Target Sample cial vendors such as Acxiom, Experian, and In explaining the sample matching meth- InfoUSA. The development of comprehen- odology, it may be helpful to think of the sive consumer and voter databases is a rela- target sample as a simple random sample tively recent phenomenon that has important (SRS) from the target population. However, implications for survey sampling. the efficiency of the procedure can be im- proved by using stratified sampling in place Sample selection using the matching meth- of simple random sampling. SRS is generally odology is a two-stage process. First, a less efficient than stratified sampling because random sample is drawn from the target the size of population subgroups varies in population. We call this sample the target the target sample. sample. Details on how the target sample is drawn are provided below, but the essential With stratified sampling, we partition idea is that this sample is a true probability the population into a set of categories (or sample and thus representative of the frame “strata”) that are believed to be more ho- from which it was drawn. mogeneous than the overall population. For example, we might divide the population Ideally, we would interview the respondents into race, age, and gender categories. The in the target sample and conventional sam- cross classification of these three attributes pling theory would describe the properties divides the overall population into a set of of the sample. However, we have no eco- mutually exclusive and exhaustive groups or nomical way of contacting most members strata. Then a SRS is drawn from each cat- of the target sample: they have not provided egory and the combined set of respondents their email addresses to us, many do not constitutes a stratified sample. If the num- have listed phone numbers, and those who ber of respondents selected in each strata is do have listed numbers may not agree to be proportional to their frequency in the target interviewed. Therefore, we do not attempt population, then the sample is self-represent- to interview members of the target sample. ing and requires no additional weighting. Instead, for each member of the target sam- At YouGovPolimetrix, we usually stratify on ple, we select one or more matching mem- race, gender, and age. For political studies, bers from our pool of opt-in respondents. we also stratify on party registration and This is called the matched sample. Matching region. For other types of studies, custom is accomplished using a large set of variables strata can be developed. that are available in consumer and voter databases for both the target population and 3.3 The Distance Function the option panel. When choosing the matched sample, it 6 is necessary to find the closest matching we select multiple matches. The number of respondent in the panel of opt-ins to each matches is based on an estimated response member of the target sample. Various types probability using a hazard model to estimate of matching could be employed: exact the probability that a panelist responds matching, propensity score matching, and by the end of the survey field period. The proximity matching. Exact matching is im- total number of panelists matched to each possible if the set of characteristics used for member of the target sample is determined 285 hamilton avenue suite 200 palo alto ca 94301 T 650.462.8000 F 650.462.8422 www.polimetrix.com matching is large and, even for a small set of by matching panelists until the expected characteristics, requires a very large panel number of responses is greater than or equal (to find an exact match). Propensity score to one. matching has the disadvantage of requiring estimation of the propensity score. Either a Second, we use a second round of match- propensity score needs to be estimated for ing when respondents begin an interview. each individual study, so the procedure is Though the expected number of respondents automatic, or a single propensity score must who arrive for each target sample element is be estimated for all studies. If large numbers approximately one, randomness in response of variables are used the estimated propen- patterns will mean that some target sample sity scores can become unstable and lead to elements are matched more than once and poor samples. some none at all. The best matching re- spondent is assigned to the matching target At YouGovPolimetrix, we employ a proxim- element if that element has not already been ity matching method. For each variable used matched. Otherwise, the responding panelist for matching, we define a distance function, is compared to the target sample elements d (x,y), which describes how “close” the val- across all open studies and assigned to the ues x and y are on a particular attribute. For closest matching respondent using a priority numerical characteristics, such as age, years assignment algorithm. This minimizes the of schooling, latitude, longitude, income, number of respondents who are turned away etc., the distance function is usually just (because a match has already been found) the absolute value of the difference |x – y|, and ensures the most accurate matches pos- though, occasionally, we use the square of sible. the distance to penalize large discrepancies. 3.5 Statistical Theory The overall distance between a member of The intuition behind sample matching is the target sample and a member of the panel clear: if respondents who are similar on a is a weighted sum of the individual distance large number of characteristics tend to be functions on each attribute. The weights similar on other items for which we lack can be adjusted for each study based upon data, then substituting one for the other which variables are thought to be important should have little impact upon the sample. for that study, though, for the most part, Can this intuition be made rigorous? The we have not found the matching procedure answer is yes, as we describe below. to be sensitive to small adjustments of the weights. A large weight, on the other hand, The theoretical conditions that guarantee forces the algorithm toward an exact match the validity of sample matching are quite on that dimension. technical, but their content is easily under- stood. There are three main assumptions: 3.4 Non-response Adjustments Not all respondents in a matched sample Assumption 1: Ignorability will respond to a survey invitation. At Panel participation is assumed to be ignor- Polimetrix, we use two procedures to deal able with respect to the variables measured with non-response: multiple matching and by survey conditional upon the variables re-matching. Instead of selecting a single used for matching. What this means is match for each member of the target sample, that if we examined panel participants and 7 non-participants who have exactly the same ists. More precisely, the probability distri- values of the matching variables, then on av- bution of the matching variables must be erage there would be no difference between bounded away from zero for panelists on the how these sets of respondents answered the range of values (known as the “support”) survey. This does not imply that panel par- taken by the non-panelists. In practice, this ticipants and nonparticipants are identical, excludes attempts to match on variables for but only that the differences are captured by which there are no possible matches within 285 hamilton avenue suite 200 palo alto ca 94301 T 650.462.8000 F 650.462.8422 www.polimetrix.com the variables used for matching. Since the set the panel. For instance, it would be impos- of data used for matching is quite extensive, sible to match on computer usage because this is, in most cases, a plausible assump- there are no panelists without some experi- tion. ence using computers. Assumption 2: Smoothness Under Assumptions 1-3, it can be shown The expected value of the survey items that if the panel is sufficiently large, then given the variables used for matching is a the matched sample provides consistent “smooth” function. Smoothness is a tech- estimates for survey measurements. The nical term meaning that the function is sampling variances will depend upon how continuously differentiable with bounded close the matches are if the number of vari- first derivative. In practice, this means that ables used for matching is large, but Monte the expected value function doesn’t have any Carlo evidence indicates that these adjust- kinks or jumps. ments are usually small. The key issues for an application are whether the variables Assumption 3: Common Support used for matching are adequate controls for The variables used for matching need to panel participation effects and, if they are, have a distribution that covers the same whether the panel is large enough to permit range of values for panelists and non-panel- close matches. Table 3: Survey Accuracy in 2005 California Special Election Polimetrix Final Survey Election Outcome Proposition Yes No Undecided Outcome Error 73 43% 54% 2% 47.4% -3.1% 74 45% 52% 3% 45.1% 1.3% 75 48% 49% 3% 46.7% 2.8% 76 40% 56% 3% 38.0% 3.7% 77 41% 52% 6% 40.6% 3.5% 78 33% 55% 13% 41.5% -4.0% 79 38% 46% 16% 39.0% 6.2% 8 4. Validation of Sample Matching results and can become highly unstable when large weights are used. 2005 California Special Election During the 2005 California special election, Sample matching is a newly developed meth- YouGovPolimetrix released survey estimates odology for selection of “representative” of the proportion of voters intending to samples from non-randomly selected pools vote for and against seven propositions on of respondents. Sample matching results is 285 hamilton avenue suite 200 palo alto ca 94301 T 650.462.8000 F 650.462.8422 www.polimetrix.com the ballot. These estimates were contained a sample of respondents who have similar in press releases that were published with properties to a true random sample. That is, several public sources (the National Jour- the matched sample mimics the characteris- nal’s Hotline, www.realclearpolitics.com tics of the target sample. and www.pollingreport.com). The outcome A number of side-by-side comparisons of of all seven propositions was correctly pre- matched samples against other offline and dicted (a record matched by only one other online samples shows this new sampling polling organization) and the root mean method to be stable and highly accurate. square error was 3.0% (only slightly larger than what would be expected from random sampling). While one (or even seven) estimates do not prove that the methodology “works,” these results are very encouraging. In an election which a number of phone and other Internet surveys provided very misleading estimates, sample matching performed very well. Summary Most samples today, whether for phone or the Internet, do not even roughly approxi- mate random samples. The primary sam- pling problem that researchers face is one of sample selection. Most Internet surveys are not conducted using a random sample of Internet users. Rather, most employ access panels from which samples are selected for individual studies. By far the most common method for sample selection in consumer market research (both online and offline) is quota sampling. Quota sampling has no basis in sampling theory, since the survey researcher has almost complete discretion in the selec- tion of respondents within the “cells.” Ad- ditionally, hard-to-fill quota cells often end up being highly unrepresentative. Post selection stratification methods such as raking can be influenced by which variables are used for weighting and can often have serious implications for survey estimates. Cell weighting also suffers from the fact that 9 only a few variables can be used to weight

FAQs

What enhancements does sample matching offer over traditional weighting procedures?add

The methodology shows substantial improvements in sample representativeness, outperforming existing weighting techniques by ensuring more accurate demographic alignments, particularly for underrepresented groups.

How does the selection of the target sample influence sample quality?add

Sampling from a well-defined target population using stratified sampling significantly enhances representativeness by correctly grouping demographics, which minimizes selection biases inherent in convenience panels.

What implications does matching methodology have for survey participant demographics?add

The study indicates that using comprehensive consumer and voter databases allows researchers to match respondents closely, improving overall representation of demographics not typically captured in standard surveys.

When did the majority of the adult population gain Internet accessibility?add

As of recent findings, approximately 75% of adults had access to the Internet by 2020, addressing earlier 'Digital Divide' issues that impacted survey sampling.

What are the primary assumptions for validating sample matching's statistical approach?add

Sample matching's validity relies on three assumptions: ignorable panel participation, a smooth expected value function, and common support across matching variables among participants and non-participants.

About the author

Douglas Rivers

Papers

Followers

View all papers from Douglas Riversarrow_forward

Understanding people: Sample matching

Sign up for access to the world's latest research

AbstractAI

Key takeawaysAI

FAQs

Abstract
AI

Key takeaways
AI