Samples and Populations

So you've developed your research question, figured out how y'all're going to measure whatever you want to report, and have your survey or interviews ready to go. Now all your demand is other people to become your data.

Y'all might say 'easy!', there's people all around y'all. You lot have a big family tree and surely them and their friends would accept happy to accept your survey. And so at that place's your friends and people yous're in class with. Finding people is way easier than writing the interview questions or developing the survey. That reaction might exist a strawman, perchance you've come up to the conclusion none of this is easy. For your information to be valuable, y'all not only have to ask the right questions, y'all have to ask the right people. The "correct people" aren't the all-time or the smartest people, the right people are driven by what your study is trying to respond and the method yous're using to answer it.

Remember way back in chapter 2 when we looked at this chart and discussed the differences betwixt qualitative and quantitative information.

Qualitative Quantitative
Purpose Understanding underlying motivations or reasons; depth of knowledge Generalize results to the population; make predictions
Sample Small and narrow; non mostly representative Large and broad
Method Interviews, focus groups, example studies Surveys, web scrapping
Analysis Interpretative, content analysis Statistical, numeric

Ane of the biggest differences between quantitative and qualitative data was whether we wanted to exist able to explain something for a lot of people (what percentage of residents in Oklahoma support legalizing marijuana?) versus explaining the reasons for those opinions (why do some people back up legalizing marijuana and others not?). The underlying differences there is whether our goal is explain something about everyone, or whether we're content to explicate information technology about but our respondents.

'Everyone' is called the population. The population in research is whatever grouping the research is trying to answer questions nearly. The population could exist everyone on planet World, everyone in the United States, everyone in rural counties of Iowa, everyone at your university, and on and on. Information technology is merely everyone within the unit you lot are intending to study.

In club to study the population, nosotros typically take a sample or a subset. A sample is simply a smaller number of people from the population that are studied, which nosotros can use to and then empathize the characteristics of the population based on that subset. That'due south why a poll of 1300 likely voters can be used to approximate at who will win your states Governor race. It isn't perfect, and nosotros'll talk nigh the math backside all of it in a afterward affiliate, merely for now we'll just focus on the unlike types of samples yous might apply to written report a population with a survey.

If correctly sampled, we can use the sample to generalize information we get to the population. Generalizability, which we divers earlier, ways we tin assume the responses of people to our study match the responses everyone would have given us. We tin can only do that if the sample is representative of the population, meaning that they are alike on important characteristics such as race, gender, historic period, educational activity. If something makes a big divergence in people's views on a topic in your research and your sample is not balanced, y'all'll go inaccurate results.

Generalizability is more than of a concern with surveys than with interviews. The goal of a survey is to explain something about people beyond the sample you get responses from. You lot'll never see a news headline saying that "53% of 1250 Americans that responded to a poll approve of the President". It's merely worth request those 1250 people if nosotros tin can presume the rest of the Usa feels the same manner overall. With interviews though nosotros're looking for depth from their responses, and and so we are less hopefully that the fifteen people nosotros talk to will exactly lucifer the American population. That doesn't hateful the information nosotros collect from interviews doesn't take value, it just has different uses.

There are two broad types of samples, with several unlike techniques amassed below those. Probability sampling is associated with surveys, and non-probability sampling is often used when conducting interviews. We'll first describe probability samples, before discussing the not-probability options.

The type of sampling you'll use will be based on the type of research you're intending to do. There'south no sample that'due south right or wrong, they can just be more than or less appropriate for the question yous're trying to answer. And if yous use a less appropriate sampling strategy, the answer you get through your research is less probable to be authentic.

Types of Probability Samples

So we just hinted at the idea that depending on the sample yous use, you lot tin can generalize the data you collect from the sample to the population. That volition depend though on whether your sample represents the population. To ensure that your sample is representative of the population, you volition want to use a probability sample. A representative sample refers to whether the characteristics (race, historic period, income, education, etc) of the sample are the aforementioned as the population. Probability sampling is a sampling technique in which every individual in the population has an equal adventure of existence selected every bit a subject for the enquiry.

There are several different types of probability samples you tin can utilize, depending on the resources yous accept available.

Let's start with a simple random sample. In order to utilise a simple random sample all you have to do is accept everyone in your population, throw them in a hat (not literally, yous can just throw their names in a lid), and choose the number of names yous desire to use for your sample. By drawing blindly, you tin can eliminate human bias in constructing the sample and your sample should stand for the population from which it is existence taken.

However, a elementary random sample isn't quite that like shooting fish in a barrel to build. The biggest result is that yous have to know who everyone is in order to randomly select them. What that requires is a sampling frame, a list of all residents in the population. But we don't always have that. There is no list of residents of New York City (or whatever other city). Organizations that do have such a list wont just give it away. Endeavor to ask your university for a list and contact information of anybody at your school and so you can do a survey? They wont give information technology to you, for privacy reasons. It's really harder to call back of popultions y'all could hands develop a sample frame for than those you lot can't. If you can become or build a sampling frame, the piece of work of a simple random sample is fairly simple, but that'due south the biggest claiming.

Most of the time a true sampling frame is impossible to larn, so researcher have to settle for something approximating a consummate list. Earlier generations of researchers could apply the random punch method to contact a random sample of Americans, considering every household had a single phone. To apply it you just pick up the phone and dial random numbers. Assuming the numbers are really random, anyone might exist called. That method actually worked somewhat well, until people stopped having dwelling house phone numbers and somewhen stopped answering the phone. Information technology'south a fun mental exercise to think virtually how you lot would become about creating a sampling frame for different groups though; think through where you would expect to find a list of anybody in these groups:

Plumbers Contempo outset-fourth dimension fathers Members of gyms

The best way to get an bodily sampling frame is likely to buy one from a individual company that buys data on people from all the different websites nosotros employ.

Permit'southward say yous do have a sampling frame though. For example, you might be hired to do a survey of members of the Republican Party in the state of Utah to sympathise their political priorities this yr, and the organization could give y'all a list of their members because they've hired you lot to do the reserach. I method of amalgam a simple random sample would be to assign each name on the listing a number, and and then produce a list of random numbers. In one case yous've matched the random numbers to the list, you lot've got your sample. See the case using the list of 20 names below

and the list of 5 random numbers.

Systematic sampling is like to uncomplicated random sampling in that information technology begins with a listing of the population, but instead of choosing random numbers one would select every kth name on the list. What the heck is a kth? K merely refers to how far autonomously the names are on the list yous're selecting. And so if you lot want to sample one-tenth of the population, you lot'd select every 10th name. In order to know the k for your study you lot need to know your sample size (say g) and the size of the population (75000). You can divide the size of the population by the sample (75000/1000), which will produce your k (750). As long as the list does not incorporate any subconscious lodge, this sampling method is as good as the random sampling method, but its only advantage over the random sampling technique is simplicity. If we used the same list as above and wanted to survey 1/5th of the population, we'd include iv of the names on the listing. It'due south of import with systematic samples to randomize the starting point in the list, otherwise people with A names will be oversampled. If we started with the 3rd proper noun, we'd select Annabelle Frye, Cristobal Padilla, Jennie Vang, and Virginia Guzman, every bit shown beneath. And so in order to use a systematic sample, we demand three things, the population size (denoted as Due north), the sample size we want (north) and one thousand, which we calculate by dividing the population by the sample).

N= 20 (Population Size) northward= 4 (Sample Size) thousand= 5 {20/4 (kth element) selection interval}

We can also employ a stratified sample, but that requires knowing more near the population than simply their names. A stratified sample divides the study population into relevant subgroups, and and so draws a sample from each subgroup. Stratified sampling can be used if yous're very concerned virtually ensuring balance in the sample or there may be a problem of underrepresentation amid certain groups when responses are received. Non everyone in your sample is equally probable to answer a survey. Say for instance we're trying to predict who will win an election in a county with three cities. In metropolis A there are ane million college students, in urban center B there are 2 one thousand thousand families, and in City C at that place are 3 1000000 retirees. You know that retirees are more likely than busy college students or parents to respond to a poll. So yous break the sample into iii parts, ensuring that you get 100 responses from City A, 200 from Metropolis B, and 300 from City C, so the three cities would match the population. A stratified sample provides the researcher command over the subgroups that are included in the sample, whereas uncomplicated random sampling does non guarantee that whatever one type of person will be included in the final sample. A disadvantage is that information technology is more than complex to organize and analyze the results compared to uncomplicated random sampling.

Cluster sampling is an approach that begins by sampling groups (or clusters) of population elements and then selects elements from within those groups. A researcher would apply cluster sampling if getting admission to elements in an entrie population is likewise challenging. For instance, a study on students in schools would probably benefit from randomly selecting from all students at the 36 elementary schools in a fictional city. Just getting contact data for all students would be very hard. And so the researcher might piece of work with principals at several schools and survey those students. The researcher would need to ensure that the students surveyed at the schools are similar to students throughout the unabridged city, and greater admission and participation within each cluster may make that possible.

The prototype below shows how this tin can work, although the example is oversimplified. Say we have 12 students that are in 6 classrooms. The school is in total one/quaternary green (iii/12), i/quaternary yellow (3/12), and half blueish (6/12). By selecting the correct clusters from within the school our sample tin can be representative of the unabridged school, assuming these colors are the just significant divergence between the students. In the real globe, y'all'd want to match the clusters and population based on race, gender, age, income, etc. And I should indicate out that this is an overly simplified example. What if 5/12s of the school was yellowish and 1/12th was green, how would I go the correct proportions? I couldn't, simply you'd do the all-time you lot could. You notwithstanding wouldn't want iv yellows in the sample, you'd just try to approximiate the population characteristics equally best you lot can.

Actually Doing a Survey

All of that probably sounds pretty complicated. Identifying your population shouldn't be too difficult, but how would y'all ever go a sampling frame? And so actually identifying who to include… Information technology's probably a bit overwhelming and makes doing a good survey sound incommunicable.

Researchers using surveys aren't superhuman though. Often times, they utilize a lilliputian aid. Because surveys are really valuable, and because researchers rely on them pretty often, there has been substantial growth in companies that can help to get ane'southward survey to its intended audience.

One popular resources is Amazon'due south Mechanical Turk (more commonly known equally MTurk). MTurk is at its near basic a website where workers look for jobs (chosen hits) to be listed by employers, and choose whether to do the task or not for a set advantage. MTurk has grown over the last decade to exist a common source of survey participants in the social sciences, in function because hiring workers costs very picayune (you can get some surveys completed for penny's). That means you lot tin go your survey completed with a small-scale grant ($1-2k at the low cease) and get the data back in a few hours. Really, it's a quick and like shooting fish in a barrel way to run a survey.

However, the workers aren't perfectly representative of the average American. For instance, researchers accept found that MTurk respondents are younger, better educated, and earn less than the average American.

One style to get effectually that issue, which tin can be used with MTurk or whatever survey, is to weight the responses. Considering with MTurk yous'll get fewer responses from older, less educated, and richer Americans, those responses you lot do give you lot want to count for more to make your sample more than representative of the population. Oversimplified example incoming!

Imagine you're setting upward a pizza political party for your class. There are 9 people in your form, 4 men and five women. You only got four responses from the men, and 3 from the women. All 4 men wanted peperoni pizza, while the 3 women want a combination. Pepperoni wins correct, 4 to 3? Non if you presume that the people that didn't answer are the same as the ones that did. If yous weight the responses to match the population (the full class of nine), a combination pizza is the winner.

Because you know the population of women is 5, you tin weight the 3 responses from women by 5/3 = 1.6667. If nosotros weight (or multiply) each vote we did receive from a woman by one.6667, each vote for a combination now equals 1.6667, meaning that the three votes for combination total 5. Because we received a vote from every man in the form, we simply weight their votes past 1. The big assumption we have to make is that the people we didn't hear from (the ii women that didn't vote) are like to the ones we did hear from. And if nosotros don't get whatsoever responses from a group we don't have annihilation to infer their preferences or views from.

Let'due south go through a slightly more than complex example, still only considering i quality almost people in the class. Let's say your class actually has 100 students, simply you but received votes from l. And, what type of pizza people voted for is mixed, merely men still prefer peperoni overall, and women all the same prefer combination. The class is lx% female person and twoscore% male.

We received 21 votes from women out of the threescore, then we can weight their responses past 60/21 to represent the population. Nosotros got 29 votes out of the 40 for men, and then their responses tin be weighted by 40/29. Run into the math below.

53.8 votes for combination? That might seem a picayune odd, just weighting isn't a perfect science. We can't identify what a non-respondent would accept said exactly, all we can do is use the responses of other similar people to brand a good guess. That issue often comes up in polling, where pollsters have to judge who is going to vote in a given election in order to project who will win. And we can weight on any characteristic of a person nosotros think will be important, lonely or in combination. Mod polls weight on historic period, gender, voting habits, education, and more to brand the results as generalizable as possible.

In that location'south an appendix later on in this book where I walk through the bodily steps of creating weights for a sample in R, if anyone really does a survey. I intended this section to prove that doing a good survey might be simpler than information technology seemed, but now information technology might sound fifty-fifty more than hard. A expert lesson to take though is that there'southward ever another door to go through, another hurdle to improve your methods. Being practiced at research just means beingness constantly prepared to be given a new challenge, and being able to find some other solution.

Not-Probability Sampling

Qualitative researchers' main objective is to gain an in-depth understanding on the subject matter they are studying, rather than attempting to generalize results to the population. As such, non-probability sampling is more common because of the researchers desire to gain information not from random elements of the population, only rather from specific individuals.

Random pick is not used in nonprobability sampling. Instead, the personal judgment of the researcher determines who will be included in the sample. Typically, researchers may base their pick on availability, quotas, or other criteria. Still, not all members of the population are given an equal chance to be included in the sample. This nonrandom arroyo results in not knowing whether the sample represents the entire population. Consequently, researchers are not able to brand valid generalizations nigh the population.

As with probability sampling, there are several types of non-probability samples. Convenience sampling, also known as adventitious or opportunity sampling, is a process of choosing a sample that is hands attainable and readily available to the researcher. Researchers tend to collect samples from user-friendly locations such as their place of employment, a location, school, or other close affiliation. Although this technique allows for quick and easy access to available participants, a large role of the population is excluded from the sample.

For example, researchers (particularly in psychology) often rely on inquiry subjects that are at their universities. That is highly convenient, students are cheap to rent and readily available on campuses. Still, it means the results of the written report may have limited power to predict motivations or behaviors of people that aren't included in the sample, i.e., people outside the age of 18-22 that are going to college.

If I enquire y'all to get discover out whether people approve of the mayor or not, and tell yous I want 500 people's opinions, should you go stand in front of the local grocery shop? That would exist convinient, and the people coming will exist random, right? Not really. If you stand up outside a rural Piggly Wiggly or an urban Whole Foods, do y'all think yous'll meet the aforementioned people? Probably non, people'south chracteristics make the more or less likely to exist in those locations. This technique runs the loftier gamble of over- or under-representation, biased results, too as an inability to make generalizations most the larger population. As the name implies though, it is user-friendly.

Purposive sampling, also known as judgmental or selective sampling, refers to a method in which the researcher decides who will be selected for the sample based on who or what is relevant to the written report'south purpose. The researcher must starting time identify a specific feature of the population that can best assistance answer the research question. Then, they tin can deliberately select a sample that meets that item criterion. Typically, the sample is small with very specific experiences and perspectives. For instance, if I wanted to sympathize the experiences of prominent foreign-built-in politicians in the U.s.a., I would purposefully build a sample of… prominent foreign-born politicians in the United States. That would exclude anyone that was born in the United States or and that wasn't a politician, and I'd have to define what I meant by prominent. Purposive sampling is susceptible to errors in judgment by the researcher and selection bias due to a lack of random sampling, just when attempting to research pocket-sized communities information technology tin can exist effective.

When dealing with small-scale and difficult to reach communities researchers sometimes employ snowball samples, also known as chain referral sampling. Snowball sampling is a process in which the researcher selects an initial participant for the sample, then asks that participant to recruit or refer additional participants who have like traits equally them. The bicycle continues until the needed sample size is obtained.

This technique is used when the study calls for participants who are hard to find because of a unique or rare quality or when a participant does not want to be constitute because they are part of a stigmatized grouping or beliefs. Examples may include people with rare diseases, sex workers, or a kid sex offenders. It would be incommunicable to find an accurate list of sexual practice workers anywhere, and surveying the general population nearly whether that is their job volition produce false responses equally people will exist unwilling to identify themselves. As such, a common method is to gain the trust of one individual inside the community, who can then introduce y'all to others. It is important that the researcher builds rapport and gains trust so that participants can be comfy contributing to the study, but that must also be balanced by mainting objectivity in the research.

Snowball sampling is a useful method for locating hard to reach populations only cannot guarantee a representative sample considering each contact will be based upon your last. For instance, let's say yous're studying illegal fight clubs in your state. Some fight clubs allow weapons in the fights, while others completely ban them; those two types of clubs never interreact because of their disagreement virtually whether weapons should be immune, and there'south no overlap between them (no members in both type of club). If your initial contact is with a club that uses weapons, all of your subsequent contacts will be within that customs so you'll never understand the differences. If you didn't know there were two types of clubs when you started, you lot'll never even know you're only researching one-half of the community. As such, snowball sampling can be a necessary technique when in that location are no other options, but information technology does have limitations.

Quota Sampling is a process in which the researcher must first carve up a population into mutually sectional subgroups, similar to stratified sampling. Depending on what is relevant to the report, subgroups can be based on a known feature such equally age, race, gender, etc. Secondly, the researcher must select a sample from each subgroup to fit their predefined quotas. Quota sampling is used for the same reason as stratified sampling, to ensure that your sample has representation of sure groups. For instance, let'due south say that you're studying sexual harassment in the workplace, and men are much more willing to discuss their experiences than women. Y'all might choose to decide that half of your last sample volition be women, and stop requesting interviews with men in one case you lot fill your quota. The cadre difference is that while stratified sampling chooses randomly from within the different groups, quota sampling does not. A quota sample can either be proportional or not-proportional. Proportional quota sampling refers to ensuring that the quotas in the sample match the population (if 35% of the company is female, 35% of the sample should be female). Non-proportional sampling allows you to select your own quota sizes. If y'all think the experiences of females with sexual harassment are more important to your research, you tin can include whatever pct of females you desire.

Dangers in sampling

Now that nosotros've described all the unlike ways that one could create a sample, we can talk more about the pitfalls of sampling. Ensuring a quality sample ways asking yourself some basic questions:

  • Who is in the sample?
  • How were they sampled?
  • Why were they sampled?

A repast is oft only as good equally the ingredients you employ, and your data volition only exist as expert as the sample. If you collect data from the incorrect people, you'll get the incorrect respond. You'll however get an reply, it'll but be inaccurate. And I want to reemphasize here wrong people just refers to inappropriate for your study. If I want to study bullying in heart schools, but I only talk to people that live in a retirement home, how authentic or relevant volition the data I gather exist? Sure, they might have grandchildren in center school, and they may remember their experiences. Just wouldn't my data be more relevant if I talked to students in middle school, or perhaps a mix of teachers, parents, and students? I'll get an answer from retirees, but it wont be the one I demand. The sample has to be appropriate to the enquiry question.

Is a bigger sample ever better? Not necessarily. A larger sample can be useful, simply a more representative i of the population is better. That was made painfully clear when the mag Literary Digest ran a poll to predict who would win the 1936 presidential ballot between Alf Landon and incumbent Franklin Roosevelt. Literary Digest had run the poll since 1916, and had been correct in predicting the outcome every time. It was the largest poll ever, and they received responses for two.27 million people. They substantially received responses from 1 percent of the American population, while many modern polls employ merely 1000 responses for a much more populous country. What did they predict? They showed that Alf Landon would be the overwhelming winner, yet when the election was held Roosevelt won every state except Maine and Vermont. It was one of the most decisive victories in Presidential history.

So what went wrong for the Literary Digest? Their poll was large (gigantic!), but it wasn't representative of likely voters. They polled their own readership, which tended to be more educated and wealthy on average, forth with people on a list of those with registered automobiles and telephone users (both of which tended to be owned by the wealthy at that time). Thus, the poll largely ignored the bulk of Americans, who concluded up voting for Roosevelt. The Literary Digest poll is famous for being wrong, but led to meaning improvements in the science of polling to avoid similar mistakes in the time to come. Researchers have learned a lot in the century since that mistake, even if polling and surveys still aren't (and can't be) perfect.

What kind of sampling strategy did Literary Digest use? Convenience, they relied on lists they had available, rather than endeavour to ensure every American was included on their list. A representative poll of ii one thousand thousand people will give y'all more accurate results than a representative poll of 2 thou, but I'll take the smaller more representative poll than a larger one that uses convenience sampling any day.

Summary

Picking the correct type of sample is critical to getting an accurate respond to your reserach question. At that place are a lot of differnet options in how you lot can select the people to participate in your research, but typically but one that is both correct and possible depending on the research you're doing. In the adjacent chapter we'll talk about a few other methods for conducting reseach, some that don't include any sampling by you.