Sample Size for Perceptual Studies - Another Question
Doc: I'm trying to figure out what the best sample size is for our perceptual study we are going to do. What is the best size, 400 or 500? - Dean
Dean: Determining the best sample size depends on a variety of things. It's not just about 400 or 500. The main things to consider to determine a sample size for a research project are (1) sampling error; (2) how the data will be analyzed; and (3) cost. I'll explain these in more detail:
Sampling Error. Sampling error is the amount of error generated by the respondents in a research project, how the respondents were selected, how similar they are to other people in the population/universe, and more. Any study conducted with a sample (as opposed to a census where all respondents/elements are tested) has some amount of sampling error—it's always present, along with measurement error and random error. (A census still has measurement error and random error, so the results of any study are not perfect.)
When selecting a sample size, the question is: How much sampling error is acceptable? If you go to the sampling error calculator on my business website, you'll find that, at the 95% confidence level, a sample of 400 has about ±4.9% sampling error. This means that if 50% of your sample agrees or disagrees on something, the "real" percentage is between 45.1% and 54.9%. That isn't too bad for a radio research project.
Now, if you calculate the error for a sample of 500, you'll see that sampling error is about 4.4%. Is the reduction of ±0.5% important? The "real" answer would fall between 45.6% and 54.4%. Will this significantly improve your analysis and use of the data? If so, then you should use a sample of 500. However, you might also look at other sample sizes—what about 300 (±5.7% error) or 350 (±5.3% error) or 425 (±4.8% error)? There is no rule or guideline stating that you must use 400 or 500 respondents.
Data Analysis and Use. The reason sample sizes of 400 and 500 are customary in media research relates to how the data are interpreted. In most situations, media research data are displayed in typical marketing age cells: 18-24, 25-34, 35-44, and 45-54. They are also usually broken down by male and female, which produces a total of 8 cells.
Statistical convention indicates that a sample of 50 is the smallest that should be used to provide valid and reliable indications of things, so this, therefore, produces a total sample of 400. At some time in the past, someone decided that a sample of 500 sounds better, and that's the reason for the other typically used sample size. A reduction of ±0.5% error in a behavioral research study is not a valid reason to increase the sample by 100 people.
"But wait a minute," you might ask, "What's the deal with 50 people per cell? Isn't that too small?" No, it isn't if you have good screener questions to recruit your respondents. Let me explain.
The cause of sampling error is variance (differences) among the respondents. The more the respondents are alike, the lower the variance, and, therefore, sampling error; the wider the variance, the larger the sampling error. For example, if you conducted a study with people between the ages of 12 and 65, you would find a huge amount of difference among the respondents. The young kids' responses would be nothing like the older folks, and the sampling error would be astronomical. A huge sampling error virtually guarantees that it will be difficult, if not impossible, to generalize the results to the population from which the sample was drawn.
Now, if we introduce screener questions (filter questions), we can reduce the amount of variance. For example, instead of investigating everyone from the ages of 12 and 65, let's say we only include women 18-34 (two filters). Then we could add listening to a specific radio station or radio format (one filter, but a big one). We could also add that the women listen to the station or format for at least one hour each day. (four filters total)
Do you see what happens with screener/filter questions? They naturally reduce variance among the respondents, and this reduces sampling error. In reality, this is done with virtually all research, so even though the sampling error calculator says that a sample of 400 has a sampling error of about ±4.9%, it's actually less than that because of the screeners included in selecting the respondents who participate in research projects. With good screeners, I would imagine that sampling error could probably be reduced by 50%. This means that although the sample error calculator indicates a sampling error of ±13.9% for a sample of 50, in reality, it is probably about ±7.0%. Good screeners/filters allow you to use smaller samples.
But there is a caveat here. What you don't want to do is include too many screeners/filters. Why? Because if you add too many, you'll wind up looking for the proverbial "needle in a haystack." I have seen research projects include so many screeners/filters that it was virtually impossible to find qualified people. So don't get carried away.
Cost. The last of the three main criteria for selecting a sample size relates to how much money you have to spend for the project. In perceptual studies and callout, costs are based on: (1) sample size; (2) length of interview; and (3) difficulty in finding qualified respondents (known as incidence).
If you decide to use a sample of 500 instead of 400, expect to pay several thousand dollars more for the study. Do you understand the relationship between sampling error and cost? Are you willing to pay several thousand dollars more to decrease error rate by only ±0.5%?
That's a very brief discussion of these three items. There are other things to consider, but I hope this helps you with your decision.
Sample Size and Z-scores
1. What sample size do you recommend for an auditorium music test and if you test less than you recommend, at what point are the data no longer considered reliable?
2. In my format (Modern AC) we typically only get about 30 80's songs to test well. We haven't used Z-Scores to compare (as you say) apples to oranges, or 80's to currents. If a good testing current scores a 3.5 and an average testing 80's record scores a 2.3 but both score a 1.5 in their respective clusters are they to be considered equal?
3. Did you ever get a copy of Ries & Trout's "22 Immutable Laws of Marketing?"- Andy
Andy: Answer 1: Since a music test uses a measurement procedure known as "repeated measures," you can get away with a smaller sample than other types of research. However, regardless of the type of research you conduct, you should "back into" your sample size.
In an auditorium test, use a minimum of 25 respondents per cell. That's a bare minimum now, and if you want less sampling error, then increase the sample to whatever size you want for the sampling error you're willing to accept (and your budget will allow). So, for example, if your target is 18-34, you'd probably divide it 18-24 and 25-34. That's two cells, and 25 X 2 = 50 minimum. If you also split male/female, then you'll have four cells and need 100 respondents.
These are the basic sampling procedures. The sample size depends on how you plan to interpret the data. Remember that a music test (any type) is not hard science because we collect information from human subjects. Don't interpret your music test scores as a message from heaven or somewhere else. Allow yourself some flexibility in interpreting scores, and don't go crazy with too many cells. That's the major problem I see with PDs and many consultants—they get divide the data into too many cells, and that's not good.
Answer 2: I can't answer this question because you don't give me enough information. Are the scores from the same group? What type of scale did you use? Are the 3.5 and 2.3 total sample scores or from different cells? Were the songs tested in the same session? Finally, I don't understand anything about your "clusters." How can a 3.5 and 2.3 become a 1.5 in their respective clusters. What are the clusters and how do you construct them?
Give me more information and I'll try to answer your question. However, I can say that you need to convert the 3.5 and 2.3 to Z-scores before you (or I) try to compare them. There is no other legitimate way to do it.
Answer 3: No I didn't get a copy of the Ries and Trout book.
Sample Skew - Prize Pigs
Dr. Roger, Sir: I’ve been out of the business for a couple of years now, and I was wondering if anybody has addressed the issues of Prize Pigs skewing the research in various markets. This is something we first touched on during a focus group in Santa Rosa in 2001 when I once again noticed that nearly 50% of the respondents were people documented for years in our winner’s database.
Plus, do you think that this may be a reason that programmers today need not worry about the general audience (97% of the available listening public) and concentrate on what they now call P1s (or what we always called Groupies).
Please don’t take this for a rank on you or your service for I do have the utmost respect for you. I feel that once again a little interjection of objective reasoning is warranted for both the health and welfare of not just your business, but for the industry as a whole. - Beau
Beau: I’m a bit confused with your note, so I’ll do the best I can. The first thing I’m confused about is the respondents in the Santa Rosa focus groups. If 50% of the respondents were in the radio station’s winner’s database, it may be possible that the database was used for screening purposes. Or, it may be that the screener was designed to recruit P1s (station fans), which means that some contest winners will probably be included.
In either case, the design of the screener is going to produce a specific type of respondent—non-listeners, listeners other than P1s, and P1s. If the correct respondents show up for a focus group (or any other research), then there is no problem. However, if you recruit for infrequent listeners and you have a room full (or study full) of P1s, then something is wrong.
Secondly, you then relate the “Prize Pigs” idea to PDs not having to worry about the “general audience.” I don’t see the connection here, because, as I said, focus group (or any other research) recruiting, if conducted correctly, includes a specific type of respondent. I don’t think that PDs ignore the general audience. Or, I should say that I haven’t met any PDs who ignore the general audience.
What I usually see are PDs who concentrate on P1s, but who are always looking at other listeners to convert them to P1s. There is nothing wrong with concentrating on P1s, but I do think it’s wrong to only concentrate on P1s. It’s necessary to continually search for new listeners. If you don’t, the P1s will eventually fade away.
I don’t take your comments as a “rank on me or my service.” I’m too old for that.
Sample - Involuntary
I appreciate your answer regarding Arbitron Methodology. You mentioned ‘getting a voluntary' participation, how do you feel about research conducted involuntarily, such as electronic measuring devices? - DS
DS: I'm guessing that you mean involuntary sampling such as monitoring radios in cars as they drive by a stop sign or something. My problem with this methodology is that just because a monitor indicates that a specific radio station is on in a car doesn't mean that the person or persons in the car are actually listening to the station, or if the station is on voluntarily.
For example, the radio could be broken and only pick up one station. The car could be a rental and the driver or passenger did not select the station. The driver or passengers may be from the Czech Republic. The driver or passenger could also have hit the scan button and the station is stopped on the station only for a few seconds and picked up. And so on.
This would be similar to sitting in a mall and counting how many people are eating a Tootsie Roll Pop based on whether the person has a sucker stick in his or her mouth. You could count all day, but you wouldn't know if the "stick" was from another type of sucker, or it could be a toothpick, or it could be a skinny cigarette or marijuana, or it could be rolled up piece of paper.
Sample - Valid
Our state just passed a telemarketing law that allows you to get on a list for free, and not be called (and it works!). Combine that with a large number of people who don’t answer their home phone and I’ve been wondering if Arbitron diary-keepers can still be considered a statistically valid sample? A typical diary-keeper is:
1. One who doesn’t screen calls.
2. One who has the time to accept a diary and log their listening habits.
3. One who will do it for no compensation.
Who is this person? This is not a slam on Arbitron. It’s a question about any research sample being valid in today’s anti-telemarketing environment. - Anonymous
Anon: Excellent question, but before I get to an answer, we need to agree on a few terms. First is “valid.” In research sampling, this term means, “are you testing the type of person you think you’re testing.”
Secondly, you use the word “statistically” in front of “valid” in your second sentence. In this case, “statistically” isn’t needed because we aren’t discussing the validity of the statistics used in surveys. What we are concerned with here is whether the sample involved in Arbitron or in other radio (or media) research is valid. OK?
Thirdly, and important concept you didn’t mention is validity, which in research means, “are the results consistent over time, or over several studies (or Arbitron surveys). OK again?
Now to your question…
In theory, and I want to stress theory, a random sample is supposed to work like this. Let’s say I want to conduct a telephone survey with 400 respondents. I randomly select 400 people using a legitimate procedure and call only those 400 people who would then participate in my telephone survey. That’s the way the theory goes. (The most basic type of probability sampling is the simple random sample where each subject or unit in the population has an equal chance of being selected.)
In reality, the theoretical approach to telephone research, or any research with human subjects, never….EVER…works that way because some percentage (it varies) of the originally selected random sample will not participate for one reason or another. For example, people have caller IDs, states have telemarketing laws (as you explain…and Colorado just passed the same type of law), people hang up immediately, and so on.
So the reality of behavioral research is that a random sample is never used. This is true for all radio research, all TV research, all newspaper research, and all consumer products and services research. Behavioral research uses volunteer samples—people who agree to participate in a research project.
Let me repeat—there is no behavioral research that uses a sample as dictated by sampling theory. As soon as one person from the originally selected random sample…only one…refuses to participate in the study, the complete randomness of the sample goes in the porcelain receptacle.
What does that mean? As I said, all behavioral research uses, and has always used, volunteer samples. That’s the fact, Jack. In reality, a sample involved in a behavioral research study should be described as a “volunteer sample from a randomly selected group of respondents.”
Does that make all research, Arbitron and all other radio research, invalid and unreliable? Nope. On the contrary. And here’s why.
People who voluntarily participate in Arbitron surveys and all other media surveys share the same basic characteristics such as a higher level of interest in local, national, and international events, higher socio-economic status, higher levels of interest in politics, and more. In other words, research volunteers have been shown to be more active in a wide variety of things. I didn’t find this out alone. This has been documented by other researchers, particularly Rosenthal and Rosnow in their 1969 book, Artifacts in Behavioral Research.”
People who participate in radio and TV (and all media) studies tend to be heavier users of the media and are likely to know more about the media. I have found in numerous studies that these people also tend to perceive that their opinions have an affect on media content.
The same types of samples have been used in media research for several decades. This has allowed researchers to verify both the validity (are these the correct people?) and reliability (are results consistent over time?) of the samples. The research is not conducted with a thought that the sample are invalid or unreliable. There is too much data to know otherwise.
Results from syndicated ratings companies like Arbitron and Nielsen are extremely reliable, especially Nielsen since it uses a panel sample for its NTI (National Television Index) numbers. You can check the results yourself by looking at results from several books.
OK, so the data support that media samples are reliable and valid, but will that continue? I say yes, and here are the reasons:
Although several states have passed, or will soon pass, “telemarketing” laws, these laws as far as I know do not relate to legitimate research companies like Arbitron, Nielsen, and smaller companies (such as mine) that conduct perceptual studies. The telemarketing laws probably will not reduce the number of people who will participate in surveys because my guess is that the people who sign up for the telemarketing exclusion are the same people who would refuse to participate before the law was passed.
However, even if the new telemarketing laws and increases in called ID do reduce the number of participants, all of the companies involved will continue to deal with the same sample of interested participants.
Just like anything else that changes, researchers and broadcasters will develop new ways to find a sample of respondents. For example, one of the best ways I can think of is to use a radio station’s database as a start for sample selection. Radio station databases are loaded with highly interested radio listeners…perfect for radio stations and researchers.
Finally, you are on target with your comment about compensation. In fact, I can’t think of a researcher who hasn’t discussed this in the past year or so. The answer to this is, “So what? If we have to pay people for their answers, we simply have to pay them. Case closed.” (By the way, everything that I know about indicates that paid respondents’ answers are not different from respondents who are not paid.)
In the face of adversity, smart people tend to develop answers and solutions. The situation with sampling in behavioral research is merely a speed bump that needs to be crossed. Does that help?
I hear the term "sampling error" mentioned a lot. What is that? – Mark
Mark: The definition of sampling error is: The estimated difference a sample is from the population from which the sample was drawn. As you probably know, most research uses a sample selected from the population rather than investigating the entire population (a census). This is done for a few reasons. First, it takes too much time and money to investigate the population. Secondly, research experience in all fields shows that we can generalize the results from a sample to a population with a great deal of accuracy (as long as the sample is selected correctly).
A typical comment from radio people who don't have research experience is, "How can you do a research study in (insert any market) with only 400 or 500 people and say that they represent the entire market?" Well, you can as long as the research follows the guidelines of the scientific method and sampling error is taken into account when analyzing the data.
One of my teachers many years ago said, when asked by a student about the sampling from the population, "We live in a world of small sample statistics. Get used to it." My teacher was right, but there are countless examples demonstrating that small sample statistics really "works." For example, the following list shows the final poll results of the five national media-sponsored polls conducted for the 1996 presidential contest between Bill Clinton and Bob Dole. Each of these polls used samples selected from the entire U.S. population.
The actual vote was Clinton-49% and Dole-41%. Five major polls predicted (shown as Clinton/Dole), CBS/New York Times - 53/35, USA Today/CNN/Gallup - 52/41, Harris - 51/39, ABC News/Washington Post - 51/39, and NBC News/Wall Street Journal - 49/37.
As you can see, considering sampling error of about ±4%, all of these polls accurately predicted the final outcome of the election with national samples of about 1,000 respondents. Sampling, as long as it is done correctly, really does work.
One more thing—just because a sample is large (let's say in the 1,000s) doesn't mean that it is better than a sample in the 100s. Any sample can be selected incorrectly, and size alone doesn't make it good. It is better to use a correctly selected small sample than an incorrectly selected large sample.
Sampling Error - 2
Is there sampling error in every research project? – Anonymous
Anon: No. There is no sampling error when you conduct a census—where every respondent or element is included. However, in a census, you still have measurement error and random error. A census does not guarantee that your results are 100% accurate.
Sampling Error Calculator
Yo, Doc…We’re thinking about doing a telephone study for our radio station and I’d like to know the sampling error for different sample sizes. Would you tell me what the sampling error is for sample of 200, 300, 400, and 500? - Anonymous
Anon: Yo to you too. Here’s an easy way to get those numbers. Go to my business website and click on “Sample Error” at the left of your screen. You can find the estimated sampling error for any sample size.
Sampling Error - Revisited
Hi, I wrote earlier about sampling error. You mentioned that you would give me the formula if I need it. I need it. The reason I need it is because I was in a meeting with our researcher and our GM asked the researcher, "What is the formula for sampling error?" After looking at the ceiling for a while, the researcher said, "Well, that's a complicated question for this meeting. Let me get back to you on that."
It was very clear that the researcher didn't know the formula and was stalling for time. I would just like to know the formula for "ammunition" for me. By the way, why wouldn't the researcher know the formula? Thanks. - Karyn
Karyn: I don't know why your researcher didn't know the typical formula used for sampling error. I have a suspicion, but I won't get into that here.
The typical sampling error formula used in most research is shown in the next question below. In the formula, SE(p) stands for "standard error" (another term for sampling error), "p" stands for percentage (the percentage of responses to a question), and "N" stands for sample size. If you want to compute the estimated maximum error, use p = 50.
The Z-score value changes depending on the level of confidence you're interested in. Here are the three most popular confidence levels and their corresponding Z-score values:
68% = 1.00
95% = 1.96
99% = 2.57
Here is an example. Assume you're conducting a project with 400 respondents (N = 400) and you want to know the estimated maximum sampling error for each of the three popular confidence levels (use p = 50). The answers are (you should compute these yourself to make sure you know how to get the numbers):
68% = 2.5%
95% = 4.9%
99% = 6.4%
Sampling Error Table
I'm a college senior taking a media research class and we're using your book. Would you tell me which formula you used to compute the sampling error table on page 98? Thanks a lot. - Alan
Alan: The formula to compute sampling error is:
So, if 20% of the respondents give an answer, the formula would be:
(At the 95% confidence level, multiply by 1.96; at the 99% confidence level, multiply by 2.57.)
Therefore, at the 95% confidence level, your maximum sampling error is 3.5%.
Memorize this formula, Alan, because I guarantee it will be on a test you take in class. This is one of four formulas I require my students to memorize since it comes up so often in research. You must know this formula. Trust me.
Click Here for Additional S Questions
All Content © 2018 - Wimmer Research All Rights Reserved