During the PDA AC Radio Seminar this past fall, you suggested that there was a way to ‘calibrate’ your online callout research so it would be as reliable, or almost as reliable, as true phone-based call out. Could you go back through that process, not only looking at ‘how to,’ but also pointing out the pit-falls that may be easy to fall in? If we’re going to use Internet-based research, we want it to be as accurate as possible, so with that in mind please set the standard high. – Anonymous
Anon: This is not going to be a short answer, so get a six-pack of your favorite beverage.
Before I address your questions, I’d like to briefly discuss two things: (1) the Methods of Knowing; and (2) research in general. These topics are important as background information for your answers.
Methods of Knowing
In our book, Mass Media Research: An Introduction, Joe Dominick and I discuss C. S. Peirce’s summary of the four ways we learn things: tenacity, intuition, authority, and science. These relate to every area of our lives, not just radio research.
A user of the Method of Tenacity follows the logic that something is true because it has always been true. An example is the storeowner who says, “I don’t advertise because my parents did not believe in advertising.” The idea is that nothing changes—what was good, bad, or successful before will continue to be so in the future.
In the Method of Intuition, or the a priori approach, a person assumes that something is true because it is “self-evident” or “stands to reason.” (“It seems like.”) Some people oppose research because they think they know the answers and research is a “waste of time.”
The Method of Authority promotes a belief in something because a trusted source, such as a parent, a teacher, a boss, or programming consultant, says it is true. The emphasis is on the source, not on the methods the source may have used to gain the information.
The Scientific Method is unique among the methods of learning and it includes some unique characteristics. The method is public, objective, empirical (testable), systematic, cumulative, self-correcting, and predictive.
So what? Well, the “so what” is that it’s important for decision-makers to know the source of the information they are presented. Is the information merely an opinion? Is the information an urban legend? What is the source of the information? These questions lead to the second introductory discussion—research in general.
Research in General
I have been involved in research for about 30 years and have learned many things about what the “average” person thinks about research. Some of the things include, but are not limited to:
Most people think that research is a simple process that can be performed by anyone, including people with absolutely no research background.
Most radio people do not have any formal research training, yet many of these people make claims about research and research methodology. Unfortunately, in many situations, other unsuspecting individuals accept these unfounded statements as true. For example, and I don’t know where it started, there is a belief that for a research project to be valid and reliable, virgin respondents (people who have never participated in a research study—“professional respondents”) must be used. This is pure trash.
Numerous research urban legends, myths, and fallacies are passed from one person to another—usually from a PD, former PD, or programming consultant, who has no research training. Much of my time in research is spent correcting the research “facts” passed along by non-researchers.
Much of what most people know about research falls into the category of learning I call, “It seems like” (Method of Intuition). That is, consultants, PDs, and others who have no research training believe and pass along information that is based on hearsay. For example, you may hear someone say that music tests need a sample of 150 respondents. Why? Because “it seems like” a smaller sample would not be good. What happens with this type of comment? Many people pass along the “fact” that music tests must include 150 respondents, and many people believe it’s true.
Finally, many non-researchers think that “interpreting” research consists merely of looking at the data. This is particularly true when people try to compare the results of one study to the results of another study, or from one trend to another trend. In music tests or callout, for example, some people may see a song rated as a 5.1 in one test and a 5.9 in another test and think that the song went up—more people like the song, or some similar interpretation. In reality, however, it might be that the 5.9 is actually lower than the 5.1—the song could actually have gone down in the ratings. Listen to me now, and believe me later—in most situations, it is virtually impossible to simply “eyeball” the results of two or more tests and determine if “things” went up or down.
Comparing Callout to the Internet
I listed the “eyeball” comment as the last point because it’s a great segue to your question about developing a way to produce a valid and reliable method to test your music on the Internet.
First, I think the Internet has great potential when it comes to data collection, and my guess is that the Internet will eventually become the primary data collection method for virtually all types of research. However, the problem I have seen is that too many people have accepted the Internet as a valid and reliable methodology without properly investigating the advantages and disadvantages. That’s not the right way to go.
So what do you need to do to insure that your Internet collected information is valid (Does your procedure test what you think it’s testing?) and reliable (Does your method provide consistent results over time?). My problem is trying to figure out a way to describe what needs to be done when I know that most people who read this don’t have experience with research and statistics. I’ll do my best.
If I were talking to (or writing for) people with research and statistics backgrounds, I would say this about callout and the Internet:
If you want to use the Internet for callout, you must verify the Internet results with those collected via callout, which we already know is valid and reliable if done correctly. For the comparison, you should do these things:
Conduct your callout as you normally would, but at the same time (same days), test the same songs on your Internet site. Be sure to use the same directions, rating scale, and hooks. In addition, make sure the sample compositions (age, sex, etc.) are the same. The sample sizes don’t have to be exactly the same, but they should be close. In other words, both should have about 80 or 100 respondents, or whatever sample size you normally use in your callout. (Conducting the tests simultaneously and using the same procedures controls several sources of internal validity such as history and maturation.)
When you have the results, you need to do at least three things to find out if the data are similar (each method can be performed easily on a spreadsheet like Excel):
Compare the standard deviations for each song using a t-test. Don’t “eyeball” these numbers, because you won’t be able to tell if they are similar or different. The standard deviation indicates the amount of agreement among each sample, and they should be the same, using a 95% confidence level.
Compare the raw scores for each test using a t-test. Once again, the scores should be the same using a 95% confidence level.
Compare the raw scores for each test with a Pearson Product-Moment Correlation. The correlation should be at least .90.
Do this same procedure for at least two testing sessions (reports). Don’t rely on only one study to determine if the two methods are comparable.
If two separate callout/Internet comparison studies indicate that the two methods are similar, then you can proceed on the assumption that the Internet will produce valid and reliable information.
Here’s the problem…
I know that most of people who are reading this don’t know how to do any of the statistical methods I just described. I would imagine that the most logical conclusion is, “That’s too much to do. I don’t know how to do those things, so I’ll just eyeball the data.” Don’t do that! There are alternatives.
One of the best alternatives is to go to a local college or university and get help from a teacher in journalism, mass media, marketing, statistics, psychology, or sociology. The teacher, or maybe a student, can set up the spreadsheets for you so you can analyze your data (each spreadsheet should take about one hour to design).
Final Emphasis: Don’t disregard these procedures because they may look too complicated. Your music is your product and you must ensure that the data you use to make decisions about your product are valid and reliable. I know it’s easy to just forget all this stuff and just eyeball your data, but don’t do that.
Most callout companies score songs from 1 to 5 and you end up with scores like 3.12 or 3.78 and such. Assuming a sample of 100 people testing 30 songs, what is the margin of error on these scores? In other words, if a song scores a 3.12 is it + or - .25? Lower? Higher? - Anonymous
Anon: What you’re looking for here is called the Standard Error of the Mean (SEM) and it’s a little different from calculating error for percentages. To get the error associated with your callout scores, you need to compute the standard deviation (STD) for the group of 30 songs.
If you use Excel, for example, and your song scores are in cells A1 to A30, then use this formula: =stdev(a1:a30).
The next step is to divide the Standard Deviation by the Square Root of the sample size (N), or SEM = STD/(square root of N), or in your case the formula is: SEM = STD/10.
That formula will give you a number such as .32, which is subtracted from and added to your song scores to give you your confidence interval for each song. For example, let’s say that Song #1 has a score of 3.00. Using the hypothetical SEM of .32 means that the song’s “true” score actually lies somewhere between 2.68 and 3.32.
The problem I have with your question is that you shouldn’t have to do this work. Tell the company that conducts your callout to compute the SEM for you when they complete a report. The song scores are already in a computer somewhere and it would take about 15 seconds for them to include this in your printout.
Since my GM won't do any kind of local music research, the only research tool I have is Critical Mass music research that I get from Mediabase. I asked Mediabase what the margin of error is for the nationwide callout research a couple of months ago and I am still waiting for a response. From your experience, how accurate is nationwide callout? Thank you. - Fred
Fred: My experience has nothing to do with my answer. Your question relates purely to sampling and statistics. You can achieve the same experience by reading a research book or taking a research/statistics class.
Anyway, computing sampling error takes about 1 second if you do it on a computer. In fact, if you go to my company's web site, you can compute it yourself. Go to my business website. Along the left side of the home page is a button named "Sample Error." Click on that, and then follow the directions. The only thing you have to know is the sample size. (I'm not sure if you know the sample size of the callout research.)
On to your question about how accurate nationwide callout is. National music research is just that—national. You can compute sampling error for the research, but it will not tell you anything more than what the error is for the national study. In other words, you can't take the national data, compute sampling error, and then say that the results fall within that that range for your market. National research provides National results.
Let me compare it to political polls. There are several national polls conducted during every presidential campaign. Historically, the polls conducted by companies that correctly conduct the research are very, very close to the actual vote on Election Day—considering sampling error.
These polls do a fantastic job at predicting the national totals, but they do not necessarily reflect the voting patterns in individual states, counties, towns, cities, parishes, and other areas. For example, the country may elect a Democratic president, but there are several states, cities, etc. that elected the Republican candidate. Get it? The national polls only predicted the total national vote, not what happened in all the smaller areas.
This is the same for national music testing. The numbers are probably fairly accurate in estimating the national rating of songs, but they may not relate to what the listeners in your market think about the songs. If you relate the data from a National music test to the listeners in your market, the margin of error can be anywhere from 0 to 100%—just like a national political poll.
I want to make sure that you understand this clearly, so I will repeat it. Asking the company that conducts the national music tests for sampling error associated with the test allows you only to interpret the National data with that error margin. It says nothing about the error associated with individual markets or stations. Here's another example . . . What would you say if Arbitron produced your market's book by using a national sample of listeners? What would you say about that?
National music data provides national numbers and only indications of what may or may not exist in your market.
(I should add that it's possible that some of your listeners are involved in the national sample, but my guess is that the sample is very small. If you can, find out how many listeners in your market were included in the study. Ask for the data from these people only. Then you can compute sampling error that relates to your market.)
Current to Recurrent
We regularly test our currents in callout. What's the best way to know when to move the song from current to recurrent? What percentage of burn should send a song from current to recurrent? - TG
TG: If you're going to use burn percentage to determine a shift from current to recurrent, here is one suggestion:
First, don't make a decision after only one test. Track your results to determine the "true" burn percentage. An easy way to help you do this is to plot your burn percentages on a sheet, preferably graph paper. This won't take you long to do since you're probably only testing 20 songs each week. (More than that isn't good.)
Plot your burn percentages and look for a sharp up trend. When you see a sharp break up in the burn percentage line, it may be time to consider moving the song to recurrent. You should also compute an average burn percentage so you can compare one song to all the others in the test. Use Z-scores to do this.
I don't think you should set a burn percentage limit, such as 30%, because your sample is small and there is a lot of error involved with callout data.
It seems to be a constant problem, but dance music rarely researches well. Yet, there are many examples of dance songs that react and request very well (for example, Darude's "Sandstorm.") Go to the hottest club in your city and watch everyone scream whenever one of the hottest songs of the moment is played. Where is the flaw in the callout research? Songs that create such reaction in a club and that find there way to the top of the nightly radio station countdowns should research! How do you explain this? - Anonymous
Anon: You have a very interesting use of the word research. Interesting question, though, and I'm surprised no one has ever asked this before.
In research terms, what you are describing is known as method-specific results, which means that the results of a test or experimental are a function of the method, or approach, used for investigation. And what you have noticed is not new. It's the same thing that was said for Disco music . . . "Disco music tests terribly, but if you go to a nightclub where Disco music is played, most people are flopping all over the place going nuts . . . especially when "YMCA" comes on . . . every single person in the crowd (regardless of age) is going through the physical gyrations of spelling the title of the song with their arms and bodies." (Y–M–C–A). Dumb song? Yup. Tests horribly? Sure does. But people go nuts at a club when the song comes on. Go figure.
OK, I will. In both situations (yours and Disco), the methods of "testing" are different. In callout, you're asking people to rate songs on some type of "Like-Dislike," scale, whereas in the nightclub the "test" is how much a song gets people off their potato couch butts and hop around the floor ("Dance like no one is watching.")
If you want a more direct comparison of the two methods, you'll need to change your callout question to something like, "On a scale of 1 to 10, where 1 means it's not a good dance song, 10 means it's a great dance song, and 2 through 9 are in between, how would you rate . . .?" You'll probably find that some of the songs that don't "research" well will soon change to good researches. (Or something like that.)
1) First, a follow up to the dance music topic you covered a few weeks ago. I can’t believe you didn’t touch on this Doc. The writer wondered why dance music didn’t test well in callout for what I am assuming is a CHR station, even though dance smashes like DaRude’s recent hit never fail to pack a dance floor. You suggested that people should rate whether such songs are “good dance songs.”
Perhaps the fact that dance songs don’t tell well for many CHRs (including mine) has to do more with the fact that dance is only one type of music most CHRs play. Callout scores should be lower because you’re seeing a cross section of a station’s listeners, some who like dance and some who don’t. DaRude can get a club moving because everybody at the club is inclined to like dance music. Am I on to something here?
What about a dance music/CHR Rhythmic station’s research results? (The CHR/Rs that don’t lean hip hop that is.) Since their listeners are inclined to like dance music, wouldn’t dance music score better in their research? Any examples you know of?
2) I remember Internet-based research being something you frowned on the last time you spoke of it. (It’s been a while). Can you repeat the reasons why you have some reservations? I think one of them was that there is no way to verify the people on the other end are who they say they are. Well, if you COULD trust your respondents, what other things would need to be taken care of?
3) There’s an assumption that my consultant has about testing music. He suggests that all old songs be tested in an auditorium setting instead of in call out. Do you agree? The vibe I get is that older songs a) supposedly don’t test well in call out; b) one can never test all of their gold library in call out due to most gold categories being larger than the rest.
Are those thoughts along the “It seems like” lines or not? Now go grab a can of tea and get to work! - Anonymous
Anon: Yes, boss! I’m getting to work. My can of iced tea is at hand. I’m not sure if you’re on to something or if your comments belong in the “It seems like” category, but that doesn’t matter. What matters is that you’re following the rules of science and questioning things even though there may already be an answer. That’s good.
Question 1: You may have a point, but I didn’t make that leap because the person who wrote the question did not explain the type of radio station he/she was talking about. I have to answer the questions based on the information I received. The person asked if there was a flaw in callout research and that is what I addressed. (See below, "Callout - Dance Music.")
I can’t agree or disagree with your statement that, “Callout scores should be lower because you’re seeing a cross-section of a station’s listeners, some who like dance and some who don’t.” Although your conclusion seems logical, I don’t like dealing with “seems like” in reference to research. An additional answer might be that the dance songs tested in the callout actually suck and the respondents are merely voicing their opinions.
It wouldn’t be wise to assume that dance music will get lower scores because some people like the music and some don’t. You need to find the answer, not make an assumption.
Question 2: You are correct in saying that my main reservation with Internet-based research is that there is no way to determine who really answers the questions. To my knowledge, this hasn’t been solved yet. However, if the problem were solved (as you suggest), my secondary concern is that the sample is randomly selected. In other words, relying on volunteers from a radio station’s database to answer survey questions is not appropriate.
While all research uses volunteers (the respondents have to agree to participate), it is important that every respondent in a given population should have an equal chance to get involved. If you rely only on respondents who participate on their own, then the results will be skewed to that type of person, and that type of person may not represent the population from which they came.
I believe that data collection via the Internet is a potentially very useful tool, but from what I have seen so far with many radio stations, the information being collected isn’t valid and reliable. I know several radio people who no longer use telephone research because the Internet approach just seems right and it’s virtually free to conduct (if they use a survey on the radio station’s website). These people don’t control for error, don’t understand the affect of volunteer respondents, and usually design the questions (mostly terrible) themselves. Much of what I have seen is literally a waste of time. (But hey, it’s “research” and “anybody can conduct research.”)
Question 3: I agree with your consultant about testing old songs in an auditorium setting as opposed to on the telephone. There are at least two reasons for this:
The auditorium setting continues to be the best way to test music because of the control over the situation. The callout methodology introduces several sources of error, such as: respondents may or may not listen to the hooks, respondents’ attention may be diverted by other activities they are involved in while on the telephone, fatigue sets in very quickly when rating songs on the telephone, and respondents may or may understand the rating scale, etc. One of the most important tenets of scientific research is control over the research setting, and the auditorium approach is the only music testing method that allows for such control. Callout is an error crapshoot.
Libraries are too big to test on the telephone; respondents will not sit and listen to 600 songs on the telephone. And you can’t get around this by testing 20 songs at a time until you have tested all 600. There are two reasons for this: (1) The results would be affected by history, meaning that the ratings are not collected at the same time; and (2) You would not be able to test the 600 songs with the same sample. Each callout out sample would have it’s own error (sampling error, measurement error, random error) and you could not combine the results of about 30 callout sessions to produce a test of the entire library. That don’t be right. No way.
By the way, I have never heard the idea that old songs don’t test well in callout.
Click Here for Additional Callout Questions
Roger D. Wimmer, Ph.D. - All Content ©2018 - Wimmer Research All Rights Reserved