Callout Questions Continued


How to Select Songs

Hi again!  When developing a radio station’s format, or adding new music to a station, I’ve seen from your previous answers that it’s important to ask the listeners which songs they like.  However, there are only so many songs you can put into an audience test, out of some million songs that are out there on CDs.  Of course, it’s obvious that you’ll want to pick the songs that people know (preferable songs that have been, or just come out as singles) and that fit to the format. However, there may be exceptions to this.  So how do you select songs for callout?  I’ve thought of several ways to pick songs that should be put into callout and played when tested good.

  1. Look at the key (or side) artists of your format and pick all their singles.

  2. Let each employee of your radio station write down their favorite songs and put those into callout.

  3. Put all the songs into callout that listeners request and that haven’t been tested yet, plus those that tested poorly before but are suddenly heavily requested.

  4. Put all the songs into callout record companies say you should play.

  5. (For running station only.)  Put untested songs into callout that sound similar to the top songs on the radio station.

Which of those do you think is best?  Or is there any other procedure that comes to mind? - Kurt


Kurt:  I checked with a few top PDs in the country and here are some suggestions:

  1. Songs that are charting and fit the music blend of their own station.

  2. Songs that are playing on the direct competitors (local airplay).

  3. Songs that are selling well in the market.

By the way, in your list, I specifically don’t like #2, #3, and #4.  Too much error involved with each one.

Interpreting Data 1

Thanks for the help you're always providing on this site.  I know you're a busy man so I'll get straight to the point—I need help.  We just started callout research for my radio station (Christian CHR in small market) and I have been given a lot of direction on formulating callouts, but what about results?  I was wondering if there are any books or tools you'd recommend that could help me learn how to evaluate the numbers and put them to use once I have the info.  Any ideas would be greatly appreciated! - Anonymous


Anon:  Thanks for the comment about the column.  I appreciate that.  Tall order here, my friend.  This is similar to asking an architect to explain in writing how to draw an office building (or something).  I'm not sure if my explanation will be too complicated for you, but there isn't any other way to describe things, which, by the way, is the reason why people hire researchers to design research projects.


I don't know what type of rating scale you're using, but that doesn't matter.  You first need some type of software for your data, maybe a spreadsheet of some type if you don't have any type of music analysis software.  Get the data into some type of software and you'll be able to compute the mean, standard deviation, and everything else you'll need.


Because most callout data jump around a lot because of the sample, it's best to convert your data to Z-scores.  This will stabilize the data and allow you to make comparisons from one report to another.  If you don't use Z-scores, your comparisons won't be legitimate.  This is true even if you use a panel for your sample.


After your song ratings are converted to Z-scores, you can track the performance from one callout to another to determine if familiarity is improving, or whatever else you're interested in.  However, as I have mentioned many times in this column, don't base your decisions on only one callout report—track a song for a few weeks before you make a decision.


My guess is that after you read this, you'll probably be a bit confused, but there is no simple way to describe the process.  It involves many things, including statistics.

Interpreting Data 2

Hello Roger, Thanks for the column.  I'm new to interpreting song research data and I need some help.  What does it all mean?  If the song numbers read: POS 49.0, NEG 39.0, NET 10.0, Burn 29.6, FAM 100.0.  Familiarity is great!  However, there's only a 10 percent difference between those who like and dislike the song.  Do you keep it for familiarity, or toss it because it has a 39 NEG, which nets to only 10?  This seems low to me.  Is there a rule of thumb for any of these categories?  Thanks in advance for your help. - Anonymous


Anon:  Hello to you, and you're welcome for the column.  On to your question…I have a bit of a problem here because I don't know what type of scale you use to rate your songs.  I don't know if you use a 3-point, 5-point, or some other number of points, so I'm going to try to work around that.


Before I get to your question, I'd like to discuss the scheme you use to interpret your songs, specifically the Positive (POS), Negative (NEG), and NET.  As I said, I don't know what type of scale you use, but this summary isn't a good way to analyze callout scores or any other data for that matter.  The basic reason is that you don't have any idea about the intensity of the Positive and Negative scores—49% are Positive scores, but HOW positive?  Do the 49% really love the song, or do they just "barely" like it?  You don't know.  Let's use an example.


Let's assume that we test songs on a 7-point scale, where the higher the number, the more a respondent likes a song.  Assume that we have 100 respondents who produce the following data for a song (I used your data for percent positive and negative.  I'm sure the scores would be spread out in a real test, but I'm demonstrating what could happen theoretically):


1 = 39

2 = 0

3 = 0

4 = 12

5 = 49

6 = 0

7 = 0


OK, according to the interpretive scheme you use, 39 people rate the song a  "1" (39% negative), 49 people rate the song a "5" (49% positive), and 12 people rate the song a "4" (12% neutral).  Almost one-half of the respondents like the song (barely), and 39% dislike the song (hate it).  This produces a 10% net Positive.  What does this mean?  The problem is that you don't have any indication of the intensity of like or dislike.  You only have a net Positive.  This hides the fact that the Positive ratings aren't very high.


Now, if you forget the positive and negative scheme and compute the mean for the song, you'll find that it's 3.3 on the 7-point scale.  That's not a very good score.  If we take the mean score face value, we know that it falls below neutral (Score = 4) and that doesn't seem like a winner.


As I said, the distribution I constructed is theoretical.  You could have 49% who rate the song a "7," which would produce a much higher mean score.  Your Positive/Negative approach hides what is actually going on.  In research terms, this is called factor fusion, which means that you are artificially restricting the sensitivity of a rating scale or measurement by squeezing things together.  The approach is a poor way to summarize data, and that's not good.


Now on to your question.  Taking your data face value, it says that 100% are familiar with the song, and among these people 49% like the song and 39% dislike the song.  In addition, about 30% are tired of hearing it.  You didn't tell me which group (like or dislike) is tired of hearing the song, but it is something to consider.  If the burn percentage comes from the dislike group, then it adds no additional information to your interpretation.  However, if some of the burn percentage comes from the "like" group, then it's a more significant negative.  Do you see what I mean?  If the burn is from the "like" group, then it's an additional negative.


I'm really at a loss here because I don't have your data.  How does this song compare to the other songs in your test?  What are the average Positive, Negative, and Burn scores?  Is it OK to you play a song that 39% of your listeners dislike?  Is it OK to play a song that 30% of your listeners are tired of hearing?  You need to look at all the songs you test to help determine these answers.


What I can tell you is that the Positive/Negative approach you're using now is not a good way to analyze your data.  Can you convert your data to a more meaningful scale?

Interpreting Data 3

This is a tough one, at least for me.  Callout research from Mediabase uses Positive, Negative, Net, Burn, and Familiarity.


I often think that we radio people need to take into consideration what our listeners will tolerate, not just what they like or dislike.  Listeners will often tolerate a song they don’t particularly like (or are familiar with) because they know one they’ll enjoy is coming right up. Or they’ll tolerate commercials because more music is on the other side of them.  That being stated: How should one go about deciding the best use a songs score(s)?  Is NET the most important?  At what score (and in which category) should a PD/MD consider dropping (or increasing spins for) a song?  When is Burn score too high?  What does it all mean?  Is there a standard point where you should increase or decrease a song’s spins? (e.g. when the negative score hits 20.0 you drop it all together?)


Help!  I want to be a better station.  Thanks, you’re awesome! - Anonymous


Anon:  I’m not sure I’m “awesome,” but thanks anyway.  You asked several questions and I’ll try to address all of them.


First, I need to point out that callout research, just like auditorium testing, is not designed to tell you what to play on your radio station.  The data are intended to provide you with indications of the listeners’ perceptions and YOU need to make the decision if the song should go on the air (or be pulled).  Too many people use music research the wrong way—the research is not a bible.  The research is an aid to decision-making.


Part of this “indication” element is the fact that your interpretation of callout research by Mediabase or anyone else must include sampling error.  I don’t know the sample size used by Mediabase, but if a song has a burn percentage of 20%, the “real” percentage probably falls somewhere between 10% and 30%.  (You need to know the sampling error before you start interpreting the results.)


On to your questions….


You ask about listeners tolerating songs.  Oh, I’m sure that a “tolerate” option could be included in music research, but I’m not sure what that really means or how listeners would answer that question.  Let’s say that you ask, “Even though you dislike this song, would you tolerate it on your favorite radio station?”  I think average listeners would have a tough time with this question.  My guess is the answer would be, “Sometimes,” or “It Depends.”  But that doesn’t help much.


I am a believer in Occam’s Razor, which states that the simplest approach is always the best.  Since the start of music research in the early 1980s, the simplest approach of rating a song on familiarity, some type of “like-dislike” scale, and burn (tired of hearing) has been shown to be both reliable and valid.  Sometimes a test will include a “fit” question, and that’s OK too.  But I think a “tolerate” question is pushing the limits of simplicity.  It’s asking a very ambiguous question and that’s not good.  Unless I can be proved wrong, my reaction is to stay away from “tolerate” questions.


Is the NET score most important?  I would think so because it subtracts the negative scores from the positive.  However, I believe that the average score with standard deviations provides more information.


What score should you use for interpretation?  There are many rating scales for music research, but a good approach to follow is to consider playing songs that are above average and consider dropping songs that are below average.


What are the standards for increasing spins or dropping a song?  Once again, there are no universally accepted standards for interpretation.  However, logic dictates that songs that receive high scores should receive more spins; songs with low scores or high burn should receive fewer spins (or deleted from the playlist).  Your example of 20% negative means that 80% are positive or neutral.  Do you think it’s logical to drop a song that has 80% positive responses?  That’s your call.


I know you’re interested in finding some exact guidelines, but there aren’t any.  My problem here is that I don’t have the data you’re using.  If I did, we probably could come up with some guidelines for you to use.  In the meantime, you should be able to come up with a few guidelines based on above or below average scores.

Interpretations (Decisions)

Doc:  We're about to start callout at our AC radio station and I'd like to know if you would give me a few tips about how to interpret the results.  We're going to do two reports each month with 100 respondents in each report.  Thanks. - Anonymous


Anon:  As you noticed, I edited your question a bit to remove references to your radio station (as well as a few other things).  I don't think it's necessary for this information to be publicized.  On to your question . . .


If you have been reading this column for a while, you may have seen my answers about comparing the results of one study to another, such as callout, perceptual studies, or Arbitron.  In all of my answers, I have said that comparisons of one study to another can't be made unless the data are converted to Z-scores.  Comparing the raw scores from one study to another isn't appropriate, legitimate, acceptable, or correct.


Why?  Because the data are collected at different times with different respondents (different samples).  If you compare the raw scores from one callout report to the raw scores from another callout report, you are comparing apples to oranges (so to speak) and that's incorrect.  The same holds true for any other type of research, including comparing one Arbitron book to another — it's wrong.


Here is what you need to do:

  1. After you conduct your first callout, and before you do anything with the information, you must convert the song data (familiarity, rating, burn) to Z-scores (also called Standard Scores or Standardized Scores).  You have to do this because each report is in a different metric (as it's called) since the data come from different samples.  For example, one sample may be "easy" graders and other sample may be "difficult" graders, and comparing the results from these two groups isn't valid.

  2. All Z-scores have a mean of "0" and a standard deviation of 1.0.  Z-scores are computed by using the sample's standard deviation (hence the name "standard scores," or "standardized scores).  The standardization process means that the data are in the same metric — you can literally compare apples to oranges (or one research study to another).  (Note: The only "raw" information you can compare in one study to another or from one Arbitron book to another is the rank of something, such as a song's rank among all songs tested or a radio station's rank in any demographic.)

  3. I prepared an example of how to compute Z-scores on Excel — click here.  The example is designed to show you the formula for computing Z-scores and how to set up a basic table.  It would be easy to use this template to develop a spreadsheet for as many callout reports as you would like to compare.

  4. Don't make any decisions about any of the songs based on the information in the first two reports.  Wait until the third report because you'll have three "looks" at the data to identify any trends (or establish a direction if one report shows a high score for a song and the second report shows a low score for the same song — the third report will establish which report is correct.)

Rating Scale

Great column.  I’ve always had an issue with the 5-point system that callout research uses.  It just seems artificial to me.  People sitting in their car don’t think, “Wow, that song’s such a 4.0!”  They do one of three things—change the station/turn off the radio, leave it on, or turn it up and sing along.


Would a 3-point system be effective, if listeners were given these instructions? The 5-point system does allow for more options, but really, there are only three things that can happen when a song comes on the radio, right?


And I realize that as a programmer, if you only looked at average scores of songs tested in this fashion, you’d probably get a whole mess of tunes clustered around 2.0. I think you’d have to look at each song individually—how many 1s, 2s, and 3s. You’d see your polarizing songs easier, and identify your “solid, but not spectacular” songs more easily, right?  Just wondering.  Thanks for the service. - Young Idealistic Program Director


Young:  Thanks for the comments about the column.  I’m glad you enjoy it.  Good question, but let me see if I can straighten out a few things for you.


First, not all callout research uses a 5-point rating scale.  The type of scale used varies by radio station and/or the research company conducting the callout.  Some use a 3-point scale, while others use 7 points or 10 points.  Which one is correct?


In research, this discussion falls under a category known as “psychometric theory,” or measurement theory.  Psychometricians (what a word, eh?) spend their time trying to determine which type of measurement system is best for the test under consideration.


If you read articles and studies about measurement systems, you’ll find a term called “factor fusion.”  (By the way, beating yourself in the head with a rubber mallet is probably more fun than reading psychometric literature.)  What factor fusion refers to is the situation where a measurement system artificially reduces (squeezes) the sensitivity of the measurements by having too few points.  A simple example is a thermometer with only three points: Cold—Warm—Hot.  If a doctor takes your temperature and says, “Your temperature is hot,” what does that mean?  How hot?  Just a little over 98.6, or am I ready to die?


By the way, you can also go the other way and have more rating points than are actually needed.  I refer to this as “factor dilution.”  A thermometer in tenths is usually all that is required.  It’s usually not necessary to know that your temperature is 98.555432.


OK…with that background, then let’s consider your suggestion of using a 3-point scale where 1 = change the station/turn off the radio, 2 = leave the station on, and 3 = turn up the radio and sing along.


How much information does that give you?  If someone rates a song as a “1” does that mean the person hates the song or just dislikes the song a little?  What does a “2” (leave the station on) actually mean in reference to how much a person likes the song?


OK, so let’s use another option: 1= Hate, 2 = OK/Fair, and 3 = Like a lot (or love, or favorite).  This may be a bit better, but some people may say that people’s perceptions of songs are more complex than Hate—OK—Love.  (A typical respondent would say something like, “I don’t really hate the song and it’s not really OK…it’s OK some of the time.” — Remember, we’re dealing with “average” radio listeners.)


What’s the solution?  Well, you can read the stuff by psychometricians, but I’ll try to save you some time.  You’ll find that, depending on the sensitivity of the measurement required, these people prefer rating scales with 5, 7, 10, and 100 points.  However, during the past few decades, there has been an increased use of the 10-point scale because virtually all people are familiar with it since they use it in their everyday lives (“She/he is a 10,” and the “Top 10” rankings for almost everything.)  In addition, the 10-point scale is universally accepted, demonstrated by the scores used in the Olympic games.


The bottom line?  I don’t agree with your suggestion to use a 3-point scale for callout or any music testing situation.  The scale epitomizes factor fusion and doesn’t provide enough sensitivity to the measurement.  I suggest using a 7- or 10-point scale.  If pressed to choose only one, I suggest using a 10-point scale.  This is not an opinion; it is based on conducting several tests of measurement scales for music testing.

Rating Scale (10-Point)

Can you set up the 10-point research scale (1 = ?, 2 = ?, etc.).  Does this take too long with callout hooks?  - Anonymous


Anon:  Although you can provide 10 different words to describe each of the 10 points, that really isn’t necessary.  There are two approaches you can use:

In the telephone call, the interviewer would say to the respondent something like:


“Please use a scale of 1 to 10, where 1 means “Hate,” “10 means “Favorite,” and 2 thru 9 are in between.”




“Please use a scale of 1 to 10, where the higher the number, the more you like the song.”


Click Here for Additional Callout Questions

A Arbitron B C D E F G H I J K-L M
Music-Callout Music-Auditorium N O P Q R S T U V W-X-Z Home

Roger D. Wimmer, Ph.D. - All Content ©2018 - Wimmer Research   All Rights Reserved