Auditorium Music Test Questions - Page 6
Song Scores - Predicting
Hi Doc! In answer to a previous question, “Data Analysis,” you wrote:
“For example, auditorium data could be used to find out…If there is a statistical model to allow a PD to predict the score for a song that was not tested in the session.”
Now this makes me curious. What would such a statistical model, in your opinion, look like and how could someone, looking at the raw data, say if there is such a statistical model hidden in the data? By what data points would you try to predict an untested song’s score? - Kurt
Kurt: First, I need to say that I know such a predictive algorithm is possible because I designed one to be used by a PD in “emergencies” between music tests, not as a replacement for a music test. For example, a PD may have forgotten to include a song in a test and doesn’t want to wait for several months (or more) for the next test. The algorithm gives the PD an option to replace guessing (or waiting).
What would such a statistical model look like? The model “looks” like a series of numbers and uses a relatively simple univariate statistic.
How could someone, looking at the raw data, say if there is such a statistical model hidden in the data? Statistical models or algorithms don’t necessarily jump out of the data. These formulas are usually developed because someone asks a question, such as, “Can song scores be predicted?” From there, it’s trial-and-error work—the same type of work used in any scientific endeavor.
By what data points would you try to predict an untested song’s score? This is the heart of the algorithm and involves input from the PD (or someone who knows the music on the radio station). The PD categorizes songs in the test and these categories are used in the algorithm.
I’m not sure if you’re interested in using such an algorithm, but you have a few options:
If you have a background in statistics, you can develop your own predictive model.
If you use a research company to conduct your music test, someone at the company should be able to develop a similar procedure. Just ask them this: “Would you develop a statistical procedure for me to predict scores for songs that I did not include in my most recent music test?” (The problem you may have here is that many radio researchers don’t have a research and/or statistics background.)
Hire a professor from a local college or university to develop the algorithm for you.
I hope you don’t interpret my answer as rude, coy, or anything negative, but I spent a lot of time developing the algorithm, and I’m not too excited about just giving away my work (research is my livelihood). I included the point in the “Data Analysis” answer to suggest that radio research studies offer a great deal of information that is not being utilized or tapped.
Song Rating: What does it mean?
If a person rates a song as a "10" on a 10-point scale, what does that mean? – Anonymous
Anon: I don't know. It depends on what you told the person a "10"
means.
Sorting
Given the choice (in a do or die situation), would you choose to sort a music test by Z scores alone or perhaps a bit of passion consideration. Also, on a 7-point scale, which is a more clear representation of passion . . . a combined "6 + 7" score or a stand alone "7" score? - Anonymous
Anon: If I were sorting a music test, I would use only Z-scores because I can
set an almost infinite number of cut-off levels that would define variations of
"passion" for a song.
Your second question . . . A "more clear representation of passion" on
a 7-point scale would be to use only the 7s because you are narrowing your
category to those people who rate it the highest.
An aside about passion scores. You can figure out what percentage of your scale
you devote to passion scores just by dividing the number of scores you are using
for passion by the total number of scores in the scale.
For example, if you use a 5-point scale and use 4 + 5 scores to define passion,
you use 40% of the scale to define passion. If you use a 7-point scale and use 6
+ 7 scores, you use 30% of your scale to define passion. If you use a 10-point
scale and use 8 + 9 + 10, you use 30% of your scale (or 20% if you use only 9 +
10). And so on.
You need to decide what portion of your scale you're willing to devote to
passion. The smaller the percentage, the greater the passion since a smaller
percentage means fewer points to define the term. Or use Z-scores.
Standard Deviations and Music Tests
My question relates to music tests. A while ago, you mentioned that in addition to the average score and a few other things, that it’s also important to check the standard deviation for each song to see how much the respondents agreed in their evaluation of the songs. Can you explain that in more detail? - Ryan
Ryan: Yes, I can explain that. But more importantly, the question is, “Would I explain that?” OK. Just yanking your chain. I’ll be happy to explain it.
If you simply look at a song’s mean (average) score, you don’t know how that scored was achieved. That is, let’s say that on a 10-point rating scale, a song receives a score of 5.5—right in the middle of the scale.
The first reaction to such a score is probably something like, “That’s an average testing song; not a hit; and not a piece of trash. But is that true? The standard deviation tells you the amount of agreement among the respondents in reference to their song scores. (I know you could look at the raw scores to see how they are distributed, but if you have 100 respondents and test 400+ songs, that’s a lot of data to review. You’d go nuts looking at all those numbers. The standard deviation speeds up the process.)
To make this easier to see, I developed a sample spreadsheet for three songs—each with an average (mean) score of 5.5. While each song’s average is 5.5, the average was arrived at in three unique ways. Look at the bottom of the table to see the standard deviations.
| Respondent | Song 1 | Song 2 | Song 3 |
| 1 | 1 | 1 | 5 |
| 2 | 1 | 3 | 5 |
| 3 | 1 | 3 | 5 |
| 4 | 1 | 3 | 5 |
| 5 | 1 | 4 | 5 |
| 6 | 1 | 4 | 5 |
| 7 | 1 | 4 | 5 |
| 8 | 1 | 5 | 5 |
| 9 | 1 | 5 | 5 |
| 10 | 1 | 6 | 5 |
| 11 | 10 | 6 | 6 |
| 12 | 10 | 6 | 6 |
| 13 | 10 | 6 | 6 |
| 14 | 10 | 7 | 6 |
| 15 | 10 | 7 | 6 |
| 16 | 10 | 7 | 6 |
| 17 | 10 | 8 | 6 |
| 18 | 10 | 8 | 6 |
| 19 | 10 | 8 | 6 |
| 20 | 10 | 9 | 6 |
|
Mean |
5.5 | 5.5 | 5.5 |
|
Standard Deviation |
4.6 | 2.1 | 0.5 |
Song 1 has a standard deviation of 4.6, which is high for a 10-point scale and shows that there isn’t a lot of agreement among the respondents—the song is highly polarized—respondents either hate it or love it. Song 2 has a standard deviation of 2.1—more agreement than Song 1, but still shows that the song is not universally liked or disliked (there is some polarization). Song 3 has a standard deviation of 0.5, indicating that there is a lot of agreement among the respondents.
Get the idea here? In addition to the average score, you need to look at the standard deviation (standard/average difference from the mean) of the songs to determine how much the respondents agree in their ratings. A song that receives a standard deviation of 0.0 indicates that everyone rated the song the same way. The higher the standard deviation, the more disagreement there is among the respondents. (The standard deviation can’t be higher than the highest number in the rating scale.)
Standard Deviations and Music Tests - Follow-up
Roger - First off, thank you for all the work you put in to make this one of the most informative corners of the web—it is appreciated, especially by young programmers and air talent. My question is about standard deviation.
The question is, “Okay...aaaannnddd....?” In other words, “So what?” What are we supposed to do with standard deviation? I know a station will play a high testing song but won’t play a low testing TOTAL song. But, what about standard deviation? Should programmers try to play songs that deviate as much or as little as possible or some how mix it all up? I can see the advantages and risks of doing each. What is your advice? Thanks again! - Dan
Dan: Thanks for the comment about the column. I’m glad it’s useful. On to your question.
Your question…“So what?”…is a good one. I often use the same question when students present ideas for research projects, theses, and dissertations. If the “So what?” question can be answered, then there is a high likelihood that the topic has merit. So here’s the answer to your question.
Although some people criticize research and researchers for a variety of things such as sterilization of radio, eliminating creativity, and so on, the fact is that if research is used correctly, it provides information that helps a decision maker make a decision. Research should not be used to make decisions. I’ll repeat…Research data should be used to help in the decision making process, not make a decision.
The neat thing about research and statistics is that there are an almost unlimited number of approaches and methodologies that can help radio programmers make decisions. However, the problem I have seen in the past 20+ years is that only a handful of decision makers in radio “get it.” That is, only a few have reached the point where they understand how research and statistics can eliminate a lot of ambiguity, confusion, and hesitation about the problem or decision under consideration.
Standard deviation is a good example of additional information that can be used to better understand the data. (I’m getting to your “So what?” question). The original question asked about using standard deviation in music tests, so I’ll address that area.
Music tests are important for music radio stations because they provide a PD with information about the playlist—the radio station’s product. And since music IS the product, it makes sense to have as much information as possible about the songs. When experienced PDs review music test results, they usually ask things like, “Do the songs have polarized ratings (love-hate)?” “How much do the sub-cells (e.g. ages) agree in their ratings?” and “How stable are the results?” These questions can be answered by looking at the standard deviations for each song. For example, let’s say that a song receives a mid-range or average score. Does that mean that all respondents rated the song in the middle of the scale? Could it be that one-half of the respondents rated the song high and the other half rated the song low (to produce the average score)? The standard deviation will give the PD the answer. (A low standard deviation indicates that the respondents agreed in their ratings; a high standard deviation indicates a “love-hate” rating.)
Does that help you understand why standard deviation is necessary? It provides more information about each song’s rating. It gives the PD more ammunition to determine if the song should be included in the playlist.
As you have probably heard, “Information is the key to success.” (Or something like that.) The more information you have about a decision you need to make, the more likely you will be to make the correct decision. Standard deviation information in music tests provides that.
Your question is, “How should standard deviation be used?” You should use it to give you information about the agreement in the respondents’ ratings. The best songs are those that have a high rating and a low (near zero) standard deviation because it shows that the respondents agree in their rating. A song with a high standard deviation (this varies depending on the scale used) shows that the respondents don’t agree in their ratings—the song is polarized (not universally liked) and may cause tune out by some listeners.
As long as I’m on the soapbox, let me add this: The radio industry continues to lag behind the other mass media when it comes to using research and statistics for decision making. While some radio people may say that radio is “over-researched,” the contrary is actually true. Radio is under-researched. There are many research methodologies and statistical procedures that radio PDs, MDs, GMs (and others) could use, but know nothing about. Whose fault is that?
It’s the fault of everyone. Radio managers (some of them) are at fault because of their aversion to research, or a failure to ask, “Is there another way to investigate this question?” Researchers (some of them) are at fault because they don’t push radio managers hard enough to consider alternatives. Radio managers usually know only one way to investigate a problem. A good researcher may know 10 ways to investigate the same problem. Both sides need to pursue new things—more effective ways to use research so decision making isn’t such a pain in the neck.
Style Ratings
Hi, Doctor: My English is not the best, but I will try to explain myself the best I can.
I was invited to watch the process of people responding to a Format Finder. When it came to rate different music styles by listening triplets of hooks representing each style, I got the impression that respondents did not wait until all three hooks of every music style were played, but they tended to rate the style based on the perception of the song that came in first.
Can it be that when you play the best and more powerful titles of each music style, the people don't listen to the whole triplet of songs but rather rate each song separately? Is there a way to avoid this?
Thanks in advance for your advice. - Anonymous
Anon: I don't see anything wrong with your English. There is no reason to apologize. Your question is great because it relates to many things about music testing.
First, I need to say that the answer to your question goes back to the early 1980s, but that's OK because my guess is that you're probably new to the radio research field and you may not know where to find some answers. I need to explain a few things in order to answer your question.
Although radio research has been conducted since the 1930s, the research methods we use now didn't start until the mid 1970s, and the significance of research in radio didn't emerge until the early 1980s. Testing music hooks to determine what the audience likes and dislikes started around 1981. After the auditorium procedure was established, the next logical step was to play hooks over the telephone to recruit respondents for music tests, perceptual studies, or other research such as focus groups and group administration studies. Respondents qualified for a research project if they rated a 3-hook sample of music in a certain way, usually 8, 9, or 10 on a 10-point scale (where "10" means "like a lot" or "favorite," or something similar to that).
My first experience with the problem you mentioned dates back to 1982 with a perceptual study I did for an AC radio station. We planned to use the "new" method of playing 3-hook samples to recruit respondents (the radio station's P1s, or fans). We asked the PD to select three songs that best represented his radio station, and used that 3-song group to recruit respondents. The respondents had to rate the AC hooks as and 8, 9, or 10 to qualify for the study. If they rated the hooks lower than an "8," they were terminated at that point — they weren't asked any additional questions. The procedure sounded logical and we thought we were doing the right thing. But we soon found that the approach wasn't the "right thing." Here's why . . .
After about a week of interviewing, the field service doing the telephone calls said they were having problems finding the radio station's P1s. After much discussion, we decided to change the screener to determine if we could discover the problem. What we did was eliminate the termination point if a respondent rated the AC hooks lower than an 8 — regardless of how the respondents rated the songs, they were still asked which radio stations they listen to and which radio station was their favorite. What we found was very interesting and changed forever (at least for me) the use of hooks to recruit respondents for research projects.
We found that about 40% of the radio station's P1s rated the 3-song hook group lower than an 8 on the 10-point rating scale. What? How can that be? To find out what was going on, we called back many of people (remember, they were P1s of the radio station) and asked about their ratings. We found that even though we asked them to rate the songs as a representative group of songs, they didn't do what we asked. They rated the group low because they didn't like one, two, or even all three of the songs. What we thought were representative songs for the radio station did not match the perceptions of about 40% of the radio station's P1s.
The next logical step was to conduct a test of the 3-hook procedure in an auditorium music test setting. The results supported what we found in the telephone study, and we found what you did — that some people rated the 3-hook group after hearing only one or two of the hooks (even though they were told to wait until they hear all three hooks.)
I spent a great deal of time testing this problem in a variety of ways. From all this research, I found two ways to avoid the problem of using 3-song hook groups to test anything or to recruit for research projects:
1. Instead of playing 3-song hook groups, read a list of artists for the respondents. This eliminates the possibility of selecting the wrong hook(s). For example, use a question like this: How would you rate music by artists such as . . . (use three or four artists who represent the format.) Make sure to tell the people not to rate the individual artists themselves, but the type of music the artists represent. This approach works well and the respondents are less likely to rate the group low because of one or two artists they don't like.
2. If you want to have the respondents rate hooks, have them rate several songs, but only one at a time (not as a group). It's best to test 5 to 10 songs that represent a format rather than only three songs. This allows you to drop one or more songs that may not test well (differ significantly from the other songs in the group), but get a good indication of the overall perception of the format.
In summary, the problem with using song hooks to recruit respondents for research projects, or to test formats ("format finders" or "format hole studies") was identified over 20 years ago, but is still used by some people for some reason. My guess is that some researchers and non-researchers still use the method because it seems to be the right thing to do. The problem is that it's not the right thing to do, and conducting research based on what seems to be correct is not the way to conduct research.
Testing Intros
This subject came up while my PD and I were talking about music tests. We started talking about how we notice the way people switch between stations (mostly while driving). If they are searching for a song they like, they listen to a couple seconds to determine if they'll stay with that station. But a lot of times if they are listening to a song they like and then a new songs starts, it only seems to take a few seconds of the intro for them to punch out and find the next station. I used to build auditorium tests for a living and in nearly ten years I never produced tests with anything other than the "hook" which most of the time comes about :60 or into the song.
It seems people know long before they get to the hook whether or not they want to hear a particular song. I'm curious. Do you have any thoughts about testing song intros? Do you think that if there is a "punch out" factor early in the song, (like during the intro) that this information would be useful to radio stations? - Bryan
Bryan: The purpose of a music test is to determine listeners' ratings of songs they know. Sure, it would be possible to test song intros, but the design of music test methodology includes producing hooks that will be most relevant to the largest number of respondents.
I can remember in the early 80s testing which portion of a song to test to achieve the greatest recognition from respondents. While you and your PD may view people (some people) hit the radio button after a few seconds of a song's intro, the fact is that some people need to hear the "meat" of a song before they determine if it's familiar to them. That's why a segment from inside the song is usually used for the hook—it is most appropriate for the largest percentage of people.
Another reason to use a portion of the inside of a song is that intros for many songs don't provide enough information. Hooks are limited to about 5 seconds and there are many songs where the first 5 seconds don't provide much information. Now remember, I'm sure there are many people who can identify any song by the first three notes, but we're not interested in that. We are interested in providing the most information for the largest percentage of people rating the songs.
You also ask, "Do you think that if there is a 'punch out' factor early in the song, (like during the intro) that this information would be useful to radio stations?" Yes, I believe that's important and that's the information provided to you when a person rates a song low on the rating scale. Where I disagree with you is when you say, "punch out factor early in the song." The listeners punch out because of the SONG, not the intro to the song. The "early punchers" hit the button quickly because they are in the "I can recognize this song in three notes" category. But these "early punchers" will rate a song's intro the same as they would rate a hook from the inside of a song (low), and therefore, your point is moot. That's is, you're already getting that information in your music test and testing the intro adds nothing to your knowledge.
Timing
Speaking of music tests, when is the best time of the year to conduct a music test? Is the best time right before Arbitron starts? - Anonymous
Anon: I am asked this question frequently, and I’ll give you the answer I always give, and that is:
The time to conduct a music test is when you think you need the information from your listeners (or potential listeners). There is no logic at all to conducting music tests only before an Arbitron book. I have heard the same comment millions of times, which is, “We want to have everything right before the start of the book.”
Why in heaven’s name would anyone want to have “things right” before the book starts? Why wouldn’t a person in charge of a radio station’s programming want things “right” 365 days of the year? People don’t make decisions to listen to a radio station the first time or to listen more often to a radio station only before a ratings period. They make these decisions every day of the year. The logic (?) behind having things “right” before the books starts makes absolutely NO sense to me—it's ridiculous.
Unhealthy Music Test Scores
Doctor: I am enjoying my time in your fine state of Colorado. I just moved here last month and I can see why you stay put as a resident.
At this new station I have joined, I have been going over our "Internet Callout" scores from the past couple of years. Before I get to my question, please allow me to state that I know your reservations about relying on these data. Trust me, I don't. I see the information only as one indicator tool out of a few at my disposal. My decisions are made by me, and not by my possibly faulty data.
Sadly, my possibly faulty data is the best I have at the moment (though I am
petitioning our GM to spend . . . wish me luck). Something I have noticed so
far: What defines a "good score" here is quite lower than anywhere I have ever
worked. We do the typical 1 to 5 scale, with 5 meaning "Gold," and 1 meaning
"Gulf of Mexico" water. At every station I have programmed, in multiple
formats, I am used to: (a) The top scores in our target demo landing somewhere
in the 4.0 to 4.5 range; and (b) Seeing maybe 30% of the songs tested land at a
4.0 or better.
But here at this new gig, in our target demo, I'll see maybe one or two songs
land at a 4.0 or higher. I can't decide what this means. Could it be that of
the listeners we have attracted in this demo, we are significantly missing the
mark on playing what they want? (But then why are they taking these surveys
over and over?) Might they simply be the most picky listeners I have every
programmed for? Was every station I've run before now out of whack, and this is
the first splash of truth I've ever experienced?
Possibly related . . . when I look at the young end of this target, or even
shift my filters to the next demographic group down from them, the scores
resemble what I am used to.
I'm worried that I said all that and you're gonna be like, "Well you should
switch to z-scores. Everything else is crap." And that you'll say "it makes no
sense to compare audience A to audience F." I agree with that, and I don't view
this as a direct comparison. Looking forward to your perspective. - Anonymous
Anon: First, welcome to Colorado. I don't know who you are or where you
live and work, but most of the state is a nice place to be. On to your
questions . . .
1. I'll start with the comment about z-scores in your last paragraph. You are correct is saying that I will tell you not to make data comparisons (music test scores or any other ratings or scores) from one market to another without converting the data to z-scores. Comparing raw scores/ratings from one situation to another is meaningless because each data set has its own "metric." That is, each sample is unique in how it scores or rates anything. For example, while one sample may be "easy" graders and rate everything highly, another sample may be "tough" graders and rarely give the highest score/rating. One absolute rule in research, a rule without exception, is: Data comparisons of any kind require that the data be converted to z-scores. This is true not only for comparing data from one market to another, but also for your analysis of the ratings from one callout to another, or from one age cell to another. An analysis of your Internet Callout scores is meaningless unless you convert the data to z-scores. (For anyone who needs more information about z-scores, there are many questions/answers on the Research Doctor Archive on this page.
2. Two pieces of information you didn't include are: (1) How many do songs you test in a callout?; and (2) Which songs do you test? Are they songs you're currently playing or songs that you are thinking about playing? These two points are very important, because at face value (only looking at the data you have without any background information), something "don't be right." It might be that you are testing too many songs. It might be that your instructions for your scale aren't clear. If you are testing songs you're currently playing, and only the lower end of your demo rates the songs highly, then at face value, the older people in your demo don't like the songs you're playing. If that's true, and I need a lot more information to verify that, then you have a problem.
3. With that said, there are several items that need to be addressed: (a) How are respondents recruited? (b) What screening questions do you use to qualify respondents? (c) What procedures do you use to verify that the respondents are legitimate?
You said that many people complete your Internet Callout "over and over." Who are these people? Are they legitimate respondents? Are they people just messing around on the Internet? Are they people from your competing radio stations trying to mess up your results? You need to know all those things.
In addition, have the screening requirements changed at all since your station's first test? If the screening requirements have changed in any way, you won't be able to analyze a history of your information unless you convert to z-scores.
4. As you probably already know, I'm not a big fan of collecting research information via the Internet. However, if that is the only data collection option available, there are a few tests that can be conducted to determine if the sample is legitimate (or at least, somewhat legitimate). Your approach to looking at the data only as an indication of reality is good. In addition, as I am sure you do, you should look at the results from several surveys before considering making a decision about a specific song. One test of a song means nothing.
5. If you verify that your sample and measurement instrument are OK, and you still don't get the types of high scores that you expect, then I think you will have to adjust your thinking. It may be that your listeners (or the people taking the survey) are "tough" graders.
6. The fact that the upper ages in your demo don't rate the music very highly bothers me. You need to look very closely at every single aspect of your testing procedure. Something isn't right, but I don't have enough information about the sampling procedures, measurement instrument, and other things, to determine what that "something" is. Look at everything.
However, the lower scores by the upper age listeners may be 100% true. If that's the case, you have a significant problem because, for these people, your radio station is their choice for mediocre music. In other words, the music on your radio station is severely limiting its potential.
7. One final thing: I hope your GM understands that the music on your radio station IS the product. If you don't test the product, how do you know what the listeners want? By not giving you money to test your product, the GM is essentially giving you a shovel to help dig the grave.
When to Test Songs
Roger, you really should be charging me for the info you've provided many times previously. Here's my question: how do you decide if and when a song has been around long enough to get valid response in an auditorium music test (AMT)? I put a few near-currents in a recent auditorium test and the songs came back unfamiliar (no surprise) which negatively affected the overall score. Thanks again. - Anon
Anon: Nice try on sending you a bill. I don’t know who you are. However, you
know me, so just send a check to me for $750. The address is on my website…www.rogerwimmer.com.
But I won’t hold my breath.
On to your question . . . There is no way to know when to put a song in an
auditorium test. You did the right thing . . . just stick the songs in and see
what happens.
Now, you mention that you put a few near-currents in a test and the scores were
negatively affected by the unfamiliar scores. Hold on there . . .
You should do a run that pulls out the unfamiliar scores to see how the songs
perform only with the people who are familiar with the songs. In addition, if
you include the unfamiliar ratings in the score computation, you will affect the
average score for your test and this may change how you use the data. This is
especially true if you use Z-scores for data analysis…which you should.
Z-Scores in Music Research
Great column! My friend and I got into a conversation recently about which music test numbers to use to compute Z-scores. I say use the Likert scale averages to compute standard deviations and Z-scores and he says to use Positive percents (4s & 5s) to make the calculations.
I guess it could be calculated either way and I kind of see his point. It just makes more sense to me not to blow off the negatives and neutrals completely. I'm curious as to which way you do it since you're the “Research Dr.” and obviously a fan of Z-scores. - Anonymous
Anon: I’m glad you enjoy the column. Thanks. About your question…
(For those who don’t know, the Likert scale is a 5-point scale. Pronounced LICK-ert)
I like your approach and here’s why.
Let’s say you have a song with 50% Negative scores (1 + 2) and 50% Positive scores (4 + 5). With your approach, the song average would be around 3.0, which means that its Z-score would be somewhere around 0.0, or average among all songs tested. In your friend’s approach, the song’s Z-score would probably be high because of the high percentage of Positive scores, but this would hide the fact that the other 50 percent of the respondents hate the song.
And there is another problem. If you compute Z-scores on only the Positive scores, a song with a 50% Negative/50% Positive split could rank higher than a song that is liked (overall) by more respondents. For example, you could have a song with 40% 4s and 60% 3s. This would probably be a fairly high ranking song, but it would rank lower than a song with 50% Negatives/50% Positives. That don’t be right.
Your question is a good one because it demonstrates why it’s necessary to know what data you’re using to compute your Z-scores. Z-Scores can be computed for anything, but ya gots ta know the original data or you may interpret something incorrectly.
All Content © 2012 - Wimmer Research All Rights Reserved