Auditorium Music Test Questions - Page 2
Dial vs. Paper
Hi Doc, love the column. I'm being pitched by two different research companies for our small group's auditorium music test (AMT) business. One company uses the dial method to collect scores, and the other uses traditional paper and pencil. Each claims their method is superior. Any thoughts? - Anonymous
Anon: I'm glad you enjoy the column. Thanks. OK, here we go with a can of worms . . .
Auditorium music testing began (with paper and pencil) in either 1980 or 1981 by The Research Group (TRG), but other companies got involved immediately. One of these companies was Cox Broadcasting, where I had just started working as a researcher. One of my first responsibilities was to analyze the reliability and validity of TRG's auditorium music testing methodology. I did that using the results from an early 1982 music test TRG conducted for WEAZ in Philadelphia, which was then a Jim Schulke "Beautiful Music" radio station (Cox Broadcasting owned Schulke Radio Productions at that time).
I still remember that test very clearly. Terry Patrick, now a successful programming consultant on his own, worked for The Research Group then, and was the moderator of the test. And what a test that was! Listening to about 300 hooks from groups like "101 Strings" and "The Henry Mancini Orchestra"…kind of like being nibbled to death by ducks!
The reason I say all this is that I know for a fact, since I was involved in the process, that the validity and reliability of the auditorium music testing procedure has been documented since at least 1982. We looked at every step of the process in the 1982 analysis…recruiting, site location, setting, measurement instrument (the scale), and more. Every step was analyzed.
And since that time, I have checked and rechecked the validity and reliability of paper-and-pencil auditorium music testing. I have found that the results are always the same—the procedure is a solid scientific research methodology. In other words, it tests what it's supposed to test and it produces consistent results.
Enter the electronic dial pad procedure in (I'm guessing) the early 1990s.
My problem is that I have never seen an analysis of the dial pad methodology, specifically a comparison of results to a paper-and-pencil test. The data may exist somewhere, I just haven't had the opportunity to review such a study. So my only alternative is to see what it said about the method.
I went to Broadcast Architecture's web site and found the company's description of the dial pad method. Some of the points presented, with my comments following each point, include:
The most accurate and effective auditorium test available. There is no definition or supporting data for "most accurate" and "effective" so I'm not sure that that means.
Real-time video analysis of the effectiveness of a station's music, flow, personality and marketing strategies. I'm not sure what the benefit is of "real-time video analysis." How does this help a programmer? There might be an advantage to seeing the results instantly, but I don't know what that is. If that's important to you, then it's an advantage.
Results are immediate and ready to implement within 24 hours of testing. Results are not immediate with a paper-and-pencil test, but they are available within about 24 hours, give or take a few hours. If a few hours are important to you, then the dial pad has an advantage.
Test participants turn the dial "up" as they hear songs they like, and turn it "down" as they hear songs they don't like. This recreates the radio listening environment, and allows us to measure exactly how participants feel about every song they hear. I'm not sure how the dial pad "recreates the radio listening environment." Maybe because a dial is turned similar to a radio, but there are no supporting data to verify that this is important. In addition, this doesn't mean that the paper-and-pencil test does not measure exactly how participants feel about every song.
The dial also allows the audience to respond instantly to the music they are hearing. In traditional auditorium testing - where pencil and paper are used - participants concentrate more on "dot-filling" and less on listening . . . . resulting in more accurate test results. This "advantage" is, quite frankly, rather silly. Respondents also respond instantly using paper-and-pencil, and how is concentrating on "dot-filling" any different from concentrating a turning a dial up or down? I don't get it. In addition, there are no supporting data for the statement that the dial pad approach produces results that are "more accurate ."
The data are processed the night after the test . . . So are the results for paper-and-pencil tests.
This technology also can be used to receive accurate and actionable anslysis (sic) of morning shows, top-of-mind images, advertising and promotions. This also can be accomplish in the paper-and-pencil test
. . . is truly the most effective testing available. Once again, there is no definition of "effective" and there are no supporting data.
So what? Well, based on the information, the only documented advantage of the dial pad over paper-and pencil is that you can see the respondents' ratings as they happen. If that's important to you, then, well, that's important.
As you probably can guess, I strongly support scientific research because it is the only Method of Knowing that is objective, cumulative, and self-correcting. I also support refinement and change in research methodologies if the refinement or change is verified to provide better data—especially data that are more valid and reliable than a previously used methodology.
If the dial pad approach to music testing does, in fact, produce data that are more valid and reliable, then I'm all for it because it advances our knowledge in the area. The problem is that I have never seen such information. The only information I have now is that the dial pad is a different approach—not better, not worse—just different. The technology of the dial pad means nothing to me because technology in and of itself does not make the procedure better than "old technology." In some cases, technology may be a gigantic step backwards.
Unless you have access to comparative data that I don't know about (comparing the two approaches using a study following the guidelines of scientific research), the only thing I can tell you is that you must select the procedure that you are most comfortable with. I don't have access to comparative data to provide any more help. However, I would be happy to review such a study if it exists.
Dropping Songs - What is the cutoff level?
If you have been shooting in the dark with your playlist, and you decide to test your large library (auditorium tests, callout, whatever), is it more logical to eliminate a certain percentage of your library? In other words, for example, drop the bottom 25% rated songs or songs that rate below a certain level (every song that scores below a 5 on a 1-10 scale?
To me, dropping the bottom X% of the songs is effective, but perhaps not as potentially correct as dropping songs that scored below a certain point—but I don’t know where the bar should be set, and I think if you have a crummy playlist, you risk having only 50 songs left to play!
Perhaps a one-two punch: In the first survey, knock out the bottom X% of songs, then look at the scores to see if you can dump all songs below a certain score and still have a rotation that doesn’t burn? (And perhaps find better sources to figure out what songs your listeners DO like?) - Gene
Gene: I have answered many questions about music test scoring and how to select songs and many of the questions and answers are here in the Archive. However, I’ll answer again to make sure I address your specific questions.
If you have been reading this column for a while, you should have noticed that I don’t believe in doing anything with raw data from music tests. Raw data are too arbitrary and they can fluctuate dramatically depending on the respondents you use for your test. For example, a song may receive an average score of 6.0 in two of your tests. In one test, the song may be a “hit” and in the second test, the song may be a “miss.”
You give an example of dropping songs that score below a 5 on a 1-10 scale. This could be a problem if you have a group of “tough scoring” respondents. You many not have many songs rated close to the top of the scale (10)—you may have a situation where the top song has an average score of 6.0 or 7.0 (or something substantially less than 10). If you decide to drop songs that score less than 5.0, you may have 20 songs in your playlist.
You should never look at raw scores for music tests for anything. The raw scores aren’t meaningless, but they are close to that. You should always convert the data to Z-scores (discussed at length in “The Research Doctor Archive.”). Z-scores standardize the data and eliminate the fluctuations present in raw scores. Z-scores also allow you to compare scores from one test to another—something you can’t do with raw scores, and the same reason why it’s not statistically legitimate to compare the raw ratings and shares in Arbitron from one book to another—different samples are used.
I also need to add that Z-scores should be used even if you re-use 50%-75% of the respondents from one music test to another, which is the only way to conduct music tests.
Although dropping the bottom 25% of the songs in your test is better than dropping songs that score below a set level (you suggest 5.0), you still need to look closely at each song to understand why it scored so poorly. For example, your target may love the song, but your P2s hate it. What do you do? You might consider playing the song if your P1s love it. In other words, music test data should be used as a guide, not a “bible” of what you should and should not play.
OK, so what? If you don’t compute Z-scores, then use the “bottom 25%” approach, although I don’t know why you would select 25%…why not the bottom 30% or 45%? If you play any songs under the mean (average) in your test, then at least 50% of the respondents don’t like the song. What I do suggest is to convert your data to Z-scores and consider dropping any song that has a Z-score of –1.0 or higher. You won’t be setting a bar—the bar is set by your respondents.
Find better sources to figure out which songs your listeners like? I don’t know of any better method that the auditorium approach. It has been shown to be both reliable and valid for over two decades.
Doc: Is there one thing you could identify as the biggest fallacy (urban legend?) about music testing? - Anonymous
Anon: I could list several items, but I think the most often mentioned (and passed on) fallacy about music testing is that the order of songs tested has an affect on the scores.
As I have stated many times in this column, I have tested the correlation between song order and song score numerous times and have never found a correlation of any significance. I'll repeat—the order of songs tested in either an auditorium test or callout has no affect on the song scores.
However, for anyone who must have some order to the songs tested, one excellent way is to put the songs in alphabetical order by song title.
Hey Doc: What percentage score (roughly) is a song still considered unfamiliar in your book? When do you get uncomfortable seeing a survey with that percentage on songs? 2 songs? 4 songs? More? Thanks for your help. - Anonymous
Anon: Let’s work backwards to try to get an answer.
It’s rare to find a song that has 100% familiarity. I know it happens, but there aren’t many that truly fall into the “everybody knows this song” category. When you conduct any type of music test (callout, auditorium, Internet) and look at familiarity percentages, you should deduct 10% for what I call an “idiot factor,” and another 10% for error (sampling error, measurement error, and/or random error).
The 20% reduction means that a song with a familiarity of 80% is about as high as you can expect. As I said, I know there are songs that will approach 100%, but that doesn’t happen often. (You may see high 80s and 90s familiarity percentages if you test only P1s.)
OK. If we take 80% as the cut-off level for “high familiarity,” then anything less than 80% should give an indication that the song hasn’t reached its familiarity potential. I’m not sure about an exact number of songs, but if 20% of the songs you test are unfamiliar (more than 20% of the sample is unfamiliar with the songs), then you may be criticized for playing “too many unfamiliar songs.”
The problem here is that there aren’t any strict guidelines. Most listeners say they like to hear familiar songs, but how does a song become familiar if radio stations don’t play it? A song has to be unfamiliar at some point. You need to watch the familiarity percentage and song score simultaneously.
That is, if a song has a low familiarity percentage, but is rated highly by those who know it, the song will probably be a good one. If a song has low familiarity and a low score by those who know it, the song is probably not a good one.
I am not the original poster, but your response prompted a related question.
Applying the “idiot factor” and sampling/error margin information you wrote (take 20% off for those two factors), is it safe to assume that burn is also lower among the majority of my audience vs. the P1s or idiots? - Anonymous
Anon: Good question. The answer is “maybe” because it depends on how many other radio stations in the market are playing the same songs. Your P2s, P3s, and other Ps may also have high burn percentages because they may hear the same songs on other similarly formatted stations.
If you don’t share much with similarly formatted radio stations, then your non-P1s' burn percentages will usually be lower than for your P1s.
Familiarity Follow-Up 2
I am the person who submitted the question about music tests and familiarity. I guess I need closure on the topic.
I need to refute or confirm this notion: playing great testing current songs more often (until burn becomes an issue) will increase my cume. With the alternative being playing great testing current songs less often, with more recurrents and gold filling the gaps. (Referring to these options from a CHR point of view.)
I’ve noticed that often when new stations sign on, especially those of a Hot AC, CHR, CHR/Rhythmic, or Urban nature, they site ultra “tight” playlists of currents and the newest recurrents as one of their biggest tools in building cume quickly. - Anonymous
Anon: Closure, eh? Let’s see if this helps.
Generally speaking, playing the best testing current songs more often on a CHR will increase cume because that’s what that audience (generally younger people) wants to hear. CHR is current based and that’s what the younger people tend to like…the best new songs played frequently.
Field Service Tests
What do you think of the approach of having listeners go to the field service and taking the test there at their convenience, as opposed to going to a hotel on one or two nights. - Anonymous
Anon: Music tests at a field service (the company that recruits the respondents) started about 10 years ago by radio companies that conducted their own music tests. The main reason was to eliminate the travel of company personnel to conduct a large number of tests. The field service test eliminated travel because it can be arranged with one telephone call.
The procedure requires videotaping the introduction to the test to explain the rating scale and the purpose of the test. This controls the introduction by having all respondents exposed to the same introduction and eliminates the participation of field service personnel. A respondent is recruited, shows up at the field service, and is escorted into a small room to take the test. The only responsibilities of the field service are to give the respondent the song scoring materials and start the videotape.
That’s a very simple explanation. What are some of the potential problems? Well, if the procedure is designed correctly and is supervised carefully, there shouldn’t be any problems. However, some problems include:
The test should take no longer than a few days. If it drags on too long, then events from the time when the first respondent takes the test until the day when the last respondent takes the test may affect the results.
All field service music test data must be checked for reliability and validity. In addition, it’s necessary to check for response sets (the same answer) for every respondent. These procedures are easy to do for someone who understands statistics, but are too complicated to explain here.
The field service must be given explicit instructions on how to conduct the test in the event a respondent has a question. This is also important if there is a problem with the introduction video tape.
The respondent must be checked frequently to ensure that he/she is actually taking the test. In initial tests of this system, I found that when left alone, some respondents tended to get off the task at hand and missed scoring some songs.
Frequent breaks are important because some respondents tend to get bored—it’s lonely for some people to sit alone in a room listening to several hundred hooks.
As I said, if the procedure is designed and supervised well, there shouldn’t be any problem. However, it is absolutely necessary to conduct several statistical tests to ensure that the data are good.
I received a multiple-question note from Andy. I'm breaking his questions up to make it easier to follow . . .
(Q. 1) Once you have tested the currents and library and settled on the appropriate tunes how or what kind of test should you conduct (or is there some kind of standard) to see exactly how often you should rotate each category?
Answer: I know you're looking for an exact answer here, but I don't think there is one. The reason is that listeners vary so much. While some listeners say they like to hear their favorite songs over and over again, others do not. Some say that hearing the same song twice in one day is fine and other say that's too much.
You could get answers to these questions by asking your listeners.
(Q. 2) Is there a certain Z-score cutoff below which it wouldn't be beneficial to put a song on the air?
Answer: Generally speaking, any song with a Z-score of –1.0 or greater probably isn't a good song to play on your radio station. However, this is only a general rule. You need to look at your data carefully to find out if there is any redeeming quality to a low scoring song.
(Q. 3) What is the median response among test respondents you've seen, in terms of how often a song is played which constitutes 'a song that gets played too often?' Is it more of a scheduling problem? Because even if a song rotates every 2 hours, with our average TSL being 7:00, that listener would only hear that song every 2 days—on average. Or is that too often?
Answer: I don't know of any "median" response for a song that is perceived as "played too often." The best thing for you to do is ask your listeners. I don't want to predict what your listeners think about rotation. That's not appropriate.
Your questions are not easy to answer because the situation is not cut-and-dried. Consider this—people usually listen to 3 or 4 radio stations. You may have a perfect rotation, but your listeners may also hear the same songs played by your competitors. You may be criticized for playing songs too often even though you don't. Therefore, you're going to have to accept some complaints about repetition.
(Q. 4) I know for me as a listener I don't notice tight rotations as a passive listener during the week. But on the weekends, when I make more trips in my car or have the radio on in the garage, when I hear a song repeat I register it pretty quickly. Is attention to these things affected by outside distractions for listeners? Sorry for all the questions, I'm just curious.
Answer: Your own experience raises an excellent question that you should ask your listeners. My experience with radio listeners indicates that they will probably agree with you.
Because there is no hard evidence (it may exist, but I haven't seen it) about song rotations on radio stations, the best you can do is worry only about what you do on your radio station. You can't control what your competitors do and you can't keep your listeners from tuning to other radio stations. It would be great if some company, organization, or individual radio station would be willing to test this question. It would be nice to have answers that are more definitive.
Many research studies have asked listeners what they don't like about radio stations or why they tune from one radio station to another. Repetition usually shows up. However, there is a need to investigate this question in a full-blown (scientific) study to find out what specific affects the repletion perception has on listening.
On the subject of hooks, there are many songs that have several well-known hooks. "Paradise by the Dashboard Light" and "Scenes from an Italian Restaurant" are songs that have "songs within the song." Which hook do you pick? If you pick two hooks and test them both, which score do you believe? - Anonymous
Anon: Good question. To get the best answer, you'll need some knowledge of statistics. When I test which hook is best, I do several things and then look at the results of all of them to select the best hook.
Compute the standard deviation for each hook. What you're looking for is the hook that has the lowest standard deviation because it means that there is more agreement among the respondents.
Compute a t-test for the different hooks to determine if there is a statistically significant difference between the scores.
Convert the scores to Z-scores and then conduct a significance test between the scores.
Ask the respondents which hook best represent the song.
OK, so let's assume you don't know how to do these statistical things. Go ahead and test two or three hooks you think are representative of the song. Then . . .
Prepare a short tape with the hooks for the music test. After the test, play the tape and ask the respondents which hook, if any, best represents the song.
Look at the scores for the hooks you tested, particularly the spread of the scores (how many low, medium, high scores). Use the hook that has the least amount of spread—the one where the scores are most grouped together.
If you still don't know what the heck is going on, then do this: Test both hooks and average the scores together.
I have done these comparisons thousands of times, and usually find that when you consider sampling error, the two hooks will test the same. (I'm assuming that both hooks are representative of the song.)
Which company should I use to produce hook tapes for our music tests?" – Thom
Thom: While I do have a preference, I’d prefer not to mention it in this column. Instead, I’ll say this . . . hooks can make or break a music test. A good tape moves like a metronome…hook number, dead air, hook, dead air…hook number, dead air, hook, dead air . . . (about 10 seconds total for each hook). This means that the tape is consistent in hook length, sound (you don’t want the volume to vary from one hook to another), and pacing. In addition, select a company that is consistent from one tape to another—kind of like McDonald’s in that you always know what you’re going to get. Finally, I know there are price differences among the companies. Be sure to check that out.
A little aside here…Music testing started around 1980 or so and there wasn’t much known about the process back then. There were no hook tape companies—the PDs made the tapes (or assigned the task to someone else). We never knew what to expect. In one of the first tests I ever did, I met the PD at the hotel where the test was going to be conducted and asked to hear the tape. He stuck the tape in the machine, hit the play button, and out came . . . hook, dead air, hook, dead air, hook, etc. I said, "Where are the numbers?" His response was something like, "What numbers?"
He had to stand in front of the room the whole night reading the hook numbers out loud. Things have changed since then.
How long should a hook be? - Anonymous
Anon: With a few exceptions, most hooks should be 5 to 7 seconds long. In numerous tests of hook length, we found that most people identify a familiar song in 3 seconds or less. ("I can name the song in one note.")
Some songs and some formats (such as the old Beautiful Music) require hooks up to 10 seconds long. (Aside: I can remember the Beautiful Music tests we conducted in the early 80s—about 300 instrumental songs. Talk about jump in the bathtub and slit your wrists!)
Click Here for Additional Music Test Questions
All Content © 2015 - Wimmer Research All Rights Reserved