Auditorium Music Test Questions - Page 1


Auditorium or Callout?

Dr: Enjoy your column.  In your mind, what's a more effective way of music testing, auditorium or callout? - The Great One


TGO: I'm glad you enjoy the column.  Thanks, and on to your question . . .


My comment about which music test approach has nothing to do with what's in my mind.  My comments are based on scientific research procedures.


So, with that in mind, there is no question that an auditorium music test is more valid (Does it test what it's supposed to test?) and reliable (Does the test provide consistent results over time and with different samples?) than a callout test.


The main reason why auditorium tests are better than callout is control over the testing situation.  In an auditorium test, the researcher knows: (1) who takes the test; (2) that all respondents were exposed to the hooks; (3) that all respondents were exposed to the same hooks; (4) that each person rated the hooks alone; (5) that all respondents were exposed to the hooks under the same environmental conditions; (6) that all respondents rated the hooks without extraneous intervening variables, or at least were exposed to the same extraneous variables; and (7) that respondents' questions about the procedure can be answered immediately and uniformly.


That's just part of the list, but should provide enough evidence to show that an auditorium music test allows the hooks to be rated under a controlled situation.  This can't be done in the callout methodology.  There is no debate here.  Those are the facts.

Auditorium Sample

I’ve always conducted auditorium music tests with a room of 100% cumers, 50% my P1s and 50% P1s of my top competitor.  One research company I worked with sought participants by making hundreds of blind phone calls to my market.  In recent years, I’ve worked with another company that requests a copy of my massive listener database to assist in contacting participants.  (The database contains “card club” members, who obviously may or may not be my P1s).


Both companies claim their methods produce a valid sample.  I’m concerned that the second method may invite too many “insiders”—frequent contest participants or a wedge of listeners that give us an unusually high TSL. The only differences I’ve noticed are slightly lower song scores or more burned titles when the second company conducts the tests.  Your thoughts? - Anonymous


Anon:  Excellent question.  I need to discuss a little research background stuff before I get to your question.


The ideal sample in behavioral research (that is, research with humans) is a random sample, which is a sample where everyone has an equal chance of being selected.  In theory, we would randomly select a number of respondents and all would agree to participate in the project, whether it’s an auditorium test, callout, focus groups, or perceptual study.  But we know that not all people selected will agree to participate for one reason or another.


The number of people who agree to participate in a research project is known as the acceptance rate, which is the percentage of people called or contacted and agree to participate.  In media research studies, the acceptance rate is usually about 60%.  In other words, about 35% of the qualified people contacted do not agree to participate.


What does that mean?  It means that research studies never use a truly random sample.  In effect, the sample is a group of volunteers who agree to participate.  If anyone tells you that their sampling procedures will give you a true random sample, they are blowing smoke.  The only way to have a true random sample is to force the people who are selected to participate in the study, and we can’t do that.


OK, so we know that all samples in radio research (and all behavioral research) are volunteers.  There are several ways to try to recruit these people for a study.  You could:

  1. Make “blind phone calls” (as you call them) and hope to find qualified respondents (by the way, the research term is “cold calls”).  While this process is fine from all statistical perspectives, it is very time consuming and costly.  A cold call recruit may cost as much as three times what other procedures cost because so many calls are wasted (non-working numbers, businesses, and so on).

  2. Buy a list of say, 25-34 year olds in your market from one of several sampling companies.  These lists aren’t perfect, but they eliminate many of the wasted phone calls.  But this procedure is still expensive because many of the people on the list will not match the exact target you’re interested in.  (Using a purchased list is probably the most often-used method in all types of behavioral research.  The problem, though, is that caller ID, answering machines, and the unwillingness of many people to participate have caused this tried-and-true method to skyrocket in terms of costs.

  3. Use an established list of participants.  In your case, the research company you’re your database of Card Club members.  However, as you mention, even these lists aren’t perfect because of data entry errors, people changing their radio station preference, and so on.  But it’s the best approach to use to reduce expenses, and it’s the approach that will probably be commonplace in the next few years.  If not, radio stations will not be able to afford any type of research.

OK, that’s the background.  You want to know if the two approaches used by your research companies are valid.  The answer is “yes.”  Cold calls will eventually find qualified respondents, but the method is costly.  Using a radio station’s database substantially reduces the number of phone calls needed because the list, although not perfect, eliminates many wasted phone calls.  The same goal—finding qualified respondents—is possible with both sampling methods.


Now, one thing you need keep in mind is that a research company that uses your database should give you a break in price because recruiting costs will be lower than cold calls.  I’m assuming here that your database is good.  I have seen many radio station databases where 60% or more of the names are bad.


You mention that your database sample produced slightly lower song scores or more burned titles.  I don’t know what “slightly lower song scores” means.  Did you conduct a statistical analysis to verify that?  If they only look lower, they may not be.  In order to check this out, you need to conduct t-tests or convert the scores to z-scores.


In addition, I don’t know how many more burned titles you have, but I don’t see that as a problem unless you say that 90% of the songs are burned.  I also don’t know what you classify as a “burned song.”


But let’s assume that there are more burned titles in the database group.  This sounds logical because your database sample, by virtue of respondents giving you their names and joining your club, have reduced the variance (difference) of people and the answers you would find in a cold call sample.

Auditorium Tests and Small Samples

Hi Roger: To deal with the lack of budgets and the high cost of music testing, some companies are now doing auditorium music testing using a sample as small as 40 participants.   Typically, the sample would be all one gender spread over a 10 to 12 year age range.  Is this a credible sample size? What is the smallest sample size you would say is statistically reliable?


Thank you for the work you do. - Anonymous


Anon:  Hi to you too.  You're welcome for the work I do.  Thanks, and on to your question . . .


While your question may seem simple to some people, you have actually opened a "can of worms" about research and sampling, and I'm afraid my answer isn't going to be short because I need to address several things.  So . . . bear with me, and you may want to get a 6-pack of your favorite beverage to drink while you read this.


Sample Size

The reason sample size is important in any research study is because sample size determines the amount of sampling error present in the study.  With one exception that I'll explain in a moment, the general rule is:  The smaller the sample, the larger the sampling error.  Let me explain that . . .


Research projects conducted with human beings are dramatically different than research studies conducted with static elements, such as a study conducted by a chemist who is testing the tensile strength of different types of metals.  Steel is steel.  Aluminum is aluminum.  But in the behavioral science research with human beings, one respondent (subject, radio listener, TV viewer, etc.) is not exactly the same as another respondent in the sample.  Humans are very different, and this is clearly evident when a random sample is selected without any qualifications—screener questions—such as age, sex, number of hours listen to the radio in a day, favorite radio station, and so on.  In other words, behavioral science research uses human who are very different (they vary from one another, called "variance" in research), and this variance introduces error when the results of such studies are generalized to the population from which the sample was drawn.


Sampling error was introduced by statisticians in the 1920s and 1930s to compensate for the differences among people in a sample.  So, for example, when a behavioral research study is conducted, the researcher will say something like, "A sample of 400 respondents was used for this study, which produces a sampling error of ±4.9%."  This means that if the study shows that 50% of the sample likes a specific radio station (or the answer to any other answer), the "real" answer is somewhere between 45.1% and 54.9%.  The sampling error percentage is actually a statistical "fudge factor" because of the variance present in the respondents in the study.  There is NO research study in the behavioral sciences that produces results with 100% certainty.  None.  Nada.  Zilch.  The reason?  The studies involve human beings.


This obviously means that NO radio research study should be interpreted without considering sampling error.  If a researcher, or anyone else says something like, "The study shows that 12% of the Women 18-34 name WAAA as their favorite," is not interpreting the data correctly.  The researcher, or whomever, should say something like, ""The study shows that about 12% of the Women 18-34 name WAAA as their favorite."  Or, if using the actual sampling error, the statement would be, "The study shows that between 11% and 15% of the Women 18-34 name WAAA as their favorite."


In summary, sample size is important in research because it determines sampling error.  If a sample is used that is too small, sampling error may approach 90% or more.  That's crazy.


(Note:  I have a sampling error calculator on the home page of my business website – click here and then click on the 95% option.)


Sample Size – An Exception to the "General Rule"

I said earlier that there is an exception to the general rule, "The smaller the sample, the larger the sampling error."  This exception is going to be the basis/root/heart/foundation for the answer to your question.


Let's look at an example to demonstrate this exception.  Let's say I'm interested in finding out which radio station is listened to most often by people in my city.  In Research Study A, I go out to a busy street during lunchtime and interview 100 people, regardless of age or sex, or any other qualifier, and ask them, "Which radio station do you listen to most often during a typical day?"  I record all the data and in my results, I say something like, "WBBB is the number one radio station in the city according to a recently conducted research study."


Now, in Research Study B, I go out to a busy street during lunchtime and stop 100 people, but only women who are 18-24 years old.  I ask them if they live in the area and listen to the radio at least one hour per day.  If "yes" to both questions, I then ask, "When you have the choice, which radio station do you listen to most often during a typical day?"


Which study, A or B, will have the least amount of sampling error (variance)?  You better say "B" or I'll come out there and smack you.  Obviously the women in Study B will have less variance because they are all in the same age group, live in the area, and listen to the radio at least one hour per day.  I also eliminate the error of the influence of others in a person's listening choice by asking,  "When you have the choice . . ."


In Study A, the calculated sampling error is ±10%.  In reality, however, the error would be much higher because I could have interviewed small children, a mixture of males and females, or a bunch of other weird options.  The results would give me only a very limited indication of which radio station is listened to most often.  The results could not be generalized with any degree of certainty to the total population of the city.  Moral?  A decent sample may be large enough to produce good results, but without any qualifications (screeners), the results are virtually meaningless beyond something like, "Hey, this is interesting!"


Variance – So What?

Now, I hope it's apparent that including screener questions in a research study dramatically reduces sampling error.  The reason is that the qualifiers force the sample to be homogeneous (similar).  Heterogeneous samples are loaded with sampling error; homogeneous samples are not.  That's a good thing for research projects.  All radio stations have specific age and sex targets, and those people should be used in all research studies a radio station conducts, unless there is a need to look beyond the target.  Listen to me now and believe me later.


Smaller Sample Music Tests?

With all that as an introduction, it's time to address your specific question about some research companies using smaller samples for music tests.


A typical auditorium music test usually includes 80-100 respondents.  However, these respondents are usually broken into smaller age and/or sex cells.  So . . . although the results from the Total Sample are looked at, researchers, consultants, and radio station people also look at the smaller cells.  There is no problem with that as long as the respondents are good; that is, as long as the respondents are properly screened.


In essence, then, the smaller sample studies you ask about have been used since auditorium music tests started around 1982.  I don't see anything wrong with using a sample of 40 participants as long as the sample is properly selected.  If a radio station is interested only in Women 25-34 (or whatever), a strictly qualified sample of 40 is fine.


Another Sampling Thing to Consider

Now, this may shock you, but if a research study involves a very tightly controlled homogeneous sample, there is no problem with looking at cell sizes of 20.  So, although I said a sample of 40 is fine with Women 25-34, if the screener to recruit the respondents is designed by an expert, then it should be possible to split that sample into 25-29 and 30-34 to get some indications of differences between the two cells.  The homogeneity of the sample is the key and this can't be done in all studies with a broadly selected sample.


Music Test Design and Error Reduction

One more thing about reducing error.  In addition to testing only a homogeneous sample, the research design used for music tests further reduces error.  The type of design is called a "Repeated Measures Design."  This means that the sample of people repeatedly rate only songs about the same number of seconds in length, and they use only one rating scale.  In and of itself, the design reduces error.  Neat, eh?  The combination of a homogeneous sample and Repeated Measures Design make music testing an extremely valid and reliable research procedure.


Don't believe me?  Consider this . . .  Let's say that a researcher is going to test 400 songs, and tells the respondents, "The 400 hooks will vary in length from 4 to 15 seconds.  Please use a "Hate/Love" scale of 1-10 for the first 100 songs, a 1-7 scale for songs 101 to 200, a 1-3 scale for songs 301-300, and a 1 or 2 scale for songs 301-400."  The respondents would walk out.


What to remember?  (1) Homogeneous sample; (2) Repeated Measures Design.


Are you tired of reading?  Only one more thing to go.


A Final Check on the Sample

Regardless of how perfectly a screener or questionnaire is designed, or how perfectly the recruiting was done, almost every research study will have one respondent, or maybe even a few, who don't belong in the sample.  These people may have lied when they answered the screener questions or maybe even guessed correctly even though they knew their answers were bogus.  That's a fact of research and it happens in almost every study.  That's cool.  Some people like to sneak into things so they can get paid, but there is a way to weed these people out of the sample.


What I'm about to describe is what I call the "Wimmer Sample Verification Procedure" (WSVP).  I developed this process several years ago, but I'm willing to pass it along here since I finally included it in the latest edition of our textbook, Mass Media Research: An Introduction.


I mentioned research "variance" earlier.  The term refers to how people differ from each other.  Variance is the key to finding research respondents who don't belong in a study and it is the key to the WSVP.  If a respondent varies too much from the rest of the sample, the person is eliminated.  You can compute the WSVP yourself on a spreadsheet.  If you can't do it, have your research company do if for you.  If your research company can't do it, find another research company that can provide the information.


Here are the steps:

  1. Calculate the standard deviation for each respondent's song ratings.  Let me be very clear here:  The standard deviations must be computed for respondents, not songs

  2. Calculate z-scores for all the respondents' standard deviation scores.

  3. Eliminate any respondent whose standard deviation z-score is greater than ±1.5.  A person with a z-score that high does not belong in the sample.  If you like you can choose to eliminate anyone with a z-score greater than ±2.0, but I like 1.5 better.

The reason the WSVP works so well for music tests is that while someone may lie or guess his/her way into the study, it is virtually impossible for that person to mimic the responses of the other people in the sample because music test ratings are recorded individually – each person does not know how the rest of the group rated each song.  A confederate respondent will virtually always have a z-score that is too high (positive or negative).  The WSVP adds an additional element to why it's OK to conduct music tests with small samples.


What to remember?  (1) Homogeneous sample; (2) Repeated Measures Design; (3) Wimmer Sample Verification Procedure.  If these three things are followed/incorporated, there should be no problem conducting a music test with 40 respondents.

Best Start Date

Doc:  Several months ago, I asked if it mattered what day of the week to start an online music test to get the maximum number of participants.  You had no evidence one way or another and encouraged me to track results myself.  I did that with 10 tests, starting two different tests on each respective day of the week over the course of several months (no weekends).


Conclusion?  Monday through Thursday have consistent results, but Friday seems to be a bad day.  About 24% fewer participants (on average) participated with a test starting on a Friday.  The Monday through Thursday tests were all within about 7% of one another.


In addition, I came to a surprising conclusion about something else—"poor" weather on the start date or 1-2 days immediately following the start date of a test, resulted in a higher participation rate.  However, this was an observation on my part since I didn't keep a log on the weather.


Finally, I came to the conclusion that there are so many variables involved in music testing that any results should not be relied upon, but I will continue to track this and see where it takes me. - Anonymous


Anon:  First, you'll notice that I edited your note just a bit.  I don't think I changed your meaning, but please let me know if I did.


Next . . . congratulations on conducting the study.  This is the first time I have seen anything done on this topic.  But don't berate your work.  Sure, more studies need to be conducted to determine the reliability of your results, but what you have is great.


Your results provide very good indications about the affect of the start date on online music testing.  You have demonstrated how the scientific method works—a small step each time to get to valid and reliable results.  You made the first step and I'm enormously proud of you.


One more thing, you said that music tests involve many variables.  That's true, but from what I can see, you only varied one variable (start date), which means that the difference in participation rates is probably due to start date and not some other (extraneous) variable.


If I had a research award to give, you would receive it.

Best Time for Music Test

Doc:  When is the best time to conduct my music tests?  I have always heard that the tests should be conducted just before the book so I can get the music right. - Anonymous


Anon:  I have heard this question many times and my answer is always the same . . .


There is no best time to conduct a music test.  You should do the test(s) whenever you need information about your playlist.  If you need information now, then now is the best time to conduct your tests.


Regarding the "getting the music right" before the book . . . That makes no sense at all.  Why would you or anyone else want to get the music right before the book starts?  The music should be right all year long.  Listeners don't make decisions to listen to a radio station based on the start of an Arbitron survey.  The importance of the start of the survey is only in the minds of radio people who believe garbage that has been told to them for several years.


No research (music tests, perceptual studies, focus groups, etc.) should be conducted to coincide with Arbitron.  As I said, listeners don't make up their minds about listening to radio stations based on when an Arbitron survey starts or ends.  Did I repeat that enough?

Blowup Potential

I’m trying to sort all the information I see from research companies that do music tests and I’m getting confused.  Each company claims that their approach is the best.  I see things like, “Increase your cume,” “Increase TSL,” “Find core songs for your P1s,” and more.  One company says its method is “Patent pending.”


After reading all that stuff, I think I’d rather do the tests myself.  Can you help me understand a few things?  What is the best way to conduct a music test?  What type of sample is best to use?  What is the best way to analyze the data?  Are there some ways that are better than others?  In other words, can you give me some guidelines?  - Anonymous


Anon:  You’ll notice I edited your question to delete the names of research companies.  I don’t think that information is important here.  So, on to your questions.


First, you better get something to drink, because this is not going to be a short answer.  Secondly, I give you credit for wanting to do music tests yourself, but since you’re a novice in the area, I will be honest and say that you better prepare for a lot of work.  Here we go…


There is no “right” way to conduct a music test.  The procedures depend on the information you need.  Just because a company uses a specific approach does not mean it is appropriate for you.  But here is a warning: Music tests look deceiving simple to conduct, but require a lot of experience and planning.  While music tests are not open-heart surgery, there are several steps you need to consider.  Each step is important, and if everything isn’t correct, the test can blow up.


What I have written below is a brief outline of some things you need to consider.  I have also included a rating for “Potential Blow-up,” where “1” = “Not likely to blow-up the test,” “10” = “Highly likely to blow-up the test,” and 2 through 9 are in-between.


Some things to consider include, but are not limited to, are:


Sampling/Screener Design:  Blow-up Rating = 10.  Your goal for the music test dictates the type of respondents you recruit.  Some approaches are:

  1. If you want to increase your cume, recruit respondents who listen to your format.  Your screener would include music format descriptions and respondents qualify if they listen “often” (or whatever requirement you use) to a description of your format (not a description of your radio station’s music, but rather the format in general).  You might have two or three descriptions of your format, and a respondent would qualify by listening “often” to one or more of the descriptions.  This approach will give you the widest selection of respondents.

  2. If you want to increase your TSL, recruit respondents who cume your radio station.  In this case, your screener would not use music descriptions, but rather a question about the respondents’ listening, such as: “During a typical week, which area radio stations do you usually listen to for music?”  Respondents who name your radio station qualify for the test.

  3. If you want to “super serve” your P1s, recruit only your P1s.  I don’t recommend this procedure because the sample is too narrow.  Radio stations that constantly test only their P1s will eventually disappear because the focus is only on P1s, not the cume.  Radio stations that use only P1s in music test will “P1” themselves to oblivion.

  4. If you want a broader perspective, then you can recruit a combination of format cume, your radio station’s cume, and your P1s.

These aren’t the only sampling approaches.  There are numerous variations.


You must pay very close attention to your screener.  You must ask questions correctly and in the correct order.  If your screener is designed incorrectly in any way, you’ll have the wrong respondents at the test.


Recruiting:  Blow-up Rating = 10.  You can recruit the respondents or you can hire a field service.  If you recruit the respondents yourself, be sure to have many bottles of aspirin available because it’s not an easy task—count on several hundred hours to recruit your sample.


Hook Preparation:  Blow-up Rating = 8.  You can produce the hook tape or you can hire a hook production company.  As with recruiting, if you decide to produce the tape yourself, be sure to have many bottles of aspirin available.


Measurement Instrument:  Blow-up Rating = 10.  You will need to develop a rating scale (measurement system) and prepare scoring sheets for the respondents.  If you don’t want to input the data by hand (you don’t), then you’ll need to develop OCR (optical character reader) sheets and purchase a machine that can read the sheets.  (You may choose to use hand-held devices for data collection.)


Test Location:  Blow-up Rating = 4.  You’ll need to find a place to conduct your test.  A hotel ballroom is usually OK, but you’ll have to be flexible if nothing is available.  You’ll need to check for access to the room, parking, and make arrangements with the hotel to set up the room.


Sound System:  Blow-up Rating = 2.  This shouldn’t be a problem for a radio person.  By the way, you don’t need a $10,000 sound system to play hooks.


Co-op/Incentives:  Blow-up Rating = 1.  You need to pay the respondents for their participation, usually $50-$100.


Data Verification:  Blow-up Rating = 10.  You’ll need to verify your data.  Most research companies have software to check for respondents who don’t belong in the test.  If you don’t know how to do this, you can find someone who understand statistics to help you.  Your verification will also help identify if your recruiting is satisfactory.


Data Analysis:  Blow-up Rating = 10.  You can analyze the data yourself or sub-contract the process to someone who has appropriate software.  If you want to do this yourself, you’ll have to buy software or have it written by someone.  You might be able to get by with analyzing the data on a spreadsheet, but you’ll probably go crazy doing that and I don’t recommend it.  However, you can do many post hoc (after the fact) tests using spreadsheets.


Report Presentation:  Blow-up Rating = 9.  You will need to figure out a way to display your results.  A data analysis package will probably provide a good display, but it may not be suitable for your needs.


Those are some of the things you need to consider.  Music tests aren’t difficult to conduct, but they do require a lot of attention and planning.  That’s the advantage of using a research company—most of these people have a lot of experience with music tests and know where problems can occur.


If you are determined to conduct your own tests, then go for it.  If you do proceed, then you should plan for 6-12 months preparation time—mostly for developing your measurement instrument and data analysis procedures.


One final point:  Hiring a research company doesn’t mean you have to accept the way the company analyzes and presents your data.  The good companies give you a lot of flexibility in data presentation and you can ask for unique rankers, correlations, and other ways to look at your data.  For example, you can get almost any answer to song likes/dislikes/relationships by asking for a variety of banner points (column headings).


Don’t interpret my comments as crushing your idea to do your own tests.  You can do them if you have the time and money to develop everything.  You also may want to consider hiring someone to help you do the things you can’t.  Have fun.

Blow-up Potential - 2

In one of your questions posted about music tests, you had a “blow-up” rating (“1” = “Not likely to blow-up the test,” “10” = “Highly likely to blow-up the test.”)  You listed that hook preparation would rate and 8 on your scale. I was wondering why this is?  I’m not a production guru, but I don’t find hook preparation to be at all a headache.  Just curious. - Anonymous


Anon:  I rated hook preparation an 8 because of what I have experienced in the past.  While you may find hook production easy, here are some of the things I have experienced:

  1. Inappropriate hook.  Although most people who produce hooks can determine which portion of the song to select, some people can’t do this.  I have heard hooks where none of the respondents can identify the song.

  2. Hook length.  The biggest problem is hooks that are too short.  I have heard hooks that are 1-2 seconds long.  On the other hand, hooks that are too long can make the test drag on and that creates respondent fatigue.

  3. Numbering.  Although it doesn’t happen too often, some people misnumber the hooks, skip numbers, and in one case, the person forgot to put numbers on the tape and had to stand in front of the group and read the numbers to the respondents.

  4. Variation in sound.  Sometimes the sound varies from very quiet to extremely loud and the respondents become frustrated.

  5. Stereo.  Only one channel is recorded.  This is particularly bad for Beatles’ songs..

  6. Repetition.  I have heard tapes where the same song is repeated up to 5 times.

  7. Dead air between hooks.  The dead air between hooks should be short and constant.  I have heard tapes where the time between songs varies from one second (typical) to several seconds.  This also creates respondent fatigue.

Those are some things related to the production of the hook tape.  There are also other problems, such as:

  1. The moderator forgets to bring the tape.

  2. The moderator brings the wrong tape.

  3. The tape is partially or completely blank.

  4. Both the master and the backup tape break.

  5. The tape is damaged in transit and can’t be used.

That’s a partial list.  A person who pays attention to details won’t have any problems, but not all people pay attention to details.

CHR Music Tests

Hi Doc: I was wondering whether a CHR station would ever need to do an Auditorium Music Test (AMT), or should it rely only on callout research?  If a CHR radio station would do an AMT, would it be for its older songs? - Anonymous


Anon: I think there are several reasons why a CHR radio station would do an AMT.  Here are a few:

  1. The callout methodology limits the number of songs you can test—usually about 20.  In an AMT, it's possible to test several hundred songs in one session.

  2. I don't know anything about your radio station's playlist, but I assume you play more than just currents.  An AMT will allow you to test hundreds of recurrents and older songs (gold, or whatever term you use).  This will give you information about familiarity and burn (tired of hearing the song) for all the music you play.

  3. Because of the number of songs you can test in an AMT, you (or the company that conducts your music test) can compute correlations and other statistical procedures on the data so you can see how each song relates to every other song.  You can get a lot of information about how the music "fits" together, which is a great help in developing your playlist.

  4. You can also test other things during an AMT, such as jingles, promotional materials, TV commercials, and personalities.

  5. An AMT will allow you to do a "tack-on" questionnaire to ask the respondents several questions about your radio station or any other topic of interest.

The auditorium approach is useful for all radio formats.  While a CHR radio station probably does more callout because of the type of music involved, the AMT approach is still very useful and should be conducted regularly.

Comparing Scores Across Markets

We have several AC radio stations in our group and I was wondering if it would be OK for the PDs to compare their scores from music tests.  It may be a good way to get some new songs to play.  - Ray


Ray:  There are many references to your question in this archive.  Check other question.


Your question is a little confusing to me because you ask if it’s “…OK  for the PDs to compare their scores…” and then you say, “It may be a good way to get some new songs to play.”  Comparing data is one thing, but sharing data is a completely different item.  I’ll address both things.


Comparisons:  There is no problem with comparing data between radio stations, but you must convert the scores to z-scores in order to make the comparisons (see “The Research Doctor Archive” for many discussions).  Why?  Because the scores come from different samples and you can’t compare raw scores between two different samples of listeners.  By the way, both samples don’t’ have to use the same rating scale.  For example, one PD may use a 1-5 scale and another may use a 1-7 scale.  The z-scores will standardize the data so you can compare this “apples to oranges” situation.


Sharing:  Many years of experience with music tests has verified that it’s not a good idea to share music test data (or any research information) with other radio stations.  Listeners aren’t the same in all markets and you need information from your own listeners, not listeners from another market.

Consultant Tips

Doc:  In the most current “Consultant Tips” on All Access, Garry Mitchell has an article titled, “Five Tips & Tricks for Your Next Auditorium Music Test.”  The first two tips are:


1. Include the same song hook twice, spaced evenly apart from each other within the test.  Comparing the scores of each hook helps you benchmark the margin of error for the test (and also measure fatigue onset, if you're not rotating the song order for each respondent).


2. Include a hook from a totally obscure, unfamiliar song somewhere in the test. This “control song” helps benchmark the reliability of your music test results.


I don’t really understand these things.  Are these necessary? - Andy


Andy:  When we first started doing music tests in the early 1980s, we used both of these items to check for validity and reliability.  After a short time, we found that these two elements weren’t necessary since there are better ways to test for validity and reliability.  With all due respect to Garry Mitchell, neither of the two points mentioned provides relevant information.


The information from comparing the results from a repeated song is relevant only to that song.  The results (the difference) cannot be generalized to the other songs in the tests.  In addition, the difference is not a “margin of error.”


Including an “obscure” song does not provide useable information.  A more useful approach is to analyze the standard deviations for the songs.

Control Songs

In an auditorium test I saw, the company used a couple of "control" songs that were slightly out of format range. What's the purpose of such ‘"control" songs and what do they tell you? - Anonymous

Anon: I’ll answer from my own experience. I don’t know what all research companies do with their music tests.

Not a lot was know about music testing when we started in the early 1980s. For example, we didn’t know how many songs to test, the best length for hooks, when the tests should be conducted and where, the order of songs to test, how many respondents to invite, how much to pay them, and more.

Because we were also experimenting with screeners, one major concern was to make sure that the respondents in the room were the correct people. One way to do this (we thought) was to include a few "out-of-format" control songs. Although the procedure isn’t an exact science, we figured that if the control songs rated highly, we would need to check the respondents.

While the control song approach is one way to get an indication of the respondents’ qualifications, we now use more accurate and reliable statistical methods to determine if the respondents are qualified. Hey, we all learn.

Cycles in Music

Hi Doc, what do you think of Guy Zapoleon's “Music Cycle” theory in that music goes through three phases every 10 years or so? - Anonymous


Anon:  I finally found the article you’re referring to, and here are my thoughts…


Music descriptions and discussions about music are very vague and ambiguous.  To describe music, people use words and phrases like “soft,” “hard,” “mainstream,” “extreme,” “uptempo,” and so on.  Without operational definitions for these terms (that is, universally accepted definitions for words like “soft,” “hard,” and “mainstream,” etc.), it’s hard to argue for or against Guy Zapoleon’s discussion.


I think Zapoleon’s discussion is fine, but I can’t say whether he is  right or wrong because there is no way to quantify his arguments without operational definitions for all the terms he uses.  It’s his opinion and everyone is entitled to his or her opinion.

Data and t-tests

Although I'm not an expert, I do know some statistics and I was wondering if I can conduct t-tests on our music test data?  We're always arguing about the differences between the male and female scores and I thought this might be a way to get some information to help us answer our questions.  Is that the right way to go? - Anonymous


Anon:  Sure.  That's a good idea.  For those who don't know, the t-test is a univariate (single dependent variable) statistic that tests for differences between two groups.  You could conduct tests on individual songs or test all songs in the test at one time.  The results will show you if there is a statistically significant difference between the male and female scores and is better than saying, "It seems like the women like this song more than the men do."  (You could also conduct the same tests on age groups, or P1s vs. P2s, or any other groups you have in your sample.)


If you don't have any statistical software to conduct the test, you can use a spreadsheet like Excel.


Thanks for doing this column. I read it every day and learn a lot. I have a question (this is the first time I ever submitted one). My question relates to music tests, but you always ask for background information, so let me do that first.


I am the PD for a mainstream AC in a medium size market. Our target is females 32-48, but our Arbitron shows that we have about 70% females and 30% males. When we do auditorium music tests, we recruit the same percentages—70% females and 30% males.


My question relates to music test decisions. Some of the song scores show big differences between men and women. In some cases, men hate a song and the women love it and in some cases, the men love it and the women hate it. My question is: How do I determine how many of these differently rated songs should I play? Is there a rule of some kind about how many of the 'men love it/women hate it' songs I should play? Thanks. - Anonymous


Anon: Thanks for the comments about the column. You'll notice that I edited your question because, in my opinion, you included too much proprietary information about your radio station. Your competitors don't need to know those things. I also changed your name to "Anonymous."


On to your question . . . I believe you're putting the cart before the horse (I have said that a few times in this column). Instead of finding a rule for song acceptance, I think you need to consider your target and your music test recruiting.


You say your target is females 32-48, but you recruit 70% females/30% males for your music test because that's what Arbitron shows as your audience composition. My question to you is: Why do you invite men to your music test if your target is women? From the information you provided, your music test should include 100% women—forget the men.


The problem with figuring out what to do with "differently rated" songs is not new. In fact, it goes back to the early 80s when radio stations began to program to a "target" and began conducting music tests. Nearly every PD found songs that women loved but men hated, but would play the songs because of the women's scores. The decision always went in favor of women. (Obviously, the reverse is true if the radio station's target is men.)


This led to a seemingly logical conclusion—If the decision to play a differently rated song always went in favor of the target (women), then why include men in the first place? From then on, most female-targeted radio stations included only women in research studies and music tests. I suggest you consider doing the same thing.


Always remember Occam's Razor (the simplest approach is usually the best) . . . Don't make your music tests unnecessarily complicated.


Click Here for Additional Music Test Questions


A Arbitron B C D E F G H I J K-L M
Music-Callout Music-Auditorium N O P Q R S T U V W-X-Z Home

Roger D. Wimmer, Ph.D. - All Content ©2018 - Wimmer Research   All Rights Reserved