Auditorium Music Test Questions - Page 1

 

Auditorium or Callout?

Dr: Enjoy your column.  In your mind, what's a more effective way of music testing, auditorium or callout? - The Great One

 

TGO: I'm glad you enjoy the column.  Thanks, and on to your question . . .

 

My comment about which music test approach has nothing to do with what's in my mind.  My comments are based on scientific research procedures.

 

So, with that in mind, there is no question that an auditorium music test is more valid (Does it test what it's supposed to test?) and reliable (Does the test provide consistent results over time and with different samples?) than a callout test.

 

The main reason why auditorium tests are better than callout is control over the testing situation.  In an auditorium test, the researcher knows: (1) who takes the test; (2) that all respondents were exposed to the hooks; (3) that all respondents were exposed to the same hooks; (4) that each person rated the hooks alone; (5) that all respondents were exposed to the hooks under the same environmental conditions; (6) that all respondents rated the hooks without extraneous intervening variables, or at least were exposed to the same extraneous variables; and (7) that respondents' questions about the procedure can be answered immediately and uniformly.

 

That's just part of the list, but should provide enough evidence to show that an auditorium music test allows the hooks to be rated under a controlled situation.  This can't be done in the callout methodology.  There is no debate here.  Those are the facts.

 

Auditorium Sample

I’ve always conducted auditorium music tests with a room of 100% cumers, 50% my P1s and 50% P1s of my top competitor.  One research company I worked with sought participants by making hundreds of blind phone calls to my market.  In recent years, I’ve worked with another company that requests a copy of my massive listener database to assist in contacting participants.  (The database contains “card club” members, who obviously may or may not be my P1s).

 

Both companies claim their methods produce a valid sample.  I’m concerned that the second method may invite too many “insiders”—frequent contest participants or a wedge of listeners that give us an unusually high TSL. The only differences I’ve noticed are slightly lower song scores or more burned titles when the second company conducts the tests.  Your thoughts? - Anonymous

 

Anon:  Excellent question.  I need to discuss a little research background stuff before I get to your question.

 

The ideal sample in behavioral research (that is, research with humans) is a random sample, which is a sample where everyone has an equal chance of being selected.  In theory, we would randomly select a number of respondents and all would agree to participate in the project, whether it’s an auditorium test, callout, focus groups, or perceptual study.  But we know that not all people selected will agree to participate for one reason or another.

 

The number of people who agree to participate in a research project is known as the acceptance rate, which is the percentage of people called or contacted and agree to participate.  In media research studies, the acceptance rate is usually about 60%.  In other words, about 35% of the qualified people contacted do not agree to participate.

 

What does that mean?  It means that research studies never use a truly random sample.  In effect, the sample is a group of volunteers who agree to participate.  If anyone tells you that their sampling procedures will give you a true random sample, they are blowing smoke.  The only way to have a true random sample is to force the people who are selected to participate in the study, and we can’t do that.

 

OK, so we know that all samples in radio research (and all behavioral research) are volunteers.  There are several ways to try to recruit these people for a study.  You could:

  1. Make “blind phone calls” (as you call them) and hope to find qualified respondents (by the way, the research term is “cold calls”).  While this process is fine from all statistical perspectives, it is very time consuming and costly.  A cold call recruit may cost as much as three times what other procedures cost because so many calls are wasted (non-working numbers, businesses, and so on).

  2. Buy a list of say, 25-34 year olds in your market from one of several sampling companies.  These lists aren’t perfect, but they eliminate many of the wasted phone calls.  But this procedure is still expensive because many of the people on the list will not match the exact target you’re interested in.  (Using a purchased list is probably the most often-used method in all types of behavioral research.  The problem, though, is that caller ID, answering machines, and the unwillingness of many people to participate have caused this tried-and-true method to skyrocket in terms of costs.

  3. Use an established list of participants.  In your case, the research company you’re your database of Card Club members.  However, as you mention, even these lists aren’t perfect because of data entry errors, people changing their radio station preference, and so on.  But it’s the best approach to use to reduce expenses, and it’s the approach that will probably be commonplace in the next few years.  If not, radio stations will not be able to afford any type of research.

OK, that’s the background.  You want to know if the two approaches used by your research companies are valid.  The answer is “yes.”  Cold calls will eventually find qualified respondents, but the method is costly.  Using a radio station’s database substantially reduces the number of phone calls needed because the list, although not perfect, eliminates many wasted phone calls.  The same goal—finding qualified respondents—is possible with both sampling methods.

 

Now, one thing you need keep in mind is that a research company that uses your database should give you a break in price because recruiting costs will be lower than cold calls.  I’m assuming here that your database is good.  I have seen many radio station databases where 60% or more of the names are bad.

 

You mention that your database sample produced slightly lower song scores or more burned titles.  I don’t know what “slightly lower song scores” means.  Did you conduct a statistical analysis to verify that?  If they only look lower, they may not be.  In order to check this out, you need to conduct t-tests or convert the scores to Z-scores.

 

In addition, I don’t know how many more burned titles you have, but I don’t see that as a problem unless you say that 90% of the songs are burned.  I also don’t know what you classify as a “burned song.”

 

But let’s assume that there are more burned titles in the database group.  This sounds logical because your database sample, by virtue of respondents giving you their names and joining your club, have reduced the variance (difference) of people and the answers you would find in a cold call sample.

 

Best Start Date

Doc:  Several months ago, I asked if it mattered what day of the week to start an online music test to get the maximum number of participants.  You had no evidence one way or another and encouraged me to track results myself.  I did that with 10 tests, starting two different tests on each respective day of the week over the course of several months (no weekends).

 

Conclusion?  Monday through Thursday have consistent results, but Friday seems to be a bad day.  About 24% fewer participants (on average) participated with a test starting on a Friday.  The Monday through Thursday tests were all within about 7% of one another.

 

In addition, I came to a surprising conclusion about something else—"poor" weather on the start date or 1-2 days immediately following the start date of a test, resulted in a higher participation rate.  However, this was an observation on my part since I didn't keep a log on the weather.

 

Finally, I came to the conclusion that there are so many variables involved in music testing that any results should not be relied upon, but I will continue to track this and see where it takes me. - Anonymous

 

Anon:  First, you'll notice that I edited your note just a bit.  I don't think I changed your meaning, but please let me know if I did.

 

Next . . . congratulations on conducting the study.  This is the first time I have seen anything done on this topic.  But don't berate your work.  Sure, more studies need to be conducted to determine the reliability of your results, but what you have is great.

 

Your results provide very good indications about the affect of the start date on online music testing.  You have demonstrated how the scientific method works—a small step each time to get to valid and reliable results.  You made the first step and I'm enormously proud of you.

 

One more thing, you said that music tests involve many variables.  That's true, but from what I can see, you only varied one variable (start date), which means that the difference in participation rates is probably due to start date and not some other (extraneous) variable.

 

If I had a research award to give, you would receive it.


Best Time for Music Test

Doc:  When is the best time to conduct my music tests?  I have always heard that the tests should be conducted just before the book so I can get the music right. - Anonymous

 

Anon:  I have heard this question many times and my answer is always the same . . .

 

There is no best time to conduct a music test.  You should do the test(s) whenever you need information about your playlist.  If you need information now, then now is the best time to conduct your tests.

 

Regarding the "getting the music right" before the book . . . That makes no sense at all.  Why would you or anyone else want to get the music right before the book starts?  The music should be right all year long.  Listeners don't make decisions to listen to a radio station based on the start of an Arbitron survey.  The importance of the start of the survey is only in the minds of radio people who believe garbage that has been told to them for several years.

 

No research (music tests, perceptual studies, focus groups, etc.) should be conducted to coincide with Arbitron.  As I said, listeners don't make up their minds about listening to radio stations based on when an Arbitron survey starts or ends.  Did I repeat that enough?

 

Blowup Potential

I’m trying to sort all the information I see from research companies that do music tests and I’m getting confused.  Each company claims that their approach is the best.  I see things like, “Increase your cume,” “Increase TSL,” “Find core songs for your P1s,” and more.  One company says its method is “Patent pending.”

 

After reading all that stuff, I think I’d rather do the tests myself.  Can you help me understand a few things?  What is the best way to conduct a music test?  What type of sample is best to use?  What is the best way to analyze the data?  Are there some ways that are better than others?  In other words, can you give me some guidelines?  - Anonymous

 

Anon:  You’ll notice I edited your question to delete the names of research companies.  I don’t think that information is important here.  So, on to your questions.

 

First, you better get something to drink, because this is not going to be a short answer.  Secondly, I give you credit for wanting to do music tests yourself, but since you’re a novice in the area, I will be honest and say that you better prepare for a lot of work.  Here we go…

 

There is no “right” way to conduct a music test.  The procedures depend on the information you need.  Just because a company uses a specific approach does not mean it is appropriate for you.  But here is a warning: Music tests look deceiving simple to conduct, but require a lot of experience and planning.  While music tests are not open-heart surgery, there are several steps you need to consider.  Each step is important, and if everything isn’t correct, the test can blow up.

 

What I have written below is a brief outline of some things you need to consider.  I have also included a rating for “Potential Blow-up,” where “1” = “Not likely to blow-up the test,” “10” = “Highly likely to blow-up the test,” and 2 through 9 are in-between.

 

Some things to consider include, but are not limited to, are:

 

Sampling/Screener Design:  Blow-up Rating = 10.  Your goal for the music test dictates the type of respondents you recruit.  Some approaches are:

  1. If you want to increase your cume, recruit respondents who listen to your format.  Your screener would include music format descriptions and respondents qualify if they listen “often” (or whatever requirement you use) to a description of your format (not a description of your radio station’s music, but rather the format in general).  You might have two or three descriptions of your format, and a respondent would qualify by listening “often” to one or more of the descriptions.  This approach will give you the widest selection of respondents.

  2. If you want to increase your TSL, recruit respondents who cume your radio station.  In this case, your screener would not use music descriptions, but rather a question about the respondents’ listening, such as: “During a typical week, which area radio stations do you usually listen to for music?”  Respondents who name your radio station qualify for the test.

  3. If you want to “super serve” your P1s, recruit only your P1s.  I don’t recommend this procedure because the sample is too narrow.  Radio stations that constantly test only their P1s will eventually disappear because the focus is only on P1s, not the cume.  Radio stations that use only P1s in music test will “P1” themselves to oblivion.

  4. If you want a broader perspective, then you can recruit a combination of format cume, your radio station’s cume, and your P1s.

These aren’t the only sampling approaches.  There are numerous variations.

 

You must pay very close attention to your screener.  You must ask questions correctly and in the correct order.  If your screener is designed incorrectly in any way, you’ll have the wrong respondents at the test.

 

Recruiting:  Blow-up Rating = 10.  You can recruit the respondents or you can hire a field service.  If you recruit the respondents yourself, be sure to have many bottles of aspirin available because it’s not an easy task—count on several hundred hours to recruit your sample.

 

Hook Preparation:  Blow-up Rating = 8.  You can produce the hook tape or you can hire a hook production company.  As with recruiting, if you decide to produce the tape yourself, be sure to have many bottles of aspirin available.

 

Measurement Instrument:  Blow-up Rating = 10.  You will need to develop a rating scale (measurement system) and prepare scoring sheets for the respondents.  If you don’t want to input the data by hand (you don’t), then you’ll need to develop OCR (optical character reader) sheets and purchase a machine that can read the sheets.  (You may choose to use hand-held devices for data collection.)

 

Test Location:  Blow-up Rating = 4.  You’ll need to find a place to conduct your test.  A hotel ballroom is usually OK, but you’ll have to be flexible if nothing is available.  You’ll need to check for access to the room, parking, and make arrangements with the hotel to set up the room.

 

Sound System:  Blow-up Rating = 2.  This shouldn’t be a problem for a radio person.  By the way, you don’t need a $10,000 sound system to play hooks.

 

Co-op/Incentives:  Blow-up Rating = 1.  You need to pay the respondents for their participation, usually $50-$100.

 

Data Verification:  Blow-up Rating = 10.  You’ll need to verify your data.  Most research companies have software to check for respondents who don’t belong in the test.  If you don’t know how to do this, you can find someone who understand statistics to help you.  Your verification will also help identify if your recruiting is satisfactory.

 

Data Analysis:  Blow-up Rating = 10.  You can analyze the data yourself or sub-contract the process to someone who has appropriate software.  If you want to do this yourself, you’ll have to buy software or have it written by someone.  You might be able to get by with analyzing the data on a spreadsheet, but you’ll probably go crazy doing that and I don’t recommend it.  However, you can do many post hoc (after the fact) tests using spreadsheets.

 

Report Presentation:  Blow-up Rating = 9.  You will need to figure out a way to display your results.  A data analysis package will probably provide a good display, but it may not be suitable for your needs.

 

Those are some of the things you need to consider.  Music tests aren’t difficult to conduct, but they do require a lot of attention and planning.  That’s the advantage of using a research company—most of these people have a lot of experience with music tests and know where problems can occur.

 

If you are determined to conduct your own tests, then go for it.  If you do proceed, then you should plan for 6-12 months preparation time—mostly for developing your measurement instrument and data analysis procedures.

 

One final point:  Hiring a research company doesn’t mean you have to accept the way the company analyzes and presents your data.  The good companies give you a lot of flexibility in data presentation and you can ask for unique rankers, correlations, and other ways to look at your data.  For example, you can get almost any answer to song likes/dislikes/relationships by asking for a variety of banner points (column headings).

 

Don’t interpret my comments as crushing your idea to do your own tests.  You can do them if you have the time and money to develop everything.  You also may want to consider hiring someone to help you do the things you can’t.  Have fun.

 

Blow-up Potential - 2

In one of your questions posted about music tests, you had a “blow-up” rating (“1” = “Not likely to blow-up the test,” “10” = “Highly likely to blow-up the test.”)  You listed that hook preparation would rate and 8 on your scale. I was wondering why this is?  I’m not a production guru, but I don’t find hook preparation to be at all a headache.  Just curious. - Anonymous

 

Anon:  I rated hook preparation an 8 because of what I have experienced in the past.  While you may find hook production easy, here are some of the things I have experienced:

  1. Inappropriate hook.  Although most people who produce hooks can determine which portion of the song to select, some people can’t do this.  I have heard hooks where none of the respondents can identify the song.

  2. Hook length.  The biggest problem is hooks that are too short.  I have heard hooks that are 1-2 seconds long.  On the other hand, hooks that are too long can make the test drag on and that creates respondent fatigue.

  3. Numbering.  Although it doesn’t happen too often, some people misnumber the hooks, skip numbers, and in one case, the person forgot to put numbers on the tape and had to stand in front of the group and read the numbers to the respondents.

  4. Variation in sound.  Sometimes the sound varies from very quiet to extremely loud and the respondents become frustrated.

  5. Stereo.  Only one channel is recorded.  This is particularly bad for Beatles’ songs..

  6. Repetition.  I have heard tapes where the same song is repeated up to 5 times.

  7. Dead air between hooks.  The dead air between hooks should be short and constant.  I have heard tapes where the time between songs varies from one second (typical) to several seconds.  This also creates respondent fatigue.

Those are some things related to the production of the hook tape.  There are also other problems, such as:

  1. The moderator forgets to bring the tape.

  2. The moderator brings the wrong tape.

  3. The tape is partially or completely blank.

  4. Both the master and the backup tape break.

  5. The tape is damaged in transit and can’t be used.

That’s a partial list.  A person who pays attention to details won’t have any problems, but not all people pay attention to details.

 

CHR Music Tests

Hi Doc: I was wondering whether a CHR station would ever need to do an Auditorium Music Test (AMT), or should it rely only on callout research?  If a CHR radio station would do an AMT, would it be for its older songs? - Anonymous

 

Anon: I think there are several reasons why a CHR radio station would do an AMT.  Here are a few:

  1. The callout methodology limits the number of songs you can test &#8212 usually about 20.  In an AMT, it's possible to test several hundred songs in one session.

  2. I don't know anything about your radio station's playlist, but I assume you play more than just currents.  An AMT will allow you to test hundreds of recurrents and older songs (gold, or whatever term you use).  This will give you information about familiarity and burn (tired of hearing the song) for all the music you play.

  3. Because of the number of songs you can test in an AMT, you (or the company that conducts your music test) can compute correlations and other statistical procedures on the data so you can see how each song relates to every other song.  You can get a lot of information about how the music "fits" together, which is a great help in developing your playlist.

  4. You can also test other things during an AMT, such as jingles, promotional materials, TV commercials, and personalities.

  5. An AMT will allow you to do a "tack-on" questionnaire to ask the respondents several questions about your radio station or any other topic of interest.

The auditorium approach is useful for all radio formats.  While a CHR radio station probably does more callout because of the type of music involved, the AMT approach is still very useful and should be conducted regularly.

 

Comparing Scores Across Markets

We have several AC radio stations in our group and I was wondering if it would be OK for the PDs to compare their scores from music tests.  It may be a good way to get some new songs to play.  - Ray

 

Ray:  There are many references to your question in this archive.  Check other question.

 

Your question is a little confusing to me because you ask if it’s “…OK  for the PDs to compare their scores…” and then you say, “It may be a good way to get some new songs to play.”  Comparing data is one thing, but sharing data is a completely different item.  I’ll address both things.

 

Comparisons:  There is no problem with comparing data between radio stations, but you must convert the scores to Z-Scores in order to make the comparisons (see “The Research Doctor Archive” for many discussions).  Why?  Because the scores come from different samples and you can’t compare raw scores between two different samples of listeners.  By the way, both samples don’t’ have to use the same rating scale.  For example, one PD may use a 1-5 scale and another may use a 1-7 scale.  The Z-scores will standardize the data so you can compare this “apples to oranges” situation.

 

Sharing:  Many years of experience with music tests has verified that it’s not a good idea to share music test data (or any research information) with other radio stations.  Listeners aren’t the same in all markets and you need information from your own listeners, not listeners from another market.

 

Consultant Tips

Doc:  In the most current “Consultant Tips” on All Access, Garry Mitchell has an article titled, “Five Tips & Tricks for Your Next Auditorium Music Test.”  The first two tips are:

 

1. Include the same song hook twice, spaced evenly apart from each other within the test.  Comparing the scores of each hook helps you benchmark the margin of error for the test (and also measure fatigue onset, if you're not rotating the song order for each respondent).

 

2. Include a hook from a totally obscure, unfamiliar song somewhere in the test. This “control song” helps benchmark the reliability of your music test results.

 

I don’t really understand these things.  Are these necessary? - Andy

 

Andy:  When we first started doing music tests in the early 1980s, we used both of these items to check for validity and reliability.  After a short time, we found that these two elements weren’t necessary since there are better ways to test for validity and reliability.  With all due respect to Garry Mitchell, neither of the two points mentioned provides relevant information.

 

The information from comparing the results from a repeated song is relevant only to that song.  The results (the difference) cannot be generalized to the other songs in the tests.  In addition, the difference is not a “margin of error.”

 

Including an “obscure” song does not provide useable information.  A more useful approach is to analyze the standard deviations for the songs.

 

Control Songs

In an auditorium test I saw, the company used a couple of "control" songs that were slightly out of format range. What's the purpose of such ‘"control" songs and what do they tell you? - Anonymous


Anon: I’ll answer from my own experience. I don’t know what all research companies do with their music tests.


Not a lot was know about music testing when we started in the early 1980s. For example, we didn’t know how many songs to test, the best length for hooks, when the tests should be conducted and where, the order of songs to test, how many respondents to invite, how much to pay them, and more.


Because we were also experimenting with screeners, one major concern was to make sure that the respondents in the room were the correct people. One way to do this (we thought) was to include a few "out-of-format" control songs. Although the procedure isn’t an exact science, we figured that if the control songs rated highly, we would need to check the respondents.


While the control song approach is one way to get an indication of the respondents’ qualifications, we now use more accurate and reliable statistical methods to determine if the respondents are qualified. Hey, we all learn.

 

Cycles in Music

Hi Doc, what do you think of Guy Zapoleon's “Music Cycle” theory in that music goes through three phases every 10 years or so? - Anonymous

 

Anon:  I finally found the article you’re referring to, and here are my thoughts…

 

Music descriptions and discussions about music are very vague and ambiguous.  To describe music, people use words and phrases like “soft,” “hard,” “mainstream,” “extreme,” “uptempo,” and so on.  Without operational definitions for these terms (that is, universally accepted definitions for words like “soft,” “hard,” and “mainstream,” etc.), it’s hard to argue for or against Guy Zapoleon’s discussion.

 

I think Zapoleon’s discussion is fine, but I can’t say whether he is  right or wrong because there is no way to quantify his arguments without operational definitions for all the terms he uses.  It’s his opinion and everyone is entitled to his or her opinion.

 

Data and t-tests

Although I'm not an expert, I do know some statistics and I was wondering if I can conduct t-tests on our music test data?  We're always arguing about the differences between the male and female scores and I thought this might be a way to get some information to help us answer our questions.  Is that the right way to go? - Anonymous

 

Anon:  Sure.  That's a good idea.  For those who don't know, the t-test is a univariate (single dependent variable) statistic that tests for differences between two groups.  You could conduct tests on individual songs or test all songs in the test at one time.  The results will show you if there is a statistically significant difference between the male and female scores and is better than saying, "It seems like the women like this song more than the men do."  (You could also conduct the same tests on age groups, or P1s vs. P2s, or any other groups you have in your sample.)

 

If you don't have any statistical software to conduct the test, you can use a spreadsheet like Excel.

 

Decisions

Thanks for doing this column. I read it every day and learn a lot. I have a question (this is the first time I ever submitted one). My question relates to music tests, but you always ask for background information, so let me do that first.

 

I am the PD for a mainstream AC in a medium size market. Our target is females 32-48, but our Arbitron shows that we have about 70% females and 30% males. When we do auditorium music tests, we recruit the same percentages—70% females and 30% males.

 

My question relates to music test decisions. Some of the song scores show big differences between men and women. In some cases, men hate a song and the women love it and in some cases, the men love it and the women hate it. My question is: How do I determine how many of these differently rated songs should I play? Is there a rule of some kind about how many of the 'men love it/women hate it' songs I should play? Thanks. - Anonymous

 

Anon: Thanks for the comments about the column. You'll notice that I edited your question because, in my opinion, you included too much proprietary information about your radio station. Your competitors don't need to know those things. I also changed your name to "Anonymous."

 

On to your question . . . I believe you're putting the cart before the horse (I have said that a few times in this column). Instead of finding a rule for song acceptance, I think you need to consider your target and your music test recruiting.

 

You say your target is females 32-48, but you recruit 70% females/30% males for your music test because that's what Arbitron shows as your audience composition. My question to you is: Why do you invite men to your music test if your target is women? From the information you provided, your music test should include 100% women—forget the men.

 

The problem with figuring out what to do with "differently rated" songs is not new. In fact, it goes back to the early 80s when radio stations began to program to a "target" and began conducting music tests. Nearly every PD found songs that women loved but men hated, but would play the songs because of the women's scores. The decision always went in favor of women. (Obviously, the reverse is true if the radio station's target is men.)

 

This led to a seemingly logical conclusion—If the decision to play a differently rated song always went in favor of the target (women), then why include men in the first place? From then on, most female-targeted radio stations included only women in research studies and music tests. I suggest you consider doing the same thing.

 

Always remember Occam's Razor (the simplest approach is usually the best) . . . Don't make your music tests unnecessarily complicated.

 

Click Here for Additional Music Test Questions

 

A Arbitron B C D E F G H I J K-L M
Music-Callout Music-Auditorium N O P Q R S T U V W-X-Z Home

All Content © 2009 - Wimmer Research   All Rights Reserved