Jump to content

Sighted vs blind listening tests

Recommended Posts

One of the audio blogs which makes good reading is Sean Olive's, a researcher for Harman. Yesterday (April 9) he posted some comments and research on blind vs sighted listening tests. In this case it relates to loudspeakers but in view of some of the debates here, it may be of more general interest.




The comment from Gordon Holt was particularly striking and there's a link to his interview.


Link to comment

A few things I would like to comment. In scientific experiment, blinded and more specifically double blinded test is very very important, there is no doubt about that. However, in listening tests, a few things are not addressed simply with blinded test. Our hearing, while it can be very sensitive, is not exactly the most precise instruments. How often it is that our first impression is wrong. Some equipment can sound very good, very exciting initially but after prolonged listening, they turn out not to be so good while some may slowly grow on you.

Like many other hobbies, some people may be very sensitive to slight changes in sound where other could not have cared less. I don't expect that if you pick a random person from the street, they would be able to pick which sound is better or more accurate etc. I doubt I can tell the difference between $20 bottle of wine over $500 bottle of wine, they may taste different but I don't really care about wine and they all taste bad to me.

To an extant, hi fidelity is really an acquired taste. While in college a long time ago when digital were still relatively new, I had a Magnavox and a Sony CD player hooked up to my stereo system. A few friends who totally believed in 1 and 0 said that all digital sounds the same and one is one, zero is zero. After playing both CD players to them unblinded, they all said that they could hear the difference but they could not tell which they preferred or which was better.

When I audit an equipment, I generally give at least 30-60 minutes to get initial impressions and if I think they are worthwhile pursuing, then further auditng and ideally, home auditing under familiar condition with familiar associated equipments are the way to go. To think that I can do a blinded test for a few minutes and absolute pick which speakers are absolutely better, most likely with unfamiliar equipments, under probably not the most ideal situation and probably under some pressure/uncomfortable environment, I don't think that my hearing is that good.


However, I don't deny that visual queue, prejudice, bias etc can have impact on judgement but audio blind test to me has its own weakness as well.


Just as a side comment regarding German sound and German voicing, that's really too bad. I play the piano and I believe that German sounds does exist. Hamburg Steinway and New York Steinway are not identical. While Hamburg Steinway, Bluthner, Grotrian, Sauter, Steingraeber all have their unique voice, they all share some common trait in comparison to Fazioli, Yamaha or NY Steinway. I heard Daniel Barenboim once said that Berlin Philharmonic would not consider admitting anybody to their orchestra who are not using German instruments. Not out of national pride but to keep certain unique German sound. Then he went further and said that under a good musician and he used the principal clarinetist of Chicago Symphony as an example, who has both German and French clarinet and depending on the kind of music, he could adapt his way of playing with either instrument such that it was impossible to tell if he was using German or French instrument to keep the sound authentic to the music.






Link to comment

The main distinction about blind testing is that it is designed to test differences, rather than preferences. It's trying to reduce the effects of personal bias, prejudice, preference and test if the difference can be perceived, rating the results statistically to separate them from chance. So a loudspeaker blind test would ideally involve a reference of some kind.


Link to comment

IMO it is hogwash to say that sighted audio product evaluations cannot produce honest and reliable judgments on how the product truly sounds. But to each his own, I am not going to agonize over making sure my listening tests are sighted or blind. As far as I know real musicians and conductors listen with their eyes wide open and so do I. I also enjoy listening to my music at night, lights off in a dark room environment. I have no objections to blind listening tests, they can be useful. I just disagree that unless listening tests are blind, the tests are flawed


Blind tests not only involve closing your eyes but also not being aware what or whether any changes have been made. Very impractical in many cases such as speaker placement, room acoustic treatment, tube changes, etc. So now I need to get some other person to do things to my audio system without my knowledge and sight so that I can discern whether I like or dislike whatever they did? Not me buddy, not with my system.


Many of the blind testing advocates also insist that ABX testing be almost instantaneous since they claim that audio memory is not existent. Again suit yourself, but I don't believe this and neither do most musicians and conductors that I know.


I accept the fact that my hearing and listening preferences are different from others just as much as I know that my preferences for wine are likely to be different from yours. So for me, I have to conduct the tests and they are normally done to please ME and my friends.


I am usually more concerned that too many people do not listen to component changes in their own system, in their own home. Just because something impresses you in a store or in somewhere else it doesn't mean it will sound good or as good in your system. And first impressions are not always lasting impressions. IMO long term listening is a must. Many audio components take many hours to break in and hopefully they don't break down. I always prefer the try before you buy policies, where you actually listen to the component in you system to see whether any audio improvement is worth the cost.


As far as Gordon Holt's comment to the below question, I couldn't disagree more.


"Judging by online forums and by the e-mail I receive, there are currently three areas of passion for audiophiles: vinyl playback, headphone listening, and music servers. Are you surprised by this?"


"I find them all boring, but nothing surprises me any more."


Link to comment

This is always a swamp. It seems to me that double blind testing is often useful - but its conclusions are often overstated, and possibly even outright misunderstood, by its proponents.


For example, the situation Sean talks about is an obvious correct application of double blind testing. Sighted bias is very real - that's how people come to conclude that putting a device the size of a hockey puck on their wall has a major impact on the acoustics of the room. DBT eliminates this sight bias in tests of preference amongst dissimilar loudspeakers - very good. But what do the results of that test prove? That A is better than B? Well, only for the conditions under test. A might sound better than B in this room, over this span of time, but it doesn't mean you'll necessarily like A after a few months in your own room, just as you might not like that TV with 9K color temperature so much as you did in the store. It's important to not over represent the significance of any test.


Where things get more controversial, for me, is when DBT is used to assert the audibility, or lack of same, of small physical differences. The benefit of DBT is that it's supposed to maximize the effect of short-term auditory memory, on the assumption that short-term memory is best for measuring the capacity of the low-level auditory mechanisms. And there's plenty of science to back up that assumption, using tests like, for instance, comparing two pitches, or the relative levels of two signals.


But listening to music has another, important dimension - long term memory, and long term learning. Music involves not just detecting sound, but objectifying sound: this kind of instrument, playing in that kind of space, in that kind of relation to other instruments, etc. The more experience we have in listening to particular kinds of music, particular sounds, in different kinds of contexts, the better we get at objectifying similarities and differences. If you're listening to your audio system in your room, that you're heard many times, with a recording you've heard many times, you're more likely to detect small variations from the "usual sound" than a person who's entirely unfamiliar with your system and recording. And if you're familiar with critical listening, you're more likely to detect differences in unfamiliar situations. It seems to me that the design of most DBTs minimizes the role of cognition, experience, learning, whatever you want to call it.


To Sean's example, let's say speaker A and speaker B are the same basic design, with minor modifications to the crossover. A group of consumers brought in for a DBT finds no statistically significant difference. But an experienced speaker designer insists there is a difference. If you give that designer a week to do sighted comparisions between A and B, with the system and room under test, using all the recordings he normally uses for evaluation, would you not expect at the end of that time for the designer to have a better chance of differentiating A from B in a DBT than the consumer group? And yet, the results of DBT tests are rarely characterized as "under the time frame of this test, with listeners of the experience level chosen, the test did not produce statistically significant results". The role of experience and learning is usually discounted, and the results of the test used to "prove" what credulous morons audiophiles all are.


Bob Stuart of Meridian, who has degrees in both audio engineering and pyschoacoustics and has been as influential as any single person in the field of digital audio, has some interesting observations about this in a recent interview/lecture he gave to the UK audio engineering society. The MP3 audio recording of that interview can be found on this page:




Starting at about 1:24:00 in, he talks about A/B testing and its limitations. There's also a fascinating section about digital audio, digital formats, and the design of Meridian's version of minimum-phase digital filtering, starting about 44 minutes in. It's not quite the same without seeing his slides, but still very interesting.


Link to comment

This is not true. I conduct blind tests d for measuring listener preferences between different amplifiers, loudspeakers, cables, etc. By making the tests blind you are removing the psychological biases associated with brand, price,etc. If you don't do the test blind these biases will affect the outcome of the test, and you don't get a true measure of the sound quality. I just posted an article in my blog "The Dishonesty of Sighted Listening Tests" @ http://[email protected].




Cheers | Sean Olive | Director Acoustic Research | Harman International | http://seanolive.blogspot.com

Link to comment

The point of my blog posting is that the German voiced loudspeaker was perceptually indistinct from the European version of the same speaker. The so-called German "target" of the loudspeaker was essentially the same as every other highly-rated loudspeaker we have measured whether the loudspeaker was designed in Canada, US, Japan, England or Germany: an accurate, flat frequency response.


If the loudspeaker is something other than accurate, then the "true" sound of the German instruments and orchestra captured in recordings will never be faithfully reproduced: it will be a distortion of what the artist intended. Our philosophy is audio is a science in the service of art (the music and recording). If science allows different interpretations of how a loudspeaker should sound, then the art will never be faithfully and reliably reproduced.


Your first point is well taken. Critical listening is a tough job, and requires familiarity with the task, program material, room, etc. We train listeners so that performance in listening tests becomes second nature to them. Still, we find that when untrained listeners are given the opportunity to participate in loudspeaker tests under well-controlled test conditions, their loudspeaker preferences are essentially the same as the trained listeners. The responses of the trained listeners are simply more discriminating and consistent.


Cheers | Sean Olive | Director Acoustic Research | Harman International | http://seanolive.blogspot.com

Link to comment

Hi MahlerFreak,


I agree with most of your points. As you say, it is very dangerous to generalize or extrapolate the results of a listening test to conditions outside the ones tested. There are some things you can do at least ensure the test conditions are sensitive to and revealing of potential sound quality problems with the audio component. That way, you are testing the components under a worst-case scenario to minimize problems in the real-world. These include:

1) Encouraging early reflections in the listening room so that the off-axis problems in the loudspeaker are more apparent

2) Carefully selecting and training listeners that can detect and accurately describe linear and nonlinear distortions in the components. The trained listeners develop familiarity with the music, listening room and task.

3) Carefully choose program material known to be revealing of these distortions

4) Always verify the subjective results with the objective measurements and known psychoacoustic theory. If the listening test results are a surprise then there is some further explanation required.


These are some of the precautions we take to avoid some of the pitfalls of double-blind testing you have listed.






Cheers | Sean Olive | Director Acoustic Research | Harman International | http://seanolive.blogspot.com

Link to comment

Thanks Sean, your second post eliminated many of the questions I had about your first one.


I'd be very interested to hear your point of view on Bob Stuart's observations - they're only a few minutes long if you start at the location I gave (1:24:00 in). Particularly interesting to me is Stuart's observation that our cognition gets in the way of effective A/B testing; that the problem is not remembering what we just heard, rather the problem is forgetting it at the cognitive level.





Link to comment

Replying to "There is no such thing as a German-Voiced Loudspeaker"


I think that every speaker design is a compromise. While the ultimat goal of speaker design

is to achieve the most accurate, most neutral speakers, that vision of ideal speaker is going

to be limited by the designer's perception of what is accurate, or more likely what is acceptable.

I am not aware of any set of measurement that will describe how warm this sound is, how smooth or mellow, how big an image, how wide the soundstage, how much inner detail etc. Good conventional measurement does not

means good sound so ultimately the final voicing will rely on subjective listening test as much as anything else.


I found the idea of German voicing quite interesting. May be it is an over kill to spend time and effort to make a special regional edition of an audio product. However, it is too bad that the idea is killed by what I think is a flawed test, at best. Since the speakers are designed to cater to a specific group of listeners in mind, were the test carried out on that specific group of audience before deciding that it is worthless. This really gets into a rather specialized area and may be something that if test on a person who grow up with the so called German sound/taste may be able to appreciate the difference more. If you design speakers specifically for German/northern European customers, why would you let Japanese or American listener tell you that it does not matter? May be they are, may be they are not. (I assume this to be the case since you stated that it was Harman employees that participated).


No doubt there are plenty of voodoo science, rubbish products, fake audio trickery stuffs out there but I still think that familiarity and prolong exposure are probably more important that blind vs non blind test. However, they are much more difficult to set up and arrange especially for a large group try out.


Link to comment



Actually,there are objective measurements for quantifying many of the sound quality attributes you mention: brightness/sharpness, envelopment, roughness.,etc. You can find these metrics discussed in J AES and J ASA and the relevant psychoacoustic literature. If you read Wolfgang Kippel's PhD work (~1990) he measured listeners' loudspeaker preferences and various perceptual attributes, and developed a model that would predict their intensity based on anechoic and in-room measurements. Floyd Toole talks about Klippel's work in his new book "Sound Reproduction."


The notion that different countries or regions prefer different loudspeaker sound qualities has never been scientifically proven -- If audio is to be treated as science in the service of art - I would prefer that these tastes, if they exist, be encoded in the recording (do record companies release different mixes for different regional tastes? I don't think so.), or dealt with via the tone controls or DSP settings in the audio receiver or Itunes menu. Can you imagine an audio receiver or car head unit with a sound quality knob that says "USA, Canada, Germany, Denmark, Japan, China" instead of the current useless labels designed by brain-dead marketing departments: "Jazz, Country, Pop, Classical"?


Over the past 25 years we have included many different nationalities in our tests - including Germans -- and could not find any significant differences in the speakers they prefer. We have also measured many speakers designed in different countries targeted for their local markets, and have found them to generally converge on the same flat frequency response target -- another sign that accuracy seems to be the universal target. Highly respected German, British, Canadian and Japanese speaker manufacturers only make one SKU that is sold everywhere in the world. So clearly, the notion that loudspeaker preference is universal - is not just my opinion, but the opinion shared by many loudspeaker companies throughout the world. I think the same is also true throughout the world for amplifiers (flat), speaker cables (Flat) and CD/DVD/BLU-RAY players (flat),etc. Why would it be different for loudspeakers?


At Harman, we just developed a standard listening room at our locations in the USA, England, Germany and Japan, and I have the opportunity to run some controlled scientific tests in these different countries to see your hypothesis is true or false. Stay tuned.


Cheers | Sean Olive | Director Acoustic Research | Harman International | http://seanolive.blogspot.com

Link to comment



Thanks for the reference regarding subjective sound quality and its correlated measurement. I will check that out as it should make for some interesting reading.


I am not advocating different versions of the same product for different regions, as I said before, it may be an over kill. After all, the designer should offer in his vision, what he/she think is the best product they can produce, given the usual constraints. If the buyers agree with that vision, then they will buy them. It was just interesting to see what happen, if the product was fine tune specifically for a particular market (assuming that the person doing it really know his/her business). Audio market is not large enough to really demand that kind of specialty but it is nice to see someone actually think of that and do something about it.


In a way, regional sound is becoming obsolete as musicians and audiences travel and have a lot more exposure to various sounds. Now pretty much every orchestras have musicians from all over so the unique sound of Vienna Philharmohic, Berlin Philharmonic etc are slowly being diluted (luckily their unique concert hall sound is not). It was a nice thought that may be someone is still trying to preserve those differences, even through audio design.


Link to comment



I'm glad you agree with me that designing products for regional tastes in audio sound quality is a probably overkill, and as you say, less necessary given today's world economy where cultural influences have no borders.


If I seem a little argumentative about this topic, it's because I am currently dealing with this exact issue with one of our customers, who believes we should be designing products to suit a particular regional taste in sound quality. Defining what that taste is or finding scientific evidence to support it, has so far not been very fruitful. That's why I said "stay tuned" :)




Cheers | Sean Olive | Director Acoustic Research | Harman International | http://seanolive.blogspot.com

Link to comment

First I see you on HA, then Stereophile, now I see you here. It's like the gods are mingling with the mortals or something. I took the liberty of the mindless link propogation onto HA, but I fully expect all the interesting action (and/or trolling) to be occurring on here and Stereophile. Heh.


Thank you very much for this blog post. Of course this is "kinda" old news (from 1994) but the results and implications are obviously not well known to the public.


I recall some speculation once about different characteristics of hearing for different ethnic groups - that, eg, some of the brouhaha about differences between measured equal loudness curves might be attributed to different groups from around the world providing varying amounts of data for each curve. Was there any conclusion to this speculation? Does the demonstrated nonexistence of regional speaker preferences imply that such regional/ethnic listening differences are not shown to exist?


I don't have the original paper (or AES E-library access anymore) to look into this more closely, so I'll just ask you. Statistically, was this conclusion of the equivalance of German voicing to other voicings stated in terms of type II error or beta < 0.05? Or is there an alternative means for proving this? Statistical equivalence has always been a big sticking point; there has been talk on HA of noninferiority testing being required for that, with a potential requirement for many thousands of trials in some circumstances.


Link to comment


Yes, I'm everywhere. Too much idle time on my hands as I am on vacation this week. Our study is old (1994) but, as you say, the results and implications still hold true today.


The ISO 226 Equal-loudness-curves were updated in 2003 with data from Japan, Germany, Denmark, UK, and USA. I didn't realize that claims were being made that some of the discrepancies in loudness contour results were possibly related to differences in the racial perception of loudness. It would like to find more information on that debate. If true, the current ISO curves are biased towards Japanese hearing since they contributed 40% of the current ISO 226:2003 data; perhaps A-weighting should be re-named to J-Weighting.


We didn't conclude in the 1994 study that the German and Danish-crossovers sounded identical (they didn't)- just that the differences were sufficiently small that listeners could not reliably formulate a preference between them based on the mean preference ratings and the overlapping 95% confidence intervals. I did a Scheffe post-hoc test between the pairs of speakers to show the differences were not statistically significant, but didn't show that in the paper. Also, the German speaker had a much more complicated cross-over that didn't warrant the extra cost based on the subjective results and the anechoic measurements.


Cheers | Sean Olive | Director Acoustic Research | Harman International | http://seanolive.blogspot.com

Link to comment

Heh. So, how much more effort are you going to put into the Stereophile ruckus? Let me know anytime if I don't need to try to defend you. ;)


I wish I remembered where I heard that rumor about racial perception of loudness. Anecdotally, the only particularly specific thing I can recall is that some Head-Fi people have commented there are a lot of Asian fans of the Etymotic ER-4B, which is a diffuse field equalized IEM, and few people had any earthly idea why so many people would consistently like a treble boost that big. And of course, there is the whole debate about different mastering styles for different international markets (which at least in the classical vinyl world has been documented in old issues of Gramophone), to match the debate over different speaker styles. I suppose I should ask on HA if anybody else remembers anything about this?


Re crossovers: ah, of course - it does largely boil down to cost in the end. Which, of course, is not a bad thing.




It's a shame you came onboard here a few days after the ALAC vs AIFF thing, because I opined at great length at the interpretation of blind test results and on larger epistemological issues in audio, and I would have appreciated some professional input on what I was saying. Without referring you to a 200-post thread, which I'm sure is something you're *dying* to do on your vacation, I offer you to pop a few of my balloons :) :


Would you agree that, even though the statistical results of a blind test cannot be used to prove a null hypothesis (in the negative result) nor even perhaps reject it (in the positive) with exactly 100% certainty, is it still possible that a rational observer could come to either conclusion, once type I or type II error has been made arbitrarily low?


Would you also agree that, given a blind test designed to detect engineering/mathematical impossibilities - lossless != lossless, or nyquist's theorem is wrong, etc, once all numerical differences between devices under test have been confirmed not to exist or are not pertinent - that such a test is internally inconsistent and can never be meaningfully constructed?


Would you also agree that the theory (and I daresay, paradigm) of sighted, subjective listening is internally consistent, and for its adherents, has meaningful answers for all observed results - and therefore, efforts to disprove it cannot exclusively rely on scientific or statistical evidence?




Also, there was a question going on between AV-OCD and I which I think you're the best person to ask. Suppose that, contrary to what is stated in objective measurements, an end user, in their own living room or dealer room etc, gives a sighted preference to a speaker which measures more poorly than some other tested speaker. How should an end user go about resolving this contradiction?


Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Create New...