Does BIAS affect audio test results?

The Computer Audiophile · June 19, 2020

17 minutes ago, pkane2001 said:

Enough, Alex. @The Computer Audiophile can you please give me moderation rights here?

Done.

pkane2001 · June 19, 2020

7 minutes ago, Audiophile Neuroscience said:

Paul, I am merely pointing out you may be suffering from the very condition you are exploring. I applaud your interest in bias. However, the conclusions you shared in the opinion piece and head fi post are just opinions that would appear to support your confirmation bias IMO.

If you care to discuss individual studies, okay. But the whole "clinical studies show you will have brighter teeth in just two weeks" doesn't cut it, you must examine each study. You were the one stating it was easy. That's not my experience

You obviously want me out of this thread. Bedtime here, anyway. 🙂

Ok, you keep diagnosing my condition and I'll keep asking for actual information. Not sure where I stated anything is easy. The whole reason I started this thread was to find what, if anything, can be done about proper testing methodology to eliminate obvious biases. In so far as you think any of the shared studies are mistaken, I'd love to hear why and how, and to see any evidence that demonstrates this. Accusing me of bias doesn't help, since I'm not the one doing these studies, I'm just citing them.

The Computer Audiophile · June 19, 2020

Hi Guys, I believe that people in this thread are doing the exact same thing that objectivists used to do in subjective threads, but it's reversed. If you remember, we had to create a special forum for these discussions because so many other discussions were ruined with "prove it" "your crazy" "audiophool" "Blind test or it didn't happen" type comments. Well, some of the subjective leaning folks in this thread are doing the same thing to an objective thread. Please stop.

This is an objective thread. If you can't work within the rules of this sub-forum, then stay out of it.

Audiophile Neuroscience · June 19, 2020

41 minutes ago, The Computer Audiophile said:

Hi Guys, I believe that people in this thread are doing the exact same thing that objectivists used to do in subjective threads, but it's reversed. If you remember, we had to create a special forum for these discussions because so many other discussions were ruined with "prove it" "your crazy" "audiophool" "Blind test or it didn't happen" type comments. Well, some of the subjective leaning folks in this thread are doing the same thing to an objective thread. Please stop.

This is an objective thread. If you can't work within the rules of this sub-forum, then stay out of it.

I didn't know that inviting discussion on individual studies is not appropriate in an objective thread. These studies appear to support "Blind test or it didn't happen" (your words)

I also feel that questioning confirmation bias in a thread about bias is germane to the topic.

The Computer Audiophile · June 19, 2020

Just now, Audiophile Neuroscience said:

I didn't know that inviting discussion on individual studies is not appropriate in an objective thread. These studies appear to support "Blind test or it didn't happen"

I also feel that questioning confirmation bias in a thread about bias is germane to the topic.

If you can provide objective information, it's welcome in this thread. We have a complete forum for all other information and I encourage you to use it. You can even create your own thread and post a link to it in this one.

Audiophile Neuroscience · June 19, 2020

1 minute ago, The Computer Audiophile said:

If you can provide objective information, it's welcome in this thread. We have a complete forum for all other information and I encourage you to use it. You can even create your own thread and post a link to it in this one.

I invited Paul to pick a study, any study that he cited. He did not. As there was over 15 cited, I was offering that he should choose rather than me cherry pick. Again, I do not see how this offer to discuss bias and cited studies fits into *your* description of, "complete forum for all other information ".

The OP wants me out, so as said, I will respect his wishes.

pkane2001 · June 19, 2020

1 minute ago, Audiophile Neuroscience said:

I invited Paul to pick a study, any study that he cited. He did not. As there was over 15 cited, I was offering that he should choose rather than me cherry pick. Again, I do not see how this offer to discuss bias and cited studies fits into *your* description of, "complete forum for all other information ".

The OP wants me out, so as said, I will respect his wishes.

I don't have a favorite, you pick one. I shared them as I came across them. Start with the first one, if you'd like.

fas42 · June 19, 2020

On 5/23/2020 at 12:25 AM, pkane2001 said:

This thread is specifically about bias in audio testing, how it may affect the results, and any mitigation strategies that can help deal with it.

Specifically about the last point, Paul, I'll mention the general methods I use when assessing:

First of all, the word "preference" is meaningless to me - that concept is alien to how I think ... either a system is acceptably accurate to the recording, or it's not. If the former, then if comparing two genuinely very highly performing systems then I might favour one over the other - but this situation has never arisen for me.

What I find I'm always doing is determining whether a setup is acceptably accurate. And how I go about that is as follows: I have a range of recordings that have very distinct characteristics; they have 'signatures' which may prove difficult for a particular rig to produce acceptably - or, they may have little trouble doing so. When faced with a new replay setup, I'll throw an almost random set of recordings at it, see what first impressions tell me. Then, based on what the feedback there is, I'll narrow down on just a few recordings which most clearly highlight where I feel the rig is not working at its best ... I will play these at various volumes, explore what further information that gives me. The better a particular system functions, the more 'difficult' the recordings I will use, the louder I will listen to them, and the 'deeper' into the sound image I will focus, to see if the finest details still come across correctly.

IOW, I'm always attentive to any signs of misbehaviour; any failures of the playback to retrieve what's on the recording to an acceptable standard ... by this approach I feel that bias plays no part - the concept is to merely identify what is incorrect in the playback.

pkane2001 · June 21, 2020

On 6/19/2020 at 7:47 PM, fas42 said:

Specifically about the last point, Paul, I'll mention the general methods I use when assessing:

First of all, the word "preference" is meaningless to me - that concept is alien to how I think ... either a system is acceptably accurate to the recording, or it's not. If the former, then if comparing two genuinely very highly performing systems then I might favour one over the other - but this situation has never arisen for me.

What I find I'm always doing is determining whether a setup is acceptably accurate. And how I go about that is as follows: I have a range of recordings that have very distinct characteristics; they have 'signatures' which may prove difficult for a particular rig to produce acceptably - or, they may have little trouble doing so. When faced with a new replay setup, I'll throw an almost random set of recordings at it, see what first impressions tell me. Then, based on what the feedback there is, I'll narrow down on just a few recordings which most clearly highlight where I feel the rig is not working at its best ... I will play these at various volumes, explore what further information that gives me. The better a particular system functions, the more 'difficult' the recordings I will use, the louder I will listen to them, and the 'deeper' into the sound image I will focus, to see if the finest details still come across correctly.

IOW, I'm always attentive to any signs of misbehaviour; any failures of the playback to retrieve what's on the recording to an acceptable standard ... by this approach I feel that bias plays no part - the concept is to merely identify what is incorrect in the playback.

While maybe that's true about you, Frank, it may not be true for most people

When you say there is no preference except for what is accurate, this is certainly not true in all cases, and for one, it would be great if you could demonstrate that this is true with some objective evidence. Do we even know what "accurate reproduction" means for the whole audio system, including speakers, the room, and the listener? How can you be sure that what you think is "accurate reproduction," is not in fact, seriously distorted? Just because you think it is, doesn't make it so. As you can see in many of the studies cited here, even highly trained professionals fall for simple sighted bias and prefer something based on the appearance and expectation rather than on the sound, despite being aware of bias and being taught to avoid it.

You may also find interesting @Archimago's last blind test results. There's a result with a p value of better than 0.05 that points to the test subjects preferring a little distortion (-75dB THD) with their music over the completely undistorted version. This is not proof of anything, but it does point to the possibility that what sounds better to many isn't necessarily what is most accurate.

fas42 · June 22, 2020

2 hours ago, pkane2001 said:

While maybe that's true about you, Frank, it may not be true for most people

I agree it's not true for most audio enthusiasts - generally, the intent is for certain, prescribed recordings to sound impressive, and most else much less so - cutting myself off from so much good music is not particularly appealing, 😉.

2 hours ago, pkane2001 said:

When you say there is no preference except for what is accurate, this is certainly not true in all cases, and for one, it would be great if you could demonstrate that this is true with some objective evidence.

That it's true that most people would prefer, or not prefer, accurate playback?

2 hours ago, pkane2001 said:

Do we even know what "accurate reproduction" means for the whole audio system, including speakers, the room, and the listener? How can you be sure that what you think is "accurate reproduction," is not in fact, seriously distorted? Just because you think it is, doesn't make it so. As you can see in many of the studies cited here, even highly trained professionals fall for simple sighted bias and prefer something based on the appearance and expectation rather than on the sound, despite being aware of bias and being taught to avoid it.

Accurate reproduction can only be assessed at the interface between the speakers and the listening environment - headphones are of course the classic means for doing this, but this is not the ideal listening situation for many people. How I get around this is by what I've mentioned many times - listening to an individual speaker as if it were one half of a headphone ... strangely enough, the physical principles in place are actually rather similar ... 😜.

How it works in practice is, that if one can use a single speaker like a headphone driver, then the SQ is of a sufficient standard. Seriously distorted replay is completely obnoxious to listen to in this manner; makes it trivially obvious that the replay chain is faulty.

2 hours ago, pkane2001 said:

You may also find interesting @Archimago's last blind test results. There's a result with a p value of better than 0.05 that points to the test subjects preferring a little distortion (-75dB THD) with their music over the completely undistorted version. This is not proof of anything, but it does point to the possibility that what sounds better to many isn't necessarily what is most accurate.

I mentioned in his thread on this forum about that test, why this something like this might be happen - purely speculation on my part; I haven't listened to his samples on a well enough performing rig to check further.

pkane2001 · June 22, 2020

I've previously posted the ITU-R recommendation on proper testing procedures for small audio impairments, in other words, for testing for small audible differences. But what if the differences are, say, "night-and-day" or "not even subtle" as is often stated by audiophiles and the press? Turns out there's a detailed recommendation for proper objective testing in these cases, as well:

ITU-R BS.1534-3, 10/2015

https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1534-3-201510-I!!PDF-E.pdf

This recommends the MUSHRA testing methodology: a double-blind multi-stimulus test method with hidden reference and hidden anchor(s). This method is designed to help weed out test subjects that are either not sensitive to real differences, or those who might be biased in one way or another. More about this testing methodology in another post.

What I found interesting in this recommendation (quoting the salient parts)

4 Selection of assessors

The higher the quality reached by the systems to be tested, the more important it is to have experienced listeners.

4.1 Criteria for selecting assessors

Only assessors categorized as experienced assessors for any given test should be included in final data analysis.

5.1 Description of test signals

It is recommended that the maximum length of the sequences be approximately 10 s, preferably not exceeding 12 s. This is to avoid fatiguing of listeners, increased robustness and stability of listener responses, and to reduce the total duration of the listening test.

5.2 Training phase

In order to achieve reliable results, it is mandatory to train the assessors in special training sessions in advance of the test.

5.5 Recording of test sessions

In the event that something anomalous is observed when processing assigned scores, it is very useful to have a record of the events that produced the scores

The recommendation also goes into the details of recommended statistical analysis of the results and even into how to present those results in a report. Very thorough!

pkane2001 · June 26, 2020

There are often objections raised that blind audio test results are invalid because it's too difficult to do them right. That's often used by those who haven't tried blind tests or want to disprove the results of such tests, and usually without evidence. So, let's see what professionals do to reduce possible errors in DBTs.

As most research and studies already cited here demonstrate, there's a very large, measurable error that's caused by sighted testing. In some cases the result is wrong by as much as 40%, as reported in one of the studies. So, if blind tests need to be used to reduce the effect of this error, then what, if anything, can we do to make blind sensory tests more accurate?

Here are a few slides from a presentation already shared earlier. This talks to what a test administrator should do to improve the quality of the test results. Since that presentation is on sensory preference testing not specific to audio, I edited some of their examples out that didn't relate to audio:

(I highlighted a few that I find are often overlooked in audio testing, both blind and sighted)

Training a panel

•Expectation error –any information a panelist receives influences the outcome

•Panels finds what they are expected to find

•Trick –provide only enough information for panelist to be able to do the test

•Try not to include people already involved in the experiment (single blind)

•Avoid codes that create inherent bias (1,A etc)

•Motivated panelists

•Leniency error –rate products based upon feelings about researcher

•Suggestion effect –response of other panelists to product (need to isolate panelists and keep them quiet)

Testing times

•Must not be too tired

•Late morning or mid afternoon are good

•Early AM bad for testing

•Late day –lack of panelist motivation

Stimulus Error

•Influence of irrelevant questions

•Try to mask unwanted difference (e.g. colored lights)

•Logical error –associated with stimulus error –tendency to rate characteristics that appear to be logically associated. Control by masking differences

Halo and Proximity Effect

•Halo effect –caused by evaluating too many factors at one time. Panelists already have an impression about the product when asked about second trait –will form a logical association (e.g. dry-> tough)

•Best to structure testing so that only one factor is tested at a time (difficult to do)

•Proximity error –rate more characteristics similar when they follow in close proximity.

Convergence Effect

•Convergence effect –large difference between two samples will mask small differences between others.

•This causes results to converge. So use random order to reduce this.

•Next slide shows how flavor interactions impact this.

Positional Effect and Contrast Effect

•Positional effect –tendency to rate second product higher or lower

•2 products very different –panelists will exaggerate differences and rate ‘weaker’ sample lower than would otherwise

•Use random order. Use all possible presentation orders

Controls

•Include reference sample in test as part of mix

•Use random numbers

•Balanced order of presentation to reduce physiological and psychological effects

Summit · June 27, 2020

23 hours ago, pkane2001 said:

There are often objections raised that blind audio test results are invalid because it's too difficult to do them right. That's often used by those who haven't tried blind tests or want to disprove the results of such tests, and usually without evidence.

You keep saying the same thing over and over again. I know of no genuine objections raised that blind audio test is good for testing bias. The question is if those bias testes and how they are conducted are good for testing the subtle SQ difference of audio gear. Audio gear that many people feel that they need to evaluate for a long time to get to know how they sound.

I have said it before and will tell you again that those quick change between gears is not how I and many other audiophile are listening to music and how we normally evaluate which gear to buy.

The Auditory memory is very short so what those 5-30 second test tell us is how one audio gear sound compared to another, which is completely different to when you listen for a long time and your mind has the time to calibrate and you can evaluate how the sound of a gear is compared to a reference which is real none recorded sound. None of the problem associated with blind testes are because they are done blind, it's because how they are done.

I have made a few blind testes with friends in the past and the fast swiping of gear was no good. This is not limited to blind tests per se and the problem is the same no matter if a sighted or blind A/B test. Blind test there we could first get to know the sound of the gear for a couple of hours and then listen to a whole song before changing gear was much more telling.

pkane2001 · June 27, 2020

1 hour ago, Summit said:

You keep saying the same thing over and over again. I know of no genuine objections raised that blind audio test is good for testing bias. The question is if those bias testes and how they are conducted are good for testing the subtle SQ difference of audio gear. Audio gear that many people feel that they need to evaluate for a long time to get to know how they sound.

Interesting. As I believe I said (and you quoted me): "There are often objections raised that blind audio test results are invalid because it's too difficult to do them right." And the post you were responding to was all about the steps that professionals in sensory testing take to ensure that the test results are more statistically valid, i.e., done right.

Your main objection seems to be that longer term audio evaluation is more sensitive than short term to small differences. That's an often brought out hypothesis, but I've yet to see any sort of objective evidence to support it. Can you cite some studies that demonstrate this increased sensitivity? By the way, your hypothesis in no way invalidates blind testing, it simply proposes a method of doing it.

Summit · June 27, 2020

5 hours ago, pkane2001 said:

Your main objection seems to be that longer term audio evaluation is more sensitive than short term to small differences. That's an often brought out hypothesis, but I've yet to see any sort of objective evidence to support it. Can you cite some studies that demonstrate this increased sensitivity? By the way, your hypothesis in no way invalidates blind testing, it simply proposes a method of doing it.

That is correct my problem have never been about testing SQ blind, I have done it myself a few times. The problem is the methodology commonly used in all type of tests to determine sound quality of different gear.

The biggest problem is that these tests are not conducted in the way that I and many audiophile listen and evaluate audio gear sighted. To be able to hear subtle difference between audio equipment I need to be familiar with the room, audio system as well as the recordings. A blind test or a sighted A/B test is only difficult to do if the point is to achieve statistical significant proof by many fast repeated switching of gear.

Our Auditory memory is very short so a test there they change gear every 5-30 second test won’t let us hear the SQ just the change between the gear, and to some degree their overall sound signature.

To properly evaluate any audio gear we need to be able to first “calibrate” to the sound and then compare it to our long-term memory of real none recorded sound. Echoic memory or other short time memories are not valuable for this task. Then I test two audio gear in my stereo I will compare how the bass, drums, the guitar, piano, voices etc. sound to how they normally sound like in real life and to do that I need to use my long time memory. Even when I compare two audio gear I will use my memory of how those instrument sound compare to both the other audio gear and the references, the non-recording memories I have.

I believe that to understand why most audio tests are flawed we need to understand how we hear and compare sound. Here is a start.

“Each type of memory is tied to a particular type of brain function. Long-term memory, the class that we are most familiar with, is used to store facts, observations, and the stories of our lives. Working memory is used to hold the same kind of information for a much shorter amount of time, often just long enough for the information to be useful; for instance, working memory might hold the page number of a magazine article just long enough for you to turn to that page. Immediate memory is typically so short-lived that we don’t even think of it as memory; the brain uses immediate memory as a collecting bin, so that, for instance, when your eyes jump from point to point across a scene the individual snapshots are collected together into what seems like a smooth panorama.”

https://brainconnection.brainhq.com/2013/03/12/how-we-remember-and-why-we-forget/

pkane2001 · June 27, 2020

1 hour ago, Summit said:

That is correct my problem have never been about testing SQ blind, I have done it myself a few times. The problem is the methodology commonly used in all type of tests to determine sound quality of different gear.

The biggest problem is that these tests are not conducted in the way that I and many audiophile listen and evaluate audio gear sighted. To be able to hear subtle difference between audio equipment I need to be familiar with the room, audio system as well as the recordings. A blind test or a sighted A/B test is only difficult to do if the point is to achieve statistical significant proof by many fast repeated switching of gear.

Our Auditory memory is very short so a test there they change gear every 5-30 second test won’t let us hear the SQ just the change between the gear, and to some degree their overall sound signature.

To properly evaluate any audio gear we need to be able to first “calibrate” to the sound and then compare it to our long-term memory of real none recorded sound. Echoic memory or other short time memories are not valuable for this task. Then I test two audio gear in my stereo I will compare how the bass, drums, the guitar, piano, voices etc. sound to how they normally sound like in real life and to do that I need to use my long time memory. Even when I compare two audio gear I will use my memory of how those instrument sound compare to both the other audio gear and the references, the non-recording memories I have.

I believe that to understand why most audio tests are flawed we need to understand how we hear and compare sound. Here is a start.

“Each type of memory is tied to a particular type of brain function. Long-term memory, the class that we are most familiar with, is used to store facts, observations, and the stories of our lives. Working memory is used to hold the same kind of information for a much shorter amount of time, often just long enough for the information to be useful; for instance, working memory might hold the page number of a magazine article just long enough for you to turn to that page. Immediate memory is typically so short-lived that we don’t even think of it as memory; the brain uses immediate memory as a collecting bin, so that, for instance, when your eyes jump from point to point across a scene the individual snapshots are collected together into what seems like a smooth panorama.”

https://brainconnection.brainhq.com/2013/03/12/how-we-remember-and-why-we-forget/

Well, these are all conjectures or possible explanations for why it may work this way. And I don't necessarily disagree with anything you said here (I'm open to hearing evidence for one way or the other). But I still would like to see some studies or properly conducted blind tests that demonstrate that longer term listening can be more sensitive than shorter-term, 8-10 seconds switching when evaluating minor differences. That's certainly not the way the industry conducts subjective listening tests, although I've not found a clear indication of whether it's because short-term switching is just easier to conduct and more convenient, or because echoic memory limits our ability to evaluate minor audio differences beyond a few seconds. In conversations with some audio testing professionals, they did indicate that shorter-term, quick-switching was the way to detect minor SQ differences. But, again, that's just someone saying it, and what I'm looking for is objective evidence for whether it's true or not.

Summit · June 28, 2020

15 hours ago, pkane2001 said:

Well, these are all conjectures or possible explanations for why it may work this way. And I don't necessarily disagree with anything you said here (I'm open to hearing evidence for one way or the other). But I still would like to see some studies or properly conducted blind tests that demonstrate that longer term listening can be more sensitive than shorter-term, 8-10 seconds switching when evaluating minor differences. That's certainly not the way the industry conducts subjective listening tests, although I've not found a clear indication of whether it's because short-term switching is just easier to conduct and more convenient, or because echoic memory limits our ability to evaluate minor audio differences beyond a few seconds. In conversations with some audio testing professionals, they did indicate that shorter-term, quick-switching was the way to detect minor SQ differences. But, again, that's just someone saying it, and what I'm looking for is objective evidence for whether it's true or not.

Okay, before setting up and conducting a test you have to decide what to measure and how it can be measured. Sound quality is subjective in nature and is therefore very difficult to measure because people have preference as well as bias. Sound quality also depends on many external factors like the quality of the room, audio system, recordings and how they are setup etc..

So if the test method (time, place, audio system etc.) would be significantly different to how it is normally done *, mustn’t it be questioning if the test was conducted correctly, so it actually measure the participants ability to hear sound quality difference between the tested audio equipment, and the participants bias?

I told you in my first post in this thread that BIAS tests can be tested for “Discrimination testing” but not for “Preference testing”. Preference testing has to be set up and be conducted very different to Discrimination testing. The method used in all published blind tests I seen is maybe suited to measure discrimination, but not for preference of sound quality or people’s ability to hear them.

“Discrimination testing is a technique employed in sensory analysis to determine whether there is a detectable difference among two or more products.”

Let me exemplify which type of hypotheses that can be tested by the method made for Discrimination testing.

Example 1: There are no audible difference between USB cables no matter price, design or model.

Example 2: People cannot identify a genuine native 96 KHz 24 bit audio recording from an audio recording that has been downs sampled to 41.1 KHz 16 bit.

The difference between detecting a difference and forming an opinion about which audio gear that sound best and most true is very different. One is simple the other is more complex. To detect a difference like in the example above or for aspects like loudness, bass quantity etc we use our short time memories, but when we evaluate SQ we use our long time memory.

You ask for objective evidence. I have presented logical “evidence” that the test method commonly used is significantly different to how audiophiles normally evaluate sound quality between audio gears. This alone should be enough to questioning all these sound quality testes IMO. Tests which result goes against normally conducted test at home of millions of audiophiles around the globe. To me it’s clear that they believe that they can measure people’s subjective preference and that it can be made by conduct them like it was a Discrimination test. And the reason to this is (conjectures) probably because they want their studies to be scientifically made with statistical significant results.

I have also point you to how our brain works and how it can affect audio tests. Even if am not an expert in how the brain and our hearing works I know that the understanding of the brain/hearing and how it works have change considerably the last 10-15 years.

It’s true that I cannot quote a study that explicit state that these test are fundamentally flawed. My aim is however not to present proof to why these studies are flawed scientifically. All things that can affect the result of the study/test should be presented and explained and if the affect will inflict/influence the result the method has to be changed so that the test itself has no influence on the result.

*The best way to set up and conduct a test is to do it like people normally listen and evaluate sound quality. I know of no one that goes to a HIFI shop and listen to a song by changing audio gear every 5-8 second. Then I and many other audiophiles are evaluating audio gear we listen for hours, days or sometime even weeks before buying it. Many audiophiles only buy gear that they can lend home to evaluate in their own audio system before buying it. The same for reviewers. They often listen to the reviewed gear and compare it to their reference gear for weeks and sometimes month before publishing their verdicts.

pkane2001 · June 28, 2020

6 hours ago, Summit said:

Okay, before setting up and conducting a test you have to decide what to measure and how it can be measured.

Again, some interesting hypotheses, and I appreciate you taking the time to describe them in detail. Before doing any tests myself, I would want to still see if there's already existing objective evidence presented to support or contradict these. But let's discuss your logical evidence, in the absence of objective one Before we dive into it, let me state something that may be a misunderstanding:

The goal of this thread, and the proper design of tests listed here is to eliminate UNRELATED biases from determining preference or discrimination ability. These biases affect both types of tests, but certainly there are some that affect only subjective or only discrimination testing. So, yes, it's important to avoid BIASES in sensory testing that have nothing to do with the sense being tested.

6 hours ago, Summit said:

Sound quality is subjective in nature and is therefore very difficult to measure because people have preference as well as bias.

.

Sound quality is subjective. Absolutely. And yet, sound quality and the subjective measures of such is what nearly every study presented here so far is trying to address. In fact, these demonstrate that our preferences and subjective likes are not that different from each other. When biases related to other senses are removed or controlled in the experiment, there is often a high statistical correlation between. For example, different people liking the sound quality of the same speakers. Or the same headphones, or even the same type of distortion. It may be more difficult to do a proper study of subjective preferences, but not impossible, and there are industry-standard procedures published to do so with audio, also cited in this thread.

In fact, @Archimago just completed an internet blind test on harmonic distortion that included both, discrimination measures as well as subjective preferences. Interestingly, there was a statistically significant preference shown to a low level harmonic distortion compared to no distortion. While I can't say it's a perfectly conducted test, it does address your objection to using different equipment, room, etc.: in fact, nobody in this experiment used the same equipment or the same room.

7 hours ago, Summit said:

The method used in all published blind tests I seen is maybe suited to measure discrimination, but not for preference of sound quality or people’s ability to hear them.

Please review all the cited studies in this thread. Nearly every one is a subjective preference study. But as part of the test, a proper sensitivity testing/consistency has to be shown, otherwise the results can't be statistically valid. If someone can't consistently identify the same device as having certain SQ qualities, then perhaps they fail the discrimination test, even though the test is one designed to measure preferences. That's where proper statistical analysis has to be performed, and the test (such as MUSHRA, for example) is designed to detect test subjects with poor discrimination ability, or those who just guess randomly or have a hidden bias.

7 hours ago, Summit said:

I have presented logical “evidence” that the test method commonly used is significantly different to how audiophiles normally evaluate sound quality between audio gears. This alone should be enough to questioning all these sound quality testes IMO

Mmm, no. In scientific research, it's not enough to just bring up possible issues with the process. You need to demonstrate that they are indeed an issue and this requires objective evidence. For example, the claim that longer term evaluation is more sensitive to small differences than short switching is not at all obvious, and begs to be tested. Without evidence, it's just a conjecture, and there are thousands of those floating around just this forum, for example.

7 hours ago, Summit said:

All things that can affect the result of the study/test should be presented and explained

Presented, explained, and substantiated by evidence. There was a software engineer working for me a long time ago who claimed the bugs in his software were caused by glitches generated by cosmic rays hitting the computer. He was serious (!) and wouldn't accept that he could've possibly made a mistake or missed something in programming. So, yes, he had an explanation, but without evidence, it was not something I'd ever accept or even start to consider

7 hours ago, Summit said:

I know of no one that goes to a HIFI shop and listen to a song by changing audio gear every 5-8 second.

I'm not arguing for quick switching. I'd like to see objective evidence that long-term evaluation is more sensitive. Regardless of how audiophiles do their evaluation at the shop or at home, this is not objective evidence, it's just an anecdote.

I in no way want to make it sound like proper subjective (or even discrimination) testing is easy. In fact, this whole thread is about the complications that must be dealt with in such experiments.

But for those of us who actually want to understand and learn what is really going on, this is not an impossible task and, in my opinion, worth pursuing in an objective way, rather than accepting as truth what someone else said they think.

botrytis · June 28, 2020

Over on the BAS site (Boston Audio Society), they spend time talking about non-biased testing (listening testing) and have dome quite a bit of it. Another that does that is SMOOTMS (Southeast Michigan Woofer and Tweeter Marching Society - not making this up). Both show that sighted listening is inherently biased

fas42 · June 28, 2020

12 hours ago, Summit said:

Okay, before setting up and conducting a test you have to decide what to measure and how it can be measured. Sound quality is subjective in nature and is therefore very difficult to measure because people have preference as well as bias. Sound quality also depends on many external factors like the quality of the room, audio system, recordings and how they are setup etc..

Yes, decide what you want to measure ... IME many ambitious rigs are about developing various types of tonality seasoning, and then of course subjectivity is everything in the assessment of what one hears. But "sound quality" is not subjective, in my book - it is the degree to which there is no significant audible adulteration of what's on the recording by the playback chain - how one assesses is by listening for faults in the sound, clearly evident misbehaviour of the replay.

To use the dreaded car analogy, 😜, most audiophiles compare by saying things like, I prefer the MB ambience over the BMW variety ... I say, I note that car 1 develops a slightly annoying vibration while accelerating, at a certain engine speed; whereas car 2 doesn't. Therefore, car 2 is the better quality car.

Ajax · June 29, 2020

On 6/28/2020 at 5:25 AM, pkane2001 said:

Well, these are all conjectures or possible explanations for why it may work this way. And I don't necessarily disagree with anything you said here (I'm open to hearing evidence for one way or the other). But I still would like to see some studies or properly conducted blind tests that demonstrate that longer term listening can be more sensitive than shorter-term, 8-10 seconds switching when evaluating minor differences. That's certainly not the way the industry conducts subjective listening tests, although I've not found a clear indication of whether it's because short-term switching is just easier to conduct and more convenient, or because echoic memory limits our ability to evaluate minor audio differences beyond a few seconds. In conversations with some audio testing professionals, they did indicate that shorter-term, quick-switching was the way to detect minor SQ differences. But, again, that's just someone saying it, and what I'm looking for is objective evidence for whether it's true or not.

Hi Paul,

Your intent in establishing this blog is honourable, and your patience while attempting to enlighten the naysayers such as Alex has been exemplary.

Unfortunately the majority of people (not all) put their emotions and beliefs first and will often cherry pick or alter the facts to match them. Very few take a rational approach to a subject that has feelings associated with them. Just look at the rebound in the sharemarket, which assumes that the Pandemic will not affect the economy, despite the evidence that things are gong to get a lot worst before they get better.

There is a book by physiologist Mark Manson "Everything if F*cked, a book about Hope" that explores why people adopt unhealthy beliefs and behave as they do despite the facts.

https://www.amazon.com.au/Unti-Manson-2/dp/0062888439

I would also commend those looking for a more rational approach to audio reproduction to view Archimago's Musings' blog and the Audio Science Review website. I find they give a good balance to the more subjective views you find here.

All the best,

Ajax

The Computer Audiophile · June 29, 2020

26 minutes ago, Ajax said:

Audio Science Review website

I’ll disagree with you on this one all day long.

If that site was interested in science and objective info it would publish measurements entirely different. Objectivists claim to know the level audibility for measurements. If this is true and nothing below this matters, then it would be prudent to say a component has no issues that are audible and not publish the numbers that go lower. Publishing this info is terribly misleading and according to them could never be heard in a blind test.

Bringing it back to this topic, why publish any measurements that can’t be heard in a blind test? They can only serve to mislead those objectivists who need to be saved from themselves so they don’t spend all their money on numbers that don’t matter.

However, something else entirely is up.

Edit: I’ll start my own thread. Sorry for the derailing post.

Here's a link to my thread.

Audiophile Neuroscience · June 29, 2020

Following on from what @Summit and many others have said (myself included) its all about the methodology. So, for any methodology of any test or test procedure a pivotal question is how good is it at telling you what you want to know. For good tests there are objective ways to determine this and one example being aware of the false negative rate the test produces.

If any test has a false negative rate of 90% that would render it useless. If OTOH the false negative rate is 5% then you can say, well the test is telling me something but there is a 1 in 20 chance it is getting it wrong. In either case you know objectively where you stand.

In all the studies cited and AFAIK other blind listening studies, including Toole's, there is no definitive false negative rates declared ( I genuinely stand to be corrected here). It seems that it is assumed that the false negative rates are negligible....but how do you/they know that? Definitive statements are made such as "blind tests show..bla bla". ...but what if the false negative rate is 90%?

Now, as pointed out, there are recommendations, official recommendations, made to mitigate these false negatives like take your time, avoid fatigue, drink more water, whatever. However, these are just (mostly) sensible guidelines that in and of themselves have not been validated with experimental evidence that results in revealing the actual false negative rates by adhering to these guidelines. Guidelines are guidelines, if you compare European vs American guidelines on various medical conditions you will find differences of opinion. Why? Because the experimental evidence is not conclusive and the jury remains out.

Bottom line -Tell me the false negative rate of any test otherwise I have no confidence the test is not giving me a wrong result.

pkane2001 · June 29, 2020

14 hours ago, botrytis said:

Over on the BAS site (Boston Audio Society), they spend time talking about non-biased testing (listening testing) and have dome quite a bit of it. Another that does that is SMOOTMS (Southeast Michigan Woofer and Tweeter Marching Society - not making this up). Both show that sighted listening is inherently biased

Any pointers to those tests?

It's not hard to demonstrate that sighted testing introduces many variables into testing that have nothing to do with actual audio or hearing. Some of the findings already mentioned in this thread illustrate it quite nicely:

The hearing aid test where the subjects were told it has some new technology heard major improvements, while those who thought it was old technology did not (same exact hearing aid was used)
Subwoofer effect. While the subwoofer was visible to the test subjects, they heard an obvious low-end extension to the sound even though it wasn't hooked up. When the sub was hidden from view, the effect disappeared
Speaker comparisons that showed a very high preference for a high-end speaker when sighted, even among trained audio testers, that became a very different (and consistent among test takers!) selection when the identity of the speaker was hidden

pkane2001 · June 29, 2020

11 hours ago, Ajax said:

Hi Paul,

Your intent in establishing this blog is honourable, and your patience while attempting to enlighten the naysayers such as Alex has been exemplary.

Unfortunately the majority of people (not all) put their emotions and beliefs first and will often cherry pick or alter the facts to match them. Very few take a rational approach to a subject that has feelings associated with them. Just look at the rebound in the sharemarket, which assumes that the Pandemic will not affect the economy, despite the evidence that things are gong to get a lot worst before they get better.

There is a book by physiologist Mark Manson "Everything if F*cked, a book about Hope" that explores why people adopt unhealthy beliefs and behave as they do despite the facts.

https://www.amazon.com.au/Unti-Manson-2/dp/0062888439

I would also commend those looking for a more rational approach to audio reproduction to view Archimago's Musings' blog and the Audio Science Review website. I find they give a good balance to the more subjective views you find here.

All the best,

Ajax

Hi Ajax,

My view is not so negative I see this thread as a way to keep notes on my own findings and to share them with others who might be interested. Even if those who are reading it are trying to prove me wrong, it's still a valuable exercise to hear others opinions and to discuss ways to improve the testing process.

Of course, I have my own opinions on this topic which might be obvious, but this thread is not about my opinion. I like Archimago and his blog, and of course, participate on ASR, but I find ASR is often as much about the knee-jerk reaction to subjective views as this site is a knee-jerk reaction to objective ones. I'm happy to exist somewhere in-between, as I find dogma hard to accept. From either side

Regards,

Paul

Does BIAS affect audio test results?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Create an account or sign in to comment

Create an account

Sign in