Jump to content
pkane2001

Does BIAS affect audio test results?

Recommended Posts

11 hours ago, Audiophile Neuroscience said:

Following on from what @Summit and many others have said (myself included) its all about the methodology. So, for any methodology of any test or test procedure a pivotal question is how good is it at telling you what you want to know. For good tests there are objective ways to determine this and one example being aware of the false negative rate the test produces.

 

If any test has a false negative rate of 90% that would render it useless. If OTOH the false negative rate is 5% then you can say, well the test is telling me something but there is a 1 in 20 chance it is getting it wrong. In either case you know objectively where you stand.

 

In all the studies cited and AFAIK other blind listening studies, including Toole's, there is no definitive false negative rates declared ( I genuinely stand to be corrected here). It seems that it is assumed that the false negative rates are negligible....but how do you/they know that? Definitive statements are made such as "blind tests show..bla bla". ...but what if the false negative rate is 90%?

 

Now, as pointed out, there are recommendations, official recommendations, made to mitigate these false negatives like take your time, avoid fatigue, drink more water, whatever. However, these are just (mostly) sensible guidelines that in and of themselves have not been validated with experimental evidence that results in revealing the actual false negative rates by adhering to these guidelines. Guidelines are guidelines, if you compare European vs American guidelines on various medical conditions you will find differences of opinion. Why? Because the experimental evidence is not conclusive and the jury remains out.

 

Bottom line -Tell me the false negative rate of any test otherwise I have no confidence the test is not giving me a wrong result.

 

This seems to be your main objection to blind testing, from what I gather. But, if you read the studies and proposed testing methodologies, you'll see how hidden controls and pre-testing are used to determine false negatives. The testing recommendations include eliminating scores from those testers that show no ability to discriminate or discriminate randomly.  Testing on a larger sample will eliminate the variables related to individual's performance that day or at that time, or whatever other individual circumstances specific to a person.

 

For example, MUSHRA testing recommendation from ITU-R includes the following (there's more to each section, I cut out the statistical analysis that's part of the recommendation)

 

image.png.450c65db3c28c26cf4388657bbcd7b01.png

image.png.9ed8bca038d5b3a30d225f2483b032ba.png

 

Toole's findings related to errors caused by sighted evaluation compared to blind, even with trained listeners:

 

image.png.47d3c49d04cfa58d9019ab34b117d734.png

 

 

 

Share this post


Link to post
Share on other sites
1 hour ago, pkane2001 said:

 

Any pointers to those tests?

 

It's not hard to demonstrate that sighted testing introduces many variables into testing that have nothing to do with actual audio or hearing. Some of the findings already mentioned in this thread illustrate it quite nicely:

 

  • The hearing aid test where the subjects were told it has some new technology heard major improvements, while those who thought it was old technology did not (same exact hearing aid was used)
  • Subwoofer effect. While the subwoofer was visible to the test subjects, they heard an obvious low-end extension to the sound even though it wasn't hooked up. When the sub was hidden from view, the effect disappeared
  • Speaker comparisons that showed a very high preference for a high-end speaker when sighted, even among trained audio testers, that became a very different (and consistent among test takers!) selection when the identity of the speaker was hidden

 

 

 

https://www.bostonaudiosociety.org/bas_speaker/abx_testing.htm

 

http://djcarlst.provide.net/smwtms.htm

 

 

https://www.nytimes.com/1986/03/16/arts/sound-expert-ears-take-a-test-to-resolve-the-great-cd-debate.html

 

Especially BAS had an article on amplifier testing. I can't find it now.

 


Current:  JRiver 24 on Win 10 PC (AMD Ryzen 5 2600 with 32 GB RAM) or Daphile on an I5-2500K with 16 GB RAM

DAC - TEAC UD-501 DAC 

Pre-amp - Audio Research SP-16

Amplification - PS Audio S300

Speakers: Wharfedale Linton Heritage

Cables:Tara Labs and DiMarzio Interconnects

Share this post


Link to post
Share on other sites
3 hours ago, pkane2001 said:

 

This seems to be your main objection to blind testing, from what I gather. But, if you read the studies and proposed testing methodologies, you'll see how hidden controls and pre-testing are used to determine false negatives. The testing recommendations include eliminating scores from those testers that show no ability to discriminate or discriminate randomly.  Testing on a larger sample will eliminate the variables related to individual's performance that day or at that time, or whatever other individual circumstances specific to a person.

 

For example, MUSHRA testing recommendation from ITU-R includes the following (there's more to each section, I cut out the statistical analysis that's part of the recommendation)

 

image.png.450c65db3c28c26cf4388657bbcd7b01.png

image.png.9ed8bca038d5b3a30d225f2483b032ba.png

 

Toole's findings related to errors caused by sighted evaluation compared to blind, even with trained listeners:

 

image.png.47d3c49d04cfa58d9019ab34b117d734.png

 

 

 

 Hi Paul,

any false result, a false positive or false negative is a concern but false negatives are certainly relevant when looking at failures identifying a condition/situation. It is a complex area of statistics (for me) but you need to know sensitivity figures, specificity figures and calculate the true postives, true negatives, false positives and false negatives and look and positive and negative predictive values. Sometimes expressed in the form of 2x2 tables. I mentioned this in my first ever post on AS (CA)

 

Neither of the examples cited actually specify false negative rates as in actual objective measurements.

 

I do have the whole ITU paper and it is at least more rigorous in identifying good quality listeners but this is not the same as stating what the actual fasle negative rate is for the test procedure for experienced or any other listeners.

 

There are obviously just assertions in the Toole description cited, not any actual figures for false negatives. I have this paper and many/most other Toole publications collected over the years and nowhere have i seen published false negatives rates. he just seems to assume they are negligible.

 

FWIW the Toole publications tend to favor that both experienced and inexperienced listeners are both good 'discriminators' for discerning difference, experienced listeners just more efficient, needing less to get a statistically significant result. I also believe this conclusion is flawed (as in understatement) but (if true) it would tend to invalidate the ITU screening procedure to some extent.

 

What is required is actual objective measurement figures for the test at hand, not ways that someone has tried to mitigate false results, you need actual measurements of false results expressed as a percentage (the "anchor" test does not give you the false negative rate).


Sound Minds Mind Sound

 

 

Share this post


Link to post
Share on other sites

Hi David,

 

1 hour ago, Audiophile Neuroscience said:

I do have the whole ITU paper and it is at least more rigorous in identifying good quality listeners but this is not the same as stating what the actual fasle negative rate is for the test procedure for experienced or any other listeners.

 

Well, they do more than that. They identify the process for rejecting a set of false negative results by including anchors in the test. Since the test is blind, there's no way for a test subject to apply a 'false negative' selection to just one of the DUTs while not doing so to the anchors, unless, of course, doing this consciously and deliberately. This deliberate subversion case can be eliminated by having a large enough and varied enough set of test subjects.

 

1 hour ago, Audiophile Neuroscience said:

There are obviously just assertions in the Toole description cited, not any actual figures for false negatives

 

The actual paper with analysis and detailed statistics is behind a paywall, but I don't think they analyzed false negatives. Since the results in their blind test was clear, consistent preference that was statistically significant, I don't think a false negative was a possible outcome there:

 

Toole, F.E. and S. E. Olive, “Hearing is Believing vs. Believing is Hearing: Blind vs. Sighted Listening Tests and Other Interesting Things”, 97th Convention, Audio Eng. Soc., Preprint 3894 (1994 Nov.)

 

1 hour ago, Audiophile Neuroscience said:

What is required is actual objective measurement figures for the test at hand, not ways that someone has tried to mitigate false results, you need actual measurements of false results expressed as a percentage (the "anchor" test does not give you the false negative rate).

 

Those would be nice. And I assume that specific objections to the test procedure can be tested for false-negatives. So, what is your hypothesis about what might cause a false negative in a properly conducted blind test (say, using ITU-R recommendation)?

 

@Summit proposed one (long-term vs short-term evaluation). This can be turned into a proper blind test, although it's not going to be easy to conduct if it demands a week or a month to be spent with each DUT at a time. It could take years :)

 

Share this post


Link to post
Share on other sites
18 hours ago, fas42 said:

 

Yes, decide what you want to measure ... IME many ambitious rigs are about developing various types of tonality seasoning, and then of course subjectivity is everything in the assessment of what one hears. But "sound quality" is not subjective, in my book - it is the degree to which there is no significant audible adulteration of what's on the recording by the playback chain - how one assesses is by listening for faults in the sound, clearly evident misbehaviour of the replay.

 

To use the dreaded car analogy, 😜, most audiophiles compare by saying things like, I prefer the MB ambience over the BMW variety ... I say, I note that car 1 develops a slightly annoying vibration while accelerating, at a certain engine speed; whereas car 2 doesn't. Therefore, car 2 is the better quality car.

 

Sound quality is not subjective per se, but the listening evaluation normally is.

 

I thought tonality seasoning was your thing 9_9.  

Share this post


Link to post
Share on other sites
5 hours ago, Summit said:

 

Sound quality is not subjective per se, but the listening evaluation normally is.

 

I thought tonality seasoning was your thing 9_9.  

 

Yes, it's all about how you listen ...

 

My thing? What I'm about is hearing what's on the recording; and audiophiles as a group are deeply locked in a belief system that insists that there is a hierarchy of "bad" to "good" recordings, where the the worst of the former are completely unlistenable to .,.. and so instinctively gravitate to rigs that have quality signatures which reinforce that thinking - the seasoning has a huge range of excuses for why it's done; and an obvious example which is held in high esteem is "tube rolling" ... don't like how the recording sounds? OK, change some active devices until the distortion added balances the nature of the recording ... the consumer becomes part of the mastering chain.

 

I won't be happy until the latest rig produces the same subjective presentation that I got 35 years ago when the system back then was working at peak level - I know that I'm close to excluding all playback chain personality when the experiences match - as I would have thought would be a reasonable take on the situation, 😉.


Frank

 

http://artofaudioconjuring.blogspot.com/

 

 

Over and out.

.

 

Share this post


Link to post
Share on other sites
6 hours ago, pkane2001 said:

The actual paper with analysis and detailed statistics is behind a paywall, but I don't think they analyzed false negatives. Since the results in their blind test was clear, consistent preference that was statistically significant, I don't think a false negative was a possible outcome there:

 

Toole, F.E. and S. E. Olive, “Hearing is Believing vs. Believing is Hearing: Blind vs. Sighted Listening Tests and Other Interesting Things”, 97th Convention, Audio Eng. Soc., Preprint 3894 (1994 Nov.)

 

I found it to be available here, https://www.academia.edu/27512201/Hearing_is_Believing_vs._Believing_is_Hearing_Blind_vs._Sighted_Listening_Tests_and_Other_Interesting_Things?auto=download

 


Frank

 

http://artofaudioconjuring.blogspot.com/

 

 

Over and out.

.

 

Share this post


Link to post
Share on other sites

Hi Paul. I would agree with you that the ITU recommendations are a step in the right direction in attempting to reduce errors. Arguably the best we have got so far.

 

ITU-R BS.1534 ( cited in post #102) uses “Multi Stimulus test with Hidden Reference and Anchor (MUSHRA)”, is a " Method for the subjective assessment of intermediate quality level of audio systems". It is "not intended for assessment of small [audio] impairments". The paper does not mention "false negatives" and grading of an anchor signal on an ordinal scale from 1 to 5 compared to peers does not substitute or exclude the need for nominating a known false negative rate.

 

 

ITU-R BS.111-3 (2015) is for "Methods for the subjective assessment of small impairments in audio systems" and is therefore putatively more appropriate to audiophile listening tests. Kudos because it does state some limitations like "perceptual  objective  assessment  methods  are  being  developed  for  testing  the sound quality of sound systems", and "further studies [are needed ] of the characteristics of listening rooms and reproduction devices for the advanced sound system are needed".

 

In addition to those caveats it is also still worth mentioning that nowhere in that paper is the term "false negatives" mentioned or offered as a measure. In relation to "screening tests" they talk about uncovering "tendencies" but there is no quantitative measure offered or suggested in that section. They sensibly advise "caution" given "subjects have different sensitives to different artefacts". Anchors are not used unless all systems are found transparent, and as a means to see how (allegedly) good the subjects are at discerning known differences. Again that should not be confused as yielding a false negative rate for the test at hand. My interpretation FWIW oversimplified is that 'hey, these guys do not have golden ears, let's get a bunch that do'


Sound Minds Mind Sound

 

 

Share this post


Link to post
Share on other sites
7 hours ago, pkane2001 said:

@Summit proposed one (long-term vs short-term evaluation). This can be turned into a proper blind test, although it's not going to be easy to conduct if it demands a week or a month to be spent with each DUT at a time. It could take years 

 

Definitely something along the lines that Summit and others have suggested  with listening conditions more representative of real world experience.

 

I am also sure you are aware of Thomas Lund from Genelec and his "Slow Listening" and his AES papers on Slow Listening and "On Human Perceptual Bandwidth and Slow Listening" ( I have them if needed). I am not pushing it as a solution but just something that needs to be further explored in that direction.

 

I absolutely agree that it could take years and as well, may be costly and probably require considerable research expertise.

 


Sound Minds Mind Sound

 

 

Share this post


Link to post
Share on other sites
4 hours ago, Audiophile Neuroscience said:

 

Definitely something along the lines that Summit and others have suggested  with listening conditions more representative of real world experience.

 

I am also sure you are aware of Thomas Lund from Genelec and his "Slow Listening" and his AES papers on Slow Listening and "On Human Perceptual Bandwidth and Slow Listening" ( I have them if needed). I am not pushing it as a solution but just something that needs to be further explored in that direction.

 

I absolutely agree that it could take years and as well, may be costly and probably require considerable research expertise.

 

 

Thomas Lund's AES paper has no experimental data, so no evidence, just conjectures. In the paper, slow listening is defined as "hours" not days or weeks or months, so should be testable in much less than a year :)

 

Here's an AES presentation by James Johnston, an audio research scientist and author of many papers and presentations:

 

http://www.aes-media.org/sections/pnw/ppt/jj/highlevelnobg.ppt

 

A few slides that apply to this (thread) discussion:

image.png.357ede2d93436af7c9bb11797a7585c8.png

image.png.36831a2694e4c3797898aedfa671f2af.png

 

image.png.239e28212b9d0e5af56fefc1c1fe7dab.png

Share this post


Link to post
Share on other sites
36 minutes ago, pkane2001 said:

 

Thomas Lund's AES paper has no experimental data, so no evidence, just conjectures. In the paper, slow listening is defined as "hours" not days or weeks or months, so should be testable in much less than a year :)

 

Here's an AES presentation by James Johnston, an audio research scientist and author of many papers and presentations:

 

http://www.aes-media.org/sections/pnw/ppt/jj/highlevelnobg.ppt

 

A few slides that apply to this (thread) discussion:

image.png.357ede2d93436af7c9bb11797a7585c8.png

image.png.36831a2694e4c3797898aedfa671f2af.png

 

image.png.239e28212b9d0e5af56fefc1c1fe7dab.png

 

Lund's stuff basically suggests much longer listening time intervals as others have also suggested...he talks about "dedicated listening for 8 hours per day" and "In case what is tested for is unfamiliar, slow listening could take as long as it would for the subject to learn a new language, maybe more." Importantly, I am not saying he is right anymore than Toole or anyone else is right, just it is a very different approach

 

AES paper also attached

 

----

JJ Johnston first came under my radar when Wes Phillips interviewed him at ATT quite some years ago regarding PSR (Perceptual Soundfield Reconstruction) which he was apparently working on. The article  said "Forget everything you've heard about multichannel. Compared to PSR, it's all a joke -- from Quad to all the way up to five-channel SACD. Oh, Ambisonics works well enough, if you're willing to put up with a listening room that looks like a tornado hit a music store and a single-person-head-in-a-vice sweet spot. But PSR does all that with a minimal amount of equipment -- and does it better, to boot."

AES Time for Slow Listening.pdf


Sound Minds Mind Sound

 

 

Share this post


Link to post
Share on other sites
6 hours ago, Audiophile Neuroscience said:

 

Lund's stuff basically suggests much longer listening time intervals as others have also suggested...he talks about "dedicated listening for 8 hours per day" and "In case what is tested for is unfamiliar, slow listening could take as long as it would for the subject to learn a new language, maybe more." Importantly, I am not saying he is right anymore than Toole or anyone else is right, just it is a very different approach

 

AES paper also attached

 

Thanks for the paper. Perhaps Lund has done some research in this area, but unlike Toole's papers (and his book), Lund's contains no experimental data, no analysis, no evidence to substantiate his claims. But I'm not aware of where Toole has studied (or even just recommended) the length of the evaluation, so I don't know if they'd disagree.

 

6 hours ago, Audiophile Neuroscience said:

JJ Johnston first came under my radar when Wes Phillips interviewed him at ATT quite some years ago regarding PSR (Perceptual Soundfield Reconstruction) which he was apparently working on. The article  said "Forget everything you've heard about multichannel. Compared to PSR, it's all a joke -- from Quad to all the way up to five-channel SACD. Oh, Ambisonics works well enough, if you're willing to put up with a listening room that looks like a tornado hit a music store and a single-person-head-in-a-vice sweet spot. But PSR does all that with a minimal amount of equipment -- and does it better, to boot."

 

J_J is an interesting character, who's been doing audio research for quite a long time. His findings often contradict the 'common sense' beliefs of audiophiles. I've had a few discussions with him on ASR regarding the audibility of distortions and on other topics. Here's a collection of presentations and papers that I found very educational in my research:

 

http://www.aes.org/sections/pnw/jj.htm

 

Also, some interesting recordings of talks (including by JJ) worth watching and hearing:

http://www.aes-media.org/sections/pnw/pnwrecaps/index.htm

 

A good presentation on perception of audio:  

 

Share this post


Link to post
Share on other sites
12 hours ago, pkane2001 said:

 

Thanks for the paper. Perhaps Lund has done some research in this area, but unlike Toole's papers (and his book), Lund's contains no experimental data, no analysis, no evidence to substantiate his claims. But I'm not aware of where Toole has studied (or even just recommended) the length of the evaluation, so I don't know if they'd disagree.

 

 

J_J is an interesting character, who's been doing audio research for quite a long time. His findings often contradict the 'common sense' beliefs of audiophiles. I've had a few discussions with him on ASR regarding the audibility of distortions and on other topics. Here's a collection of presentations and papers that I found very educational in my research:

 

http://www.aes.org/sections/pnw/jj.htm

 

Also, some interesting recordings of talks (including by JJ) worth watching and hearing:

http://www.aes-media.org/sections/pnw/pnwrecaps/index.htm

 

A good presentation on perception of audio:  

 

Hi Paul

thanks for the links on JJ.

 

Re Toole vs Lund.

I have not seen experimental evidence provided by Lund but I may have missed it. My point was not that he is right or wrong but that he provides certain arguably factual information, and as you call it, conjectures, that may need to be accounted for in any hypothesis you or I or anyone might form on our topic of interest. It may be that we decide to dismiss his (non experimental) evidence and/or conjectures and provide good reasons to do so. Good hypotheses are well reasoned and do not bias inclusion or exclusion of information but seek all potentially relevant information. As Lund has 2 AES papers related to the topic I simply offered them as potentially relevant.

 

Toole has many publications. Yes there is experimental evidence presented. The question is therefore do we accept this as wholly correct and shut up shop and be done with it. A great many people would I believe say YES. The guy has unequivocally "proven" this or that. Trouble is there is a great many that say NO, I'm not buying that.

 

I won't go back into the same loop here but essentially Toole, and many others, predicate their experimental evidence on blind listening tests as used in certain methodologies. I have explained the issues and controversies about this. Suffice to say that every conclusion is based on the validity of these tests --> outcome measures. Logically, if you are 100% on board with those outcomes it is game over.....and also logically, the great debate will continue unresolved.

 

I hope that makes some sense.

 

 


Sound Minds Mind Sound

 

 

Share this post


Link to post
Share on other sites
11 hours ago, Audiophile Neuroscience said:

Hi Paul

thanks for the links on JJ.

 

Re Toole vs Lund.

I have not seen experimental evidence provided by Lund but I may have missed it. My point was not that he is right or wrong but that he provides certain arguably factual information, and as you call it, conjectures, that may need to be accounted for in any hypothesis you or I or anyone might form on our topic of interest. It may be that we decide to dismiss his (non experimental) evidence and/or conjectures and provide good reasons to do so. Good hypotheses are well reasoned and do not bias inclusion or exclusion of information but seek all potentially relevant information. As Lund has 2 AES papers related to the topic I simply offered them as potentially relevant.

 

Toole has many publications. Yes there is experimental evidence presented. The question is therefore do we accept this as wholly correct and shut up shop and be done with it. A great many people would I believe say YES. The guy has unequivocally "proven" this or that. Trouble is there is a great many that say NO, I'm not buying that.

 

I won't go back into the same loop here but essentially Toole, and many others, predicate their experimental evidence on blind listening tests as used in certain methodologies. I have explained the issues and controversies about this. Suffice to say that every conclusion is based on the validity of these tests --> outcome measures. Logically, if you are 100% on board with those outcomes it is game over.....and also logically, the great debate will continue unresolved.

 

I hope that makes some sense.

 

Hi David,

 

Probably not the place to keep arguing about blind testing, as this thread is really about biases. I just want to make sure that it's clear that the argument you're making here and on other threads is about a methodology in blind testing and not blind testing itself. I'm a big advocate for removing known biases in testing, especially those demonstrably leading to wrong results. Blind testing helps to do this, according to pretty much all the studies and literature cited in this thread.

 

Even if the specific test methodology often used in randomized blind tests is in some other way flawed, as Lund might be suggesting (with no evidence), it is still a useful tool to control a number of well-known and well-measured effects confounding sighted testing. And if there really are flaws in the process, I'm the first one to say let's demonstrate that these are real and then let's figure out a way to fix them.

 

That said, I use blind tests in most of my equipment evaluations. I also use it to determine my own audibility thresholds to various impairments. And I find that I can pass a blind test with shorter comparisons much easier than with much longer ones (yes, I've attempted long-running blind tests with hour or longer evaluations). My hypothesis for why that is is that the minor differences between DUTs are mostly overwhelmed by the much larger differences between my state of mind, physical condition, tracks I use for evaluation, sequence in which I listen, what's in the news, the weather,  etc., etc. The minor audible differences get mostly averaged out over long period of time, and what remains is the average perception of all the other variables that have nothing to do with the audibility or a preference of a certain DUT. But that's just me and my hypothesis.

 

Share this post


Link to post
Share on other sites

This article was just shared on that other forum that we shouldn't mention or link to 😎 I found it interesting and thought I'd share as a follow-up to our latest discussion here:

 

https://linearaudio.net/sites/linearaudio.net/files/LA Vol 2 Yaniger(1).pdf

 

This is not a study but covers some pitfalls and biases related to testing and judging audio equipment, as well as the different types of testing and related equipment by Stuart Yaniger.

 

Stuart also goes into a test he did on himself to determine if longer-term evaluation was revealing something a short term test couldn't.  The interesting part to me was how he went about performing this test.  This should be repeatable by others, possibly with some minor adjustments:

 

image.png.4113b62d6d3f7a5fc42d454cd415cab8.png

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now



×
×
  • Create New...