Using DSD is not really related to this much, it is more related to other aspects. Only notable time domain aspect of DSD is when the source content is also DSD, because then you don't really have much frequency domain band limiting which means that you avoid most of time domain effects of that frequency domain band limiting.
Longer impulse response is result of more taps, which gives you a have sharper cut-off in frequency domain. So yes time domain response suffers from length of "ringing" or "settling time" (which MQA calls "blur") perspective. It is same with analog filters too, there you just talk about order or Q of the filter instead of length/taps.
However, you cannot make that aspect of time domain performance better by making the filter shorter (less taps) forever, because at some point it begins to roll-off frequency response early which in turn slows down transients and thus making time domain response worse again (MQA being one prime example of such). Overdoing this "settling time" aspect also leads to more images/aliasing - less accurate reconstruction.
If we consider time domain performance as whole and talk about "transient response" or "step response" we note that it has two aspects.
Settling time (amount of ringing, related to Gibbs phenomenon)
Steepness/slope of the rise
Location/timing and shape of transient
If you try to improve (1) too much, then (2,3) suffers, or vice versa. Bound of (2) is ultimately defined by the sampling rate - fs/2 (Nyquist) frequency. We need to remember from the Gibbs/Fourier series that steepness of the step is defined by number of harmonics included in the series and accuracy of their relative levels. Lowest sampling rate involved defines how many harmonics can be included, but filter responses (both digital and analog) define correctness of the relative levels and possible limitations to the steepness/timing.
When MQA talks about time domain performance they talk about (1), while when Chord talks about time domain performance they talk about (2,3). And then they take things to the extremes.
Also if you overdo (1), the reconstruction accuracy also suffers due to severe images/aliasing and you instead just get a lot of distortion (MQA being example). If you overdo (2,3) you have transients that never settle.
"xtr" and "ext2" are still quite far from the "mega" filters. I've been putting a lot of effort in making filters that optimize all aspects, from all three time domain perspectives as well as frequency domain perspective, by using some nice math tricks. Then you can choose different weightings on these aspects with filter selection. But also all extreme approaches are supported (well, for DSD upsampling some "filter" choices like polynomial are not available at the moment).
I also want to strongly emphasize, that in any case, for recorded content (instead generated test signals), you always have effect of the ADC filters included in the source data. Or if it was recorded in hires and converted to RedBook for distribution in software, that software converter's digital filter fingerprint. Using non-apodizing (like closed-form) or all-pass digital "filter" just means that the source's digital filter fingerprint is passed through. So even if your playback filter wouldn't "ring", the source data already contains "ringing" from the production phase. To deal with this aspect of source data, apodizing filters were created. In recorded material, you won't find test signal -like pulses or transients because they have already passed some filters.
Also note that the filter "rings" only when it is limiting spectrum of the signal. If you have hires recording where sampling rate is high enough that no signal harmonics are reaching the beginning of filter's transition band (start of roll-off), the filter doesn't "ring". For example with MQA, it's filters begin to roll-off already around 30 kHz for 96 kHz sampling rate (48 kHz Nyquist), so their filters are much more likely to begin "ringing" than steeper filter with cut-off starting somewhere around 47 kHz. Because content is much more likely to have stronger harmonic at 30 kHz than at 47 kHz. In addition, it also means that slope/steepness of the rise is severely affected because of lower corner frequency (less harmonics included and their relative levels affected more).