Jump to content
IGNORED

Building a high performance compute server for HQPlayer


Recommended Posts

8 hours ago, The Computer Audiophile said:

This is interesting. I'm trying ASDM7EC and my old school Xeon E3-1241 v3 CPU doesn't get above 40-50%, yet the audio constantly stutters. 


Yeah, that’s what I am seeing above: the stuttering throws off the measurement — the CPU usage spikes to 100% then a stutter starts and it drops during the stutter “recovery”. What you are seeing is probably an average rather than instantaneous CPU.

 

That modulator needs a lot of horsepower.

 

I can’t get my machine to boost its clock rate >4 GHz and suspect the AVX512 instructions are holding it back. That might mean that ASDM7EC is limited to DSD256 on the current generation of machines — well maybe unless fooling around with over clocking or even disabling AVX512 — counterproductive? — I have one more optimization trick before bailing at trying to get DSD512 working ...

Custom room treatments for headphone users.

Link to post
Share on other sites
10 hours ago, asdf1000 said:

 

With EC modulator? Forget about it at DSD512 (for real-time).

 

With current CPU clock rates and processing capabilities this seems to be true.

 

At DSD256 I am hitting 90-95% at 3.9 Ghz (again the AMSDM7 modulator "allows" the CPU to 4.7 Ghz ... more evidence about the clock limiting behavior of AVX512).

 

Increasing RAM channels didn't cause a huge difference. I don't think optimizing the network i.e. RDMA + NIC offload, will make a huge difference either. So here we have a Xeon W 2245 along with RTX 2080 Ti workstation which I suspect will be able to do 6 channels of DSD256 / EC ... that would handle room correction + digital crossover.

 

But look at what we can do with this approach! Moreover you can buy this prepackaged from Dell ... the workstation is surprisingly quiet, I installed my own GPU, RAM and NIC the base price ~$1600 is very reasonable for what you get.

Custom room treatments for headphone users.

Link to post
Share on other sites
32 minutes ago, The Computer Audiophile said:

@Miska Have you been able to get the EC modulator working at DSD512 without hiccups?

 

No, I'm not aware of any CPU that would be able to do it in realtime. Only offline conversion with HQPlayer 4 Pro.

 

I hope development of CPUs will eventually make it possible in two years or something. Intel's 10th gen mobile CPUs can already boost to 5.3 GHz, so we'll see how the desktop CPUs will look like.

 

Even though i9-9900KS can run on all-core boost of 5 GHz it is still not enough. But you get the picture if you look at the highest per core loads at DSD256 and then multiply that by two to go to DSD512.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to post
Share on other sites
35 minutes ago, jabbr said:

But look at what we can do with this approach! Moreover you can buy this prepackaged from Dell ... the workstation is surprisingly quiet, I installed my own GPU, RAM and NIC the base price ~$1600 is very reasonable for what you get.

 

Workstations designed for high CPU loads are pretty quiet. My old HP Z440 Xeon workstation is also very quiet and it doesn't change depending on load, it is as quiet even at constant full load.

 

So I would again like to go with a new HP Z4 or Z6 series. If not possible, I'd build a new workstation myself. But for me, for example the three year next business day on-site warranty was important factor on choice.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to post
Share on other sites
11 hours ago, The Computer Audiophile said:

This is interesting. I'm trying ASDM7EC and my old school Xeon E3-1241 v3 CPU doesn't get above 40-50%, yet the audio constantly stutters. 

 

My Xeon E5-1620v3  cannot do ASDM7EC even at DSD256. DSD128 is max. That's why I need a new one with W2245.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to post
Share on other sites

My i9-9900K has been doing very nicely with ASDM7-EC at DSD256. I don't have an additional GPU...

 

We may be some years away from doing DSD512 with this modulator.

 

Big Noctua fan is very quiet (in a room opposite side of the house to listening room).

 

There's already a big thread where people shared builds:

 

https://audiophilestyle.com/forums/topic/56966-hqplayer4-ec-modulator-tips-and-techniques/

 

Link to post
Share on other sites

closed-form-16m, ASDM7EC, DSD256 both 16/44.1 or DSD64 source => 98%

Processor:  0  Mhz:  3905.090
Processor:  1  Mhz:  1200.102
Processor:  2  Mhz:  1200.071
Processor:  3  Mhz:  1951.016
Processor:  4  Mhz:  3905.621
Processor:  5  Mhz:  1200.017
Processor:  6  Mhz:  2586.072
Processor:  7  Mhz:  1200.296
Processor:  8  Mhz:  4439.282
Processor:  9  Mhz:  1200.089
Processor:  10  Mhz:  1200.032
Processor:  11  Mhz:  2353.270
Processor:  12  Mhz:  3897.773
Processor:  13  Mhz:  1201.184
Processor:  14  Mhz:  3163.338
Processor:  15  Mhz:  1201.437
CPU model:  Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz
1 CPU,  8 physical cores per CPU, total 16 logical CPU units

 

Custom room treatments for headphone users.

Link to post
Share on other sites
13 hours ago, asdf1000 said:

There's already a big thread where people shared builds:

 

https://audiophilestyle.com/forums/topic/56966-hqplayer4-ec-modulator-tips-and-techniques/

 

 

Yes I see. There are several threads in both the Software and Network subfora that delve into hardware ... I posted this in the Music Server subforum for that reason -- also want to get into other topics than EC in time, but obviously there are overlaps.

Custom room treatments for headphone users.

Link to post
Share on other sites
2 hours ago, Solstice380 said:

 

These are the results for my new system that I posted in the EC Modulators thread.

 

Nice! Yeah the KS not easily available for me, and if I wasn’t using my crazy NIC that would’ve been a great option (I needed 2 x PCIe x16 ... one for GPU and one for NIC ... but for normal purposes the KS is sweet ;) 

Custom room treatments for headphone users.

Link to post
Share on other sites

Yesterday I tested on Windows using the free running mode how long it takes to process a track on i9-9900KS with poly-sinc-ext2 and ASDM7EC to DSD512. Result for 3:32 long track was 6:23. So almost 2x the track's playback time at processing speed of 0.55x.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to post
Share on other sites
3 hours ago, Miska said:

Windows using the free running mode

What does that mean?

 

3 hours ago, Miska said:

Result for 3:32 long track was 6:23.

So you have to wait six minutes for a three-minute song? If the song is loaded into the buffer or how does it work?

 

3 hours ago, Miska said:

processing speed of 0.55x.

What is the CPU usage in %, how many cores (Hyperthreading?) and what clock rate?

Link to post
Share on other sites

Dell Inc. Precision 5820 Tower

1234
Single-Core Score
8648
Multi-Core Score
Geekbench 5.1.0 Tryout for Linux x86 (64-bit)

Result Information

User jabbr
Upload Date April 15 2020 02:25 PM
Views 5

System Information

System Information
Operating System Ubuntu 18.04.4 LTS 5.3.0-46-generic x86_64
Model Dell Inc. Precision 5820 Tower
Motherboard Dell Inc. 06JWJY
Processor Information
Name Intel Xeon W-2245
Topology 1 Processor, 8 Cores, 16 Threads
Identifier GenuineIntel Family 6 Model 85 Stepping 7
Base Frequency 4.70 GHz
L1 Instruction Cache 32.0 KB x 8
L1 Data Cache 32.0 KB x 8
L2 Cache 1.00 MB x 8
L3 Cache 16.5 MB x 1

Single-Core Performance

Single-Core Score 1234  
Crypto Score 1655  
Integer Score 1186  
Floating Point Score 1268  
AES-XTS 1655
2.82 GB/sec
 
Text Compression 1110
5.61 MB/sec
 
Image Compression 1242
58.7 Mpixels/sec
 
Navigation 1017
2.87 MTE/sec
 
HTML5 1258
1.48 MElements/sec
 
SQLite 1229
385.0 Krows/sec
 
PDF Rendering 1183
64.2 Mpixels/sec
 
Text Rendering 1173
373.8 KB/sec
 
Clang 1287
10.0 Klines/sec
 
Camera 1199
13.9 images/sec
 
N-Body Physics 1120
1.40 Mpairs/sec
 
Rigid Body Physics 1333
8259.3 FPS
 
Gaussian Blur 789
43.4 Mpixels/sec
 
Face Detection 1128
8.69 images/sec
 
Horizon Detection 1090
26.9 Mpixels/sec
 
Image Inpainting 1979
97.1 Mpixels/sec
 
HDR 2414
32.9 Mpixels/sec
 
Ray Tracing 1531
1.23 Mpixels/sec
 
Structure from Motion 1162
10.4 Kpixels/sec
 
Speech Recognition 1058
33.8 Words/sec
 
Machine Learning 1047
40.5 images/sec
 

Multi-Core Performance

Multi-Core Score 8648  
Crypto Score 6766  
Integer Score 8396  
Floating Point Score 9509  
AES-XTS 6766
11.5 GB/sec
 
Text Compression 9803
49.6 MB/sec
 
Image Compression 10519
497.6 Mpixels/sec
 
Navigation 4366
12.3 MTE/sec
 
HTML5 8734
10.3 MElements/sec
 
SQLite 9618
3.01 Mrows/sec
 
PDF Rendering 8649
469.4 Mpixels/sec
 
Text Rendering 6824
2.12 MB/sec
 
Clang 10724
83.6 Klines/sec
 
Camera 8659
100.4 images/sec
 
N-Body Physics 8491
10.6 Mpairs/sec
 
Rigid Body Physics 13668
84679.6 FPS
 
Gaussian Blur 8072
443.7 Mpixels/sec
 
Face Detection 9988
76.9 images/sec
 
Horizon Detection 9103
224.4 Mpixels/sec
 
Image Inpainting 13236
649.3 Mpixels/sec
 
HDR 18669
254.4 Mpixels/sec
 
Ray Tracing 14997
12.0 Mpixels/sec
 
Structure from Motion 9591
85.9 Kpixels/sec
 
Speech Recognition 5820
186.1 Words/sec
 
Machine Learning 3263
126.1 images/sec
 

 

Custom room treatments for headphone users.

Link to post
Share on other sites

Dell Inc. Precision 5820 Tower

170911
CUDA Score
Geekbench 5.1.0 Tryout for Linux x86 (64-bit)

Result Information

User jabbr
Upload Date April 15 2020 02:39 PM
Views 2

System Information

System Information
Operating System Ubuntu 18.04.4 LTS 5.3.0-46-generic x86_64
Model Dell Inc. Precision 5820 Tower
Motherboard Dell Inc. 06JWJY
Processor Information
Name Intel Xeon W-2245
Topology 1 Processor, 8 Cores, 16 Threads
Identifier GenuineIntel Family 6 Model 85 Stepping 7
Base Frequency 4.70 GHz
L1 Instruction Cache 32.0 KB x 8
L1 Data Cache 32.0 KB x 8
L2 Cache 1.00 MB x 8
L3 Cache 16.5 MB x 1
CUDA Information  
Device Name GeForce RTX 2080 Ti
Compute Capability Memory 7.5
Maximum Frequency 1.64 GHz
Multiprocessor Count 68
Maximum Threads Per Multiprocessor 1024
Device Memory 10.7 GB 7.00 GHz

CUDA Performance

CUDA Score 170911  
Sobel 197778
51.2 Gpixels/sec
 
Canny 99288
6.21 Gpixels/sec
 
Stereo Matching 560206
792.3 Gpixels/sec
 
Histogram Equalization 125229
22.1 Gpixels/sec
 
Gaussian Blur 179393
9.86 Gpixels/sec
 
Depth of Field 623547
7.23 Gpixels/sec
 
Face Detection 36686
282.5 images/sec
 
Horizon Detection 157285
3.88 Gpixels/sec
 
Feature Matching 58696
1.21 Gpixels/sec
 
Particle Physics 685846
18268.1 FPS
 
SFFT 101541
1.40 Tflops
 

 

 

Custom room treatments for headphone users.

Link to post
Share on other sites
44 minutes ago, StreamFidelity said:

What does that mean?

 

That processing runs at full speed and all the output is just thrown away. It is sort of inherited feature from HQPlayer 4 Pro where it lets the conversion run to output file at full speed, instead of waiting for the audible output.

 

44 minutes ago, StreamFidelity said:

So you have to wait six minutes for a three-minute song? If the song is loaded into the buffer or how does it work?

 

Yes... Not sure I understand the question about buffer part.

 

44 minutes ago, StreamFidelity said:

What is the CPU usage in %, how many cores (Hyperthreading?) and what clock rate?

 

About 20%, mostly two cores are fully loaded. CPU runs at constant 5 GHz.

 

So to run this in realtime, you'd need to have something that is about 2x faster per-core than i9-9900KS.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to post
Share on other sites

To give an idea about processing needs comparison:

 

When running Roon upsampling to DSD512 from 24_96 material:

top - 12:18:30 up  2:19,  1 user,  load average: 2.42, 2.31, 2.33
Tasks: 440 total,   2 running, 271 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.0 us,  0.8 sy,  0.0 ni, 88.2 id,  4.7 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem : 32563588 total,   295900 free,  3584900 used, 28682788 buff/cache
KiB Swap:  2097148 total,  2093564 free,     3584 used. 28452876 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                     
11225 root      20   0 7548264 1.776g 150132 S 107.0  5.7  46:53.18 RoonAppliance                                                               
 1672 root      20   0       0      0      0 R   3.6  0.0   2:24.89 cifsd                                                                       
13248 jon       20   0   51476   4168   3324 R   1.0  0.0   0:00.17 top                                                                       
   47 root      rt   0       0      0      0 S   0.3  0.0   0:00.66 migration/6                                                                 
  223 root      20   0       0      0      0 S   0.3  0.0   0:09.82 kswapd0    

 

Custom room treatments for headphone users.

Link to post
Share on other sites

With 16/44.1 -> DSD512:


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                    
11225 root      20   0 7944056 2.320g 239944 S  47.5  7.5  75:05.66 RoonAppliance                                 
 1672 root      20   0       0      0      0 S   7.6  0.0   4:08.97 cifsd 

and two cores at 4.5Ghz

[email protected]:~$ ./cpuspeeds.sh
Processor:  0  Mhz:  1200.020
Processor:  1  Mhz:  1200.009
Processor:  2  Mhz:  2206.086
Processor:  3  Mhz:  4499.161
Processor:  4  Mhz:  1200.003
Processor:  5  Mhz:  1200.002
Processor:  6  Mhz:  1754.388
Processor:  7  Mhz:  1200.003
Processor:  8  Mhz:  1200.021
Processor:  9  Mhz:  1200.012
Processor:  10  Mhz:  2253.244
Processor:  11  Mhz:  4483.826
Processor:  12  Mhz:  1200.046
Processor:  13  Mhz:  1200.005
Processor:  14  Mhz:  2022.015
Processor:  15  Mhz:  1200.012
CPU model:  Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz
1 CPU,  8 physical cores per CPU, total 16 logical CPU units

 

Custom room treatments for headphone users.

Link to post
Share on other sites

I believe @Miska already commented about this before but when selecting a board and memory (X570) for HQP is the memory speed and board memory speed important? I think I remember it was very important, I am building a 3800x, I know not the best, but selecting the board I am split between Asus and Gigabyte and found that MSI allegedly have the fastest memory support

 

"For DDR4 OC support? MSi all the way, they have on they QVL sheet 4000mhz on 4 slots even for low mobo! They achieved 4666mhz on 2 slots too and in one test they has 5200mhz DDR4 with micron e live stream :D So if DDR4 speed matters to you MSI all the way."

 

So if it is that important maybe I would go MSI, just need confirmation

 

Link to post
Share on other sites
8 hours ago, Miska said:

Yesterday I tested on Windows using the free running mode how long it takes to process a track on i9-9900KS with poly-sinc-ext2 and ASDM7EC to DSD512. Result for 3:32 long track was 6:23. So almost 2x the track's playback time at processing speed of 0.55x.

 

 

I guess this means that we are better off looking for DACs that can do DSD256 without clock dividers rather than waiting for CPU technology to catch up with HQ Player's EC modulators DSD512  processing needs for real time playback. Because if we are lucky enough to get 20% increase in processing power every 2 years it is still going be a long wait (that's hoping for a new architecture every two years that can deliver at least 15% improvement and some minor improvement in between). May be if we get together we can commission T+A or Lampizator to build such a DAC 🙂 

 

I am curious though why the processing speed was only 0.55x. Cache not being able to feed the rest of the processor fast enough? ..other thread dependencies? ..thermal throttling? Something that can be addressed in future versions of HQ player?

 

 

Link to post
Share on other sites
31 minutes ago, Sagittarius said:

I am curious though why the processing speed was only 0.55x. Cache not being able to feed the rest of the processor fast enough? ..other thread dependencies? ..thermal throttling? Something that can be addressed in future versions of HQ player?

 

This was a test build on Windows, so some production build may differ a little. I've already put a lot of effort to make EC modulators run like they now do. I'm not expecting any big gains on this front. No thermal throttling. Maybe @jabbr can test at some point how it runs with AVX512 on his new Xeon. Larger and faster cache may help too. But large part of the limitation is just how many instructions a CPU core can execute within available number of clock cycles.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to post
Share on other sites

@jabbr  I ran the GeekBench benchmark so I could compare.

i9-9900KS at 5GHz

image.thumb.png.4755e19c3463adc343a8d47966ffc812.png

 

Asus RTX2080 Advanced

image.thumb.png.f7689458e5d5a7db52af0a4506ccd1da.png

 

I'm not sure how to compare OpenCL to CUDA, though.  I understand that OpenCL code runs faster.  For my use case I'm extremely happy.  And @Jussi keeps killing us... DSD512 with EC modulators in realtime probably won't happen in my lifetime! 😂  2.5X the 9900KS.... ouch.

Link to post
Share on other sites
2 hours ago, Solstice380 said:

I'm not sure how to compare OpenCL to CUDA, though.  I understand that OpenCL code runs faster.  For my use case I'm extremely happy.  And @Jussi keeps killing us... DSD512 with EC modulators in realtime probably won't happen in my lifetime! 😂  2.5X the 9900KS.... ouch.


Geekbench is just a number. For compute you want CUDA because that’s what HQP  uses (not  OpenCL). I think @Miska’s idea of looking at the time for Pro to encode a song / the song real time is a great measurement of the system’s performance. Yeah EC/DSD512 not anytime soon. 


The question is whether AVX512 @ 3.9 GHz is better than AVX2 @ 5 - 5.3 GHz. 

Custom room treatments for headphone users.

Link to post
Share on other sites
7 hours ago, Solstice380 said:

@jabbr  I ran the GeekBench benchmark so I could compare.

i9-9900KS at 5GHz

image.thumb.png.4755e19c3463adc343a8d47966ffc812.png

 

Asus RTX2080 Advanced

image.thumb.png.f7689458e5d5a7db52af0a4506ccd1da.png

 

I'm not sure how to compare OpenCL to CUDA, though.  I understand that OpenCL code runs faster.  For my use case I'm extremely happy.  And @Jussi keeps killing us... DSD512 with EC modulators in realtime probably won't happen in my lifetime! 😂  2.5X the 9900KS.... ouch.

 

I'm pretty sure those figures are not for the same algorithm and using double-precision 64-bit floating point. OpenCL is not very nice unless you want to do graphics.

 

For DSD512 with EC look for a CPU that has double the single-core score and that could be potential...

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×
×
  • Create New...