Jump to content
IGNORED

Building a high performance compute server for HQPlayer


Recommended Posts

32 minutes ago, jabbr said:

Yes that's the eventual idea. Any consideration of providing for >1 DAC with multichannel, I am sure you are considering physically separate NAA/DACs.

 

No, NAA is DAC-side clocked so only one DAC for multichannel. No synchronization for multiple DACs on purpose, because you would have multiple clocks in such case. Very few audiophile DACs support external clock synchronization anyway.

 

With exaSound you can get 8 channels of 384/32 PCM and DSD256. With Merging Hapi/Horus you can get 384 PCM and DSD256 multichannel as well.

 

2 hours ago, jabbr said:

I am waiting for my replacement NIC but I suspect that even with RDMA offload, the CPU load will not change appreciably - I think I'm very close to ASDM7EC / 512 but as they say, close only counts for ...

 

Interestingly the CPU bursts ( $ lscpu ) to 4.5 Ghz with AMSDM7 but only to 3.9 Ghz with ASDM7EC so I think there's hope of tweaking something

 

You could run null output test and see what kind of processing time you get for a track vs track length.

 

So far highest published boost is the new flagship laptop CPU i9-10980HK at 5.3 GHz, but it is only single core boost. For this case you would need at least two cores boosted and I don't know how much boost that CPU can have with  two cores.

 

i9-9900KS can do all-core boost at 5 GHz, but it is not enough for ASDM7EC at DSD512.

 

Will be interesting to see how high clocks there will be for the 10th gen desktop CPUs.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
27 minutes ago, jabbr said:

True but W2245 has AVX512. Its curious that ASDM7EC is not able to boost c/w AMSDM7 (3.9 vs 4.5). W2245 should be able to multicore boost. I may need to look at each core separately -- htop shows % but not per core clock rate. 

 

Cores are boosted individually, you can check /proc/cpuinfo to see what clocks each core have.

 

But only testing will show how each CPU model performs at such tasks.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
2 hours ago, jabbr said:

I created an AWK to capture the /proc/cpuinfo output:

In both cases output 44.1x512

First from the AMSDM7: Note that 4 threads are boosting to 4.5 Ghz


Processor:  0  Mhz:  2943.395
Processor:  1  Mhz:  1200.102
Processor:  2  Mhz:  2802.146
Processor:  3  Mhz:  1200.199
Processor:  4  Mhz:  4499.999
Processor:  5  Mhz:  1200.046
Processor:  6  Mhz:  1201.438
Processor:  7  Mhz:  4561.842
Processor:  8  Mhz:  2436.413
Processor:  9  Mhz:  1201.864
Processor:  10  Mhz:  2671.769
Processor:  11  Mhz:  1200.799
Processor:  12  Mhz:  4492.542
Processor:  13  Mhz:  1200.343
Processor:  14  Mhz:  1200.020
Processor:  15  Mhz:  4514.522
CPU model:  Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz
1 CPU,  8 physical cores per CPU, total 16 logical CPU units

Now with ASDM7EC: Note that these 4 threads stay at 3.9 Ghz


Processor:  0  Mhz:  3965.971
Processor:  1  Mhz:  1200.031
Processor:  2  Mhz:  1200.164
Processor:  3  Mhz:  1265.660
Processor:  4  Mhz:  3965.246
Processor:  5  Mhz:  1201.363
Processor:  6  Mhz:  1201.073
Processor:  7  Mhz:  3076.107
Processor:  8  Mhz:  3900.650
Processor:  9  Mhz:  1200.346
Processor:  10  Mhz:  1201.619
Processor:  11  Mhz:  1270.791
Processor:  12  Mhz:  3899.409
Processor:  13  Mhz:  1200.285
Processor:  14  Mhz:  1200.009
Processor:  15  Mhz:  3145.273
CPU model:  Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz
1 CPU,  8 physical cores per CPU, total 16 logical CPU units

 

 

Hard to say, but it could be AVX512 capping the clocks, because ASDM7EC puts more load on it.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
1 minute ago, jabbr said:

Yes as we've discussed. Is there a way to send more cycles to CUDA which isn't getting used too much? Perhaps a "prefer-cuda" flag? That might just enable 44.1x512?

 

Modulators cannot be run on GPU because of the mathematical structure they would be badly sub-optimal there. You can only run filters and convolution engine there.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
8 hours ago, jabbr said:

 

Tricky part is that things depend on CPU model and workload. Newer CPUs likely throttle less than previous generations.

 

But I would conclude that likely the higher AVX-512 usage limits clocks to base frequency in this case.

 

On 4/4/2020 at 5:04 PM, jabbr said:

8GB 1x8GB DDR4 2933MHz RDIMM ECC Memory

 

Btw, note that for full memory speed on a quad-channel CPU you need four DIMMs...

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
6 hours ago, jabbr said:

When I try to run at DSD512, the sound is on for a second and then off for a second and repeat. The CPU utilization doesn't get >70%

 

When deadlines are systematically missed, the whole process goes into spring like motion that you may know from rush hour traffic jams where queue of cars end up in such motion of acceleration and deceleration.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
32 minutes ago, The Computer Audiophile said:

@Miska Have you been able to get the EC modulator working at DSD512 without hiccups?

 

No, I'm not aware of any CPU that would be able to do it in realtime. Only offline conversion with HQPlayer 4 Pro.

 

I hope development of CPUs will eventually make it possible in two years or something. Intel's 10th gen mobile CPUs can already boost to 5.3 GHz, so we'll see how the desktop CPUs will look like.

 

Even though i9-9900KS can run on all-core boost of 5 GHz it is still not enough. But you get the picture if you look at the highest per core loads at DSD256 and then multiply that by two to go to DSD512.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
35 minutes ago, jabbr said:

But look at what we can do with this approach! Moreover you can buy this prepackaged from Dell ... the workstation is surprisingly quiet, I installed my own GPU, RAM and NIC the base price ~$1600 is very reasonable for what you get.

 

Workstations designed for high CPU loads are pretty quiet. My old HP Z440 Xeon workstation is also very quiet and it doesn't change depending on load, it is as quiet even at constant full load.

 

So I would again like to go with a new HP Z4 or Z6 series. If not possible, I'd build a new workstation myself. But for me, for example the three year next business day on-site warranty was important factor on choice.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
11 hours ago, The Computer Audiophile said:

This is interesting. I'm trying ASDM7EC and my old school Xeon E3-1241 v3 CPU doesn't get above 40-50%, yet the audio constantly stutters. 

 

My Xeon E5-1620v3  cannot do ASDM7EC even at DSD256. DSD128 is max. That's why I need a new one with W2245.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
44 minutes ago, StreamFidelity said:

What does that mean?

 

That processing runs at full speed and all the output is just thrown away. It is sort of inherited feature from HQPlayer 4 Pro where it lets the conversion run to output file at full speed, instead of waiting for the audible output.

 

44 minutes ago, StreamFidelity said:

So you have to wait six minutes for a three-minute song? If the song is loaded into the buffer or how does it work?

 

Yes... Not sure I understand the question about buffer part.

 

44 minutes ago, StreamFidelity said:

What is the CPU usage in %, how many cores (Hyperthreading?) and what clock rate?

 

About 20%, mostly two cores are fully loaded. CPU runs at constant 5 GHz.

 

So to run this in realtime, you'd need to have something that is about 2x faster per-core than i9-9900KS.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
31 minutes ago, Sagittarius said:

I am curious though why the processing speed was only 0.55x. Cache not being able to feed the rest of the processor fast enough? ..other thread dependencies? ..thermal throttling? Something that can be addressed in future versions of HQ player?

 

This was a test build on Windows, so some production build may differ a little. I've already put a lot of effort to make EC modulators run like they now do. I'm not expecting any big gains on this front. No thermal throttling. Maybe @jabbr can test at some point how it runs with AVX512 on his new Xeon. Larger and faster cache may help too. But large part of the limitation is just how many instructions a CPU core can execute within available number of clock cycles.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
7 hours ago, Solstice380 said:

@jabbr  I ran the GeekBench benchmark so I could compare.

i9-9900KS at 5GHz

image.thumb.png.4755e19c3463adc343a8d47966ffc812.png

 

Asus RTX2080 Advanced

image.thumb.png.f7689458e5d5a7db52af0a4506ccd1da.png

 

I'm not sure how to compare OpenCL to CUDA, though.  I understand that OpenCL code runs faster.  For my use case I'm extremely happy.  And @Jussi keeps killing us... DSD512 with EC modulators in realtime probably won't happen in my lifetime! 😂  2.5X the 9900KS.... ouch.

 

I'm pretty sure those figures are not for the same algorithm and using double-precision 64-bit floating point. OpenCL is not very nice unless you want to do graphics.

 

For DSD512 with EC look for a CPU that has double the single-core score and that could be potential...

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
5 hours ago, jabbr said:

Geekbench is just a number. For compute you want CUDA because that’s what HQP  uses (not  OpenCL). I think @Miska’s idea of looking at the time for Pro to encode a song / the song real time is a great measurement of the system’s performance. Yeah EC/DSD512 not anytime soon. 


The question is whether AVX512 @ 3.9 GHz is better than AVX2 @ 5 - 5.3 GHz. 

 

HQPlayer 4 Desktop has the benchmark feature. Although I accidentally broke it with the latest start/stop change (already fixed for next release). But with a releases before that change it should work.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
8 minutes ago, Solstice380 said:

I just ran their GPU Test and it returned the OpenCL result.  I was curious why @jabbr got a CUDA result and I got the OpenCL.   Difference between my GPU and the Ti?  Or?

 

Are you on Nvidia graphics? Since CUDA is Nvidia-only...

 

Also CUDA needs new enough graphics driver version, which cannot be older than the SDK used to build the software.

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment

Killer NICs that for example Gigabyte puts on some motherboards are also good for offloads. They have pretty much entire TCP/IP stack on the NIC. Linux support is not great for all models, but the models I have work on recent Linux kernels. These are just more gaming oriented NICs aiming for minimizing latencies.

 

https://www.killernetworking.com

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
5 minutes ago, The Computer Audiophile said:

Wow, I haven't had a Killer NIC for many years. They seem cool, but many people are suspect. Hey, that's like audio :~)

 

I have couple of Gigabyte motherboards that have those, and the DAC UP USB ports.

 

Now the latest Gigabyte Z390 Designare motherboard I have for the 9900KS has just two Intel NICs (different models actually), but still comes with their newest DAC-UP2 USB ports.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
24 minutes ago, jabbr said:

I just tried a few settings. Both poly-sinc-ext2 and xtr-mp with ASDM7EC took almost 2x (a few seconds shy of 2x), with or without CUDA. CUDA with -mp does significantly decrease CPU usage. In one case Pro used 640% CPU (CUDA off), with all cores active.

 

Yes, filters are not usually the limiting factor, but instead EC modulators are the ones that set the pace. This can be different with non-EC normal modulators and heavier filters.

 

But if the speed is close to similar, it means that AVX512 offsets the lower clock speed enough to get close on-par to 9900KS running at 5 GHz on per-core speeds in this case. So at least no loss in that sense, and complex filters may be running faster than 9900KS.

 

 

P.S. I just checked that equivalent new HP workstation (Z4) to my current one, with W-2245 costs about 3200€ here (32 GB RAM, 1 TB SSD).

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
2 hours ago, TubeLover said:

Even considering the current Dell sale, I specced out a workstation matching the one he JABBER just purchased, added in the cost of the his additional RAM, the pricey NIC card, and the most reasonably priced 2080 that I could find and it was over $3k. I was considering this as a possible path to allow me to run Roon and HQ Player in conjunction, at DSD256, and with filters in hopes of seeing the very significant sonic upgrade Chris mentioned in his RAAL SR1a discussion. This is, however, a pretty costly endeavor. And how long would a server pc build like that be viable to able to run HQ Player optimally as described above? Much less at DSD512.

 

You can get quite a bit cheaper with i9-9900K and RTX2080. But you can get also started with just i9-9900K and add the RTX2080 later, it has built-in graphics so you get display output without a separate card, although not any offloads from HQPlayer.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
25 minutes ago, jabbr said:

@luisma it would be great to run @Miska's test of looking at how long HQ Pro takes to encode ASDM7EC on a Ryzen because that looks to me to be the best overall test of the system performance.

 

You can do the same test with Desktop v4 too, it is just broken in current release due to regression. I fixed that and improved the time display to show m:ss.sss instead of just plain seconds.

 

You can also do similar test with Embedded, but there's currently no display for the timing data, although HQPlayer engine has it. Maybe I need to add this same functionality to the web interface.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment
4 hours ago, TubeLover said:

Thanks for the input Miska. What strategy would you recommend to manage to keep things "quite a bit cheaper with i9-9900K and RTX2080". Thanks.

 

I would first get a type of server I like with 9900K, that has technical space and possibility to add something like RTX2080 later. And then see if I would want to eventually upgrade the machine with such GPU.

 

Essentially that means that the case is large enough and has suitable PSU for the purpose.

 

You would likely want to have a quiet machine too. The kind of approach to take depends if you'd like to build one yourself, get someone build one for you (a small company or such), or get one from the big vendors.

 

Signalyst - Developer of HQPlayer

Pulse & Fidelity - Software Defined Amplifiers

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now



×
×
  • Create New...