Jump to content
IGNORED

Intel CPUs Don’t Support ECC Memory: How Bad For A/V Quality?


lisa

Recommended Posts

11 hours ago, Miska said:

 

I think AMD EPYC CPU's have always supported ECC. And have also things like memory encryption. Threadripper versions also gained support for the memory encryption part.

 

Intel Xeon also always supports ECC, that is part of the market differentiation.   I'd rather have ECC, but it is all in the mix of tradeoffs.   If I had more than 64G, I'd almost definitely go for ECC, because the stats get worse.   At 256G, there is no question.   I only have 32G -- so raw memory is reliable enough for me.   I have a  10900X processor, and if I had to tradeoff with a 6core processor with ECC -- I'd take the 10900X.   One item of note -- I have an application that easily takes all 10 cores, not all people are pushing things so hard.   Frankly, Id even prefer a 16 or 18 core machine.

 

John

 

Link to comment
2 hours ago, Miska said:

 

In addition, Xeons support much larger RAM installations and have double the number of memory channels compared to regular desktop CPU models (except some X-series and such CPUs).

 

My previous development workstation was Xeon E5 with 32 GB of ECC RAM and it supported up to 768 GB of RAM.

 

My current development workstation has Xeon W-2245 with 64 GB of ECC RAM and it supports up to 1 TB of RAM. In addition, it has AVX-512 support which has been lacking from the regular desktop CPUs.

 

My X processor has AVX512 also -- but when I run the DHNRDS on it, using a well ported AVX512 version versus the well ported AVX2 makes little difference in performance.  In fact, the current generations of CPUS slow down when using AVX 512.    I have been thinking about doing tests where using the AVX512 instruction set ONLY for the long FIR filters and Hilbert transforms, but I'd suspect that the slowdown when using AVX512 would persist long enough to also slow down when running AVX2 instrs.  (the reason for using AVX512 for some of the FIR filters is that 512bits at a time might help a little.)

 

Other than a few improvements in cycles per instruction, he big advantage to the newer processors is more flexibility in the math units, so there can be more operations done in parallel.   The effect isn't major (perhaps 20%), but it does help.

The other advantage to the bigger processors (including X processors) is more I/O lanes for more sustained I/O bandwidth.   There is NO WAY that I am pushing the PCI I/O bandwidth, but it is nice to be able to take better advantage if I ever need.

 

When I upgraded from 2 lanes to 4 lanes of memory, I detected nil performance improvement, even with my fully saturating, multi-threaded program.   I believe that the bigger cache on the X and Xeon processors might just be more helpful than memory lanes -- for normal programs.

 

For big machines, you REALLY DO want ECC.

Also, if you really do not need AVX512 and dont need more than 10 cores, I'd suggest the 10900K for the 10th generation, or wait for the 11th generation. (Or, try AMD.)

 

John

 

Link to comment
14 hours ago, jabbr said:

 

Agreed. There are details for different CPUs but nothing can do more than sinc-M/ASDM7EC at DSD256. When I run this along with DHNRDS I can only do DSD128 without dropouts.

 

For me, another advantage of the Xeon-W is that it has enough PCIe-3 lanes that I can run both CUDA GPU and 100Gbe Ethernet. The non Xeon Intel CPUs both can't do ECC nor have enough PCIe lanes for me.

Good news -- (slightly off topic) -- the new version of DHNRDS in a few days is faster, perhaps about 10-20% at higher quality, but substantially more in --normal mode.   (Got rid of cruft and stupid code.)

 

Bottom line -- on smaller machines, esp moderately tested software like DHNRDS, the SW quality trumps HW ECC.   However as mentioned elsewhere, servers, esp that do scrubbing and stuff like that -- ECC is critical.   That is -- of course, make sure that the memory is good, and the system speed isnt pushed too hard.

 

John

 

Link to comment
14 hours ago, jabbr said:

 

Agreed. There are details for different CPUs but nothing can do more than sinc-M/ASDM7EC at DSD256. When I run this along with DHNRDS I can only do DSD128 without dropouts.

 

For me, another advantage of the Xeon-W is that it has enough PCIe-3 lanes that I can run both CUDA GPU and 100Gbe Ethernet. The non Xeon Intel CPUs both can't do ECC nor have enough PCIe lanes for me.

You are right about ECC, but X CPUs do have lots of lanes.  (AFAIR around 40 or more.)

 

John

Link to comment
13 hours ago, Miska said:

 

Yes, I think it does it because of TDP limit. But yes, ASDM7EC works fine with it, with about 90% core load on the modulator cores. I offload filters to the RTX 2080 GPU, but running also filters on the CPU works fine. But offloading the filters leaves plenty of CPU time for running compilers and other development stuff.

 

I have carefully watched the CPU temperatures, and the AVX512 seems to blindly slow down the CPU, TDP limit or not.   It also seems to be partially related to the 'C' power modes.   I could eek out a bit more performance (AVX512) by tuning the CPU controls, but not by much.   Of course, I am not a speed-tweaker, so that kind of thing is out of the question for me.   (Stability and standardization, with or without ECC are important to me.)

 

I could consider offloading the expensive DHNRDS filtering to a parallel CPU, but many people who complain about the application speed being terrible aren't even using AVX2 machines.*     Since I have perfected the algorithms, fully understanding what is needed , and since there is a pipelined nature to the program (pipelined through threads), perhaps 4 long Hilbert transforms (FIR filters) in 8 or 16 threads could be done in parallel.

 

*everyone DOES complain about the speed, but those who are using 2 or 4 core SSE2 machines are most likely not going to be very happy in the higher quality modes.

 

So, on the DHNRDS program, it might be possible to do 32 or 64 1000->2000 tap Hilbert transforms in parallel, pipelined in batches of 4 transforms.   (I need to publish the algorithm before someone else figures it out -- it is just becoming practical to use with our fast CPUs today.)

 

I do wish I could have afforded 18 cores...

 

 

 

Link to comment
On 1/23/2021 at 9:42 AM, jabbr said:

 

I meant the i9-10900K ... oh god these numbers and letters have me lost...

X looks like K, almost.

The tradeoff of X vs K is X has more cache per core, AVX512, but K has a slightly faster clock rate on average and has slightly higher temp limit.

Link to comment
On 1/27/2021 at 12:00 PM, Miska said:

 

X has twice as many memory channels giving twice the memory bandwidth (and max capacity). X also has much more PCIe lanes.

 

Yes, but very interesting (to me), with the super memory/CPU intense DHNRDS software -- using two vs four channels on my machine made almost no difference.  The PCIE lanes might make a big difference for those with lots of I/O devices and needing lots of I/O speed.    More memory channels might be more important for cases when raw memory copies pf DIFFERENT memory  are done at high rates, but most code isn't that simple.   Most code uses the same memory over and over again (therefore bigger cache at similar speed is likely more imporant.)


I got the X vs K, mostly for the curiosity about AVX512, and being able to support  it for the future when AVX512 might actually help a lot for newer processors.   I use a very very efficient SIMD compiler, along with the way that the software is written, that AVX512 SHOULD produce a big speedup. but because of the limitations of the 10th & earlier generaltion X CPUS (and Xeons), AVX512 is mostly just 'more instructions', but not really gaining much from 512bit operations.   I even checked the generated code for AVX512, and  it is about as efficient as one can reasonably expect.  There is a LOT of code in DHNRDS that 512bit operations SHOULD speed things up, but doesn't.  (BTW, the compiler is LLVM/CLANG -- for my purposes, leaves GCC & Intel compilers in the dust.)

 

* I hard coded a bunch of SIMD operations -- but the very efficient hard coded ASM operationts end to slow down the code,  because the compiler optimization is more weak when using inline asm.   I was really surprised about that, because my inline asms were maximally efficient.

 

There are so many 'features' which are of secondary advantage for most situations.   48 lanes of PCIE is probably the strongest impactful advantage for big machines.   I like the big cache also -- the DHNRDS does do lot of memory transfers, but does so with the same memory over and over again.   I would expect most useful software (esp for audio) are more of the 'use same memory over and over again' instead of massive transfers of different memory (e.g. FIR filters, most math).   Therefore 'cpu-local cache size' probably trumps 'memory lanes' for audio-type of applications.   More lanes DOES make differences in benchmarks.

 

 

 

 

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now



×
×
  • Create New...