May 7 “Let Loose” Event - new iPads

leman · 2024-05-12T00:27:22-0700

I've updated the plot to make it more readable, thanks for your helpful suggestions!

theorist9 said:
I'm afraid this still doesn't make sense to me. Even using the single highest score for the M3 Max I could find from among 3 pages of individual Open CL results (so this should be a result for the 40-core model), the RTX 4060 Ti still has higher performance on this benchmark, in spite of having only 72% of the bandwidth.

If the GB6 GPU tests are so bandwidth-intensive that they limit the performance of the M3 Max, how is a GPU with so much less bandwidth able to outperform it on this benchmark?

Oh, but 4060 Ti scores lower on GB6 compute! The highest 4060 Ti compute scores are around 140k, the full M3 Max manages around 155k. This again support the idea that GB6 compute tests are largely bandwidth-limited. The 4060 Ti nominally has 50% more compute than the M3 Max, but roughly 20% less bandwidth (Apple GPU does not have access to full RAM bandwidth, for full M3 Max it should be around 350GB/s). So the results are consistent with the bandwidth-limited software behavior.

Similarly, base M3 outperforms MX450 by a large margin (GB6 ~ 45k vs 30k), but MX450 also has half the compute of the M3.

Most of your argumentations appears to be based on comparing OpenCL scores. The OpenCL GB6 backend is known to have poor performance on Apple Silicon, for whatever reason. We should be looking at optimized backends, i.e. ones with the highest scores. For Apple it's Metal, for Nvidia it's OpenCL. They are all using the same algorithms anyway, the difference is the overhead incurred by the implementation.

dada_dave · 2024-05-12T00:37:08-0700

leman said:
I've updated the plot to make it more readable, thanks for your helpful suggestions!

Very nice.

leman said:
Oh, but 4060 Ti scores lower on GB6 compute! The highest 4060 Ti compute scores are around 140k, the full M3 Max manages around 155k. This again support the idea that GB6 compute tests are largely bandwidth-limited. The 4060 Ti nominally has 50% more compute than the M3 Max, but roughly 20% less bandwidth (Apple GPU does not have access to full RAM bandwidth, for full M3 Max it should be around 350GB/s). So the results are consistent with the bandwidth-limited software behavior.

Similarly, base M3 outperforms MX450 by a large margin (GB6 ~ 45k vs 30k), but MX450 also has half the compute of the M3.

Most of your argumentations appears to be based on comparing OpenCL scores. The OpenCL GB6 backend is known to have poor performance on Apple Silicon, for whatever reason. We should be looking at optimized backends, i.e. ones with the highest scores. For Apple it's Metal, for Nvidia it's OpenCL.

Aye although in that case, for Nvidia, it should be CUDA … but sadly that’s not available anymore. Supposedly too few people used it which is … odd. Anyway, I know historically Nvidia had very poor OpenCL support though within the last few years did improve it. But even so, I’m not sure how optimized it would be relative to a CUDA implementation.

leman said:
They are all using the same algorithms anyway, the difference is the overhead incurred by the implementation.

leman · 2024-05-12T00:52:18-0700

dada_dave said:
Aye although in that case, for Nvidia, it should be CUDA … but sadly that’s not available anymore. Supposedly too few people used it which is … odd. Anyway, I know historically Nvidia had very poor OpenCL support though within the last few years did improve it. But even so, I’m not sure how optimized it would be relative to a CUDA implementation.

Yes, a CUDA backend would be best to gauge performance. That said, OpenCL scores on Nvidia hardware make sense to me (unlike Vulkan ones which are way too low).

P.S. I just looked up some historical GB5 compute scores, CUDA and OpenCL backends seem to perform very similarly on Nvidia. That’s maybe why they dropped CUDA in GB6. E.g: https://browser.geekbench.com/v5/compute/4197762, https://browser.geekbench.com/v5/compute/4917806https://browser.geekbench.com/v5/compute/4917806

dada_dave · 2024-05-12T01:14:07-0700

leman said:
Yes, a CUDA backend would be best to gauge performance. That said, OpenCL scores on Nvidia hardware make sense to me (unlike Vulkan ones which are way too low).

P.S. I just looked up some historical GB5 compute scores, CUDA and OpenCL backends seem to perform very similarly on Nvidia. That’s maybe why they dropped CUDA in GB6. E.g: https://browser.geekbench.com/v5/compute/4197762, https://browser.geekbench.com/v5/compute/4917806https://browser.geekbench.com/v5/compute/4917806

I think it depends, there was a lot of variation, but you can see here on average CUDA was higher maybe 7-10%? Hard to tell. Either way that’s not enough to impact your analysis though. Not like using Vulkan would be or OpenCL for Apple.

I still maintain though that if the algorithms were primarily memory bound on Apple GPUs, we probably should’ve seen a greater reduction in scores from the M2 to M3 Pro and we don’t. The per-core score even increases, just by not as much as the base and Max.

leman · 2024-05-12T01:58:07-0700

dada_dave said:
I still maintain though that if the algorithms were primarily memory bound on Apple GPUs, we probably should’ve seen a greater reduction in scores from the M2 to M3 Pro and we don’t. The per-core score even increases, just by not as much as the base and Max.

That is a very good point. Yeah, one would need to look at these things in more detail.

May 7 “Let Loose” Event - new iPads

leman

Site Champ

dada_dave

Elite Member

leman

Site Champ

dada_dave

Elite Member

leman

Site Champ

Similar threads