I think you have a misunderstanding about utilization, in fact high utilization means the internal unit is clogged to keep sending out tasks.
Except for high end CPUs, intel’s L3 cache is smaller than amd’s (with the exception of APUs).
In fact, back-end scheduling will continue to change strategies based on the degree of memory blocking (e.g., bandwidth usage), so it is important to understand the architectural issues based on specific measurements.
Optimize cache and memory access with back-end out of order of resources.
It is also difficult to use a large capacity like X3D for full L3.
Even using smaller capacity CPUs is impossible, and normally the larger the capacity, the lower the effective utilization.
If you have a clear understanding of the principle of locality you know that.
Also when both intel and ryzen have SMT, ryzen will indeed have lower utilization than intel, however the performance may be different after 50% utilization.
Unity3D’s internal implementation already makes use of some of these, it’s just not as effective.
Also I don’t want to say too much about dots, it looks good but in reality there are a lot of problems.
Also, you can see that the C# decompiler for vrchat already utilizes jobs and Burst and some of the ECS. (which adds up to dots).
However, Unity3D’s core is problematic. Although the external interface seems to be parallelized, it actually blocks the core containers in some cases.
Unity3D doesn’t really allow for better parallelization by opening up the game object management logic to the outside world (however, there is usually a huge amount of complexity in doing so and some of the parallelism issues remain unsolved).
While dots may sound good and be claimed to be a good data-dense solution and parallelization method, it’s important to not just look at the claims, but to actually look at the measurements and usage.
That’s why I don’t think parallelization can go on forever, instead I hope that through some people’s solutions, for example, not to engage the core with a large amount of workload, but to implement their own container structure and management mechanism to filter, reduce unnecessary load and improve the localization.
However, such a design solution is also quite complex, and it would be difficult for RDs who are not well versed in this area to do it properly.
You can use a large number of densely visible skin meshes, meshes, particle systems and even cloth, dense materials, and high drawcalls (although the CPU overhead of the same draw call can also vary several times, which is a transmission problem data intensive).
Oh, and there’s lighting, but the GPU overhead rises so quickly that it’s ignored.
This takes full advantage of the capacity gap between L3 caches, allowing you to put a lot of stuff in a row to lower fps, fully reflecting the latency gap between RAM and L3. At this time, the difference may be tens of percent or even multiple differences.
However, such scenarios are relatively rare and outrageous, and will worsen the efficiency of the GPU to a certain extent.
One thing that remains unfinished is the detailed analysis of Udon and world, part of the world’s implementation has a huge amount of data-intensive properties and is easy to miss.
I think this issue must be separated from avatars themselves, they belong to different problem types.
However, a simple preliminary analysis would have cost more to implement to try to reproduce the problem, since there are no existing highly loaded world assets to use.
For this part of the analysis, which must be separated from the problem described by @dark, avatars do not have the data density of the invisible or visible case even if they try to reproduce the problem.
However, some people’s benchmarks and some worlds have very strong characteristics in this area that need to be reproduced.
If you can, I hope you can try to reproduce the problem as much as possible, part of the world at zen3 4.8Ghz can only maintain 60~80fps or 100fps with just one avatar by itself, and even that will have an impact on the frame rate drop in multiplayer scenarios even more dramatically.
These worlds have approximately a CPU frametime overhead of 7~15ms, which is quite high overhead, and is mostly or almost entirely caused by datacache misses.
I deliberately tested the impact of switching the constraints using avatars with high constraints, and found that there are different frametime impacts in differently optimized (all of them very bad) but different types of worlds.
I measured the effect of 5000 bones/120 constraints avatar switching on and off in a world with no udon at all at only 1ms.
The difference is 340fps vs 526fps.
In a world with a runaway camera I get 2ms and 1.5ms difference before measuring towards the sky and in front of the camera respectively.
In the world of the railroad train, which looks like nothing, I got a gap of 1.2~1.3ms.
It is possible to reproduce and have a record.
So I’m a little confused as to why this would be the case, isn’t it all about access and manipulation of transforms? And the constraints themselves don’t measure anything that would make the data cache miss.
They don’t seem to be relevant at all.