Developer Update - 28 September 2023

CyanMoon · October 10, 2023, 5:45pm

I would also like to be more concise, but the lack of data and explanation process makes it difficult to convince and understand the specific feelings and impacts.

@notnullnotvoid

Bulit-in has some degree of GPU waiting for CPU or synchronization issues, and has been complained about by many in the industry.

But it’s not that serious, it’s about 10%, and the difference is less than 2% for highly loaded GPUs.

This problem can be solved in SRP.

dark · October 10, 2023, 5:47pm

Animators are not hidden, at least after you’ve loaded an avatar for the first time. So if these people are running 10-25 layers for their toggles (while still having good, medium, or poor as their lowest rank), performance will tank.

If you turn your avatar hider to “only me” and rejoin the world, you should notice the performance is way better. This is what I used to do when I went to an event where they would only have 20-40 users, but they were all very poor avatars.

I do believe Tupper said there is some bug with AMD and reporting, which I would believe based on prior experience. But if it wasn’t memory utilization, then the X3D chips wouldn’t be so incredibly useful for VRChat.

The complexity here is that: Similar to how we need to hyper-optimize our Udon code by looking at EXTERN calls in the assembly, there may be certain calls that VRChat/Unity is doing that cause it to misappropriate cache lines on the CPU, causing it to call out to system memory, which takes forever. So it’s less of a horsepower issue, and more about a mis-use of that horsepower.

I did just want to comment on this, since I didn’t see it since I rarely go on here.

It is also extremely well-documented that most game developers have strict control over their game assets and how they are rendered. They can optimize their characters, and even instance them properly if they, say, want 200+ zombies on a map.

VRChat has basically zero way to do this, and most people making avatars will almost never optimize them as much as a professional or indie game studio would.

notnullnotvoid · October 10, 2023, 6:02pm

Animators are not hidden, at least after you’ve loaded an avatar for the first time […] If you turn your avatar hider to “only me” and rejoin the world, you should notice the performance is way better.

That could certainly explain it. I’ll try that next time and see.

But if it wasn’t memory utilization, then the X3D chips wouldn’t be so incredibly useful for VRChat.

You’re completely correct about the effect that the large caches in the X3D chips have on VRChat’s performance. It’s just that whether the CPU is waiting for memory or not doesn’t affect reported CPU utilization. If a CPU core spent 100% of its time waiting on cache misses, it would still count as 100% utilization. So the low CPU utilization must be a symptom of something else.

dark · October 10, 2023, 6:06pm

The easiest way to test is to turn on the physbone debug. You’ll notice they don’t render before you load an avatar, but they constantly render after you’d loaded them.

I also remember Tupper explained that they couldn’t hide animators, otherwise it would cause hitching when they constantly load and unload. I don’t have time to dig up that source though.

I have noticed a very common trend that Ryzen chips often under-report their utilization compared to Intel chips. Who knows, this could be an issue with how the CCD and SoC work. Maybe the SoC is moving data around and not explicitly doing any processing on the CPU, who knows.

It’s just at least clear that there is cache under-utilization by the usefulness of X3D chips. And unfortunately the solve here might require a huge developmental undertaking by VRChat to somehow utilize DOTS or something similar to more effectively utilize CPU cache and multi-thread their systems. Hopefully 2022 can inherently solve these on some aspects.

CyanMoon · October 10, 2023, 6:15pm

I think you have a misunderstanding about utilization, in fact high utilization means the internal unit is clogged to keep sending out tasks.

Except for high end CPUs, intel’s L3 cache is smaller than amd’s (with the exception of APUs).

In fact, back-end scheduling will continue to change strategies based on the degree of memory blocking (e.g., bandwidth usage), so it is important to understand the architectural issues based on specific measurements.

Optimize cache and memory access with back-end out of order of resources.

It is also difficult to use a large capacity like X3D for full L3.
Even using smaller capacity CPUs is impossible, and normally the larger the capacity, the lower the effective utilization.

If you have a clear understanding of the principle of locality you know that.

Also when both intel and ryzen have SMT, ryzen will indeed have lower utilization than intel, however the performance may be different after 50% utilization.

Unity3D’s internal implementation already makes use of some of these, it’s just not as effective.

Also I don’t want to say too much about dots, it looks good but in reality there are a lot of problems.

Also, you can see that the C# decompiler for vrchat already utilizes jobs and Burst and some of the ECS. (which adds up to dots).

However, Unity3D’s core is problematic. Although the external interface seems to be parallelized, it actually blocks the core containers in some cases.

Unity3D doesn’t really allow for better parallelization by opening up the game object management logic to the outside world (however, there is usually a huge amount of complexity in doing so and some of the parallelism issues remain unsolved).

While dots may sound good and be claimed to be a good data-dense solution and parallelization method, it’s important to not just look at the claims, but to actually look at the measurements and usage.

That’s why I don’t think parallelization can go on forever, instead I hope that through some people’s solutions, for example, not to engage the core with a large amount of workload, but to implement their own container structure and management mechanism to filter, reduce unnecessary load and improve the localization.

However, such a design solution is also quite complex, and it would be difficult for RDs who are not well versed in this area to do it properly.

You can use a large number of densely visible skin meshes, meshes, particle systems and even cloth, dense materials, and high drawcalls (although the CPU overhead of the same draw call can also vary several times, which is a transmission problem data intensive).

Oh, and there’s lighting, but the GPU overhead rises so quickly that it’s ignored.

This takes full advantage of the capacity gap between L3 caches, allowing you to put a lot of stuff in a row to lower fps, fully reflecting the latency gap between RAM and L3. At this time, the difference may be tens of percent or even multiple differences.

However, such scenarios are relatively rare and outrageous, and will worsen the efficiency of the GPU to a certain extent.

One thing that remains unfinished is the detailed analysis of Udon and world, part of the world’s implementation has a huge amount of data-intensive properties and is easy to miss.

I think this issue must be separated from avatars themselves, they belong to different problem types.

However, a simple preliminary analysis would have cost more to implement to try to reproduce the problem, since there are no existing highly loaded world assets to use.

For this part of the analysis, which must be separated from the problem described by @dark, avatars do not have the data density of the invisible or visible case even if they try to reproduce the problem.

However, some people’s benchmarks and some worlds have very strong characteristics in this area that need to be reproduced.

If you can, I hope you can try to reproduce the problem as much as possible, part of the world at zen3 4.8Ghz can only maintain 60~80fps or 100fps with just one avatar by itself, and even that will have an impact on the frame rate drop in multiplayer scenarios even more dramatically.

These worlds have approximately a CPU frametime overhead of 7~15ms, which is quite high overhead, and is mostly or almost entirely caused by datacache misses.

I deliberately tested the impact of switching the constraints using avatars with high constraints, and found that there are different frametime impacts in differently optimized (all of them very bad) but different types of worlds.

I measured the effect of 5000 bones/120 constraints avatar switching on and off in a world with no udon at all at only 1ms.
The difference is 340fps vs 526fps.

In a world with a runaway camera I get 2ms and 1.5ms difference before measuring towards the sky and in front of the camera respectively.

In the world of the railroad train, which looks like nothing, I got a gap of 1.2~1.3ms.

It is possible to reproduce and have a record.

So I’m a little confused as to why this would be the case, isn’t it all about access and manipulation of transforms? And the constraints themselves don’t measure anything that would make the data cache miss.
They don’t seem to be relevant at all.

system · October 12, 2023, 8:55pm

This topic was automatically closed after 14 days. New replies are no longer allowed.

Topic		Replies	Views
Developer Update - 26 October 2023 Dev Updates	24	7611	November 9, 2023
Developer Update - 12 October 2023 Dev Updates	83	11654	October 26, 2023
Developer Update - 14 September 2023 Dev Updates	111	20181	September 28, 2023
Developer Update - 31 August 2023 Dev Updates	36	7278	February 2, 2024
Developer Update - 27 October 2022 Dev Updates	109	12927	November 3, 2022

Developer Update - 28 September 2023

Related topics