Furthermore, I must point out that polygons do not scale at O(n^2); rather, they are an O(n) problem.
Ray tracing, on the other hand, is an O(log n) problem.
Due to latency and parallelization issues, within a certain range, a squared increase in polygon count only leads to a proportional increase in overhead.
Conversely, the overhead within that range can be described as scaling at a square root rate.
For example, the processing time for 1 million triangles is only about 3 times that of 100,000 triangles rather than 10 times, and it gradually trends toward linear scaling.
However, there is an underlying issue: polygons cannot be effectively parallelized, a problem I have mentioned multiple times before.
This is because rasterization requires tasks such as sorting, visibility determination, and index processing to some extent.
This makes certain parts of the overhead serialized and difficult to parallelize. This bottleneck resides at the front-end primitive distributor.
Consequently, large-scale GPUs encounter bottlenecks when facing massive triangle counts. The only solutions are to reduce the latency of front-end data fetching or to increase the GPU front-end frequency.
This is reflected in specific GPU architectural designs where clock frequencies are adjusted asynchronously, which can be verified using specialized profiling tools or understood through technical slides, such as the subsequent designs of the RDNA series.
Modern solutions attempt to use Compute Shaders for more flexible culling to replace traditional fixed-function pipelines, such as the VPC units within NVIDIA GPUs.
By utilizing multi-step and flexible batch culling instead of relying on fixed-function hardware units to inefficiently check and cull primitives one by one, efficiency is improved. The fixed hardware approach incurs high latency and may even fail to reduce total frame time.
Originally intended to increase the efficiency of subsequent stages like Rasterization (RAS) and pixel shading, these fixed units often ended up dragging down the overall performance.
At the same time, vertex data fetching efficiency can be optimized through cluster-based designs.(Resulting in an overall 3 to 5 times increase in geometry stage performance, including the simultaneous improvement in culling efficiency.)
Similar issues exist in depth sorting. For instance, the strict tile-based deep-buffered designs of mobile GPUs lead to higher depth-culling efficiency but may introduce greater latency. When facing a large number of fragmented objects like particles, this can actually lower efficiency and lead to low pipeline utilization.
In some cases, depth testing can even perform worse than transparency blending or the “discard” operation. Therefore, most designs are tailored for common scenarios. As requirements become more extreme and demand broader coverage, it becomes necessary to overhaul the rendering workflow and introduce new technologies.
This does not mean traditional methods are inferior, as new methods introduce their own overhead. In practice, more sophisticated hybrid designs are considered, which are usually heavily encapsulated and hidden.
Furthermore, the workload causing the bottleneck shifts dynamically under different data scales, making it difficult to analyze. Thus, one must rely on comprehensive testing.
This also makes precision difficult to achieve, favoring adaptive designs and avoiding manual parameter tuning. Developers avoid running different rendering paths for high-end versus low-end GPUs, as extreme complexity leads to unmaintainable code and unpredictable results that are harder to optimize.
This is the general approach in game development.
To be precise, there are many points in this discussion thread that are difficult to even begin critiquing. However, strategy remains the priority. Often, a good strategy is far better than spending a vast amount of time trying to fix every single problem.
If we are to talk about actual issues, the current graphics thread efficiency in VRChat and the lack of Unity 6’s Compute Shader-based skinned mesh batching, along with several other technical details, could effectively improve existing problems. Of course, Unity’s URP is a bit of a complicated subject. The BIRP features that users have long requested are difficult to replicate in URP, making pipeline migration extremely arduous. While most requirements are replaceable, some needs remain permanently unsolvable.
Ultimately, reality is an extremely complex and coupled system. The only viable path is to observe and combine all variables into graphs and reports for collective trade-offs. Beyond that, there is not much else that can be done.
_____________
Furthermore, I would like to point out some additional issues. Even with the latest technology, the parallelization problem is only mitigated rather than fundamentally solved.
Even if you attempt to scale up at the same clock frequency, doubling the number of GPCs actually only yields a 30% to 50% improvement in performance, or it may even be difficult to further increase the degree of parallelization at all (whether based on Mesh Shaders or Compute Shaders).
This is significantly slower than the pixel-stage rendering. Therefore, it is normal to see high-end GPUs not performing much faster even when they are GPU-bound. If Vertex Shaders are used, the situation is even worse.
A graphics card with 11 to 12 GPCs performs almost identically or with very little difference compared to one with 4 to 6 GPCs when purely testing Vertex Shader throughput.
________
Additionally, when considering AMD GPUs, the latency might be even higher due to more complex culling and front end designs.
Their utilization might also be relatively worse, requiring a massive number of triangles or vertices to reach comparable performance.
Otherwise, they are usually another 10% to 20% slower or more in typical scenarios. In summary, all aspects must be considered comprehensively, as there are simply too many factors to take into account.