Programmer Timothy Lottes, who created the Fast Approximate Anti-Aliasing (FXAA) under NVIDIA, which was first used in Elder Scrolls V: Skyrim, weighs in on the PS4 and what it could mean to gaming from a technical standpoint.
Over on his blog, Lottes postulates on a very technical level but seeing the info we have now has not been confirmed by Sony, his formulations could end up being wrong.
So in saying that, keep in mind that what he discusses below are based on the assumption that the specs for the PS4 are those out in the open right now.
According to Lottes, the “real” reason to get excited for the PS4 is what Sony is doing with the console’s Operating System (OS) and system libraries as a platform. If the PS4 has a real-time OS, with “libGCM” style low level access to the GPU, then PS4 first-party titles will be “years ahead” of the PC due to the fact that it opens up what is possible with the GPU. He notes that this won’t happen right away at launch, but once developers tool up for the platform, this will be the case.
Assuming a 79070M in the Orbis (PS4), which AMD has already released the hardware ISA docs publicly, Lottes comes up with a logical hypothesis on what programmers might have access to the PS4.
Below, Lottes takes a look at what isn’t provided on PC but can be found in AMD’s GCN ISA documents.
Dual Asynchronous Compute Engines (ACE) :: Specifically "parallel operation with graphics and fast switching between task submissions" and "support of OCL 1.2 device partitioning". Sounds like at a minimum a developer can statically partition the device such that graphics can compute can run in parallel. For a PC, static partition would be horrible because of the different GPU configurations to support, but for a dedicated console, this is all you need. This opens up a much easier way to hide small compute jobs in a sea of GPU filling graphics work like post processing or shading.
Dual High Performance DMA Engines :: Developers would get access to do async CPU->GPU or GPU->CPU memory transfers without stalling the graphics pipeline, and specifically ability to control semaphores in the push buffer(s) to insure no stalls and low latency scheduling. This is something the PC APIs get horribly wrong, as all memory copies are implicit without really giving control to the developer. This translates to much better resource streaming on a console.
Support for upto 6 Audio Streams :: HDMI supports audio, so the GPU actually outputs audio, but no PC driver gives you access. The GPU shader is in fact the ideal tool for audio processing, but on the PC you need to deal with the GPU->CPU latency wall (which can be worked around with pinned memory), but to add insult to injury the PC driver simply just copies that data back to the GPU for output adding more latency. In theory on something like a PS4 one could just mix audio on the GPU directly into the buffer being sent out on HDMI.
Global Data Store :: AMD has no way of exposing this in DX, and in OpenGL they only expose this in the ultra-limited form of counters which can only increment or decrement by one. The chip has 64KB of this memory, effectively with the same access as shared memory (atomics and everything) and lower latency than global atomics. This GDS unit can be used for all sorts of things, like workgroup to workgroup communication, global locks, or like doing an append or consume to an array of arrays where each thread can choose a different array, etc.
Re-used GPU State :: On a console with low level hardware access (like the PS3) one can pre-build and re-use command buffer chunks. On a modern GPU, one could even write or modify pre-built command buffer chunks from a shader. This removes the cost associated with drawing, pushing up the number of unique objects which can be drawn with different materials.
FP_DENORM Control Bit :: On the console one can turn off both DX's and GL's forced flush-to-denorm mode for 32-bit floating point in graphics. This enables easier ways to optimize shaders because integer limited shaders can use floating point pipes using denormals.
128-bit to 256-bit Resource Descriptors :: With GCN all that is needed to define a buffer's GPU state is to set 4 scalar registers to a resource descriptor, similar with texture (up to 8 scalar registers, plus another 4 for sampler). The scalar ALU on GCN supports block fetch of up to 16 scalars with a single instruction from either memory or from a buffer. It looks to be trivially easy on GCN to do bind-less buffers or textures for shader load/stores. Note this scalar unit has it's own data cache also. Changing textures or surfaces from inside the pixel shader looks to be easily possible. Note shaders still index resources using an instruction immediate, but the descriptor referenced by this immediate can be changed. This could help remove the traditional draw call based material limit.
Full Cache Flush Control :: DX has only implicit driver controlled cache flushes, it needs to be conservative, track all dependencies (high overhead), then assume conflict and always flush caches. On a console, the developer can easily skip cache flushes when they are not needed, leading to more parallel jobs and higher performance (overlap execution of things which on DX would be separated by a wait for machine to go idle).
Regarding the GPU assembly, Lottes isn’t sure if the GCN has some hidden very complex rules for code generation or compiler scheduling. He adds that if Sony opens the GPU assembly, which unlike the PS3, developers might “easily crank” out 30% extra from hand tuning shaders.
Lottes ends by guessing that launch titles will be DX11 ports, which are about the same as what we have on PCs now. But if Sony provides real-time OS with libGCM vs for GCN, in one to two years’ time, Sony’s first-party studios will have enough time to build up tech to really show what the PS4 can do.
Again, Lottes is not privy to any info and is just basing this on the the info what’s out there. If these are true – and I wouldn’t bet against it being so – I imagine the PS4 will be one heck of a console.