http://dpad.gotfrag.com/portal/story/35372/?spage=9Now the 360’s GPU is one impressive piece of work and I’ll say from the get go it’s much more advanced than the PS3’s GPU so I’m not sure where to begin, but I’ll start with what Microsoft said about it. Microsoft said Xenos was clocked at 500MHZ and that it had 48-way parallel floating-point dynamically-scheduled shader pipelines (48 unified shader units or pipelines) along with a polygon performance of 500 Million triangles a second.
Before going any further I’ll clarify this 500 Million Triangles a second claim. Can the 360’s GPU actually achieve this? Yes it can, BUT there would be no pixels or color at all. It’s the triangle setup rate for the GPU and it isn’t surprising it has such a higher triangle setup rate due to it having 48 shaders units capable of performing vertex operations whereas all other released GPUs can only dedicate 8 shader units to vertex operations. The PS3 GPU’s triangle setup rate at 550MHZ is 275 million a second and if its 500MHZ will have 250 million a second. This is just the setup rate do NOT expect to see games with such an excessive number of polygons because it wont happen.
Microsoft also says it can also achieve a pixel-fillrate of 16Gigasamples per second. This GPU here inside the Xbox 360 is literally an early ATI R600, which when released by ATI for the pc will be a Directx 10 GPU. Xenos in a lot of areas manages to meet many of the requirements that would qualify it as a Directx 10 GPU, but falls short of the requirements in others. What I found interesting was Microsoft said the 360’s GPU could perform 48 billion shader operations per second back in 2005. However Bob Feldstein, VP of engineering for ATI, made it very clear that the 360’s GPU can perform 2 of those shaders per cycle so the 360’s GPU is actually capable of 96 billion shader operations per second.
# 8 shader units * 4 ops per cycle = 192 shader ops per clock
# Xenos is clocked at 500MHZ *192 shader ops per clock = 96 billion shader ops per second.
(Did anyone notice that each shader unit on the 360’s GPU doesn’t perform as many ops per pipe as the rsx? The 360 GPU makes up for it by having superior architecture, having many more pipes which operate more efficiently and along with more bandwidth.)
Did Microsoft just make a mistake or did they purposely misrepresent their GPU to lead Sony on? The 360’s GPU is revolutionary in the sense that it’s the first GPU to use a Unified Shader architecture. According to developers this is as big a change as when the vertex shader was first introduced and even then the inclusion of the vertex shader was merely an add-on not a major change like this. The 360’s GPU also has a daughter die right there on the chip containing 10MB of EDRAM. This EDRAM has a framebuffer bandwidth of 256GB/s which is more than 5 times what the RSX or any GPU for the pc has for its framebuffer (even higher than G80’s framebuffer).
Thanks to the efficiency of the 360 GPU’s unified shader architecture and this 10MB of EDRAM the GPU is able to achieve 4XFSAA at no performance cost. ATI and Microsoft’s goal was to eliminate memory bandwidth as a bottleneck and they seem to have succeeded. If there are any pc gamers out there they notice that when they turn on things such as AA or HDR the performance goes down that’s because those features eat bandwidth hence the efficiency of the GPU’s operation decreases as they are turned on. With the 360 HDR+4XAA simultaneously are like nothing to the GPU with proper use of the EDRAM. The EDRAM contains a 3D logic unit which has 192 Floating Point Unit processors inside. The logic unit will be able to exchange data with the 10MB of RAM at 2 Terabits a second. Things such as antialiasing, computing z depths or occlusion culling can happen on the EDRAM without impacting the GPU’s workload.
Xenos writes to this EDRAM for its framebuffer and it’s connected to it via a 32GB/sec connection (this number is extremely close to the theoretical because the EDRAM is right there on the 360 GPU’s daughter die.) Don’t forget the EDRAM has a bandwidth of 256GB/s and its only by dividing this 256GB/s by the initial 32GB/s that we get from the connection of Xenos to the EDRAM we find out that Xenos is capable of multiplying its effective bandwidth to the frame buffer by a factor of 8 when processing pixels that make use of the EDRAM, which includes HDR or AA and other things. This leads to a maximum of 32*8=256GB/s which, to say the least, is a very effective way of dealing with bandwidth intensive tasks.
In order for this to be possible developers would need to setup their rendering engine to take advantage of both the EDRAM and the available onboard 3D logic. If anyone is confused why the 32GB/s is being multiplied by 8 its because once data travels over the 32GB/s bus it is able to be processed 8 times by the EDRAM logic to the EDRAM memory at a rate of 256GB/s so for every 32GB/s you send over 256GB/s gets processed. This results in RSX being at a bandwidth disadvantage in comparison to Xenos. Needless to say the 360 not only has an overabundance of video memory bandwidth, but it also has amazing memory saving features. For example to get 720P with 4XFSAA on traditional architecture would require 28MB worth of memory. On the 360 only 16MB is required. There are also features in the 360's Direct3D API where developers are able to fit 2 128x128 textures into the same space required for one, for example. So even with all the memory and all the memory bandwidth, they are still very mindful of how it’s used.
I wasn’t too clear earlier on the difference between the RSX’s dedicated pixel and vertex shader pipelines compared to the 360s unified shader architecture. The 360 GPU has 48 unified pipelines capable of accepting either pixel or vertex shader operations whereas with the older dedicated pixel and vertex pipeline architecture that RSX uses when you are in a vertex heavy situation most of the 24 pixel pipes go idle instead of helping out with vertex work.
Or on the flip side in a pixel heavy situation those 8 vertex shader pipelines are just idle and don’t help out the pixel pipes (because they aren’t able to), but with the 360’s unified architecture in a vertex heavy situation for example none of the pipes go idle. All 48 unified pipelines are capable of helping with either pixel or vertex shader operations when needed so as a result efficiency is greatly improved and so is overall performance. When pipelines are forced to go idle because they lack the capability to help another set of pipelines accomplish their task it’s detrimental to performance. This inefficient manner is how all current GPUs operate including the PS3's RSX. The pipelines go idle because the pixel pipes aren't able to help the vertex pipes accomplish a task or vice versa. Whats even more impressive about this GPU is it by itself determines the balance of how many pipelines to dedicate to vertex or pixel shader operations at any given time a programmer is NOT needed to handle any of this the GPU takes care of all this itself in the quickest most efficient way possible. 1080p is not a smart resolution to target in any form this generation, but if 360 developers wanted to get serious about 1080p, thanks to Xenos, could actually outperform the ps3 in 1080p. (The less efficient GPU always shows its weaknesses against the competition in higher resolutions so the best way for the rsx to be competitive is to stick to 720P) In vertex shader limited situations the 360’s gpu will literally be 6 times faster than RSX. With a unified shader architecture things are much more efficient than previous architectures allowed (which is extremely important). The 360’s GPU for example is 95-99% efficient with 4XAA enabled. With traditional architecture there are design related roadblocks that prevent such efficiency. To avoid such roadblocks, which held back previous hardware, the 360 GPU design team created a complex system of hardware threading inside the chip itself. In this case, each thread is a program associated with the shader arrays. The Xbox 360 GPU can manage and maintain state information on 64 separate threads in hardware. There's a thread buffer inside the chip, and the GPU can switch between threads instantaneously in order to keep the shader arrays busy at all times.
Want to know why Xenos doesn’t need as much raw horsepower to outperform say something like the x1900xtx or the 7900GTX? It makes up for not having as much raw horsepower by actually being efficient enough to fully achieve its advertised performance numbers which is an impressive feat. The x1900xtx has a peak pixel fillrate of 10.4Gigasamples a second while the 7900GTX has a peak pixel fillrate of 15.6Gigasamples a second. Neither of them is actually able to achieve and sustain those peak fillrate performance numbers though due to not being efficient enough, but they get away with it in this case since they can also bank on all the raw power. The performance winner between the 7900GTX and the X1900XTX is actually the X1900XTX despite a lower pixel fillrate (especially in higher resolutions) because it has twice as many pixel pipes and is the more efficient of the 2. It’s just a testament as to how important efficiency is. Well how exactly can the mere 360 GPU stand up to both of those with only a 128 bit memory interface and 500MHZ? Well the 360 GPU with 4XFSAA enabled achieves AND sustains its peak fillrate of 16Gigasamples per second which is achieved by the combination of the unified shader architecture and the excessive amount of bandwidth which gives it the type of efficiency that allows it to outperform GPUs with far more raw horsepower. I guess it also helps that it’s the single most advanced GPU currently available anyway for purchase. Things get even better when you factor in the Xenos’ MEMEXPORT ability which allows it to enable “streamout” which opens the door for Xenos to achieve DX10 class functionality. A shame Microsoft chose to disable Xenos’ other 16 pipelines to improve yields and keep costs down. Not many are even aware that the 360’s GPU has the exact same number of pipelines as ATI’s unreleased R600, but to keep costs down and to make the GPU easier to manufacture, Microsoft chose to disable one of the shader arrays containing 16 pipelines. What MEMEXPORT does is it expands the graphics pipeline in more general purpose and programmable manner.
I’ll borrow a quote from Dave Baumann since he explains it rather well.
“With the capability to fetch from anywhere in memory, perform arbitrary ALU operations and write the results back to memory, in conjunction with the raw floating point performance of the large shader ALU array, the MEMEXPORT facility does have the capability to achieve a wide range of fairly complex and general purpose operations; basically any operation that can be mapped to a wide SIMD array can be fairly efficiently achieved and in comparison to previous graphics pipelines it is achieved in fewer cycles and with lower latencies. For instance, this is probably the first time that general purpose physics calculation would be achievable, with a reasonable degree of success, on a graphics processor and is a big step towards the graphics processor becoming much more like a vector co-processor to the CPU.”
Even with all of this information there is still a lot more about this GPU that ATI just simply isn't revealing and considering they'll be borrowing technology used to design this GPU in their future pc products can you really blame them?