View Single Post
Old 08-03-2006, 10:44 PM   #3
Pimp Racer
Regular User
 
Join Date: Oct 2004
Location: 30° 4' N 85° 35' W
Posts: 1,783
Default

PS3 Hardware:
The Playstation 3 is a gaming console(or computer system) that utilizes a Cell processor with 7 operational SPEs with access to 256MB of XDR RAM, an RSX graphics chip with 256MBs of GDDR3 RAM and access to the Cell’s main memory, a blu-ray drive for gaming and movie playback, and a 2.5” hard disc drive. Other components of the system are Bluetooth support used for wireless motion-sensing controllers, 4 USB ports, and a gigabit Ethernet port. On the more expensive version of the Playstation 3 there is also a Sony Memory Stick reader, Compact Flash reader, SD card reader, WiFi support(basically an extra network interface which is wireless), and an HDMI output.

The Playstation is capable of outputting 1080p signals through all of its outputs, though it is possible that with blu-ray movie playback, a token(ICT) can be present which forces down-sampling of 1080p to 540p if the signal goes through a non-certified interface (non-HDMI).

The Playstation 3’s audio will be handled by the Cell processor. There are many supported codecs representing high quality formats for digital entertainment, but since it is done on the Cell processor, game developers are at leisure to output any format they wish. This means 6.1, 7.1, 8.1, or beyond audio is possible unless a later change actually does restrict what is possible to output.

The Cell Processor:
The Cell inside the Playstation 3 is an 8 core asymmetrical CPU. It consists of one Power Processing Element(PPE), and 7 Synergistic Processing Elements(SPE). Each of these elements are clocked at 3.2GHz and are connected on a 4 ring Element Interconnect Bus(EIB) capable of a peak performance of ~204.8GB/s. Every processing element on the bus has its own memory flow controller and direct memory access (DMA) controller. Other elements on the bus are the memory controller to the 256MB XDR RAM, and the Flex phas i/o controller(FlexIO).

The FlexIO bus is capable of ~60GB/s bandwidth. Massive chunk of this bandwidth is allocated to communicate with the RSX graphics chip, and the remaining bandwidth is where the southbridge elements lie such as sound, optical media(blu-ray/dvd/cd), network interface card, hard drive, USB, memory card readers, Bluetooth devices(controllers), and WiFi. This may sound like a lot to share with the RSX, but consider that aside from the RSX, the other components are using bandwidth in the MB/s scale, not GB/s, so even if add all of them up there is still plenty of bandwidth left.

I actually recommend you skip down to the Xbox360 hardware comparison and look at the Cell and Playstation 3 hardware diagrams before you continue reading so you get a better idea of how things come together on the system as I explain it.

Power Processing Element:
The PPE is based on IBM’s POWER architecture. It is a general purpose RISC(reduced instruction set) core clocked at 3.2GHz, 16kb L1 instruction cache and 16kb L1 data cache, with a 512kb L2 cache. It is a 64-bit processor with the ability to fetch four instructions and issue two in one clock cycle. It is also able to handle two hardware threads. It comes with a VMX-128 vector unit with 32 register. The PPE is an in-order processor with delayed execution and limited out-of-order support for load instructions.

PPE Design Goals:
The PPE is designed to handle the general purpose workload for the Cell processor. While the SPEs are capable of executing general purpose code, they are not the best suited to do so. Compared to Intel/AMD chips, the PPE isn’t as fast for general purpose computing considering its in-order architecture and comparably less complex branch prediction hardware. This likely will prevent the Cell from replacing or competing with Intel/AMD chips on desktops, but in the console and multimedia world, the PPE is more than capable in terms of keeping up with the general purpose code used in games and household devices. Playstation 3 will not be running MS word.

The PPE is also simplified to save space and improve power efficiency with less heat dissipation. This also allows the processor to be clocked at higher rates. To compensate for some of the hardware shortcomings of the PPE, IBM is an effort to improve compiler generated code to utilize better instruction level parallelism. This would reduce the penalties of in order execution.

The VMX-128 unit on the PPE is actually a SIMD unit. This gives the PPE some vector processing ability, but as you’ll read in the next section; the SPEs are better equipped for vector processing tasks. The vector unit on the PPE is probably there in case a task that is better run on the PPE has some vector computations needed, but doesn’t perform overall better if the task was being done on an SPE, or if the specific chunk of work had to be handed off to an SPE, it bring in the

Synergistic Processing Element and the SIMD paradigm:
The SPEs on the Cell are the computing powerhouses of the Cell processor. They are independent vector processors running at 3.2GHz. A vector processor is also known to be a single instruction multiple data (SIMD) processor. This means that for a single instruction, let’s say addition, that operation can be performed in one cycle using more than one operand, effectively adding pairs, triples, quadruples of numbers in one cycle instead of taking up 4 cycles in sequence. Here is an example of the different approaches to an example problem of adding the numbers 1 and 2 together, 3 and 4, 5 and 6, and 7 and 8 to produce 4 different sums.
On a traditional desktop CPU (scalar), the instructions are handled sequentially.
Code:

1. Do 1 + 2 -> Store result somewhere
2. Do 3 + 4 -> Store result somewhere
3. Do 5 + 6 -> Store result somewhere
4. Do 7 + 8 -> Store result somewhere

On a vector/SIMD CPU (superscalar) the instruction is issued once, and executed simultaneously for all operands.
Code:

1. Do [1, 3, 5, 7] + [2, 4, 6, 8] -> Store result vector [3, 7, 11, 15] somewhere

You can see how SIMD processors can outdo scalar processors by an order of magnitude when computations are parallel. The situation does change when the task isn’t parallel like in the case of adding a chain of numbers like, 1 + 2 + 3. Quite simply, a processor has to get the result of 1 + 2, before adding 3 to it and nothing can avoid the fact that this operation will take 2 instructions that cannot occur simultaneously. Just to get your mind a bit deeper into this paradigm, consider 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8. On the surface, you might count 7 operations are necessary to accomplish this problem assuming the sums have to be calculated before moving forward. However, if you try to SIMD-ize it, you would realize that this is actually still only 3 operations. Allow me to walk you through it:
Code:

1. Do [1, 3, 5, 7] + [2, 4, 6, 8] -> Store result in two vectors [SUM1, SUM2, 0, 0] and [SUM3, SUM4, 0, 0;
2. Do [SUM1, SUM2, 0, 0] + [SUM3, SUM4, 0, 0] -> Store result in two vectors [SUM5, 0, 0, 0]; [SUM6, 0, 0, 0].
3. Do [SUM5, 0, 0, 0] + [SUM6, 0, 0, 0] -> Store result in vector.

Careful inspection of the previous solution would show two flaws. One is the optimization issue of parts of the vector not being used for the operation. Those used parts of the vector could have been used to perform operations useful for other parts of the program. It would be a huge investment on time if developers tried to solve this problem manually by filling vectors where their code isn’t already plainly vector based. That type of thing IBM is placing on compilers to be able to look into the code for parallelism – specifically instruction level parallelism (ILP).

The other huge problem (which I know is there but know less about), is in the fact that vector processors probably naturally store results in a single vector. It would require some interesting misaligned calculations, shifts and/or copies of data to place the results in a position where they are ready to perform the next step. I am not too well versed in how this can be accomplished or if the SPEs have the ability to do something like this so I’ll leave it up to further discussion. Upon further research, “vector permute/alignment” seems to be the topic that address this problem. It seems the SPE instruction set down allow for inter-vector operations. Dot products, are one instruction.

The SPE inside of the Playstation 3 sports a 128*128bit register file (128 registers, at 128bits each), which is a lot of room to also unroll loops to avoid branching. At 128 bits per register, this means that an SPE is able to perform operations on 4 operands 32bits wide each. Single precision floating point numbers are 32 bits which also explains why Playstation 3 sports such a high single precision floating point performance. Double precision floating point numbers are 64-bits long and slows the processing down an order of magnitude because only 2 operands can fit inside a vector, and I’m pretty sure it also breaks the SIMD processing ability since no execution unit can work on 2 double precision floating points at the same time, meaning that the SPE will perform double precision computations in a scalar fashion.

Quote:
“An SPE can operate on 16 8-bit integers, 8 16-bit integers, 4 32-bit integers, or 4 single precision floating-point numbers in a single clock cycle.”
– Cell microprocessor wiki. That matches up with my prediction pretty much, but I haven’t been able to find any other sources that suggest or state this. It is a very logical explanation.

The important thing to note is that vector processing, and vector processors are synonymous with SIMD architectures. Vectorized code, is best run on a SIMD architecture and general purpose CPUs will perform much worse on these types of tasks.

SIMD Applications:
Digital signal processing (DSP), is one of the areas where vector processors are used. I only bring that up because *you know who* would like to claim that it is the only practical application for SIMD architectures.

3D graphics are also a huge application for SIMD processing. A vertex/vector(term used interchangeably in 3D graphics) is a 3D position, usually stored with 4 elements. X, Y, Z, and W. I won’t explain the W because I don’t even remember exactly how it’s used myself, but it is there in 3D graphics. Processing many vertices would be very slow on a traditional CPU which would have to individually process each element of the vector instead of processing the whole thing simultaneously. Needless to say, GPUs most definitely have many SIMD units (possibly even MIMD), and is why they vastly out perform CPUs in this respect. Operations done on the individual components of a vector are independent which makes the SIMD paradigm an optimal solution to operate on them.

To put this in context, I don’t know if any of you remember 3D computer gaming between low end and high end computers between 1995 and 2000. Although graphics accelerators were out, some of them didn’t have “Hardware T&L”(transform and lighting). If you recall games that had the option to turn this on or off (assuming you had it in hardware), you could see the huge speed difference if it was done in hardware vs not. The software version still looked worse after they generally tried to hide the speed difference by using less accurate algorithms/models. It is this type of situation, the Cell is actually equipped to do relatively well, and traditional scalar CPUs would still perform vastly worse.

It is worthwhile to note that “hardware” in the case of 3D graphics generally refers to things done on the GPU, and “software” just means it is running on the CPU – even though they are both pieces of hardware executing the commands in the end. Software just refers to the part that is controlled by the software the programmer writes.

There are image filters algorithms that occur in applications like Adobe Photoshop which are better executed by vector processors too. Many simulations that occur on super computers are better suited to run on SPEs (toned down in accuracy appropriate for gaming). Some of these simulations include cloth simulation, terrain generation, physics, and particle effects.

SPE Design Goals – no cache, such small memory, branch prediction?
The SPEs don’t have a cache in the traditional sense of it being under hardware control. It uses 256kb of on-chip, software controlled SRAM. It reeks of the acronym “RAM” but offers latency similar to those of a cache and in fact, some caches are implemented using the exact same hardware – for all practical purposes, this memory is a controlled cache.

Having this memory under software control places the work on the compiler tools, or programmers to control the flow of memory in and out of the local store. For games programming, this is actually generally the better approach if performance is a high priority. Traditional caches have the downside of being non-deterministic for access times. If a program tries to access memory that is in discovered in cache(cache-hit), the latency is only around 5-20 cycles and not much time is lost. If the memory is not discovered in cache(cache-miss), the latency is in the hundreds of cycles. This variance in performance is very undesirable in games as steady frame rates are much more visually pleasing than variable ones.

IBM is placing importance on compiler technology to manage the local storage well unless the application wishes to take explicit control of this memory themselves (which higher end games will probably end up doing). If it is accomplished by compilers, then to a programmer, that local storage is a cache either way since they don’t have to do anything to manage it.

The local storage is the location for both code and data for an SPE. This does make the size seem extremely limited but rest assured that code size is generally small, especially with SIMD architectures where the data size is going to be much larger. Additionally, the SPEs are all connected to other elements at extremely high speeds through the EIB, so the idea is that even though the memory is small, data will be updated very quickly and flow in and out of them. To better handle that, the SPE is also a VLIW processor that can dual can dual-issue instructions to an execution pipe, and to a load/store pipe. Basically, this means the SPE can simultaneously perform computations on data while loading new data and moving out processed data.

The SPEs have no branch prediction except for a branch-target buffer(hardware), coupled with numerous branch hint instructions to avoid the penalties of branching through software controlled mechanisms. Just to be clear right here – this information comes from the Cell BE Programming Handbook itself and thus overrides the numerous sources that generally have said “SPEs have no branch prediction hardware.” It’s there, but very limited and is controlled by software and not hardware, similar to how the local storage is controlled by software and is thus not called a “cache” in the traditional sense.

How the Cell “Works”:
This could get very detailed if I really wanted to explain every little thing about the inner workings of the Cell. In the interest of time, I will only mention some of the key aspects so you may get a better understanding of what is and isn’t possible on the Cell.

There are 11 major elements connected to the EIB in the Cell. They are 1 PPE, 8 SPEs, 1 FlexIO controller, and 1 memory controller. In the setup for the Playstation 3, one SPE is disabled so there are only 10 operational elements on it. When any of these elements needs to send data or commands to another element on the bus, it sends a request to an arbiter that manages the EIB. It decides what ring to put the data on, and when to do it to efficiently distribute resources and avoid contention. With the exception of the memory controller (connected to RAM), any of the elements on the EIB can make requests to read or write data from other elements on the EIB. IBM has actually filed quite a number of patents on how the EIB works alone to make the most efficient use of its bandwidth. The system of bandwidth allocation does breakdown in detail, and in general, I/O requests are handled with the highest priority.

Each processing element on the Cell has its own memory controller. For the PPE, this is transparent since it is the general purpose processor. A load/store instruction executed on the PPE will go through L2 cache and ultimate make changes to the main system memory without further intervention. Underneath the hood though, the memory controller the PPE sets up a request to the arbiter of the EIB to send its data to the memory controller of the system memory. This event is transparent to the load/store instruction on the PPE so that RAM is its main memory. The SPEs are under a different mode of operation. To the SPEs, a load/store instruction works on its local storage. The SPE has its own memory controller to access system RAM just like the PPE, but it is under software control. This means that programs written for the SPE have to set up manual requests on their own to read or write to the system memory that the PPE primarily uses. The messages could also be used to send data or commands to another element on the EIB.

This is important to remember because it means that all of the elements on the EIB have equal access to any of the hardware connected to the Cell on the Playstaiton 3. Rendering commands could come from the PPE or and SPE seeing as they both have to ultimately send commands and/or data to the I/O controller which is where the RSX is connected. On the same idea, if any I/O devices connected through FlexIO have a need to read or write from system memory, it can also send messages directly to the XDR memory controller, or send a signal to the PPE or an SPE instead.

The communication system between elements on the Cell processor is high advanced and planned out and probably constitutes a huge portion, if not most, of the research budget for the Cell processor. It allows for extreme performance and flexibility for whoever develops any kind of software for the Cell processor. There are several new patents IBM has submitted that relate to transfers over the EIB and how they are setup alone. After all, as execution gets faster and faster, the general problem is having memory keeping up to speed.

Note: The section is extremely scaled down and simplified. It is to the point where if you read the Cell BE Handbook, you could say I’m wrong in many places if I implied or suggested that only one method or communication is possible or if you use my literal word choice against theirs. If you are wondering how something would or should be accomplished on the Cell, you’d have to dive deeper into the problem to figure out which method is the best to use. The messaging system between elements on the EIB is extremely complex and detailed in nature and just couldn’t be explained in a compact form.

Multithreading?
Threading is simply a word used to describe a sequence of execution. Technically, a single core CPU can handle infinite threads. The issue is that performance drops at a certain point depending on what the individual tasks are doing. The PPE has two threads on the same processor. This makes communication between these two threads easier since they are using the exact same memory resources. Sharing data between these threads is only an issue of using the same variables in code and keeping threads synchronized – much of which has been done and thoroughly studied.

On the other hand, the SPEs are more isolated execution cores that have their own primary memory which is their local store. Sharing data between SPEs and the PPE means putting data on the EIB, which means that one of the messaging methods has to be used to get it there. There are various options for this depending on what needs to be transferred and how both ends are using the data. Needless to say, synchronization between code running on SPEs and the PPE is a harder problem. It is better to think of the code running on separate SPEs as separate programs rather than threads to scale the synchronization and communication issues appropriately.

That being said, it isn’t a problem that hasn’t been seen before as it is pretty much the same as inter-process communication between programs running on an operating system. Each application individually thinks it has exclusive access to the hardware. If it becomes aware of other programs running, it has to consider how to send and receive data from the other application too. The only added considerations on the Cell are the hardware implementation details of the various transfers to maximize performance even of more than one method works.

Programming Paradigms/Approaches for the Cell:
Honestly, the most important thing to mention here is that the Cell is not bound to any paradigm. Any developer should assess what the Cell hardware offers, and find a paradigm that will either be executed fastest, or sacrifice speed for ease of development and find a solution that’s just easy to implement. That being said, here are some common paradigms that come up in various sources:

PPE task management, SPEs task handling:
This seems to be the most logical to many due to the SPEs being the computational powerhouse inside of the Cell while the PPE is the general purpose core. The keyword is computational which should indicate that the SPEs are good for computing tasks, but not all tasks. Tasks in the general purpose nature would perform better on the PPE since it has a cache and branch prediction hardware – making coding for it much easier without having to control those issues. Limiting the PPE to dictating tasks is stupid if the entire task is general purpose in nature. If the PPE can handle it alone, it should do so and not spend time handing off tasks to other elements. However, if the PPE is overloaded with general purpose tasks to accomplish, or has a need to certain computations which the SPEs are better suited for, it should hand it off to an SPE as the gain in doing so will be worthwhile as opposed to being bogged down running multiple jobs that can be divided up more efficiently.

Having the PPE fill a task manager role may also means that all SPEs report or send its data back to the PPE. This has a negative impact on achievable bandwidth as the EIB doesn’t perform as well when massive amounts of data are all goin to a single destination element inside the Cell. This might not happen if the task the elements are running talk to other elements including external hardware devices, main memory, or other SPEs.

SPE Chaining:
This solution is basically using the SPEs in sequence to accomplish steps of a task such as decoding audio/video. Basically, an SPE sucks in data continuously, processes it continuously, and spits it out to the next SPE continuously. The chain can utilize any number of SPEs available and necessary to complete the task. This setup is considered largely due to the EIB on the Cell being able to support massive bandwidth, and the fact that the SPEs can be classified as an array of processors.

This setup doesn’t make sense with everything as dependencies may require that data revisit certain stages more than once and not simply pass through once and be done. Sometimes, due to dependencies a certain amount of data has to be received before processing can actually be completed. Lastly, various elements may not produce output that a strict “next” element needs. Some of it may be needed by one element, and more to another.

CPU cooking up a storm before throwing it over the wall:
This honestly was a paradigm I initially thought about independently early into my research on the details of the Cell processor. It’s not really a paradigm, but rather is an approach/thought process. Even the Warhawk designer/producer mentioned an approach like this The Cell is a really powerful chip and can do a lot of computational work that is very fast inside the processor. The problem is bandwidth to other components outside of the chip bring in communication overheads and those bottlenecks as well. It seems like a less optimal use of computing resources if the PPE on the Cell writes output to memory, and all of the SPEs pick up work from there if the PPE can directly send data to the SPEs, removing the bottleneck of them all sharing the 25.6GB/s bandwidth to system memory. It appears to make the most sense to let the Cell load and process the game objects as much as possible, before handing it off to the RSX or writing back to memory.

This approach does make sense, but by no means is a restriction if a game has serious uses and demands for a tight relationship between the RSX or other off chip elements and Cell throughout the game loop.

Where does the operating system go?
Some sources propose that an operational SPE will be reserved by Sony for the operating system while games are running. As far as I researched, I have found nothing official to support this being the case with PS3 other than Ken Kutaragi saying an OS could run on an SPE, and IBM’s papers suggesting various Cell operating system configurations.

The specific configuration of running an OS(kernel only) on an SPE makes sense from a security perspective. I will not explain it in this post, but the Cell does have a security architecture which can enable an SPE to be secured through hardware mechanisms. Given this ability, if Sony wanted an easy method to protect its operating system from games and homebrew, then they would probably resort to running a kernel with light OS features in an SPE.

Otherwise, the short answer is that the OS could run as a tiny thread on the PPE, or on an SPE. Sony will do what has the least impact on gaming and still delivers on the functional requirements of the OS.

The RSX Graphics Chip:
The RSX specs are largely undefined and unknown at this point, and I will refrain from even analyzing it too deeply if it comes to the clearly unknown aspects. The only information available has been around since E3 2005 and is likely to have changed since then. Various statements have been made after this point that compare the RSX to other graphics chips nVidia has made. Some press sources have used these statements to analyze the RSX as if they actually knew what it was or in a speculative manner, but readers should not forget that they simply do not know for sure. I have read a lot of those sources and am throwing out specific execution speed numbers and am focusing on the more likely final aspects of the RSX specs.

The only thing that can be said with a pretty high degree of certainty is that the RSX will have 256MB of GDDR3 video memory, access to the Cell’s 256MB XDR memory, and a fixed function shader pipeline – meaning dedicated vertex shader pipelines, and dedicated pixel shader pipelines as opposed to a unified shader architecture that the Xenos on the Xbox360 has. The RSX will also be connected to the Cell through the FlexIO interface.

Due to the nature of the SPEs on the Cell, there is quite an overlap in function concerning vertex processors on the RSX. It would be up to the programmer to decide where to accomplish those tasks depending on the flexibility they need, and what resources they have available to them. The Cell could also handle some post processing(pixel) effects if the bandwidth is there and each pass through the RSX is relatively quick to process, but this will most likely not happen due to pixel shading occurring late in the rendering pipeline only for it to be taken out of the pipeline and put back in again.
__________________
http://gthotspot.blogspot.com/
Pimp Racer is offline   Reply With Quote