kohlrak: That's why you need to re-read. I proposed a separate memory for GPU.
dtgreene: Problems with separate memory for GPU:
* Less flexible, as if either CPU or GPU needs to use large amounts of memory and the other does not, one can't (easily) use the other type of memory.
The memory controller worries about the RAM. You'd have a separate memory controller and even a separate RAM slot. I mean, sure, you can't just go into your bios or OS settings and change the value, but it would solve alot of bottle-neck issues. If they did this, you could still have your external slots and gain some of the benefits of the APU, too. The problem is we're still using in and out instructions to mess with the GPU, and the GPU has to wait for other DMA'd devices like the sound card, the network card, etc, to quit playing around at their lower clock rates.
* Slower CPU<->GPU transfers. If memory is shared, then the transfer can be fast, and in some cases even be a no-op. (For example, one could map GPU memory as CPU addressable with no overhead at all; the graphics library would just return a pointer and not have to do any copying of memory.) With separate GPU memory, the data will need to be transferred, which is slow. (This is the main reason why integrated graphics could run faster than dedicated graphics for (real) work loads. Note that software renderers like LLVMpipe have this advantage, but are considerably slower than integrated graphics.)
I don't think you understand what i'm proposing. You see, STM32 (ARM CPUs) have slow ram and fast ram.
Use this image for reference and look for the grey pins labled A# (there should be 16 of them). It's been a while since i touched mine, but i recall that when you start writing to address 0x60000000 the other pins (PB) go active. So if you wrote 1 to address 0x60000000, PB0 goes hot, the others stay cold, and the PA pins are cold. If you write 7 to 0x60000006, PA1, PA2, PB0, PB1, and PB2 all go hot, and the others stay cold. The implications of this is rather fascinating, acutally. It means, by writing to an address, you set the PB pins with the data, and that address (-0x6000000) will then set the PA pins accordingly, allowing you to set a multitude of pins with fewer instructions.
This means that these boards, without any modification, can actually share memory with a GPU that's external. Moreover, if, say, you had an external redirector of some sort, you could use pin PA0 to specify that it goes to the video memory or the current ram. So, without any modifications to the CPU, this can be redirected to separate RAM. The practical outlook of this is that, since the x86 currently has a 40bit address bus (it doesn't use the full 64bits), you could reserve addresses 0x8000000000 and above to be for VRAM and limit that machine to that amount of RAM for non-video memory. That still affords you 512 gigabytes of RAM. Obviously, this standard wouldn't last forever, but we could always move it as appropriate and throw this on the OS to understand that. Then you can thread the memory controller with the slow one going into wait any time 0x8000000000 or higher is selected (which is a single bit for it to look for). Underneath the hood, that's what these memory controllers are for, anyway. Now with some even smarter fanangling, you can set the memory controllers to talk to one another, so that if the disk is DMA'd, you can load stuff directly to VRAM with the CPU temporarily locked out (and this wouldn't happen very often, of course). Once the transfer is complete, the memory controllrs quit talking to one another and go back to their multi-threaded mode and let the CPU control them independently again. As far as far as CPU and almost all code is concerned, nothing changed.