|
| 29 Oct 2013 08:24 AM |
Currently, we have 3(or 4) L1-L3(or L4) cache, which are memory(data and instruction) caches.
I've been thinking for some time, how about we'd include an execution cache, a cache in which you will do this: Address = Last 4 bytes of the data(bit 16-31 in 32-bit[bit 48-63 in 63-bit]) Tag = The remaining bytes of the data(bit 0-15 in 32-bit[bit 0-47 in 64-bit]) Data(in cache) = Result of operation
That would greatly decrease the heat, for it is stored in the memory and it wouldn't have to recalculate data over and over again, what do you think about this idea? |
|
|
| Report Abuse |
|
|
RaidenJPN
|
  |
| Joined: 22 May 2013 |
| Total Posts: 6920 |
|
| |
|
|
| 29 Oct 2013 08:59 AM |
After 30 minutes, someone finally replied! \o/
I'm wondering why those engineers didn't implement this though, I'm pretty sure they have already thought of this... |
|
|
| Report Abuse |
|
|
|
| 29 Oct 2013 09:02 AM |
-sigh- I made a mistake, this is the fixed one.
Address = Last 4 bits of A concatenated with last 4 bits of B Tag = Remaining bits of A concatenated with remaining bits of B Data = A [operation] B
However, this doesn't work in a commutative manner. :/ |
|
|
| Report Abuse |
|
|
|
| 29 Oct 2013 11:16 AM |
| That sounds like a good idea! |
|
|
| Report Abuse |
|
|
|
| 29 Oct 2013 12:19 PM |
The CPU is so fast because it can predict how the execution happens - how long single instructions will take, how long till the result is available, which way a branch will go, where will memory be accessed next...
This means that anything unpredictable (random memory accesses, hard to predict branching...) is slow.
What you are proposing is basically giving instructions a random execution time (some times its fast and goes thru result cache, sometimes its slow and has to be calculated). I dont think this would work very well since the CPU needs to plan the order in which instructions are executed (which is done in parallel inside a single core, so while waiting for a result from one another is being executed, if there is no dependency)
Which is probably why you usually either do it fully using a lookup table or dont use one at all.
Then there are other reasons like whether it really is faster even for slow instructions, whether the space it takes is worth it, whether it reduces heat (which i doubt since the CPU would execute and lookup the instruction at the same time, it cant do the lookup first since it would cause a big delay when you actually need to execute it), whether its actually common to calculate the same things again and again etc.
Memory latency is a lot bigger problem than execution speed of single instructions (or the heat produced?), so giving that space to moar cache is probably a better idea. |
|
|
| Report Abuse |
|
|
|
| 29 Oct 2013 12:28 PM |
| Good point, I really needed someone to tell me why they didn't decide to implement this, but isn't memory latency(mostly in SRAMs) almost instantaneous? Think of it, it only consists a DFF(about 4 gates?), and the I/O consists of 1 AND gate each, total of 6 gates, however in a normal execution unit, data would have to go through let's say 50 gates? More in FPUs? |
|
|
| Report Abuse |
|
|
|
| 29 Oct 2013 03:58 PM |
nerds
(aka im just jelly QQ) |
|
|
| Report Abuse |
|
|
|
| 30 Oct 2013 03:28 AM |
| Sn0x, I'm bored so I decided to create a computer from 74xx gates. :c |
|
|
| Report Abuse |
|
|
|
| 30 Oct 2013 07:28 AM |
| Dude DOnt be Jelly of This idiot He Doent Play FOotBall ,Or Rap So :D |
|
|
| Report Abuse |
|
|
cntkillme
|
  |
| Joined: 07 Apr 2008 |
| Total Posts: 44956 |
|
| |
|
|
| 31 Oct 2013 01:13 PM |
| L0l he Doent's Rap or even Footbal!!!!! How Idot!!! |
|
|
| Report Abuse |
|
|
|
| 31 Oct 2013 01:15 PM |
| h0w aboot we setle dis wid a rap batol? |
|
|
| Report Abuse |
|
|