הרשומה הזו היא באנגלית בעיקר כי הכתיבה באנגלית היא שריר שכבר מזמן לא תירגלתי. כשהתחלתי את הבלוג הזה, לפני כמעט אלף רשומות, החלטתי שהוא יהיה בעברית למרות שזה מצמצם מאד את קהל היעד, כי אז, אחרי עשרים ושתיים שנים של עבודה בחברה אמריקאית (בישראל), כתיבה בעברית הייתה שריר שהתנוון אצלי והייתי צריך לתרגל. הכתיבה באנגלית הפעם היא גם כי הנושא הספציפי הזה הרבה יותר טבעי לכתבו בשפה האנגלית. זו השפה של התחום ואין מה לעשות. אז סליחה ממי שקשה לו בלע"ז.
Lately, I've been introduced to the term "dark silicon"' – a term denoting the area in an integrated device ("chip" from now on) that has to stay "dark", that is – inactive, while other parts of the chip area are active, due mainly to power dissipation constraints in the modest thermal envelope of modern computers.
That “dark silicon” is one way of describing the lack of scaling from one process generation to the next because the transistor count is growing faster than the ability to run all of them concurrently at full speed (or at all, sometimes, due to leakage) without them overheating.
Back in the day, when I was doing computer architecture research, I had an idea for something I termed “data ports”, which now, in retrospect, I think might have been able to provide extra scaling for processors.
I adopted the idea from the domain of special-purpose accelerators and goes something like this:
Build, on the side of the processor, a configurable data pipeline that delivers data to the processor’s execution units and takes the data computed by these execution units and writes them out.
The idea is that for code in compute loops, a few instructions in a preamble to the loop will set up a few control registers, for example, the starting memory address for the data that is input to the loop and the address “stride” in-between consecutive data elements. Following that setup stage, the data pipeline starts to run for a preset number of iterations, without any instructions being repeatedly fetched, decoded and executed by the processor. The processor is mostly idle while the loop is running.
This kind of computation paradigm required new code to be generated by a compiler, or by hand. New code, especially one for a new compute paradigm, is something that Intel, my employer, was not fond of. Especially after the phenomenal flop of the Itanium series of processors that were based on a new architecture handed down from the HP labs, in which the compiler, rather than the processor, was supposed to expose all the parallelism in the code. This was perhaps the main reason the idea did not take hold, back then. Also, 20 years ago, Dennard's law still was believed valid – the fact that operating voltage does not go down at the same rate as the device size was not yet internalized to the degree it is today.
Those data ports were supposed to be very efficient in power for code in loops and could have addressed a wide range of applications, especially in which the data is organized in vectors or matrices. However, with the current understanding that all the silicon cannot be used concurrently anyway, a setup in which these data pipelines "waste transistors" and have their own separate set of execution units, and do not use the main processor units, makes more sense.
So – while executing code in loops, the rest of the processor can cool down. Thus – the surplus transistors in that “dark silicon” are being put to good use that way, with power being dissipated at different areas of the chip for different types of code.
As I'm writing this post, I realize that I still have hope that someone will re-invent this idea and implement it. Perhaps someone already did – I wouldn't know – for the last 11 years I've been mostly reading philosophy and have lost almost all contact with the state of the art in the field of computer architecture. Oh well…