n a paper [paywall] published Friday in the journal Science, MIT professors Charles Leiserson and Daniel Sanchez, adjunct professor Butler Lampson, and research scientists Joel Emer, Bradley Kuszmaul, Tao Schardl, and Neil Thompson argue that the tech industry needs software performance engineering, better algorithmic approaches to problem solving, and streamlined hardware interaction.
These three areas at the top of the stack will yield less reliable gains than semiconductor density improvements of the past because they're interrelated.
"Unlike Moore’s law, which has driven up performance predictably by 'lifting all boats,' working at the Top to obtain performance will yield opportunistic, uneven, and sporadic gains, typically improving just one aspect of a particular computation at a time," the authors state.
Nonetheless, that's what's present opportunity now that the further miniaturization no longer looks practical.
This future demands better programming techniques to write faster code. To illustrate that point, the MIT researchers wrote a simple Python 2 program that multiplies two 4,096-by-4,096 matrices. They used an Intel Xeon processor with 2.9-GHz 18-core CPU and shared 25-mebibyte L3-cache, running Fedora 22 and version 4.0.4 of the Linux kernel.
for i in xrange(4096):
for j in xrange(4096):
for k in xrange(4096):
C[j] += A[k] * B[k][j]
The code, they say, takes seven hours to compute the matrix product, or nine hours if you use Python 3. Better performance can be achieved by using a more efficient programming language, with Java resulting in a 10.8x speedup and C (v3) producing an additional 4.4x increase for a 47x improvement in execution time.
Beyond programming language gains, exploiting specific hardware features can make the code run 1300x faster still. By parallelizing the code to run on all 18 of the available processing cores, optimizing for processor memory hierarchy, vectorizing the code, and using Intel's Advanced Vector Extensions, the seven hour number crunching task can be reduced to 0.41s, or 60,000x faster than the original Python code.
Doslo je vreme asemblera vidim ja :P