Abstract
The performance and energy consumption of modern multicores are increasingly dominated by the costs of communication. To improve efficiency and support ever larger working sets, vendors are leveraging the additional transistors provided by Moore's law by designing larger caches and deeper memory hierarchies. Today's multicore hardware is commonly programmed by mapping tasks onto individual hardware threads. However, to exploit future hierarchical systems the basic computational building block should move away from this thread-centric view. Instead, cache-centric and memory-centric programming methods should become first-order abstractions.
In this talk I will describe a parallel computing scheme and runtime system called Task Assembly Objects (TAO), which is designed to handle parallel computations with strong caching and co-scheduling requirements. The central component of the scheme are parallel objects consisting of collections of tasks called task assembly objects. Task assemblies aggregate i) fine-grained tasks, ii) a set of threads and caches and iii) a private scheduler into an atomic scheduling unit that is mapped onto a set of hardware resources. By constructing and scheduling assemblies based on the system's topology, TAO enables scalable parallelism and reduced communication on deep hierarchical multicore hardware.
This talk will first cover the rationale and design of TAO. Next I will describe the implementation of the TAO prototype runtime system called go:tao. Applications in go:tao are commonly programmed as flow-graphs of parallel patterns. I will describe this by means of three applications that have been ported to the go:tao runtime: the Unbalanced Tree Search, a parallel integer sort and the 2D jacobi iteration. I will conclude the talk by characterizing the performance of these codes and summarizing open challenges.
Bio
Miquel Pericàs received his Ph.D. from the Polytechnical University of Catalonia (UPC) in 2008. He was a senior researcher at the Barcelona Supercomputing Center between 2009 and 2001, and a JSPS postdoctoral fellow at the Tokyo Institute of Technology from 2012 until 2014. His research has covered processor microarchitecture, reconfigurable supercomputing, energy-aware MPI communication, task-based programming models and performance analysis tools for task-based models. Since 2014 he is a researcher at Chalmers University of Technology. His current research interests include runtime systems, parallel computer organization and energy-efficient computing.