Latest Milestone
Oct 30, 2015

Achieved: Performance model for ARM and other emerging hardware

Exa2Green Milestone: Performance model for ARM and other emerging hardware


In this milestone, we examined the potential of running the numerical simulation of the atmospheric chemical kinetics within COSMO-ART on an ARM64 compute node. To that, we simulated a zero-dimensional box model on an ARM64 node and compared time- and energy-to-solution to a x86-64 system.


The computation on the ARM node is approx. 5.1 slower than on an x86-64 node. Even if the ARM node requires almost 39% less power under full load than the x86-64 system, the total energy consumed on the ARM node is more than 3 times as high than on the x86-64 system for the test scenario considered here.


Within the scope of this work, only limited access to ARM technologies was possible. The ARM64 node utilized in this investigation turned out to be not competitive with the newest technology in terms of power consumption. For a more complete investigation, other ARMv8-A processors would have to be taken into account. For the technology utilized here, calculating atmospheric chemical kinetics on an ARM64 node has not been shown to lead to savings in energy.


Responsible Partner: Heidelberg University

Older Milestones
Oct 29, 2015

Achieved: Implementation of an energy-efficiency feedback system

Exa2Green Milestone: Implementation of an energy-efficiency feedback system


The milestone has the goal to provide energy accounting and power profiling via a resource manager to make users aware of energy consumption. The idea is integrating an HDEEM power measurement framework into the resource manager as a new plugin to operate on BULL FPGA-based blade-servers.


Main technical solution is an energy accounting plugin for SLURM resource manager based on HDEEM power measurement framework.


Users can use SLURM commands to provide a high resolution power profiling feedback for their applications running on MISTRAL supercomputer or similar BULL HPC machines.


Responsible Partner: Universität Hamburg

Sept 30, 2015

Achieved: Final Exa2Green Partner Meeting took place

Exa2Green Milestone: Final Meeting


On 29th September 2015 the Exa2Green partners met for a last time to look back on the three years of successful collaboration, discuss the project’s achievements and plan the activities during the last project month.


During the intense one day meeting in Stuttgart the partners came to the conclusion that the Exa2Green results achieved are numerous and impressive.


Responsible Partners: Steinbeis-Europa-Zentrum and University of Heidelberg

Sept 30, 2015

Achieved: Improved parallel performance and reduced energy consumption

Exa2Green Milestone: Framework for Reconfiguring Hardware System during Runtime

The mutltigrid method is a very efficient solution method from the mathematical point of view for many engineering, environmental or medical simulations. We investigated the parallelization of this method for the use on high-performance computing clusters. We developed a technique that exploits the multigrid nature which demands for different computational effort in different stages of the method. Our technique deactivates parts of the computer hardware temporarily during the computation. This yields both improved parallel performance, i.e. shorter runtime until the solution is computed, and it reduces the energy consumption on top of the performance gain.


Multigrid solvers belong to the most efficient numerical methods for solving symmetric positive definite linear systems. The computational complexity is O(n) for sparse systems with n unknowns. Such systems can result from the discretization of elliptic partial differential equations. The geometric multigrid variant, as opposed to the algebraic multigrid, is based on the discretization of the underlying equations on several grid refinement levels. The operations on fine grid levels, involving smoothers and grid transfer operators, usually employ a limited number of basic numerical building blocks like vector operations or sparse matrix-vector multiplications. Only on the coarsest level, a direct or iterative method is used to solve the error correction equation approximately. For common cluster-level parallelizations based on a domain decomposition which uses the same number of processes throughout the whole grid hierarchy, the overall parallel performance is limited by the parallel performance on the coarsest level with the smallest problem size.


We address the issues of both parallel performance and energy consumption by using different numbers of processes on different grid levels, thus adapting the parallelization to the problem sizes in the hierarchy. We introduce a dynamic adaption of the hardware activity during the solver execution which adjusts the parallel configuration according to the solver needs.


We impose restrictions on the general parallelization scheme from which we derive an adaption strategy for the hardware activity. The restrictions reduce the communication overhead for transferring a vector between different grid levels to a minimum, while the  performance of the domain decomposition parallelization within each grid level is maintained. The actual adjustment of the hardware activity is done by pausing and reactivating MPI processes during the multigrid solver execution, thus causing the CPU core to enter sleeping C-states during the pause phase.


Responsible Partner: Heidelberg University

Sept 14, 2015

Achieved: Enhanced dynamic energy-aware run-time for dense linear algebra operations developed

Exa2Green Milestone: Enhanced dynamic energy-aware run-time for dense linear algebra operations

Dense linear algebra (DLA) problems are ubiquitous in scientific computing, being at the bottom of the “food chain” for many complex applications. Therefore, any effort towards improving the performance and energy efficiency of this type of operations will surely have a strong positive impact on the countless numerical simulations that are built upon them.

In recent years, a number of run-times have been proposed to alleviate the burden of programming multi- and many-threaded platforms. When applied to a DLA problem, these software efforts exploit the task-parallelism implicit to the operation by (semi)automatically decomposing the operation into tasks while simultaneously performing a task-dependency analysis. This process is complemented with a dependency-aware out-of-order scheduling of the tasks to the computational resources at execution.

The team of project partner Universitat Jaume I (UJI) has developed optimized versions of dense and sparse linear systems and tuned run-times for their execution on multi-core and many-core general-purpose processors and hardware accelerators. Specifically:
(i)    The UJI team has explored different solutions to improve the performance and energy efficiency of the SuperMatrix runtime task scheduler adapted to the many-core Intel Xeon Phi architecture. In combination with an appropriate thread-to-core affinity mapping, experiments with the Cholesky factorization delivered performance results that are competitive and even improve those of the proprietary MKL implementation.
(ii)    The UJI team has also addressed the problem of refactoring existing runtime task schedulers to exploit task-level parallelism in novel ARM big.LITTLE systems-on-chips. For the specific domain of DLA, an approach that delegates the burden of dealing with asymmetry to the library (using an asymmetry-aware BLIS implementation) does not require any revamp of existing task schedulers, and can deliver high performance.
(iii)    For the solution of sparse linear systems via ILUPACK, the developments from the UJI team  on an Intel Xeon Phi accelerator and a 64-core NUMA AMD server reveal that there exists ample task and data concurrency in the preconditioned solver showing notable speed-ups in these two architectures.

In current multi-threaded architectures, optimizing for energy is in general equivalent to optimizing for performance. This is particularly true in the case of DLA operations as well as the solution of sparse linear systems via ILUPACK. However, for run-time assisted parallelization of these operations, it is possible to improve the energy efficiency, without hurting performance, by modifying the run-time mechanism to leverage the C-states and P-states defined by the ACPI standard.

The practical trade-off underlying the polling vs blocking policies for idle threads is in many ways equivalent to that of performance vs energy-saving. In the original implementation of many run-times, upon encountering no task ready to be executed, an “idle” thread simply performed a “busy-wait” (polling) on a condition variable, till a new task is available. This strategy thus prevents the operating system from promoting the associated core into a power-saving C-state because the thread is not actually idle (but doing useless work). As an alternative, it is possible to rely on a power-aware version of the run-time that applies an “idle-wait” (blocking) whenever a thread does not encounter a task ready for execution and, thus, becomes inactive.

By carefully tuning the software in charge of orchestrating the execution of numerical linear algebra operations, it is possible to obtain significant energy savings, at no performance cost, during the execution of key numerical methods for the solution of scientific and engineering applications. This outcome is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors.
Responsible Partner: Universitat Jaume I

Aug 21, 2015

Achieved: Final International Conference carried out

Exa2Green Milestone: Final FET Conference


Nearing the end of the Exa2Green project, various remarkable research results have been achieved. The workshop "Power & Energy-Aware High Performance Computing on Emerging Technology", which took place on 16th July in the frame of the ISC High Performance 2015 in Frankfurt (Germany) was the possibility for the project partners to present these results, embedded in a broader programme with keynote speakers on the topic. The partners University of Heidelberg and Steinbeis-Europa-Zentrum have collaborated to organise the workshop.


The workshop, carried out by the project partners, focused on new energy-aware computing paradigms and programming methodologies to address the problem of prohibitive power consumption by current hardware when extrapolating to exascale machines. The goal of the workshop was to bring together scientists and engineers from academia and industry with interests in energy-efficient computing.


Workshop topics included:

  • Tools for advanced power consumption monitoring and profiling.
  • New multi-objective metrics for quantitative assessment and analysis of the energy profile of algorithms.
  • Smart algorithms using energy-efficient software models.
  • Power-aware implementations of numerical methods for high performance clusters.
  • Performance/energy optimisation in applications, showcases, proof of concepts.


Further information can be found on the event’s website:


Responsible Partners: University of Heidelberg and Steinbeis-Europa-Zentrum

Apr 24, 2015

Achieved: Refactored COSMO-ART and/or COSMO-HAM code prototype

Exa2Green Milestone: Refactored COSMO-ART and/or COSMO-HAM code prototype for CPUs and multi-core architectures


ETH Zurich/CSCS and KIT have strongly collaborated and developed a standalone test harness for atmospheric chemistry simulations which aims at reproducing, as closely as possible, the chemical scenario of the reference COSMO-ART baseline. It is based on a kinetic preprocessor solver (KPP) for the solution of atmospheric chemistry ordinary differential equations.

The goal of this testbed is to test out different KPP solvers (namely KPP-2.2.1, KPP-2.2.3 and KPPA_0.2.1) and to compare the performance and energy consumption between multi-threaded CPUs and CUDA implementations at the node level.

There are two versions of the KPP test framework that allow for an exclusive benchmarking of the gas-phase chemistry:

  • a 0-dim box model in which calculations in all cells in the 3D domain are identical. Several integrators that come along with the standard KPP release as well as a new integrator as proposed by PRACE 2IP WP8 were also tested.
  • an extended box model which reads in temperature and chemical concentrations in NetCDf format. These tests mimic a typical chemical setup, as it is being solved in the COSMO-ART baseline, but can be run autonomously. This way, algorithmic changes can be realized much easier and with a higher level of flexibility in the implementation. Respective energy-to-solution measurements can be carried out at computational efforts negligible compared to those necessary for a full COSMO-ART run.

Responsible Partner: ETH Zurich/CSCS

Dec 18, 2014

Achieved: Auto-tuned version of a numeric Dwarf

Exa2Green Milestone: Auto-tuned version of at least 1 numeric and non-numeric Dwarf completed


In many contexts such as neuronal network simulations, Fourier transforms, and wave propagation, the repeated evaluation of the natural exponential function represents one of the main computational bottlenecks, limiting the over-all time- and energy-to-solution for the problem under investigation.


The research team at IBM Research - Zurich has developed a novel formulation to compute the natural exponential employing only floating-point operations. The new algorithm can compute natural exponential of large vectors, in any application setting, maximizing the performance gains of the vectorisation units available in modern processors. In addition, the formulation allows and additional gain due to arbitrary accuracy ranging from a few digits up to the full machine precision. In this regard the algorithm can auto-tune the accuracy of the computed exponential with respect to the problem requirements or, e.g., the current residual of the iterative scheme (for non-linear problems). The improved vectorisation, coupled with the use of a low precision natural exponential, results in a great double speedup with respect to classical approaches based on memory bounded look-up tables.


Responsible Partner: IBM Research - Zurich

Nov 12, 2014

Achieved: Three optimised numeric Dwarfs implementations completed

Dwarfs analysis and optimisation represents one of the main goals of Exa2Green project. The research team at IBM Research - Zurich now completed the implementation of the numeric Dwarfs.


Dwarfs analysis and optimisation represents one of the main goals of Exa2Green project. The research team at IBM Research - Zurich performed many measurements and experiments towards the objective of reaching a deep understanding of the interactions occurring in the time-power-energy triangle, which characterises any scientific computing kernel. Based on this work, it has been possible to come up with new power- and energy-efficient implementations for a selected number of kernels, namely the Conjugate Gradient method (for covariance problems), the sparse matrix-vector multiplication (with optimised frequency scaling), and the stochastic trace estimator (for matrix inverse). The work performed on these kernels has been technically disclosed in scientific papers, proceedings, and during international talks, and it is available to the entire scientific community. Moreover, the finding of this research, as well as the techniques implemented to reduce power- and energy-consumption in these three kernels, can be re-used to obtain similar results in many other implementations and scientific computing Dwarfs.


Responsible Partner: IBM Research - Zurich

June 17, 2014

Achieved: Numeric and Non-Numeric Dwarfs and Energy Aware Metrics

Exa2Green Milestone: Numeric and Non-Numeric Dwarfs and Energy Aware Metrics


From traditionally guiding the road map of embedded and mobile appliances, power has become a key principle for the design of commodity general-purpose CPUs, marking the end of the frequency race. The power wall yields fine-grained estimation of application power usage timely and crucial for the design of policies and mechanisms that enable an energy-aware execution of software. Seminal papers in the literature demonstrate that a huge number of technical applications can be decomposed in up to 7/13 "Dwarfs", i.e., a small set of common kernels. The power and energy optimization of these Dwarfs represents a challenge that has a tremendous impact on a huge number of applications and libraries, whose performances are indeed driven by a small set of classical code blocks.


The research team at IBM Research - Zurich has investigated power and energy consumption of several elementary kernels belonging to the Dwarfs list. The list of analyzed kernel ranges from dense to sparse linear algebra (e.g., sparse matrix vector multiplication and conjugate gradient method) as well as to spectral methods (such as the fast Fourier transform). All the kernels have been analyzed on modern HPC architectures, namely the IBM Blue Gene/Q and the IBM POWER7 system. This analysis has lead the research team to a deep understanding of the underlying processes that drive power and energy consumption. Thanks to the new acquired knowledge, IBM Research - Zurich has also subsequently developed very accurate models for the characterization and time-power-energy prediction of scientific computing kernels, tested on multicore architectures. These models represent not only a fundamental utility tool for cluster and cloud system, but also consist in a solid base for the subsequent optimization of the Dwarfs kernels.


As regards to the standard metrics, the team at IBM Research - Zurich has further confirmed the intrinsic limitations present in the Top500 and Green500 lists. They rank methods and machines on the base of GFlops and GFlops/W rather than real quantities of practical interest, such as time-to-solution and energy. In other words, these metrics essentially measures how many Flop can be squeezed out of the machine in a unit of time or energy, while a real performance- and energy-aware metric should promote methods and machine that can effectively solve the same problem faster and consuming less energy. Toward this direction the IBM research team has developed a new "Exa2Green metric" that overcomes these limitations capturing the real performance of the algorithms. This metric, together with the introduction of new benchmarks based on, e.g., the CG method and some of the Dwarfs, will help to drive the development of future hardware in the correct direction.


Responsible Partner: IBM Research - Zurich

Apr 14, 2014

Achieved: Energy-efficient asynchronous relaxation for GPU-accelerated systems

Exa2Green Milestone: Energy-efficient asynchronous relaxation for GPU-accelerated systems


Synchronizations are becoming increasingly a bottleneck for high performance computing, especially with the ever-increasing intrinsic parallelism in today’s computer architectures. In effect, the use of new algorithms and optimization techniques that reduce synchronizations are becoming essential for obtaining high efficiency and scalability. We demonstrated and quantified this for relaxation methods on current GPUs. In particular, we developed asynchronous relaxation methods and showed that the absence of synchronization points enables not only excellent scalability, but also a high tolerance to hardware failure. Further, the efficient hardware usage enabled by the asynchronous techniques overcompensated the inferior convergence properties of these algorithms. When compared to the Gauss–Seidel relaxation, they still provide solution approximations of prescribed accuracy in considerably shorter time. Targeting multi-GPU platforms, we have shown that the derived algorithms allow for the efficient usage of asynchronous data transfers, while for memory bound problems the performance suffers significantly from the low throughput rates of the inter-device connection.


Responsible Partner: Heidelberg University

Apr 07, 2014

Achieved: New energy-friendly solvers, combined with mixed precision techniques

Exa2Green Milestone: New energy-friendly solvers, combined with mixed precision techniques

By carefully leveraging commodity hardware such as graphics processors, combined with specifically designed algorithms, it is possible to obtain significant energy savings, at no performance cost, during the execution of key numerical methods for the solution of scientific and engineering applications. This outcome is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors.


The team of the project partner Universitat Jaume I has analyzed the interactions arising in the triangle performance-power-energy for the execution of a pivotal numerical algorithm, the iterative Conjugate Gradient (CG) method, on a diverse collection of parallel multithreaded architectures. The range of target architectures included from general-purpose and digital signal multicore processors to many-core graphics processing units (GPUs), as representatives for current multithreaded systems. This analysis is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors. Moreover, the CG method has recently been proposed as a complement to the LINPACK benchmark, as this iterative method is argued to be more archetypical of the performance of today’s scientific and engineering applications.


Based on the previous analysis, we introduced a redesign of the CG method for GPUs, reshaping the GPU kernels induced by the classical formulation of this method into algorithm-specific routines which resulted in a slight increase of performance and, more importantly, enabled the efficient exploitation of power-saving techniques implicit in the hardware, producing remarkable energy savings.


Responsible Partner: Universitat Jaume I

Apr 02, 2014

Achieved: Tool environment to measure information sources related to power consumption

Exa2Green Milestone: Tool environment to measure information sources related to power consumption


This milestone has the goal to provide a tool to analyse performance and the power dissipation of parallel scientific applications. This framework enables scientists and technicians to identify sources of power inefficiency and optimize the application's code. The use of green practices when developing scientific applications will help to reduce the total energy consumption and the carbon footprints of current HPC data centers.

With this milestone, the goal to provide a built-in framework for performance and energy profiling and tracing of parallel scientific applications was followed. The framework environment integrates power dissipation and hardware counter metrics into already existing profiling and tracing tools in order to analyse the power-performance behavior of the applications by correlating performance traces with power profiles. This tool allows a detailed and accurate characterization of the applications from the performance and energy-efficiency point of views.

The tool framework provided by UHAM is flexible and leverages Extrae/Paraver and VampirTrace/Vampir environments to analyze the power dissipation and the energy consumption of parallel scientific applications. The framework includes the PMLib library, offering a simple interface to interact with self-designed and commercial wattmeters. Extensions in the framework are always possible and new metrics can be included, e.g., processor/sleep states, hardware power counters, information on running processes, system utilisation, etc. This tool is enabling project partners to identify sources of power inefficiency directly in the analysed library code and applications.


Responsible Partner: Universität Hamburg