THÈSE présentée par : - PDF

Description
UNIVERSITÉ DE STRASBOURG ÉCOLE DOCTORALE MATHÉMATIQUES, SCIENCES DE L'INFORMATION ET DE L'INGÉNIEUR UdS INSA ENGEES THÈSE présentée par : Alexandra JIMBOREAN soutenue le : 14 Septembre 2012 pour obtenir

Please download to get full document.

View again

of 202
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Religion & Spirituality

Publish on:

Views: 9 | Pages: 202

Extension: PDF | Download: 0

Share
Transcript
UNIVERSITÉ DE STRASBOURG ÉCOLE DOCTORALE MATHÉMATIQUES, SCIENCES DE L'INFORMATION ET DE L'INGÉNIEUR UdS INSA ENGEES THÈSE présentée par : Alexandra JIMBOREAN soutenue le : 14 Septembre 2012 pour obtenir le grade de : Docteur de l université de Strasbourg Discipline/ Spécialité : Informatique Adapting the Polytope Model for Dynamic and Speculative Parallelization THÈSE dirigée par : M CLAUSS Philippe RAPPORTEURS : M COHEN Albert M SEZNEC André Professeur, Université de Strasbourg Directeur de Recherche, INRIA - ENS, Paris Directeur de Recherche, INRIA - IRISA, Rennes AUTRES MEMBRES DU JURY : M LOECHNER Vincent M CAVAZOS John M BODIN François Maître de conférences, Université de Strasbourg Professeur, University of Delaware, USA Directeur technique, CAPS Enterprise, Rennes N d ordre : École Doctorale Mathématiques, Sciences de l Information et de l Ingénieur UdS INSA ENGEES THÈSE présentée pour obtenir le grade de Docteur de l Université de Strasbourg Discipline : Informatique par Alexandra Jimborean Adapting the Polytope Model for Dynamic and Speculative Parallelization Soutenue publiquement le 14 Septembre 2012 Membres du jury: Directeur de thèse : Rapporteur : Rapporteur : Examinateur : Examinateur : Examinateur : Philippe Clauss, Professeur, UdS, Strasbourg Albert Cohen, Directeur de Recherche, INRIA - ENS, Paris André Seznec, Directeur de Recherche, INRIA - IRISA, Rennes Vincent Loechner, Maître de conférences, UdS, Strasbourg John Cavazos, Professeur, University of Delaware, USA François Bodin, Directeur technique, CAPS Enterprise, Rennes Laboratoire des Sciences de l Image, de l Informatique, et de la Télédétection UMR 7005 2 Per aspera ad astra Contents 1 Introduction Context of the work Static-dynamic frameworks for analysis and optimization Polyhedral transformations and speculative parallelism: an overview The polytope model Notations and definitions The polyhedral representation of a loop nest Data access functions Scheduling the statements Static dependence analysis Dependence vectors Polyhedral transformations Concluding remarks General overview of traditional TLS systems State of the art Limits of current TLS systems From polyhedral transformations to speculative parallelism Can general-purpose codes be modeled in the polyhedral abstraction? Indirect references Sparsely allocated linked list Transpose matrix with parameters Banded matrix with indirect references Dynamic control structures Adapted polyhedral model as an analysis model: dynamic dependence analysis Dependence analysis Adapted polyhedral model as a transformation model: code patterns Parallelizing transformations with code patterns Complex code patterns Dynamic code generation using code patterns Genericness of code patterns Static versions vs Patterns Patterns vs JIT 4 CONTENTS 3.4 Verification and rollback systems in the polyhedral model Verification Rollback Conclusions Strategies for full exploitation of the available parallelism Program phases Multiversioning Related work Open questions and limits of modern compilers Adapting to the current behavior using multiple versions The chunking system Chunking one sequential loop Chunking with transformed loops Rollbacking with chunks Profiling with chunks Conclusions The VMAD framework Static component Short introduction of the LLVM IR Identify the loop nests marked for speculative parallelization Generating multiple versions Inserting static information: headers and parameters Dynamic component Code manipulation Runtime code orchestration for speculative parallelization Conclusions Evaluation of the VMAD framework Code instrumentation Related work Instrumentation by sampling in VMAD Instrumenting memory accesses Results Speculative parallelization VMAD s runtime overhead Other applications Simple dynamic dependence analysis Runtime version selection Conclusions Conclusions Perspectives 159 CONTENTS 5 Bibliography 162 6 CONTENTS List of Figures 2.1 Sample loop nest (left) and the iteration domain of S for N = 6(right) (courtesy: B. Pradelle) Domain of statement S from Figure Overview of a TLS system Illustration of the usual TLS parallelization by chunks and conflict detection (courtesy: Ph. Clauss) Examples of loop nests with an indirect memory reference Memory allocation for the linked list Table for computing the distance vectors dynamically (2-depth loop nest) Simplified code pattern General structure of a loop nest Graphic representation of the iterations of a loop nest (courtesy: Ph. Clauss) General structure of a parallel loop nest Code patterns to balance genericity and performance (courtesy: Ph. Clauss) Illustration of basic scalar values verification in the parallel loop nest (courtesy: Ph. Clauss) Alternating the execution of different versions, during one run of the loop nest (courtesy: Ph. Clauss) Loop chunking Chunked versions example Chunking time overhead Speed-up obtained with various chunking strategies Framework overview: static-dynamic collaboration Framework overview (courtesy: Ph. Clauss) Annotated source code Delimiting code regions using barriers compared to metadata Loop in LLVM IR with metadata Cloning Rebuild control flow graph in clones Multi-versioning Each version is transformed into a suitable representation 8 LIST OF FIGURES 5.10 callback in x86_64 assembly code Code structure Shadowed control flow graph Control flow graph rewritten in inline code Control flow graph with φ nodes Control flow graph with φ nodes and labels Loop structure with virtual iterators Loop structure in the parallel code pattern Verify write accesses Original iterators & upper bounds wrt new iterators Computation of the next sequential iterations Verification code Generating OMP code in LLVM IR List of headers and parameters Static-dynamic collaboration Code orchestration Chunk size increase for the original sequential version memcpy time overhead memcpy calls memcpy calls with varying chunk sizes (courtesy: J-F. Dollinger) Static component of the framework Runtime component of the profiling framework Loop nest instrumentation Code structure Stack structure of each loop level Runtime overhead with O0 and O3 optimization levels Code extract from ks and its corresponding interpolation functions Instrumenting memory accesses in SPEC CPU Speculative parallelism results I Speculative parallelism results II Percentage of time spent by VMAD in different execution phases Code speculatively parallelized with VMAD, compared to OpenMP (I) Code speculatively parallelized with VMAD, compared to OpenMP (II) Code speculatively parallelized with VMAD, compared to OpenMP (III) Dynamic dependence analysis with value range analysis and gcd tests Dynamic code selection with VMAD, Logarithmic scale (courtesy: Ph. Clauss) Chapter 1 Introduction 1.1 Context of the work The advent of multicore processors imposes new strategies for reaching good software performance and exploiting advantageously the provided hardware. The key to success is now radically related to parallelism and applications have to be run in phases where many computations are performed simultaneously on all the available processor cores. There are several possible options to accomplish this: programs can be written by explicitly describing what can be run in parallel and what cannot, the compiler can extract parallel computations from a serial code by performing advanced code analyses and then generate parallel code, or the software can be run on top of a runtime system, or virtual machine, performing on-the-fly analyses and parallelizations. Nevertheless, although each option has been intensely studied, they all have some inherent limitations. Even if many parallel programming languages are available, such as OpenMP, MPI, Cilk, TBB, HMPP, OpenCL or CUDA, parallel programming is in general difficult, because the programmer is required to handle complex issues, such as selecting a convenient algorithm for parallelization, analyzing the dependences between parts of the code, ensuring correct semantics, being aware of the underlying hardware characteristics and using a suitable programming model. Moreover, performance portability is difficult to be ensured due to hardware heterogeneity. Such issues significantly increase the software time-to-market. To aid the programmer in delivering parallel code, building compilers that perform automatic parallelization became an active research area. Compilers dedicated to automatic parallelization have a rich history, particularly focusing on scientific computing applications. Examples of such compilers are SUIF [132], Polaris [13] and PIPS [35], that are able to automatically parallelize sequential programs without the programmer s intervention. They mostly focus on for-loops accessing multi-dimensional arrays and referencing array elements through linear functions of the loop indices. Thanks to precise data dependence analysis, such loops can take advantage of advanced parallelizing transformations as tiling, loop splitting or fusion, loop interchange, loop skewing and more generally linear loop transformations. The theory concerning loop analyses and transformations is unified in a well-known mathematical 9 10 Ch. 1. Introduction framework called the polytope model [44, 45]. However, its applicability is limited to array-based scientific applications exhibiting explicit linearity of their loop bounds and array accesses. Loops exhibiting complex exit conditions and memory accesses through pointers or indirect arrays cannot be handled at compile-time using these techniques, since crucial information can only be known at execution-time. Moreover, it is generally difficult for a compiler to select the parallelizing and optimizing transformations that would perform well under various circumstances (in different execution contexts or on distinct processors). The third solution is to run the targeted program in the frame of a runtime system whose role is to use advantageously the available dynamic information and automatically parallelize on-the-fly some code parts. One main advantage is that the effectiveness of a code transformation can immediately be evaluated and the course of execution can be adjusted accordingly by the runtime system in real time. In particular, speculative parallelizing techniques are possible since an online verification can consecutively launch recovery actions, in case previously speculated information is invalidated, such as canceling wrong computations and restarting them from the last correct state. This approach does not have, a priori, any limitation on the type of targeted code, however it is strongly constrained by the time overhead inevitably introduced. Hence, it is impossible to perform complete analyses and optimizations at runtime, as done by a compiler. On the other hand, generating efficient code is mandatory. In this current context, speculative parallelization is an essential strategy to handle the parallelization of general-purpose codes. A well-researched direction in speculative parallelization is thread-level speculation (TLS) [19, 66, 70, 76, 103, 105, 106, 114, 119, 137]. A TLS framework allows optimistic execution of parallel code regions before all dependences between instructions are known. Hardware or software mechanisms track register and memory accesses to determine if any dependence violation occurs. In such cases, register and memory state are rolled back to a previous correct state and sequential re-execution is initiated. Our proposal focuses on enhancing previous works on speculative parallelism, by providing advancements in the quality of the generated parallel code. The highlights of this thesis are: 1. Optimize and parallelize the code at runtime, by applying a polyhedral transformation prior to parallelization. This has twofold consequences: first, boosting the performance of the generated parallel code, and second, exhibiting parallelism in codes which could not be parallelized in the original form. Chapter 2 introduces the mathematical background of the polyhedral model and a general overview of TLS systems, followed by our proposal on how the polyhedral model can be applied speculatively at runtime, in Chapter Fully exploit parallelism in codes that exhibit different phases during one execution, by adapting to the current behavior of the code. Our strategy relies on slicing the execution in intervals, where each interval represents a program s phase. For each phase, we prepare a custom code version, in accordance with the properties exhibited by the code during the certain phase. Chapter 4 details the multi-versioning and the chunking techniques we employ for this purpose. From the point of view of the implementation techniques, this thesis provides an 1.2. Static-dynamic frameworks for analysis and optimization 11 insight on: 1. optimizing and parallelizing transformations performed at runtime, which consist of (1) a linear re-scheduling of the target loop nest s iterations in order to reveal parallelism and (2) the parallelization of the outermost parallel loop in the transformed code; 2. adaptation of the polytope model to dynamic and speculative parallelization; speculation based on the representation of the memory accesses as predicting linear functions of the loop indices, obtained by interpolating addresses accessed during execution samples; dynamic dependence analysis by computing the dependence distance vectors from memory addresses accessed during execution samples; dynamic generation of speculatively parallel code, using binary patterns patched at runtime; rollback in case of misprediction, by canceling the last executed chunk. 3. global orchestration of the loop nest execution, by running successive slices, or chunks, of the outermost loop that are executed serially, but intra-chunk iterations are run in parallel; 4. software implementation as a collaborative static-dynamic framework consisting of extensions of the LLVM compiler and an x86-64 runtime system. We implemented the entire system in a platform called VMAD, which stands for Virtual Machine for Advanced, Dynamic Code Instrumentation and Optimization. The technical details of the framework are presented in Chapter 5, which describes the code preparation at compile-time and the actions taken by the runtime system during execution. Next, the experimental evaluations are included in Chapter 6. Moreover, we illustrate VMAD as a generic framework, which in addition to performing speculative parallelization, is suitable to accomplish various types of complex code instrumentation, analysis and optimization. Such applications are described in Chapter 7. Last but not least, the conclusions, introspections and perspectives of our work are presented in the end of the thesis. The next section is dedicated to setting the context of our research, by reviewing the outstanding previous proposals concerning each important aspect of our work. 1.2 Static-dynamic frameworks for analysis and optimization Our framework, VMAD, is a platform for dynamic code instrumentation and optimization, targeting automatic parallelization on-the-fly. In what follows we present some of the outstanding related works concerning the main characteristics and contributions 12 Ch. 1. Introduction of VMAD, followed by a more detailed survey of the state of the art in the subsequent chapters. Chapter 2, Section 2.3 sets the frame of our work as a TLS system, compared to previous techniques. Once all related notions are introduced, together with the background of our work, Chapter 4, Section 4.2 presents previous approaches for multiversioning and its applications, while Section 4.3 reviews works that performed chunking on loops. Chapter 5, Section gives an overview of various works targeting code instrumentation and profiling; and finally Chapter 7, Section 7.2 describes several works on runtime selection among several optimized versions. To outline the context of our work, we review in what follows the closest and most representative tools and methods designed to perform: Code instrumentation : PIN [79], DynamoRIO [18], Polyhedral code optimization : Pluto [15], Speculative code parallelization : LRPD [106] and R-LRPD [33] tests. PIN: Pin [79] is a dynamic binary instrumentation engine for the Intel architectures (x86 and x86-64 instruction sets) which allows the programmer to build customized instrumentation tools, called Pintools. It has already been included in many Intel commercial tools, such as Intel Parallel Inspector, Intel Parallel Amplifier and Intel Parallel Advisor. PIN inserts snippets of code in a running program to collect runtime information. Thus, PIN is suitable both for program analysis, useful in performance profiling, error detection, capture and replay; as well as for an architectural study, offering support for processor and cache simulation or trace collection. Thanks to its flexibility, various tools can be generated, for emulation, security and parallel program analysis. Since PIN is a dynamic engine, it uses Just-in-time compilation to compile (and optimize) the instrumentation code just before it is executed. The pintools can be attached to the binary code or even to the running process, without the need to recompile or re-link the code. Using this mechanism, it can also handle dynamically generated code. A pintool consists of three types of routines: instrumentation, analysis and callback routines. 1. Instrumentation routines: once the program is loaded in memory, PIN s strategy is to take control of the program and to JIT some small pieces of code, thus inserting in the original code calls to the analysis routines. 2. Analysis routines: they are executed when the code to which they are associated begins its executions and are aimed to collect the required dynamic information. 3. Callback routines: these are special callbacks invoked when certain conditions are met, or when specific events occur. PIN provides an explicit API, very flexible and easy to use for building tools dedicated to a large palette of instrumentation types. However, due to its overhead, PIN is not yet ready to be used as an online profiler, as part of a dynamic optimization phase. According to Bach et al. [6], the inherent overhead of PIN revolves around 30%, when no pintools are executed. Depending on the purpose of instrumentation, extra overhead 1.2. Static-dynamic frameworks for analysis and optimization 13 is added. For instance, for basic block counting, the overhead of PIN is up to 2000%, according to Hazelwood et al. [79]. Unlike VMAD, one cannot instrument by sampling, as the instrumentation inserted with PIN cannot be disabled during the execution of the code, leading to a high overhead. Thus it cannot be used in a dynamic optimizer. DynamoRIO: Very similar to PIN, DynamoRIO [18] is a framework designed for dynamic instrumentation of codes, by allowing the programmer to write their own tool for instrumentation, analysis, profiling, optimization or any other code transformation. Thus, unlike other similar tools, DynamoRIO does not only monitor the code, but it can also perform any kind of code modifications, as the application is running, as specified by the programmer. From a performance perspective, the base overhead of DynamoRIO is 45% in average, however, an optimized version outperforms PIN for basic block counting [18]. Still, its high overhead makes this tool unsuitable for being embedded in a dynamic optimizer. Pluto: Pluto [15] is a framework for automatic parallelization and optimization of affine loop nests. It is a source-to-source compiler that builds an abstract representation of the loop nest, using the polyhedral model, for further loop transformations. All loop optimizations, such as tiling, fusion, unrolling, are manipulated in the polyhedral representation, from which the new loop nest is generated in the transformed form. The goal is to optimize for data locality and to expose parallelism, s
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x