Table of Contents
Authors: Steven Watanabe Gary Funck Nenad Vukicevic
Intrepid Technology, Inc.
The current implementation of Clang UPC (a UPC language compiler and runtime built using the Clang/LLVM infrastructure - https://github.com/Intrepid/clang-upc/) implements UPC language semantics with only minimal changes of the LLVM core infrastructure.
All accesses to UPC shared data are lowered into calls to access function which are passed to the LLVM (via LLVM IR - LLVM assembly language - http://llvm.org/docs/LangRef.html#abstract). As LLVM optimizations are done on the LLVM IR level, some opportunities for optimization will be missed because some optimizations work directly on memory reference; a call to a runtime procedure will mask the underlying memory reference.
The goal of this document is to describe approaches that would allow UPC pointers-to-shared (or some other remote pointer) to be expressed in the LLVM IR and to in turn gain the benefits of LLVM optimizations that operate on memory references.
Creation of a new (wide - e.g. 128 bit) remote IR pointer is not considered by this document as it is outside of the scope for this project. In addition, 64 bit processor architectures are the focus of this design.
Two approaches for expressing UPC pointers-to-shared in the LLVM IR are described in this document. Both of them use the LLVM IR native pointer as a container for all the necessary UPC pointer-to-shared data and attributes. Given that the LLVM IR load/store operations will be used to load/store remote data, all of the LLVM optimizations should work as is. After the LLVM optimizations and transformations, the LLVM IR loads and stores using the IR UPC container must be converted back into the appropriate run-time calls.
The main difference between these two approaches is in how the UPC pointer-to-shared arithmetic is expressed.
In this approach, the LLVM assembly language (IR) has no knowledge of the UPC language and its semantics (shared variables, shared array layout, ,,,). Instead, a unique LLVM address space is used to label LLVM pointers that contain a remote pointer: (1) thread (rank, node) number, (b) offset in the thread’s shared address space. They are referred to as IR remote access pointers in the rest of the document.
64 bits provides enough bits to encode a large shared space on the system with large number of threads. For example, a configuration that allocates space for 20 bits of the threads number provides the following:
+----------+-----------------------+ | Thread | Offset | +----------+-----------------------+ 6 4 4 0 3 4 3 Thread number - 20 bits - 1,048,576 threads Offset - 44 bits - 16,384 Tb
The main implementation points of this approach:
C source: int __attribute__((address_space(16))) *ptr Generated LLVM IR: @ptr = common global i32 addrspace(16)* null, align 8
This approach is similar to the approach taken by the Chapel compiler experimental LLVM optimization project.
The following changes to Clang UPC and LLVM are required:
Clang
LLVM
This approach has the following advantages:
This approach extends the LLVM implementation of pointer-related operations so that they understand UPC pointer-to-shared arithmetic and comparison operations. These extensions require that UPC pointer-to-shared arithmetic is expressed directly in the LLVM assembly (IR). The following UPC related pointer information needs to be passed in the LLVM IR:
In order to use the already existing LLVM infrastructure this information can be passed using one of the following methods:
There is a certain risk to this approach as it is not clear whether there are situations where an optimization pass might convert LLVM pointers to "void *" pointers; this would remove the needed element size of the UPC pointer. For correct operation, any LLVM instruction that converts a UPC pointer-to-shared to another pointer type must preserve the blocking factor and element size information.
Expressing operations on UPC pointers-to-shared directly in thee LLVM intermediate representation (IR) will require the following changes to the Clang UPC compiler:
Alternatively, a UPC blocking factor attribute and an element size attribute could be added to the LLVM IR pointer definition. For example:
@ptr = common global i32 blocksize(1024) elemsize(16) addrspace(16)* null, align 8
This is probably a cleaner solution but it may have a wide impact within the LLVM IR infrastructure. This impact may be too high to justify the addition of new attributes.
Adding the ability to describe UPC pointer-to-shared types and operations on those types directly in the LLVM IF has several advantages:
There are some disadvantages and risks associated with this approach:
Both approaches outlined above rely on the fact that UPC pointer-to-shared values (remote access pointers) are placed inside an LLVM IR native pointer container and marked with the special address space attribute. For example, loading a value from a variable in a specified address space looks this:
@ptr = common global i32 addrspace(16)* null, align 8 @var_local = common global i32 0, align 4 [...] %0 = load i32 addrspace(16)** @ptr, align 8 %1 = load i32 addrspace(16)* %0, align 4 store i32 %1, i32* @var_local, align 4 [...]
The UPC memory consistency model defines two memory modes, 'strict' and 'relaxed'. As the names suggest, program behavior under strict consistency is more constrained than that under relaxed consistency, as memory accesses on each thread must preserve the program order. The default UPC consistency model is 'relaxed'.
The UPC memory consistency model is summarized below:
Strict accesses always appear (to all threads) to have executed in program order with respect to other strict accesses:
Given that Clang UPC generates LLVM IR load/store instructions (which in turn refer to a reserved address space) to describe shared memory accesses, some mechanism for ensuring that the accesses are re-ordered by the LLVM optimizer and code generator according to the restrictions imposed by the UPC memory consistency model. There are a few implementation choices:
LLVM atomics (based on the C++11 standard) provides six levels of atomicity that are used to achieve a balance between performance and necessary access guarantees.
Given that the SequentiallyConsistent level guarantees the total ordering between all SequentiallyConsistent operations, it seems a suitable candidate for implementation of UPC strict memory operations. The Monotonic memory order can be used for relaxed UPC accesses as consistent ordering affecting the same address is enforced. Atomic load/store instructions must be used to to enforce the necessary ordering of UPC shared accesses, because ordering attributes are only available on LLVM IR "load/store atomic" operations.
The following UPC memory access modes are mapped into LLVM atomic operation memory orderings:
UPC Operation | LLVM atomics |
---|---|
Relaxes Read | Atomic Monotonic Load |
Relaxed Write | Atomic Monotonic Store |
Strict Read | Atomic SequentiallyConsistent Load |
Strict Write | Atomic SequentiallyConsistent Store |
The following example shows an atomic store which asserts the SequentiallyConsistent ordering:
store atomic i32 %17, i32* %9 seq_cst, align 4
In this example an ordering keyword, "seq_cst", is used to label the desired ordering constraint (SequentiallyConsistent) on the LLVM "store atomic" IR instruction.
Clang UPC defines all UPC shared variables as LLVM globals located in a special data section (upc_shared). For example:
shared int var; shared int var_a[4*THREADS];
generates the following LLVM code:
@var = global i32 0, section "upc_shared", align 4 @var_a = global [4 x i32] zeroinitializer, section "upc_shared", align 4
In the case where a UPC pointer-to-shared value is encoded as an LLVM native pointer, the address space attribute can also be used to encode the shared variable blocking factor (and element type if necessary). In this case the following IR code can be generated (assuming that bit 31 in the address space is used differentiate shared and local space):
@var = addrspace(0x80000000) global i32 0, section "upc_shared", align 4 @var_a = addrspace(0x80000004) global [4 x i32] zeroinitializer, section "upc_shared", align 4
For the design alternative where a UPC pointer-to-shared value is described using attributes to provide the object’s UPC defined blocking factor and element size, the generated LLVM IR definitions might appear as follows:
@var = addrspace(0x80000000) global i32 0, section "upc_shared", align 4 @var_a = addrspace(0x80000000) blocksize(0) elemsize(4) global [4 x i32] zeroinitializer, section "upc_shared", align 4
This section lists all of the available LLVM optimizations together with their potential impact on a UPC compiler implementation which uses LLVM load ans store IR instructions to express UPC shared memory accesses.
Symbol | Description |
---|---|
Y | Optimization works |
N | Optimization does not work |
N/A | Optimization is not applicable |
Rmt Ptr | UPC Ptr | Optimization | Description |
---|---|---|---|
Y | Y | -aa-eval | Exhaustive Alias Analysis Precision Evaluator |
Y | Y | -adce | Aggressive Dead Code Elimination |
-alloca-hoisting | Hoisting alloca instructions in non-entry blocks to the entry block | ||
Y | Y | -always-inline | Inliner for always_inline functions |
Y | Y | -argpromotion | Promote by reference arguments to scalars |
? Y | Y | -asan | AddressSanitizer: detects use-after-free and out-of-bounds bugs. |
? Y | Y | -asan-module | AddressSanitizer: detects use-after-free and out-of-bounds bugs.ModulePass |
Y | Y | -basicaa | Basic Alias Analysis (stateless AA impl) |
Y | Y | -basiccg | CallGraph Construction |
Y | Y | -bb-vectorize | Basic-Block Vectorization |
Y | Y | -block-freq | Block Frequency Analysis |
? Y | Y | -bounds-checking | Run-time bounds checking |
Y | Y | -branch-prob | Branch Probability Analysis |
Y | Y | -break-crit-edges | Break critical edges in CFG |
Y | Y | -codegenprepare | Optimize for code generation |
Y | Y | -constmerge | Merge Duplicate Global Constants |
Y | Y | -constprop | Simple constant propagation |
? | -correlated-propagation | Value Propagation | |
Y | Y | -cost-model | Cost Model Analysis |
Y | Y | -count-aa | Count Alias Analysis Query Responses |
Y | Y | -da | Dependence Analysis |
? | -datalayout | Data Layout | |
Y | Y | -dce | Dead Code Elimination |
Y | Y | -deadargelim | Dead Argument Elimination |
Y | Y | -deadarghaX0r | Dead Argument Hacking (BUGPOINT USE ONLY; DO NOT USE) |
Y | Y | -debug-aa | AA use debugger |
Y | Y | -debug-ir | Enable debugging IR |
Y | Y | -delinearize | Delinearization |
Y | Y | -dfsan | DataFlowSanitizer: dynamic data flow analysis. |
Y | Y | -die | Dead Instruction Elimination |
Y | Y | -domfrontier | Dominance Frontier Construction |
Y | Y | -domtree | Dominator Tree Construction |
N/A | N/A | -dot-callgraph | Print call graph to dot file |
N/A | N/A | -dot-cfg | Print CFG of function to dot file |
N/A | N/A | -dot-cfg-only | Print CFG of function to dot file (with no function bodies) |
N/A | N/A | -dot-dom | Print dominance tree of function to dot file |
N/A | N/A | -dot-dom-only | Print dominance tree of function to dot file (with no function bodies) |
N/A | N/A | -dot-postdom | Print postdominance tree of function to dot file |
N/A | N/A | -dot-postdom-only | Print postdominance tree of function to dot file (with no function bodies) |
N/A | N/A | -dot-regions | Print regions of function to dot file |
N/A | N/A | -dot-regions-only | Print regions of function to dot file (with no function bodies) |
? Y | Y | -dse | Dead Store Elimination |
? | -early-cse | Early CSE | |
Y | Y | -extract-blocks | Extract Basic Blocks From Module (for bugpoint use) |
Y | Y | -functionattrs | Deduce function attributes |
? | -generic-to-nvvm | Ensure that the global variables are in the global address space | |
Y | Y | -globaldce | Dead Global Elimination |
Y | Y | -globalopt | Global Variable Optimizer |
Y | Y | -globalsmodref-aa | Simple mod/ref analysis for globals |
Y | Y | -gvn | Global Value Numbering |
Y | Y | -indvars | Induction Variable Simplification |
Y | Y | -inline | Function Integration/Inlining |
Y | Y | -inline-cost | Inline Cost Analysis |
Y | Y | -insert-gcov-profiling | Insert instrumentation for GCOV profiling |
Y | Y | -instcombine | Combine redundant instructions |
Y | Y | -instcount | Counts the various types of Instructions |
Y | Y | -instnamer | Assign names to anonymous instructions |
Y | Y | -instsimplify | Remove redundant instructions |
Y | Y | -internalize | Internalize Global Symbols |
Y | Y | -intervals | Interval Partition Construction |
Y | Y | -ipconstprop | Interprocedural constant propagation |
Y | Y | -ipsccp | Interprocedural Sparse Conditional Constant Propagation |
Y | Y | -iv-users | Induction Variable Users |
Y | Y | -jump-threading | Jump Threading |
Y | Y | -lazy-value-info | Lazy Value Information Analysis |
Y | Y | -lcssa | Loop-Closed SSA Form Pass |
Y | Y | -libcall-aa | LibCall Alias Analysis |
Y | Y | -licm | Loop Invariant Code Motion |
Y | Y | -lint | Statically lint-checks LLVM IR |
Y | Y | -loop-deletion | Delete dead loops |
Y | Y | -loop-extract | Extract loops into new functions |
Y | Y | -loop-extract-single | Extract at most one loop into a new function |
Y | Y | -loop-idiom | Recognize loop idioms |
Y | Y | -loop-instsimplify | Simplify instructions in loops |
Y | Y | -loop-reduce | Loop Strength Reduction |
Y | Y | -loop-reroll | Reroll loops |
Y | Y | -loop-rotate | Rotate Loops |
Y | Y | -loop-simplify | Canonicalize natural loops |
Y | Y | -loop-unroll | Unroll loops |
Y | Y | -loop-unswitch | Unswitch loops |
Y | Y | -loop-vectorize | Loop Vectorization |
Y | Y | -loops | Natural Loop Information |
Y | Y | -lower-expect | Lower expect Intrinsics |
Y | Y | -loweratomic | Lower atomic intrinsics to non-atomic form |
Y | Y | -lowerinvoke | Lower invoke and unwind, for unwindless code generators |
Y | Y | -lowerswitch | Lower SwitchInst’s to branches |
Y | Y | -mem2reg | Promote Memory to Register |
Y | Y | -memcpyopt | MemCpy Optimization |
Y | Y | -memdep | Memory Dependence Analysis |
Y | Y | -mergefunc | Merge Functions |
Y | Y | -mergereturn | Unify function exit nodes |
Y | Y | -metarenamer | Assign new names to everything |
Y | Y | -module-debuginfo | Decodes module-level debug info |
Y | Y | -msan | MemorySanitizer: detects uninitialized reads. |
Y | Y | -no-aa | No Alias Analysis (always returns may alias) |
Y | Y | -notti | No target information |
Y | Y | -nvvm-reflect | Replace occurences of __nvvm_reflect() calls with 0/1 |
N/A | N/A | -objc-arc | ObjC ARC optimization |
N/A | N/A | -objc-arc-aa | ObjC-ARC-Based Alias Analysis |
N/A | N/A | -objc-arc-apelim | ObjC ARC autorelease pool elimination |
N/A | N/A | -objc-arc-contract | ObjC ARC contraction |
N/A | N/A | -objc-arc-expand | ObjC ARC expansion |
Y | Y | -partial-inliner | Partial Inliner |
Y | Y | -partially-inline-libcalls | Partially inline calls to library functions |
Y | Y | -postdomtree | Post-Dominator Tree Construction |
Y | Y | -preverify | Preliminary module verification |
N/A | N/A | -print-alias-sets | Alias Set Printer |
N/A | N/A | -print-bb | Print BB to stderr |
N/A | N/A | -print-callgraph | Print a call graph |
N/A | N/A | -print-callgraph-sccs | Print SCCs of the Call Graph |
N/A | N/A | -print-cfg-sccs | Print SCCs of each function CFG |
N/A | N/A | -print-dom-info | Dominator Info Printer |
N/A | N/A | -print-externalfnconstants | Print external fn callsites passed constants |
N/A | N/A | -print-function | Print function to stderr |
N/A | N/A | -print-memdeps | Print MemDeps of function |
N/A | N/A | -print-module | Print module to stderr |
N/A | N/A | -print-used-types | Find Used Types |
Y | Y | -prune-eh | Remove unused exception handling info |
Y | Y | -reassociate | Reassociate expressions |
Y | Y | -reg2mem | Demote all values to stack slots |
Y | Y | -regions | Detect single entry single exit regions |
Y | Y | -sample-profile | Sample Profile loader |
Y | Y | -scalar-evolution | Scalar Evolution Analysis |
Y | Y | -scalarrepl | Scalar Replacement of Aggregates (DT) |
Y | Y | -scalarrepl-ssa | Scalar Replacement of Aggregates (SSAUp) |
Y | Y | -sccp | Sparse Conditional Constant Propagation |
Y | Y | -scev-aa | ScalarEvolution-based Alias Analysis |
Y | Y | -simplifycfg | Simplify the CFG |
Y | Y | -sink | Code sinking |
Y | Y | -slp-vectorizer | SLP Vectorizer |
Y | Y | -sroa | Scalar Replacement Of Aggregates |
N/A | N/A | -strip | Strip all symbols from a module |
N/A | N/A | -strip-dead-debug-info | Strip debug info for unused symbols |
N/A | N/A | -strip-dead-prototypes | Strip Unused Function Prototypes |
N/A | N/A | -strip-debug-declare | Strip all llvm.dbg.declare intrinsics |
N/A | N/A | -strip-nondebug | Strip all symbols, except dbg symbols, from a module |
Y | Y | -structurizecfg | Structurize the CFG |
Y | Y | -tailcallelim | Tail Call Elimination |
Y | Y | -targetlibinfo | Target Library Information |
? Y | Y | -tbaa | Type-Based Alias Analysis |
Y | Y | -tsan | ThreadSanitizer: detects data races. |
N/A | N/A | -verify | Module Verifier |
N/A | N/A | -view-callgraph | View call graph |
N/A | N/A | -view-cfg | View CFG of function |
N/A | N/A | -view-cfg-only | View CFG of function (with no function bodies) |
N/A | N/A | -view-dom | View dominance tree of function |
N/A | N/A | -view-dom-only | View dominance tree of function (with no function bodies) |
N/A | N/A | -view-postdom | View postdominance tree of function |
N/A | N/A | -view-postdom-only | View postdominance tree of function (with no function bodies) |
N/A | N/A | -view-regions | View regions of function |
N/A | N/A | -view-regions-only | View regions of function (with no function bodies) |
The above list of LLVM 3.4 optimization passes was gathered by invoking the LLVM optimizer with the -help switch.
--- opt -help ---