Clang UPC - UPC pointer-to-shared in LLVM IR

Table of Contents

1. Authors and Revision Information

2. Introduction

3. The UPC Pointer-to-shared LLVM IR Representation

3.1. Approach 1 - IR Remote Access Pointer

3.1.1. Required Changes
3.1.2. Advantages
3.1.3. Disadvantages

3.2. Approach 2 - IR UPC pointer-to-shared

3.2.1. Required Changes
3.2.2. Advantages
3.2.3. Disadvantages

4. Load/Store Operations

5. Storage Allocation

6. LLVM Optimizations

7. More Info

Chapter 1. Authors and Revision Information

Authors: Steven Watanabe Gary Funck Nenad Vukicevic

Intrepid Technology, Inc.

http://www.intrepid.com (Internet Archive)

Chapter 2. Introduction

The current implementation of Clang UPC (a UPC language compiler and runtime built using the Clang/LLVM infrastructure - https://github.com/Intrepid/clang-upc/) implements UPC language semantics with only minimal changes of the LLVM core infrastructure.

All accesses to UPC shared data are lowered into calls to access function which are passed to the LLVM (via LLVM IR - LLVM assembly language - http://llvm.org/docs/LangRef.html#abstract). As LLVM optimizations are done on the LLVM IR level, some opportunities for optimization will be missed because some optimizations work directly on memory reference; a call to a runtime procedure will mask the underlying memory reference.

The goal of this document is to describe approaches that would allow UPC pointers-to-shared (or some other remote pointer) to be expressed in the LLVM IR and to in turn gain the benefits of LLVM optimizations that operate on memory references.

Note

Creation of a new (wide - e.g. 128 bit) remote IR pointer is not considered by this document as it is outside of the scope for this project. In addition, 64 bit processor architectures are the focus of this design.

Chapter 3. The UPC Pointer-to-shared LLVM IR Representation

Two approaches for expressing UPC pointers-to-shared in the LLVM IR are described in this document. Both of them use the LLVM IR native pointer as a container for all the necessary UPC pointer-to-shared data and attributes. Given that the LLVM IR load/store operations will be used to load/store remote data, all of the LLVM optimizations should work as is. After the LLVM optimizations and transformations, the LLVM IR loads and stores using the IR UPC container must be converted back into the appropriate run-time calls.

The main difference between these two approaches is in how the UPC pointer-to-shared arithmetic is expressed.

LLVM IR Generic Remote Access Pointer: A remote thread (node, rank) and an offset into the remote shared space is combined into a single LLVM IR pointer. UPC pointer-to-shared arithmetic is performed inside the Clang (C front end), the same way as in the current Clang UPC.
LLVM IR UPC-Aware Remote Access Pointer: The UPC pointer-to-shared representation, which includes thread, phase, and an offset into the remote shared space, is placed in a single LLVM IR pointer. UPC pointer-to-shared arithmetic is performed inside the LLVM back end.

3.1. Approach 1 - IR Remote Access Pointer

In this approach, the LLVM assembly language (IR) has no knowledge of the UPC language and its semantics (shared variables, shared array layout, ,,,). Instead, a unique LLVM address space is used to label LLVM pointers that contain a remote pointer: (1) thread (rank, node) number, (b) offset in the thread’s shared address space. They are referred to as IR remote access pointers in the rest of the document.

64 bits provides enough bits to encode a large shared space on the system with large number of threads. For example, a configuration that allocates space for 20 bits of the threads number provides the following:

+----------+-----------------------+
|  Thread  |      Offset           |
+----------+-----------------------+
 6        4 4                     0
 3        4 3

Thread number - 20 bits - 1,048,576 threads
Offset - 44 bits - 16,384 Tb

The main implementation points of this approach:

All the UPC shared memory accesses are translated into IR remote pointer loads and stores (instead of run-time calls) in the Clang front end. On the IR level, remote pointers look the same as local pointers, except that they refer to a separate (UPC specific) address space. Clang and LLVM allow for data to be located in different address spaces and the following example in C language illustrates this:

C source:
int __attribute__((address_space(16))) *ptr

Generated LLVM IR:
@ptr = common global i32 addrspace(16)* null, align 8

An extra pass will be added to LLVM which transforms remote IR pointers into calls to UPC runtime access routines.

This approach is similar to the approach taken by the Chapel compiler experimental LLVM optimization project.

3.1.1. Required Changes

The following changes to Clang UPC and LLVM are required:

Clang
1. Generate IR remote pointer loads and stores instead of calls to the runtime access routines
LLVM
1. An IR transforming pass (after all optimizations) is required to transform IR remote pointers into appropriate UPC access routines.

3.1.2. Advantages

This approach has the following advantages:

Both packed and struct pointer-to-shared representations are equally capable of creating an IR remote access pointer as there are plenty of bits in the 64 bit container to support large systems.
No major changes in the existing LLVM infrastructure are required to support this approach.
All LLVM optimizations should work as is.

3.1.3. Disadvantages

Clang and LLVM configuration process must be changed for proper configuration of the IR remote pointer layout.
Clang UPC can produce an overly complicated LLVM IR code caused by the pointer-to-shared arithmetic.

3.2. Approach 2 - IR UPC pointer-to-shared

This approach extends the LLVM implementation of pointer-related operations so that they understand UPC pointer-to-shared arithmetic and comparison operations. These extensions require that UPC pointer-to-shared arithmetic is expressed directly in the LLVM assembly (IR). The following UPC related pointer information needs to be passed in the LLVM IR:

UPC pointer-to-shared value (inclusive of thread, offset, and phase)
UPC pointer-to-shared block size
UPC pointer-to-shared element size

In order to use the already existing LLVM infrastructure this information can be passed using one of the following methods:

The UPC pointer-to-shared value (packed UPC pointer-to-shared representation only) is passed inside an LLVM IR native pointer container.
The address space attribute is used to label UPC pointers-to-shared (only one bit is required). Address space 0 (default) is the local memory space.
The address space attribute is used to pass the UPC pointer-to-shared block size (31 bits provide 2GB of the block size).
The LLVM element type is used to gather the size of the element that the UPC pointer-to-shared refers to.

There is a certain risk to this approach as it is not clear whether there are situations where an optimization pass might convert LLVM pointers to "void *" pointers; this would remove the needed element size of the UPC pointer. For correct operation, any LLVM instruction that converts a UPC pointer-to-shared to another pointer type must preserve the blocking factor and element size information.

3.2.1. Required Changes

Expressing operations on UPC pointers-to-shared directly in thee LLVM intermediate representation (IR) will require the following changes to the Clang UPC compiler:

The Clang UPC front-end will no longer generate the low-level instructions required to implement UPC pointer-to-shared arithmetic. Instead, the Clang UPC front-end will generate higher level IR instructions which express pointer-to-shared arithmetic operations in a single IR instruction.
UPC pointers-to-shared types will be expressed as an LLVM pointer type that has a special address space designation and an encoding of the UPC blocking factor.
The LLVM configuration script must be extended to accept arguments which describe the UPC pointer-to-shared representation.
LLVM pointer arithmetic must be extended to provide for UPC pointer-to-shared arithmetic. For this purpose, adding new intrinsics functions (versus adding new LLVM instructions) is likely the best approach because it limits the impact LLVM. (See: http://llvm.org/docs/ExtendingLLVM.html).
A new LLVM IR pass will be developed that transforms UPC pointers-to-shared into appropriate UPC runtime calls.

Alternatively, a UPC blocking factor attribute and an element size attribute could be added to the LLVM IR pointer definition. For example:

@ptr = common global i32 blocksize(1024) elemsize(16)
       addrspace(16)* null, align 8

This is probably a cleaner solution but it may have a wide impact within the LLVM IR infrastructure. This impact may be too high to justify the addition of new attributes.

3.2.2. Advantages

Adding the ability to describe UPC pointer-to-shared types and operations on those types directly in the LLVM IF has several advantages:

UPC pointers-to-shared values are expressed pointers in LLVM, thus allowing the possibility of applying LLVM pointer-related optimizations
The generated IR is easier to read and understand.

3.2.3. Disadvantages

There are some disadvantages and risks associated with this approach:

The address space qualifier has only 20 bits of ragne and must be extended if it is used to encode the UPC blocking factor. At this time it is possible to extend it to 50 bits, but this might conflict with future Clang/LLVM changes which add more bits into the record field which records the address space qualifier.
Encoding the block size in the address space qualifier and relying on the LLVM element type size has some risks. The alternative of encoding both of them in the address space attribute (32 bits) is not practical as it would severally affect the UPC usability.

Chapter 4. Load/Store Operations

Both approaches outlined above rely on the fact that UPC pointer-to-shared values (remote access pointers) are placed inside an LLVM IR native pointer container and marked with the special address space attribute. For example, loading a value from a variable in a specified address space looks this:

@ptr = common global i32 addrspace(16)* null, align 8
@var_local = common global i32 0, align 4
[...]
  %0 = load i32 addrspace(16)** @ptr, align 8
  %1 = load i32 addrspace(16)* %0, align 4
  store i32 %1, i32* @var_local, align 4
[...]

The UPC memory consistency model defines two memory modes, 'strict' and 'relaxed'. As the names suggest, program behavior under strict consistency is more constrained than that under relaxed consistency, as memory accesses on each thread must preserve the program order. The default UPC consistency model is 'relaxed'.

The UPC memory consistency model is summarized below:

Strict accesses always appear (to all threads) to have executed in program order with respect to other strict accesses:
1. all relaxed accesses must complete before any strict access
2. all strict accesses must complete before any other strict or relaxed access
Any sequence of relaxed accesses issued by a given thread in an execution may appear to be arbitrarily reordered relative to program order by the implementation. The only exception to this rule is that two relaxed accesses issued by a given thread to the same memory location where at least one is a write will always appear to all threads to have executed in program order.

Given that Clang UPC generates LLVM IR load/store instructions (which in turn refer to a reserved address space) to describe shared memory accesses, some mechanism for ensuring that the accesses are re-ordered by the LLVM optimizer and code generator according to the restrictions imposed by the UPC memory consistency model. There are a few implementation choices:

Add LLVM strict/relaxed attributes to the IR load/store operations. This option requires changes throughout the LLVM optimization passes wherever there is a possibility of IR load/store operations reordering.
Use different address spaces for relaxed and strict UPC operations. This is similar to the previous option except for adding new LLVM IR attributes. The impact is similar also: checks have to be added in LLVM anywere that re-ordering might occur.
Use LLVM atomics to provide the necessary ordering for UPC strict/relaxed IR load/store operations. This option is the most appealing as it does not require changes to the LLVM optimization passes. Specifying UPC remote accesses as atomic will ensure that LLVM does not re-order any UPC shared accesses with both other shared access and regular accesses. This may be overly restrictive, but would not preclude a UPC-specific optimization pass which implements re-orderings that conform to the UPC consistency model.

LLVM atomics (based on the C++11 standard) provides six levels of atomicity that are used to achieve a balance between performance and necessary access guarantees.

NonAtomics: Regular load and store.
Unordered: The lowest level of atomicity. Lock-free operation that guarantees somewhat sane results (instead of having undefined behavior).
Monotonic: Consistent ordering exists on all operations affecting a specific address. This corresponds to the C++0x/C1x memory_order_relaxed memory order.
Acquire: A barrier to acquire a lock to access other memory with normal loads and stores.
Release: Release lock barrier.
AcquireRelease: Provides both an Acquire and a Release barrier.
SequentiallyConsistent: Similar to AcquireRelease, but also guarantees the ordering among all operations with this attribute. This corresponds to the C++0x/C1x memory_order_seq_cst.

Given that the SequentiallyConsistent level guarantees the total ordering between all SequentiallyConsistent operations, it seems a suitable candidate for implementation of UPC strict memory operations. The Monotonic memory order can be used for relaxed UPC accesses as consistent ordering affecting the same address is enforced. Atomic load/store instructions must be used to to enforce the necessary ordering of UPC shared accesses, because ordering attributes are only available on LLVM IR "load/store atomic" operations.

The following UPC memory access modes are mapped into LLVM atomic operation memory orderings:

UPC Operation	LLVM atomics
Relaxes Read	Atomic Monotonic Load
Relaxed Write	Atomic Monotonic Store
Strict Read	Atomic SequentiallyConsistent Load
Strict Write	Atomic SequentiallyConsistent Store

The following example shows an atomic store which asserts the SequentiallyConsistent ordering:

store atomic i32 %17, i32* %9 seq_cst, align 4

In this example an ordering keyword, "seq_cst", is used to label the desired ordering constraint (SequentiallyConsistent) on the LLVM "store atomic" IR instruction.

Chapter 5. Storage Allocation

Clang UPC defines all UPC shared variables as LLVM globals located in a special data section (upc_shared). For example:

shared int var;
shared int var_a[4*THREADS];

generates the following LLVM code:

@var = global i32 0, section "upc_shared", align 4
@var_a = global [4 x i32] zeroinitializer, section "upc_shared",
         align 4

In the case where a UPC pointer-to-shared value is encoded as an LLVM native pointer, the address space attribute can also be used to encode the shared variable blocking factor (and element type if necessary). In this case the following IR code can be generated (assuming that bit 31 in the address space is used differentiate shared and local space):

@var = addrspace(0x80000000) global i32 0, section "upc_shared",
       align 4
@var_a = addrspace(0x80000004) global [4 x i32] zeroinitializer,
       section "upc_shared", align 4

For the design alternative where a UPC pointer-to-shared value is described using attributes to provide the object’s UPC defined blocking factor and element size, the generated LLVM IR definitions might appear as follows:

@var = addrspace(0x80000000) global i32 0, section "upc_shared",
       align 4
@var_a = addrspace(0x80000000) blocksize(0) elemsize(4)
       global [4 x i32] zeroinitializer, section "upc_shared",
       align 4

Chapter 6. LLVM Optimizations

This section lists all of the available LLVM optimizations together with their potential impact on a UPC compiler implementation which uses LLVM load ans store IR instructions to express UPC shared memory accesses.

Symbol	Description
Y	Optimization works
N	Optimization does not work
N/A	Optimization is not applicable

Rmt Ptr	UPC Ptr	Optimization	Description
Y	Y	-aa-eval	Exhaustive Alias Analysis Precision Evaluator
Y	Y	-adce	Aggressive Dead Code Elimination
		-alloca-hoisting	Hoisting alloca instructions in non-entry blocks to the entry block
Y	Y	-always-inline	Inliner for always_inline functions
Y	Y	-argpromotion	Promote by reference arguments to scalars
? Y	Y	-asan	AddressSanitizer: detects use-after-free and out-of-bounds bugs.
? Y	Y	-asan-module	AddressSanitizer: detects use-after-free and out-of-bounds bugs.ModulePass
Y	Y	-basicaa	Basic Alias Analysis (stateless AA impl)
Y	Y	-basiccg	CallGraph Construction
Y	Y	-bb-vectorize	Basic-Block Vectorization
Y	Y	-block-freq	Block Frequency Analysis
? Y	Y	-bounds-checking	Run-time bounds checking
Y	Y	-branch-prob	Branch Probability Analysis
Y	Y	-break-crit-edges	Break critical edges in CFG
Y	Y	-codegenprepare	Optimize for code generation
Y	Y	-constmerge	Merge Duplicate Global Constants
Y	Y	-constprop	Simple constant propagation
?		-correlated-propagation	Value Propagation
Y	Y	-cost-model	Cost Model Analysis
Y	Y	-count-aa	Count Alias Analysis Query Responses
Y	Y	-da	Dependence Analysis
?		-datalayout	Data Layout
Y	Y	-dce	Dead Code Elimination
Y	Y	-deadargelim	Dead Argument Elimination
Y	Y	-deadarghaX0r	Dead Argument Hacking (BUGPOINT USE ONLY; DO NOT USE)
Y	Y	-debug-aa	AA use debugger
Y	Y	-debug-ir	Enable debugging IR
Y	Y	-delinearize	Delinearization
Y	Y	-dfsan	DataFlowSanitizer: dynamic data flow analysis.
Y	Y	-die	Dead Instruction Elimination
Y	Y	-domfrontier	Dominance Frontier Construction
Y	Y	-domtree	Dominator Tree Construction
N/A	N/A	-dot-callgraph	Print call graph to dot file
N/A	N/A	-dot-cfg	Print CFG of function to dot file
N/A	N/A	-dot-cfg-only	Print CFG of function to dot file (with no function bodies)
N/A	N/A	-dot-dom	Print dominance tree of function to dot file
N/A	N/A	-dot-dom-only	Print dominance tree of function to dot file (with no function bodies)
N/A	N/A	-dot-postdom	Print postdominance tree of function to dot file
N/A	N/A	-dot-postdom-only	Print postdominance tree of function to dot file (with no function bodies)
N/A	N/A	-dot-regions	Print regions of function to dot file
N/A	N/A	-dot-regions-only	Print regions of function to dot file (with no function bodies)
? Y	Y	-dse	Dead Store Elimination
?		-early-cse	Early CSE
Y	Y	-extract-blocks	Extract Basic Blocks From Module (for bugpoint use)
Y	Y	-functionattrs	Deduce function attributes
?		-generic-to-nvvm	Ensure that the global variables are in the global address space
Y	Y	-globaldce	Dead Global Elimination
Y	Y	-globalopt	Global Variable Optimizer
Y	Y	-globalsmodref-aa	Simple mod/ref analysis for globals
Y	Y	-gvn	Global Value Numbering
Y	Y	-indvars	Induction Variable Simplification
Y	Y	-inline	Function Integration/Inlining
Y	Y	-inline-cost	Inline Cost Analysis
Y	Y	-insert-gcov-profiling	Insert instrumentation for GCOV profiling
Y	Y	-instcombine	Combine redundant instructions
Y	Y	-instcount	Counts the various types of Instructions
Y	Y	-instnamer	Assign names to anonymous instructions
Y	Y	-instsimplify	Remove redundant instructions
Y	Y	-internalize	Internalize Global Symbols
Y	Y	-intervals	Interval Partition Construction
Y	Y	-ipconstprop	Interprocedural constant propagation
Y	Y	-ipsccp	Interprocedural Sparse Conditional Constant Propagation
Y	Y	-iv-users	Induction Variable Users
Y	Y	-jump-threading	Jump Threading
Y	Y	-lazy-value-info	Lazy Value Information Analysis
Y	Y	-lcssa	Loop-Closed SSA Form Pass
Y	Y	-libcall-aa	LibCall Alias Analysis
Y	Y	-licm	Loop Invariant Code Motion
Y	Y	-lint	Statically lint-checks LLVM IR
Y	Y	-loop-deletion	Delete dead loops
Y	Y	-loop-extract	Extract loops into new functions
Y	Y	-loop-extract-single	Extract at most one loop into a new function
Y	Y	-loop-idiom	Recognize loop idioms
Y	Y	-loop-instsimplify	Simplify instructions in loops
Y	Y	-loop-reduce	Loop Strength Reduction
Y	Y	-loop-reroll	Reroll loops
Y	Y	-loop-rotate	Rotate Loops
Y	Y	-loop-simplify	Canonicalize natural loops
Y	Y	-loop-unroll	Unroll loops
Y	Y	-loop-unswitch	Unswitch loops
Y	Y	-loop-vectorize	Loop Vectorization
Y	Y	-loops	Natural Loop Information
Y	Y	-lower-expect	Lower expect Intrinsics
Y	Y	-loweratomic	Lower atomic intrinsics to non-atomic form
Y	Y	-lowerinvoke	Lower invoke and unwind, for unwindless code generators
Y	Y	-lowerswitch	Lower SwitchInst’s to branches
Y	Y	-mem2reg	Promote Memory to Register
Y	Y	-memcpyopt	MemCpy Optimization
Y	Y	-memdep	Memory Dependence Analysis
Y	Y	-mergefunc	Merge Functions
Y	Y	-mergereturn	Unify function exit nodes
Y	Y	-metarenamer	Assign new names to everything
Y	Y	-module-debuginfo	Decodes module-level debug info
Y	Y	-msan	MemorySanitizer: detects uninitialized reads.
Y	Y	-no-aa	No Alias Analysis (always returns may alias)
Y	Y	-notti	No target information
Y	Y	-nvvm-reflect	Replace occurences of __nvvm_reflect() calls with 0/1
N/A	N/A	-objc-arc	ObjC ARC optimization
N/A	N/A	-objc-arc-aa	ObjC-ARC-Based Alias Analysis
N/A	N/A	-objc-arc-apelim	ObjC ARC autorelease pool elimination
N/A	N/A	-objc-arc-contract	ObjC ARC contraction
N/A	N/A	-objc-arc-expand	ObjC ARC expansion
Y	Y	-partial-inliner	Partial Inliner
Y	Y	-partially-inline-libcalls	Partially inline calls to library functions
Y	Y	-postdomtree	Post-Dominator Tree Construction
Y	Y	-preverify	Preliminary module verification
N/A	N/A	-print-alias-sets	Alias Set Printer
N/A	N/A	-print-bb	Print BB to stderr
N/A	N/A	-print-callgraph	Print a call graph
N/A	N/A	-print-callgraph-sccs	Print SCCs of the Call Graph
N/A	N/A	-print-cfg-sccs	Print SCCs of each function CFG
N/A	N/A	-print-dom-info	Dominator Info Printer
N/A	N/A	-print-externalfnconstants	Print external fn callsites passed constants
N/A	N/A	-print-function	Print function to stderr
N/A	N/A	-print-memdeps	Print MemDeps of function
N/A	N/A	-print-module	Print module to stderr
N/A	N/A	-print-used-types	Find Used Types
Y	Y	-prune-eh	Remove unused exception handling info
Y	Y	-reassociate	Reassociate expressions
Y	Y	-reg2mem	Demote all values to stack slots
Y	Y	-regions	Detect single entry single exit regions
Y	Y	-sample-profile	Sample Profile loader
Y	Y	-scalar-evolution	Scalar Evolution Analysis
Y	Y	-scalarrepl	Scalar Replacement of Aggregates (DT)
Y	Y	-scalarrepl-ssa	Scalar Replacement of Aggregates (SSAUp)
Y	Y	-sccp	Sparse Conditional Constant Propagation
Y	Y	-scev-aa	ScalarEvolution-based Alias Analysis
Y	Y	-simplifycfg	Simplify the CFG
Y	Y	-sink	Code sinking
Y	Y	-slp-vectorizer	SLP Vectorizer
Y	Y	-sroa	Scalar Replacement Of Aggregates
N/A	N/A	-strip	Strip all symbols from a module
N/A	N/A	-strip-dead-debug-info	Strip debug info for unused symbols
N/A	N/A	-strip-dead-prototypes	Strip Unused Function Prototypes
N/A	N/A	-strip-debug-declare	Strip all llvm.dbg.declare intrinsics
N/A	N/A	-strip-nondebug	Strip all symbols, except dbg symbols, from a module
Y	Y	-structurizecfg	Structurize the CFG
Y	Y	-tailcallelim	Tail Call Elimination
Y	Y	-targetlibinfo	Target Library Information
? Y	Y	-tbaa	Type-Based Alias Analysis
Y	Y	-tsan	ThreadSanitizer: detects data races.
N/A	N/A	-verify	Module Verifier
N/A	N/A	-view-callgraph	View call graph
N/A	N/A	-view-cfg	View CFG of function
N/A	N/A	-view-cfg-only	View CFG of function (with no function bodies)
N/A	N/A	-view-dom	View dominance tree of function
N/A	N/A	-view-dom-only	View dominance tree of function (with no function bodies)
N/A	N/A	-view-postdom	View postdominance tree of function
N/A	N/A	-view-postdom-only	View postdominance tree of function (with no function bodies)
N/A	N/A	-view-regions	View regions of function
N/A	N/A	-view-regions-only	View regions of function (with no function bodies)

The above list of LLVM 3.4 optimization passes was gathered by invoking the LLVM optimizer with the -help switch.

--- opt -help ---

Chapter 7. More Info

LLVM Language Reference Manual LLVM Atomic Instructions and Concurrency Guide