Clang UPC - UPC pointer-to-shared in LLVM IR


Table of Contents

1. Authors and Revision Information
2. Introduction
3. The UPC Pointer-to-shared LLVM IR Representation
3.1. Approach 1 - IR Remote Access Pointer
3.1.1. Required Changes
3.1.2. Advantages
3.1.3. Disadvantages
3.2. Approach 2 - IR UPC pointer-to-shared
3.2.1. Required Changes
3.2.2. Advantages
3.2.3. Disadvantages
4. Load/Store Operations
5. Storage Allocation
6. LLVM Optimizations
7. More Info

Chapter 1. Authors and Revision Information

Authors: Steven Watanabe Gary Funck Nenad Vukicevic

Intrepid Technology, Inc.

http://www.intrepid.com

Chapter 2. Introduction

The current implementation of Clang UPC (a UPC language compiler and runtime built using the Clang/LLVM infrastructure - https://github.com/Intrepid/clang-upc/) implements UPC language semantics with only minimal changes of the LLVM core infrastructure.

All accesses to UPC shared data are lowered into calls to access function which are passed to the LLVM (via LLVM IR - LLVM assembly language - http://llvm.org/docs/LangRef.html#abstract). As LLVM optimizations are done on the LLVM IR level, some opportunities for optimization will be missed because some optimizations work directly on memory reference; a call to a runtime procedure will mask the underlying memory reference.

The goal of this document is to describe approaches that would allow UPC pointers-to-shared (or some other remote pointer) to be expressed in the LLVM IR and to in turn gain the benefits of LLVM optimizations that operate on memory references.

Note

Creation of a new (wide - e.g. 128 bit) remote IR pointer is not considered by this document as it is outside of the scope for this project. In addition, 64 bit processor architectures are the focus of this design.

Chapter 3. The UPC Pointer-to-shared LLVM IR Representation

Two approaches for expressing UPC pointers-to-shared in the LLVM IR are described in this document. Both of them use the LLVM IR native pointer as a container for all the necessary UPC pointer-to-shared data and attributes. Given that the LLVM IR load/store operations will be used to load/store remote data, all of the LLVM optimizations should work as is. After the LLVM optimizations and transformations, the LLVM IR loads and stores using the IR UPC container must be converted back into the appropriate run-time calls.

The main difference between these two approaches is in how the UPC pointer-to-shared arithmetic is expressed.

LLVM IR Generic Remote Access Pointer
A remote thread (node, rank) and an offset into the remote shared space is combined into a single LLVM IR pointer. UPC pointer-to-shared arithmetic is performed inside the Clang (C front end), the same way as in the current Clang UPC.
LLVM IR UPC-Aware Remote Access Pointer
The UPC pointer-to-shared representation, which includes thread, phase, and an offset into the remote shared space, is placed in a single LLVM IR pointer. UPC pointer-to-shared arithmetic is performed inside the LLVM back end.

3.1. Approach 1 - IR Remote Access Pointer

In this approach, the LLVM assembly language (IR) has no knowledge of the UPC language and its semantics (shared variables, shared array layout, ,,,). Instead, a unique LLVM address space is used to label LLVM pointers that contain a remote pointer: (1) thread (rank, node) number, (b) offset in the thread’s shared address space. They are referred to as IR remote access pointers in the rest of the document.

64 bits provides enough bits to encode a large shared space on the system with large number of threads. For example, a configuration that allocates space for 20 bits of the threads number provides the following:

+----------+-----------------------+
|  Thread  |      Offset           |
+----------+-----------------------+
 6        4 4                     0
 3        4 3

Thread number - 20 bits - 1,048,576 threads
Offset - 44 bits - 16,384 Tb

The main implementation points of this approach:

  • All the UPC shared memory accesses are translated into IR remote pointer loads and stores (instead of run-time calls) in the Clang front end. On the IR level, remote pointers look the same as local pointers, except that they refer to a separate (UPC specific) address space. Clang and LLVM allow for data to be located in different address spaces and the following example in C language illustrates this:
C source:
int __attribute__((address_space(16))) *ptr

Generated LLVM IR:
@ptr = common global i32 addrspace(16)* null, align 8
  • An extra pass will be added to LLVM which transforms remote IR pointers into calls to UPC runtime access routines.

This approach is similar to the approach taken by the Chapel compiler experimental LLVM optimization project.

3.1.1. Required Changes

The following changes to Clang UPC and LLVM are required:

  • Clang

    1. Generate IR remote pointer loads and stores instead of calls to the runtime access routines
  • LLVM

    1. An IR transforming pass (after all optimizations) is required to transform IR remote pointers into appropriate UPC access routines.

3.1.2. Advantages

This approach has the following advantages:

  • Both packed and struct pointer-to-shared representations are equally capable of creating an IR remote access pointer as there are plenty of bits in the 64 bit container to support large systems.
  • No major changes in the existing LLVM infrastructure are required to support this approach.
  • All LLVM optimizations should work as is.

3.1.3. Disadvantages

  • Clang and LLVM configuration process must be changed for proper configuration of the IR remote pointer layout.
  • Clang UPC can produce an overly complicated LLVM IR code caused by the pointer-to-shared arithmetic.

3.2. Approach 2 - IR UPC pointer-to-shared

This approach extends the LLVM implementation of pointer-related operations so that they understand UPC pointer-to-shared arithmetic and comparison operations. These extensions require that UPC pointer-to-shared arithmetic is expressed directly in the LLVM assembly (IR). The following UPC related pointer information needs to be passed in the LLVM IR:

  • UPC pointer-to-shared value (inclusive of thread, offset, and phase)
  • UPC pointer-to-shared block size
  • UPC pointer-to-shared element size

In order to use the already existing LLVM infrastructure this information can be passed using one of the following methods:

  • The UPC pointer-to-shared value (packed UPC pointer-to-shared representation only) is passed inside an LLVM IR native pointer container.
  • The address space attribute is used to label UPC pointers-to-shared (only one bit is required). Address space 0 (default) is the local memory space.
  • The address space attribute is used to pass the UPC pointer-to-shared block size (31 bits provide 2GB of the block size).
  • The LLVM element type is used to gather the size of the element that the UPC pointer-to-shared refers to.

There is a certain risk to this approach as it is not clear whether there are situations where an optimization pass might convert LLVM pointers to "void *" pointers; this would remove the needed element size of the UPC pointer. For correct operation, any LLVM instruction that converts a UPC pointer-to-shared to another pointer type must preserve the blocking factor and element size information.

3.2.1. Required Changes

Expressing operations on UPC pointers-to-shared directly in thee LLVM intermediate representation (IR) will require the following changes to the Clang UPC compiler:

  • The Clang UPC front-end will no longer generate the low-level instructions required to implement UPC pointer-to-shared arithmetic. Instead, the Clang UPC front-end will generate higher level IR instructions which express pointer-to-shared arithmetic operations in a single IR instruction.
  • UPC pointers-to-shared types will be expressed as an LLVM pointer type that has a special address space designation and an encoding of the UPC blocking factor.
  • The LLVM configuration script must be extended to accept arguments which describe the UPC pointer-to-shared representation.
  • LLVM pointer arithmetic must be extended to provide for UPC pointer-to-shared arithmetic. For this purpose, adding new intrinsics functions (versus adding new LLVM instructions) is likely the best approach because it limits the impact LLVM. (See: http://llvm.org/docs/ExtendingLLVM.html).
  • A new LLVM IR pass will be developed that transforms UPC pointers-to-shared into appropriate UPC runtime calls.

Alternatively, a UPC blocking factor attribute and an element size attribute could be added to the LLVM IR pointer definition. For example:

@ptr = common global i32 blocksize(1024) elemsize(16)
       addrspace(16)* null, align 8

This is probably a cleaner solution but it may have a wide impact within the LLVM IR infrastructure. This impact may be too high to justify the addition of new attributes.

3.2.2. Advantages

Adding the ability to describe UPC pointer-to-shared types and operations on those types directly in the LLVM IF has several advantages:

  • UPC pointers-to-shared values are expressed pointers in LLVM, thus allowing the possibility of applying LLVM pointer-related optimizations
  • The generated IR is easier to read and understand.

3.2.3. Disadvantages

There are some disadvantages and risks associated with this approach:

  • The address space qualifier has only 20 bits of ragne and must be extended if it is used to encode the UPC blocking factor. At this time it is possible to extend it to 50 bits, but this might conflict with future Clang/LLVM changes which add more bits into the record field which records the address space qualifier.
  • Encoding the block size in the address space qualifier and relying on the LLVM element type size has some risks. The alternative of encoding both of them in the address space attribute (32 bits) is not practical as it would severally affect the UPC usability.

Chapter 4. Load/Store Operations

Both approaches outlined above rely on the fact that UPC pointer-to-shared values (remote access pointers) are placed inside an LLVM IR native pointer container and marked with the special address space attribute. For example, loading a value from a variable in a specified address space looks this:

@ptr = common global i32 addrspace(16)* null, align 8
@var_local = common global i32 0, align 4
[...]
  %0 = load i32 addrspace(16)** @ptr, align 8
  %1 = load i32 addrspace(16)* %0, align 4
  store i32 %1, i32* @var_local, align 4
[...]

The UPC memory consistency model defines two memory modes, 'strict' and 'relaxed'. As the names suggest, program behavior under strict consistency is more constrained than that under relaxed consistency, as memory accesses on each thread must preserve the program order. The default UPC consistency model is 'relaxed'.

The UPC memory consistency model is summarized below:

  1. Strict accesses always appear (to all threads) to have executed in program order with respect to other strict accesses:

    1. all relaxed accesses must complete before any strict access
    2. all strict accesses must complete before any other strict or relaxed access
  2. Any sequence of relaxed accesses issued by a given thread in an execution may appear to be arbitrarily reordered relative to program order by the implementation. The only exception to this rule is that two relaxed accesses issued by a given thread to the same memory location where at least one is a write will always appear to all threads to have executed in program order.

Given that Clang UPC generates LLVM IR load/store instructions (which in turn refer to a reserved address space) to describe shared memory accesses, some mechanism for ensuring that the accesses are re-ordered by the LLVM optimizer and code generator according to the restrictions imposed by the UPC memory consistency model. There are a few implementation choices:

  1. Add LLVM strict/relaxed attributes to the IR load/store operations. This option requires changes throughout the LLVM optimization passes wherever there is a possibility of IR load/store operations reordering.
  2. Use different address spaces for relaxed and strict UPC operations. This is similar to the previous option except for adding new LLVM IR attributes. The impact is similar also: checks have to be added in LLVM anywere that re-ordering might occur.
  3. Use LLVM atomics to provide the necessary ordering for UPC strict/relaxed IR load/store operations. This option is the most appealing as it does not require changes to the LLVM optimization passes. Specifying UPC remote accesses as atomic will ensure that LLVM does not re-order any UPC shared accesses with both other shared access and regular accesses. This may be overly restrictive, but would not preclude a UPC-specific optimization pass which implements re-orderings that conform to the UPC consistency model.

LLVM atomics (based on the C++11 standard) provides six levels of atomicity that are used to achieve a balance between performance and necessary access guarantees.

NonAtomics
Regular load and store.
Unordered
The lowest level of atomicity. Lock-free operation that guarantees somewhat sane results (instead of having undefined behavior).
Monotonic
Consistent ordering exists on all operations affecting a specific address. This corresponds to the C++0x/C1x memory_order_relaxed memory order.
Acquire
A barrier to acquire a lock to access other memory with normal loads and stores.
Release
Release lock barrier.
AcquireRelease
Provides both an Acquire and a Release barrier.
SequentiallyConsistent
Similar to AcquireRelease, but also guarantees the ordering among all operations with this attribute. This corresponds to the C++0x/C1x memory_order_seq_cst.

Given that the SequentiallyConsistent level guarantees the total ordering between all SequentiallyConsistent operations, it seems a suitable candidate for implementation of UPC strict memory operations. The Monotonic memory order can be used for relaxed UPC accesses as consistent ordering affecting the same address is enforced. Atomic load/store instructions must be used to to enforce the necessary ordering of UPC shared accesses, because ordering attributes are only available on LLVM IR "load/store atomic" operations.

The following UPC memory access modes are mapped into LLVM atomic operation memory orderings:

UPC Operation LLVM atomics

Relaxes Read

Atomic Monotonic Load

Relaxed Write

Atomic Monotonic Store

Strict Read

Atomic SequentiallyConsistent Load

Strict Write

Atomic SequentiallyConsistent Store

The following example shows an atomic store which asserts the SequentiallyConsistent ordering:

store atomic i32 %17, i32* %9 seq_cst, align 4

In this example an ordering keyword, "seq_cst", is used to label the desired ordering constraint (SequentiallyConsistent) on the LLVM "store atomic" IR instruction.

Chapter 5. Storage Allocation

Clang UPC defines all UPC shared variables as LLVM globals located in a special data section (upc_shared). For example:

shared int var;
shared int var_a[4*THREADS];

generates the following LLVM code:

@var = global i32 0, section "upc_shared", align 4
@var_a = global [4 x i32] zeroinitializer, section "upc_shared",
         align 4

In the case where a UPC pointer-to-shared value is encoded as an LLVM native pointer, the address space attribute can also be used to encode the shared variable blocking factor (and element type if necessary). In this case the following IR code can be generated (assuming that bit 31 in the address space is used differentiate shared and local space):

@var = addrspace(0x80000000) global i32 0, section "upc_shared",
       align 4
@var_a = addrspace(0x80000004) global [4 x i32] zeroinitializer,
       section "upc_shared", align 4

For the design alternative where a UPC pointer-to-shared value is described using attributes to provide the object’s UPC defined blocking factor and element size, the generated LLVM IR definitions might appear as follows:

@var = addrspace(0x80000000) global i32 0, section "upc_shared",
       align 4
@var_a = addrspace(0x80000000) blocksize(0) elemsize(4)
       global [4 x i32] zeroinitializer, section "upc_shared",
       align 4

Chapter 6. LLVM Optimizations

This section lists all of the available LLVM optimizations together with their potential impact on a UPC compiler implementation which uses LLVM load ans store IR instructions to express UPC shared memory accesses.

Symbol Description

Y

Optimization works

N

Optimization does not work

N/A

Optimization is not applicable

Rmt Ptr UPC Ptr Optimization Description

Y

Y

-aa-eval

Exhaustive Alias Analysis Precision Evaluator

Y

Y

-adce

Aggressive Dead Code Elimination

-alloca-hoisting

Hoisting alloca instructions in non-entry blocks to the entry block

Y

Y

-always-inline

Inliner for always_inline functions

Y

Y

-argpromotion

Promote by reference arguments to scalars

? Y

Y

-asan

AddressSanitizer: detects use-after-free and out-of-bounds bugs.

? Y

Y

-asan-module

AddressSanitizer: detects use-after-free and out-of-bounds bugs.ModulePass

Y

Y

-basicaa

Basic Alias Analysis (stateless AA impl)

Y

Y

-basiccg

CallGraph Construction

Y

Y

-bb-vectorize

Basic-Block Vectorization

Y

Y

-block-freq

Block Frequency Analysis

? Y

Y

-bounds-checking

Run-time bounds checking

Y

Y

-branch-prob

Branch Probability Analysis

Y

Y

-break-crit-edges

Break critical edges in CFG

Y

Y

-codegenprepare

Optimize for code generation

Y

Y

-constmerge

Merge Duplicate Global Constants

Y

Y

-constprop

Simple constant propagation

?

-correlated-propagation

Value Propagation

Y

Y

-cost-model

Cost Model Analysis

Y

Y

-count-aa

Count Alias Analysis Query Responses

Y

Y

-da

Dependence Analysis

?

-datalayout

Data Layout

Y

Y

-dce

Dead Code Elimination

Y

Y

-deadargelim

Dead Argument Elimination

Y

Y

-deadarghaX0r

Dead Argument Hacking (BUGPOINT USE ONLY; DO NOT USE)

Y

Y

-debug-aa

AA use debugger

Y

Y

-debug-ir

Enable debugging IR

Y

Y

-delinearize

Delinearization

Y

Y

-dfsan

DataFlowSanitizer: dynamic data flow analysis.

Y

Y

-die

Dead Instruction Elimination

Y

Y

-domfrontier

Dominance Frontier Construction

Y

Y

-domtree

Dominator Tree Construction

N/A

N/A

-dot-callgraph

Print call graph to dot file

N/A

N/A

-dot-cfg

Print CFG of function to dot file

N/A

N/A

-dot-cfg-only

Print CFG of function to dot file (with no function bodies)

N/A

N/A

-dot-dom

Print dominance tree of function to dot file

N/A

N/A

-dot-dom-only

Print dominance tree of function to dot file (with no function bodies)

N/A

N/A

-dot-postdom

Print postdominance tree of function to dot file

N/A

N/A

-dot-postdom-only

Print postdominance tree of function to dot file (with no function bodies)

N/A

N/A

-dot-regions

Print regions of function to dot file

N/A

N/A

-dot-regions-only

Print regions of function to dot file (with no function bodies)

? Y

Y

-dse

Dead Store Elimination

?

-early-cse

Early CSE

Y

Y

-extract-blocks

Extract Basic Blocks From Module (for bugpoint use)

Y

Y

-functionattrs

Deduce function attributes

?

-generic-to-nvvm

Ensure that the global variables are in the global address space

Y

Y

-globaldce

Dead Global Elimination

Y

Y

-globalopt

Global Variable Optimizer

Y

Y

-globalsmodref-aa

Simple mod/ref analysis for globals

Y

Y

-gvn

Global Value Numbering

Y

Y

-indvars

Induction Variable Simplification

Y

Y

-inline

Function Integration/Inlining

Y

Y

-inline-cost

Inline Cost Analysis

Y

Y

-insert-gcov-profiling

Insert instrumentation for GCOV profiling

Y

Y

-instcombine

Combine redundant instructions

Y

Y

-instcount

Counts the various types of Instructions

Y

Y

-instnamer

Assign names to anonymous instructions

Y

Y

-instsimplify

Remove redundant instructions

Y

Y

-internalize

Internalize Global Symbols

Y

Y

-intervals

Interval Partition Construction

Y

Y

-ipconstprop

Interprocedural constant propagation

Y

Y

-ipsccp

Interprocedural Sparse Conditional Constant Propagation

Y

Y

-iv-users

Induction Variable Users

Y

Y

-jump-threading

Jump Threading

Y

Y

-lazy-value-info

Lazy Value Information Analysis

Y

Y

-lcssa

Loop-Closed SSA Form Pass

Y

Y

-libcall-aa

LibCall Alias Analysis

Y

Y

-licm

Loop Invariant Code Motion

Y

Y

-lint

Statically lint-checks LLVM IR

Y

Y

-loop-deletion

Delete dead loops

Y

Y

-loop-extract

Extract loops into new functions

Y

Y

-loop-extract-single

Extract at most one loop into a new function

Y

Y

-loop-idiom

Recognize loop idioms

Y

Y

-loop-instsimplify

Simplify instructions in loops

Y

Y

-loop-reduce

Loop Strength Reduction

Y

Y

-loop-reroll

Reroll loops

Y

Y

-loop-rotate

Rotate Loops

Y

Y

-loop-simplify

Canonicalize natural loops

Y

Y

-loop-unroll

Unroll loops

Y

Y

-loop-unswitch

Unswitch loops

Y

Y

-loop-vectorize

Loop Vectorization

Y

Y

-loops

Natural Loop Information

Y

Y

-lower-expect

Lower expect Intrinsics

Y

Y

-loweratomic

Lower atomic intrinsics to non-atomic form

Y

Y

-lowerinvoke

Lower invoke and unwind, for unwindless code generators

Y

Y

-lowerswitch

Lower SwitchInst’s to branches

Y

Y

-mem2reg

Promote Memory to Register

Y

Y

-memcpyopt

MemCpy Optimization

Y

Y

-memdep

Memory Dependence Analysis

Y

Y

-mergefunc

Merge Functions

Y

Y

-mergereturn

Unify function exit nodes

Y

Y

-metarenamer

Assign new names to everything

Y

Y

-module-debuginfo

Decodes module-level debug info

Y

Y

-msan

MemorySanitizer: detects uninitialized reads.

Y

Y

-no-aa

No Alias Analysis (always returns may alias)

Y

Y

-notti

No target information

Y

Y

-nvvm-reflect

Replace occurences of __nvvm_reflect() calls with 0/1

N/A

N/A

-objc-arc

ObjC ARC optimization

N/A

N/A

-objc-arc-aa

ObjC-ARC-Based Alias Analysis

N/A

N/A

-objc-arc-apelim

ObjC ARC autorelease pool elimination

N/A

N/A

-objc-arc-contract

ObjC ARC contraction

N/A

N/A

-objc-arc-expand

ObjC ARC expansion

Y

Y

-partial-inliner

Partial Inliner

Y

Y

-partially-inline-libcalls

Partially inline calls to library functions

Y

Y

-postdomtree

Post-Dominator Tree Construction

Y

Y

-preverify

Preliminary module verification

N/A

N/A

-print-alias-sets

Alias Set Printer

N/A

N/A

-print-bb

Print BB to stderr

N/A

N/A

-print-callgraph

Print a call graph

N/A

N/A

-print-callgraph-sccs

Print SCCs of the Call Graph

N/A

N/A

-print-cfg-sccs

Print SCCs of each function CFG

N/A

N/A

-print-dom-info

Dominator Info Printer

N/A

N/A

-print-externalfnconstants

Print external fn callsites passed constants

N/A

N/A

-print-function

Print function to stderr

N/A

N/A

-print-memdeps

Print MemDeps of function

N/A

N/A

-print-module

Print module to stderr

N/A

N/A

-print-used-types

Find Used Types

Y

Y

-prune-eh

Remove unused exception handling info

Y

Y

-reassociate

Reassociate expressions

Y

Y

-reg2mem

Demote all values to stack slots

Y

Y

-regions

Detect single entry single exit regions

Y

Y

-sample-profile

Sample Profile loader

Y

Y

-scalar-evolution

Scalar Evolution Analysis

Y

Y

-scalarrepl

Scalar Replacement of Aggregates (DT)

Y

Y

-scalarrepl-ssa

Scalar Replacement of Aggregates (SSAUp)

Y

Y

-sccp

Sparse Conditional Constant Propagation

Y

Y

-scev-aa

ScalarEvolution-based Alias Analysis

Y

Y

-simplifycfg

Simplify the CFG

Y

Y

-sink

Code sinking

Y

Y

-slp-vectorizer

SLP Vectorizer

Y

Y

-sroa

Scalar Replacement Of Aggregates

N/A

N/A

-strip

Strip all symbols from a module

N/A

N/A

-strip-dead-debug-info

Strip debug info for unused symbols

N/A

N/A

-strip-dead-prototypes

Strip Unused Function Prototypes

N/A

N/A

-strip-debug-declare

Strip all llvm.dbg.declare intrinsics

N/A

N/A

-strip-nondebug

Strip all symbols, except dbg symbols, from a module

Y

Y

-structurizecfg

Structurize the CFG

Y

Y

-tailcallelim

Tail Call Elimination

Y

Y

-targetlibinfo

Target Library Information

? Y

Y

-tbaa

Type-Based Alias Analysis

Y

Y

-tsan

ThreadSanitizer: detects data races.

N/A

N/A

-verify

Module Verifier

N/A

N/A

-view-callgraph

View call graph

N/A

N/A

-view-cfg

View CFG of function

N/A

N/A

-view-cfg-only

View CFG of function (with no function bodies)

N/A

N/A

-view-dom

View dominance tree of function

N/A

N/A

-view-dom-only

View dominance tree of function (with no function bodies)

N/A

N/A

-view-postdom

View postdominance tree of function

N/A

N/A

-view-postdom-only

View postdominance tree of function (with no function bodies)

N/A

N/A

-view-regions

View regions of function

N/A

N/A

-view-regions-only

View regions of function (with no function bodies)

The above list of LLVM 3.4 optimization passes was gathered by invoking the LLVM optimizer with the -help switch.

--- opt -help ---

Chapter 7. More Info

LLVM Language Reference Manual LLVM Atomic Instructions and Concurrency Guide