VectorVisor: A Binary Translation Scheme for Throughput-Oriented GPU Acceleration

Samuel Ginzburg  Mohammad Shahrad†  Michael J. Freedman

Princeton University  †University of British Columbia

Abstract

Beyond conventional graphics applications, general-purpose GPU acceleration has had significant impact on machine learning and scientific computing workloads. Yet, it has failed to see widespread use for server-side applications, which we argue is because GPU programming models offer a level of abstraction that is either too low-level (e.g., OpenCL, CUDA) or too high-level (e.g., TensorFlow, Halide), depending on the language. Not all applications fit into either category, resulting in lost opportunities for GPU acceleration.

We introduce VectorVisor, a vectorized binary translator that enables new opportunities for GPU acceleration by introducing a novel programming model for GPUs. With VectorVisor, many copies of the same server-side application are run concurrently on the GPU, where VectorVisor mimics the abstractions provided by CPU threads. To achieve this goal, we demonstrate how to (i) provide cross-platform support for system calls and recursion using continuations and (ii) make full use of the excess register file capacity and high memory bandwidth of GPUs. We then demonstrate that our binary translator is able to transparently accelerate certain classes of compute-bound workloads, gaining significant improvements in throughput-per-dollar of up to $2.9 \times$ compared to Intel x86-64 VMs in the cloud, and in some cases match the throughput-per-dollar of native CUDA baselines.

1 Introduction

Server-side GPU acceleration has become ubiquitous, with all major cloud providers offering virtual machine instances with attached GPUs. GPU workloads such as graphics and machine learning have found widespread adoption due to the superior throughput-per-dollar that GPUs offer.

Typical approaches to accelerating these workloads on GPUs use domain-specific programming languages (DSLs). DSLs for GPUs heavily restrict which abstractions can be used by developers to write applications, and in particular forces them to use parallel abstractions. For example, machine learning programming systems such as TensorFlow [24] require users to specify programs as a series of operations performed on n-dimensional arrays. Approaches to extracting parallelism on GPUs for graphical workloads such as Halide [73] enforce more extreme restrictions such as requiring developers to express image operations as pure mathematical functions, defining the value of each function at each point. Other DSLs targeting batch dataflow workloads require developers to express their program using built-in parallel functions, which impose additional restrictions on application logic [75].

Developers who cannot express their application logic using these restricted abstractions are stuck manually rewriting applications in OpenCL or CUDA, which expose a low-level programming interface. Complex programs that use large pre-existing libraries, or where extracting parallelism is difficult, can be time-consuming to write and require drastic modifications to run using GPUs.

In this paper, we explore the feasibility of an alternative programming model for GPUs—where we take existing single-threaded programs and execute many copies of them using GPU threads. Each GPU thread corresponds to an emulated CPU thread, running a single instance of the program. Unlike prior approaches [37] which utilize interpretation, we translate the input program to native GPU code, substantially boosting performance while enabling a wider variety of target languages and runtimes. Unlike OpenCL or CUDA programs, we provide support for system calls and a CPU-like flat memory model. While less efficient than manual translation, this approach substantially reduces the barrier to accelerating throughput-oriented workloads using GPUs, ultimately improving the throughput and cost efficiency of applications that would otherwise run on CPUs.

Many applications written to run on CPUs are single-threaded programs, often implemented using high-level programming languages with large imported libraries. Without modification, these applications do not map cleanly to existing GPU programming models (e.g., those using language-level parallel functions such as in TensorFlow or Halide). Instead, these workloads process requests independently, with
no inter-request synchronization or communication. Examples of these workloads include cryptographic operations, image manipulation, and compression. These workloads are generally amenable to GPU acceleration [60, 67, 73], but are frequently run on CPUs instead.

Effectively enabling this programming model requires one to overcome several technical challenges in dealing with the substantial differences between GPUs and CPUs. These differences include how programs are executed—e.g., in which programs are run to completion without preemption—as well as a lack of support in GPUs for system calls. Further, failing to take differences in GPU memory hierarchies into account can result in an order of magnitude decrease in read and write performance. Prior approaches to running unmodified programs on GPUs suffer from poor performance due to the overheads of interpretation [37] as well as compatibility issues such as the lack of support for system calls [45].

To explore this unique programming model for GPUs, we built VectorVisor—a system which utilizes a vectorizing binary translator for GPUs. VectorVisor is designed to accelerate and unmodified programs that are designed to run on CPUs but can benefit from GPU acceleration. Target programs are automatically translated to run on GPUs efficiently, eliminating the need for complex manual translation. In particular, VectorVisor uses WebAssembly [54] as the intermediate binary format, which enables secure, fast, and efficient compilation for a wide range of applications.

We overcome the differences in program execution and memory hierarchy by translating WebAssembly programs to run directly on the GPU as opposed to using interpretation. We show that the remaining differences between CPUs and GPUs can be bridged with a combination of three techniques:

**Continuations:** CUDA and OpenCL do not provide support for preempting running applications in addition to lacking support for system calls. Without preemption, we cannot dispatch system calls, making it impossible to run complex and unmodified programs. To bypass this issue, we implement continuations for OpenCL C. Continuations are language-level primitives that allow us to save the program state at arbitrary locations, and then resume execution at a later time. Doing so allows us to pause and resume running GPU kernels, and to provide support for system calls. We also benefit from the portability of our approach—enabling VectorVisor to be run with multiple GPU vendors (e.g., NVIDIA, AMD).

**WebAssembly:** WebAssembly (WASM) binaries are designed with performance, portability, and security in mind. Many popular languages can compile to WASM (e.g., Rust, Go, C, C++, AssemblyScript, and more), making it an ideal intermediate format. WASM binaries are designed with runtime JIT compilation in mind, persisting vital information not present in x86 binaries. WASM semantics provide VectorVisor with memory alignment information, register allocation hints, type-checks on operations, and language-enforced structured control-flow. We heavily utilize this information to deal with challenges such as efficiently making use of the substantially larger per-thread [14] register space on GPUs—which is crucial for maximizing performance. Other important performance optimizations are also enabled through this compile-time information.

**Memory Interleaving:** GPUs organize threads in warps, or groups of threads. Each thread in a warp has a numerical index, and threads with adjacent indices must access adjacent bytes for optimal performance—so that memory accesses can be coalesced together. Coalesced memory accesses enable GPUs to maximize memory bandwidth usage at the cost of a more complex programming model. To bridge the differences between the GPU and CPU memory hierarchies, we automatically interleave the memory of each virtual machine running on the GPU to transparently coalesce all memory accesses.

We demonstrate VectorVisor’s capabilities to accelerate several unmodified, third-party applications which use popular open-source libraries. We then evaluate VectorVisor’s efficacy using nine benchmarks with throughput-per-dollar as our primary metric. Selected benchmarks include multiple classes of workloads, some of which reflect ideal applications of VectorVisor, with others reflecting the limitations of our programming model. Comparisons against native x86-64 and WebAssembly versions of each benchmark are provided, showing that VectorVisor can achieve superior throughput-per-dollar. We also provide native CUDA versions of two benchmarks to evaluate the efficacy of our translation. Our paper makes the following contributions:

1. We introduce a novel cross-platform approach to running lightweight virtual machines using GPUs, where VMs securely execute native code to maximize performance and support multiple high-level languages.
2. We show that support for system calls can be efficiently provided using continuations in addition to supporting recursion and indirect calls in OpenCL.
3. We demonstrate that we can emulate a flat memory model using an efficient memory interleave, enabling existing programs to leverage the high memory bandwidth of GPUs.
4. We explore the implications of batch size, latency, and throughput on VectorVisor’s programming model and discuss which categories of workloads are optimal for it.
5. We discuss the limitations of our system and optimal GPU configurations for it.

### 2 Motivation and Challenges

The past several years have shown a large increase in the availability of cloud accessible GPUs. GPUs that cost thousands of dollars are now available at affordable prices per hour. Developers can quickly test if accelerating their program using
a GPU is cost-effective without large up-front investments in GPU hardware. However, despite having strong parallel processing power and cloud availability, GPUs are not often used for running high concurrency server-side applications.

Translating programs originally intended to execute on CPUs to run on GPUs is difficult due to the substantial differences between the execution models and memory hierarchies. Today’s approaches to tackling these issues either require strong language-level restrictions with unintuitive stumbling blocks for developers, or slower automated approaches such as interpretation [37].

2.1 Execution Model Differences

Taking advantage of the throughput that GPUs offer requires using a different execution model than CPUs offer. GPUs feature restrictions on both the application runtimes and control flow that limit the set of possible workloads.

Runtime Limitations: CUDA and OpenCL are the two most popular compute APIs available for GPUs, and they share a near identical programming model. Programs that run on GPUs (GPU kernels) are submitted and execute until completion without preemption. High-level languages targeting general-purpose GPU programming such as CUDA C++ and OpenCL C feature restrictions on the usage of standard libraries, recursion, indirect function calls, variable length arrays, virtual functions, and templates [12, 16]. Support for other common features such as system calls and preemption are absent, further restricting the set of programs that can run.

Divergence: Unlike CPUs, which allow for different hardware threads to execute different instructions, GPUs organize threads into groups of threads (warps). Each warp shares a program counter, so all threads execute the same instruction on each clock cycle. Support for conditional branching is provided by executing no-ops for threads that have diverged, while the remaining threads block on threads executing the conditional branch. This results in the serialized execution of branches. Programs with substantial divergence are not able to efficiently use GPU resources as a result [33, 43, 51].

2.2 Memory Hierarchy Differences

GPUs and CPUs handle memory accesses differently due to the different design constraints imposed upon the hardware. CPUs optimize for reduced memory latency for all threads of execution, so they feature large cache sizes to minimize accesses to main memory. In contrast, GPUs seek to maximize memory bandwidth. GPUs can achieve $3 \times$ the memory bandwidth of a comparable CPU [5, 8]. While GPUs can achieve higher memory bandwidth, memory latency on a given GPU can be up to $2.75 \times$ worse than a comparable CPU [61, 66].

These differences in the memory hierarchy between GPUs and CPUs have two key implications for developers:

Register Space: At a high level, the memory hierarchies of CPUs and GPUs are similar, with both devices featuring register files, data caches, and byte-addressable memory. However, GPUs feature substantially larger register files. Each thread on recent NVIDIA GPUs can have a maximum of 255 32-bit register values [12] (just under 1 KiB of storage). In contrast, the x86-64 instruction set architecture has 16 64-bit general purpose registers (128 bytes of storage). Additionally, GPUs typically feature far more threads of execution than CPUs, further magnifying the difference. Making use of this extra space is critical to maximizing performance on GPUs [36, 70].

Memory Accesses: Rules regarding efficient memory access patterns are different for GPUs. GPUs require that programs perform coalesced memory accesses. Similar to CPUs, locality of memory accesses allows reads and writes to be cached, and is required to get optimal performance. However, GPU kernels must also ensure the locality of memory accesses across threads. Threads in a GPU warp are numerically indexed, and adjacent threads must access adjacent bytes of memory as a general rule. Otherwise, memory accesses within a warp can be serialized. Without proper memory coalescing GPU memory bandwidth can be cut by up to $32 \times$ [12].

In addition to performance drops, GPUs can have stricter memory access policies than CPUs. Ideally, memory accesses are naturally aligned—meaning N byte accesses must be N-bytes aligned. Unaligned accesses cause running GPU kernels to fault on NVIDIA GPUs [12], causing programs that would have run correctly on a CPU to crash on a GPU.

3 System Design

In this section we introduce VectorVisor, a vectorizing binary translator for GPUs designed to leverage the implicit parallelism provided by our programming model. Our approach enables us to run many instances of unmodified programs on the GPU concurrently, making it far easier to utilize the substantial parallelism that GPUs offer.

We first describe our programming model in depth, where we explain how developers can leverage VectorVisor to accelerate programs. We characterize a set of ideal workloads for VectorVisor and explore the limitations of our programming model. Following this, we provide a high-level overview of VectorVisor’s primary system components. To enable our simple and easy to use programming model, VectorVisor automates data transfer to and from the GPU. We illustrate this by showing the life cycle of a request processed by VectorVisor. Lastly, we provide an example of a short program, and how we transform it to run on the GPU.

3.1 Programming Model, Target Workloads

In contrast to OpenCL or CUDA, VectorVisor mimics the abstractions provided by CPU threads, treating each individual GPU thread as a small virtual machine. Each VM operates on a statically allocated chunk of memory, fully isolated from

\[https://github.com/SamGinzburg/VectorVisor\]
other VMs. This memory model does not support inter-VM communication, and thus prevents deadlocking. Many copies of the same program are mapped to GPU threads, which then operate on distinct inputs.

This approach to parallelism for GPUs enables developers to run complex and unmodified single-threaded programs originally written for the CPU. Developers can leverage GPU acceleration without learning complex programming models or rethinking the logical structure of their programs.

VectorVisor is designed to function with unmodified workloads, but not all programs are equally amenable to acceleration using our programming model. Data parallel, latency-insensitive, and compute-bound ‘serverless-like’ workloads are ideal targets for GPU acceleration using VectorVisor. For a subset of these workloads, correct manual translation can be difficult without domain-specific knowledge—and those workloads represent an ideal use-case for VectorVisor. Suitable workloads share a number of characteristics:

**High Execution Volume:** VectorVisor relies on running many instances of the same program concurrently, instead of accelerating a single execution. Therefore, the more instances packed on the GPU, the higher the cost efficiency. Naturally, the latency QoS of the application should be able to afford the added batching latency prior to execution.

**Application Limitations:** VectorVisor runs unmodified programs where possible, but some abstractions are expensive to emulate on GPUs. Recursion and indirect calls reduce application performance due to how we implement them (explained further in Section 3.3).

- Navigating tradeoffs between application concurrency and heap size are key to maximizing performance when using VectorVisor. We experimentally found in our evaluation that running 4096-6144 VMs with a heap size of 3-4 MiB proved optimal for our selected workloads. However, it is possible to run VectorVisor with varying degrees of concurrency—adjusting for different heap sizes.

- Lastly, floating point differences between CPUs and GPUs can result in different outputs for applications [11, 37, 65], depending on the specific application and compiler.

**Low Divergence:** Ideal workloads should minimize program divergence in order to fully utilize the superior throughput and memory bandwidth that GPUs can offer.

**Data Transfer Overheads:** VectorVisor automates data transfer to and from the GPU; however, the overhead involved can be a substantial fraction of end-to-end request time. Maintaining a high ratio of GPU compute to input and output size is ideal for maximizing VM throughput.

### 3.2 Design Overview

VectorVisor consists of two key components, the binary translator (compiler) and the vectorized virtual machine monitor (VMM). We show an overview of VectorVisor in Figure 1, showing the role of each component as well as the life cycle of an incoming request to the system. Requests are first queued externally to VectorVisor, before being batched by the VMM, and submitted to the VMs running on the GPU—which are blocked on a system call awaiting input. We provide a pre-configured web server that automatically handles all data transfer to the VMM. After executing a batch of requests, responses are returned via another system call, and then back to the web server. This approach enables VectorVisor to be used as a drop-in replacement for existing systems, without the need for developers to manually batch incoming requests.

Binary translation is a separate process, occurring before applications run. Programs are compiled from any language targeting LLVM [64] (e.g., Rust, C, C++) into WebAssembly, our intermediate binary format. We then compile WebAssembly to OpenCL C. Targeting OpenCL C enables VectorVisor to support multiple GPU vendors. This approach allows us to run existing programs without the need to worry about complex language semantics—we only need to concern ourselves with WebAssembly semantics which are far simpler than alternatives such as LLVM IR and directly compiling high-level languages. LLVM IR places minimal restrictions on control flow structures, and can represent programs that are impossible for any GPU to run. WebAssembly only provides structured control flow by design, ensuring that programs can always be translated to run on the GPU [54]. Alternative approaches that directly compile high-level languages to run using GPUs require substantial engineering effort, and can run into compatibility issues [37].

Our system design features a number of novel contributions that we employ to bridge the substantial differences that exist between GPUs and CPUs described in Section 2. Our contributions succeed in bridging most of the gaps in capabilities between CPU and GPU runtimes. Recursive and indirect functions limit performance for some workloads but do not limit our functionality or correctness.
8-Byte Interleave
4-Byte Interleave
1-Byte Interleave

WebAssembly (WASM) [54], is a low-level language designed to translate input programs such that they can run on the GPU. The role of the compiler is to automate away the difficulties involved in writing programs for the GPU that are outlined in Section 2. We explore a set of techniques for enabling the execution of unmodified programs, which we demonstrate using a simple example of an input program.

### 3.3 Compiler

VectorVisor uses a binary translator (compiler) to translate input programs such that they can run on the GPU. The role of the compiler is to automate away the difficulties involved in writing programs for the GPU that are outlined in Section 2. We explore a set of techniques for enabling the execution of unmodified programs, which we demonstrate using a simple example of an input program.

#### 3.3.1 Compiling WebAssembly

WebAssembly (WASM) [54], is a low-level language designed for performance, size, portability, and security. WASM binaries differ significantly from x86-64 binaries, as they are designed to be recompiled before runtime, retaining significant compilation information that can be used. Using WASM as an intermediate format simultaneously allows us to avoid dealing with the complex semantics of higher-level languages (e.g., Rust, C, C++) while also improving the performance of VectorVisor. We make use of this information in three places within our compiler:

1. **Register Allocation:** Recent NVIDIA GPUs have up to 255 32-bit registers per thread [12], providing roughly 8× the amount of storage per CPU thread ignoring vector registers. Traditional x86-64 binaries target CPUs with only 16 64-bit general purpose registers. Static analysis could conceivably be used to place stack allocations in x86-64 binaries into GPU registers, but WebAssembly provides a more convenient solution. In contrast to x86-64 binaries, WASM is a stack-based virtual machine and does not explicitly allocate registers [54]. Instead, values are placed either onto the stack or into local variables. Figure 2 shows an example WebAssembly program, which places four integers onto the stack. During compilation, we are able to store these values directly into variables which the backend GPU compiler (OpenCL C compiler) can then place into GPU registers. This approach allows the OpenCL C compiler to place values into registers that an x86-64 compiler would have placed on the stack.

2. **Runtime support:** Most programs require some degree of modification to run on GPUs. Memory allocation, locking primitives, and threading primitives make assumptions about the underlying system that are false on GPUs. However, many such modifications are already performed by WebAssembly compilers. WASM binaries not only provide substantial compilation information, but also a “batteries-included” set of runtime modifications. Compilers targeting WebAssembly typically compile programs with a modified standard library with the necessary modifications already made.

3. **Memory Alignment:** Misaligned accesses cause runtime programs to crash when run using NVIDIA GPUs [12]. Handling misaligned accesses can be done at runtime by performing multiple aligned reads, but doing so introduces runtime overhead. Emitting optimized code for aligned accesses substantially boosts application performance. WASM binaries contain alignment information (e.g., the align attribute) that we can use to optimize reads and writes. However, the align attribute is only a hint, and as per the WASM specification, programs are expected to run correctly even with incorrectly specified alignments [54]. In practice, WASM binaries compiled by LLVM always contain the correct alignment information. By restricting the set of programs that we run to those compiled by LLVM, we can leverage these compilation hints safely to improve the performance of VectorVisor.

VectorVisor supports running programs with and without this optimization using compiler-flags.

#### 3.3.2 Memory Interleaving

GPUs have strict memory access rules to obtain optimal performance. As described in Section 2.2, GPU kernels must coalesce memory accesses to maximize memory bandwidth. Doing so requires developers to interleave objects in memory, such that adjacent threads access adjacent bytes, breaking the abstraction of a flat memory model. Other aspects of the flat memory model, such as process (or VM) memory isolation are also absent on GPUs by design.

VectorVisor provides the abstraction of a flat memory model to developers, automatically interleaving the address space of underlying virtual machines (threads) on the GPU to improve performance.
Figure 4: The trampoline function serves as the entry point to each GPU kernel.

to provide both performance and security. This approach allows existing programs to run, while also extracting the full performance benefits of a GPU—assuming that running VMs exhibit similar memory access patterns. Randomized memory access patterns or significant program divergence can reduce memory bandwidth. Figure 3 shows how memory is interleaved across VMs in VectorVisor. Memory is organized into cells of contiguous bytes. Cell addresses are computed using the following pointer arithmetic (C operator precedence):

\[
\text{cell_addr} = (\text{offset}) \times (\text{num_vms} \times \text{ileave}) + (\text{vm_id} \times \text{ileave}) + \text{mem_base}
\]

Where the interleave (ileave) represents the byte-width of the interleaving (e.g., 1, 4, or 8), the offset is the zero-indexed WebAssembly address, and mem_base is the base address of the allocated chunk of memory. Memory accesses are rewritten to operate on cells, with misaligned and larger (e.g., 8, 16-byte value) accesses requiring multiple operations. Our approach enables us to support 1, 4, and 8-byte interleavings, with larger interleavings typically achieving superior memory bandwidth.

WASM memory is represented as a zero-indexed linear array of bytes with pointers in the range of 0–232.1 and does not expose virtual addresses to running VMs. The relative addressing model WASM uses enables the compiler to control the virtual addresses of all memory reads and writes. Our cell address computation prevents VMs from computing cell addresses which belong to other VMs—preventing out-of-bounds accesses from corrupting or leaking data and providing memory isolation by construction.

3.3.3 GPU Preemption

Section 2.1 described the limitations of GPU programming models such as OpenCL and CUDA. Common features of programs such as system calls, recursion, and indirect calls may vary in support—with system calls being absent from both OpenCL and CUDA. To fully mimic the execution environment provided by a CPU in VectorVisor, we support all three features. Implementing these features within OpenCL C requires us to provide support for preempting running programs. We provide support for preemption in VectorVisor by extending OpenCL C with support for continuations. Continuations provide the abstraction of being able to pause and resume programs at arbitrary points. To maximize the performance of VectorVisor, we leverage several compiler optimizations to reduce the overhead they introduce.

Continuations. Continuation-Passing Style [81] (CPS) is a relatively uncommon programming style where functions take in an additional parameter (the continuation), and instead of returning a value call the provided continuation with the return value. CPS with trampolining [27] is similar to standard CPS, with the difference being that function calls return continuations instead of just calling the provided continuation. A control operator (trampoline function), is used to repeatedly call the returned continuations. Figure 4 shows the trampoline function used in VectorVisor, which is the main entry point to each running GPU kernel. Implementing CPS with trampolines in this manner enables VectorVisor to preempt running GPU kernels at arbitrary locations—although we only return control to the CPU when either every VM is finished executing or when every VM is blocked on a system call. In Figure 5 we see that the only difference between recursive, indirect, and standard calls is the returned continuation (which encapsulates the program control state). This approach makes it easy to bypass OpenCL C language-level restrictions and provide support for recursive and indirect calls.

Compiler Optimizations. Naively implementing CPS with trampolines enables support for system calls and recursion
with large runtime overheads. To obtain better performance, VectorVisor performs static analysis to minimize the size of saved program contexts. We apply liveness analysis in addition to leveraging WASM type and control flow information to enable (1) incremental context saving, (2) loop-invariant code motion, and (3) WebAssembly-specific optimizations.

Liveness is associated with local usage inside WASM stack frames, and we insert all context save and restore operations around control flow instructions (e.g., block, loop, br, br_if, and end) and function (or system) calls. Runtime taint tracking is used to further enhance our liveness estimates.

Stack frame contexts are saved incrementally—only saving values written to since the previous context save operation. Liveness estimates are used to minimize context sizes in addition to only restoring live values when resuming continuations or unwinding stack frames. Loops without recursive or indirect calls can be further optimized—with context saving and restoring operations hoisted out of the loop. WASM function type signatures are used to translate amenable indirect calls into direct calls by filtering possible indirect call targets.

### 3.3.4 Profile-Guided Optimization

Minimizing the overhead of translating recursive and indirect calls is key to running complex applications. Compiler optimizations eliminate much of the overhead in the common case. Edge cases, such as heavy usage of indirect and recursive calls in a tight loop remain a challenge. While recursion often cannot be eliminated without restructuring programs, indirect calls are easier to remove [25, 34]. Most indirect calls in high-level languages have only one target—with on average 73.5% of indirect call sites in Java programs being monomorphic [59]. Despite aggressive monomorphization in the Rust compiler [26], up to 37% of the most popular Rust libraries reduce code size by not removing optimizing indirect calls where possible [85]. Up to 98% of indirect calls in Java programs can be optimized out entirely [59].

We package a separate tool for instrumenting binaries, to implement profile-guided optimization for VectorVisor. Each program is instrumented and run using sample inputs representative of the overall workload. Using profiler data, we replace all indirect calls with less than 15 seen call targets with direct calls. To avoid emitting indirect calls to handle unseen targets, we instead emit panic handlers which check for valid targets.

### 3.3.5 Soundness

VectorVisor performs a 1-to-1 translation for all operations in input WebAssembly programs (e.g., stack operations, memory access, arithmetic, control flow). Limitations on the soundness of our approach come from (1) Compilation to WebAssembly and (2) Optimizations.

Most common workloads can be recompiled to WebAssembly without problems, but programs which rely on specific x86 instructions (e.g., 80-bit floats), language implementation details (e.g., undefined and implementation defined behavior), and complex language runtimes with unimplemented features (e.g., Go) can experience correctness issues.

Compiler-flags and tools (e.g., wasm-snip) are used to replace panic-related functions with unreachable statements. Unrecoverable errors can be expensive to handle, and in most cases replacing them with program aborts has no impact on correctness. Profile-guided optimization (PGO) can reduce indirect call counts, significantly improving performance in some cases. Our implementation of PGO only includes function calls we observe as potential targets at indirect call sites, aborting on unseen call targets. In practice, the indirect call targets we observed did not vary significantly with user-input beyond what we observed during profiling.

### 3.4 VMM

VectorVisor’s VMM handles all data transfer between the running VMs on the GPU and CPU, as well as executing all system calls. The VMM greatly simplifies the use of VectorVisor by developers, avoiding the need to manage data transfer manually or to batch incoming requests.

Support for dispatching system calls is provided through the WebAssembly System Interface (WASI). We implement two custom WASI system calls—which are used to create a serverless-like event handler API for running VMs. Other implemented calls are primarily used to initialize language runtimes (e.g., reading environmental variables), support random number generation, serve as synchronization barriers (e.g., block on a subset of VMs), and perform simple IO (e.g., error logging).

Incoming requests are buffered using the request buffer, while system calls use an alternate buffer, as shown in Figure 1. Double buffering adds some overhead, but enables VectorVisor to overlap expensive network IO with on-GPU execution time. Sufficiently compute-intensive workloads prevent workloads from bottleneeking on the VMM, which can process thousands of VMs per-GPU.

VectorVisor supports using pinned memory transfers with multiple GPU vendors (e.g., NVIDIA, AMD) to further optimize data transfer speeds—with vendor-specific optimizations [1, 4].

### 4 Evaluation

In this section, we present an evaluation of VectorVisor. First, we discuss the efficiency of (1) our memory interleaving and (2) system call implementation. Second, we explore a variety of modified and unmodified workloads to better understand the tradeoff space of our novel approach to accelerating programs. In several cases we show that we obtain superior throughput-per-dollar against x86 CPUs. Breakdowns of the end-to-end latencies of each benchmark are provided as well to explain our results. Finally, we evaluate the efficiency of our translation against handwritten CUDA baselines.
We leave the exploration of heterogeneous deployments to WebAssembly-focused runtimes and do not have evaluated while AMD GPUs use ROCm 5.4.0 with the latest AMDGPU—which we had to modify the imported library, batch sizes, and the total number of downloads on crates.io, a public repository for Rust libraries [10].

### Table 1: Hardware Configurations. Prices as of 1/5/2023.

<table>
<thead>
<tr>
<th>Instance Name</th>
<th>CPU</th>
<th>GPU</th>
<th>Cost/Hr</th>
</tr>
</thead>
<tbody>
<tr>
<td>g4ad.xlarge</td>
<td>Intel Cascade Lake</td>
<td>AMD Radeon Pro V520</td>
<td>$0.3785</td>
</tr>
<tr>
<td>g4dn.xlarge</td>
<td>Intel Cascade Lake</td>
<td>NVIDIA T4</td>
<td>$0.526</td>
</tr>
<tr>
<td>g4dn.2xlarge</td>
<td>Intel Cascade Lake</td>
<td>NVIDIA T4</td>
<td>$0.752</td>
</tr>
<tr>
<td>g5.xlarge</td>
<td>AMD EPYC 7002</td>
<td>NVIDIA A10G</td>
<td>$1.006</td>
</tr>
<tr>
<td>g5.2xlarge</td>
<td>AMD EPYC 7002</td>
<td>NVIDIA A10G</td>
<td>$1.212</td>
</tr>
<tr>
<td>c5.xlarge</td>
<td>Intel Cascade Lake</td>
<td>N/A</td>
<td>$0.17</td>
</tr>
<tr>
<td>c5a.xlarge</td>
<td>AMD EPYC 7002</td>
<td>N/A</td>
<td>$0.154</td>
</tr>
</tbody>
</table>

### 4.1 Methodology

**Testbed.** We evaluated VectorVisor using Amazon Web Services (AWS). Five different VM types were used to compare against x86-64 baselines, and two larger instances types were used to compare VectorVisor against CUDA baselines. We provisioned three VMs with attached GPUs (g4ad.xlarge, g4dn.xlarge, g5.xlarge), each with 4 vCPUs and 16 GiB of memory. Two additional compute-optimized VMs were used for evaluating CPU performance (c5.xlarge, c5a.xlarge), each with 8 GiB of memory and 4 vCPUs. Lastly, we used a single invoker VM (c5.8xlarge) for sending requests. These instances were used to obtain the results in Figures 8 and 9. Double extra large (2xl) instances have 2× the memory and CPU of smaller (xlarge) instances. These instances were used (in addition to xlarge instances) to evaluate handwritten CUDA programs. CUDA results which use 2xl instances can be found in Table 4. All VMs are allocated in us-east-1, in the same availability zone. Benchmarks are evaluated end-to-end over the network with IO and system overheads included in all measurements.

Some hardware configurations could not be evaluated due to AMD-specific bugs. AMD v520 GPUs, which are the only cloud-available AMD GPU on AWS, are unsupported in ROCm [21, 23] resulting in runtime crashes. Two benchmarks (Strings-Go and Strings-AScript) are built with WebAssembly-focused runtimes and do not have evaluated x86-64 configurations. Detailed results for all system configurations can be found in Tables 5 and 6 (Appendix).

Table 1 shows the hardware each VM has attached. NVIDIA configurations use the latest CUDA 12 backend, while AMD GPUs use ROCm 5.4.0 with the latest AMDGPU-Pro driver. For our GPU instances, we do not run any fraction of our workload on the available CPU core against existing CUDA programs. CUDA results which use 2xl instances can be found in Table 4. All VMs are allocated in us-east-1, in the same availability zone. Benchmarks are evaluated end-to-end over the network with IO and system overheads included in all measurements.

<table>
<thead>
<tr>
<th>Instance Name</th>
<th>CPU</th>
<th>GPU</th>
<th>Cost/Hr</th>
</tr>
</thead>
<tbody>
<tr>
<td>g4ad.xlarge</td>
<td>Intel Cascade Lake</td>
<td>AMD Radeon Pro V520</td>
<td>$0.3785</td>
</tr>
<tr>
<td>g4dn.xlarge</td>
<td>Intel Cascade Lake</td>
<td>NVIDIA T4</td>
<td>$0.526</td>
</tr>
<tr>
<td>g4dn.2xlarge</td>
<td>Intel Cascade Lake</td>
<td>NVIDIA T4</td>
<td>$0.752</td>
</tr>
<tr>
<td>g5.xlarge</td>
<td>AMD EPYC 7002</td>
<td>NVIDIA A10G</td>
<td>$1.006</td>
</tr>
<tr>
<td>g5.2xlarge</td>
<td>AMD EPYC 7002</td>
<td>NVIDIA A10G</td>
<td>$1.212</td>
</tr>
<tr>
<td>c5.xlarge</td>
<td>Intel Cascade Lake</td>
<td>N/A</td>
<td>$0.17</td>
</tr>
<tr>
<td>c5a.xlarge</td>
<td>AMD EPYC 7002</td>
<td>N/A</td>
<td>$0.154</td>
</tr>
</tbody>
</table>

### Example Functions.

Perceptual hashing is widely used in industry, such as by Facebook [6, 7], to cross-reference a given image against a database of images. We evaluate an open-source implementation of Blockhash—a variation on existing perceptual hashing algorithms [9, 90]. To further evaluate the efficiency of our translation, we also evaluate a modified blockhash library that we optimized to run more efficiently using VectorVisor. Additionally, we evaluate a bill generator which generates PDFs containing a set of purchased items formatted with a default template. Both benchmarks use mock data to simulate realistic workloads, with the hashing benchmark using 200×200 randomly generated images and the bill generation benchmark using 25 randomly generated item names and prices with an attached image.

**Microbenchmarks.** We evaluate a set of common microbenchmarks, including image processing workloads (e.g., Gaussian image blur), cryptography (e.g., password-based key derivation functions such as Scrypt and Pbkdf2), string compression (LZ4), histogram computation (Histogram), and string processing (e.g., stop word filtering and hashtag extraction).

### Baseline comparison.

For our evaluation, we use two different baselines as points of comparison. First, for each of our benchmarks we compile them to WebAssembly (WASM), optimize them using wasm-snip and wasm-opt [17, 91], and execute them using Wasmtime [18] (a popular WASM JIT compiler). VectorVisor takes in the same WASM binary as an input. Second, for each of our benchmarks we compile and run them natively on an x86-64 CPU. This is the default choice for many developers who choose to run applications in the cloud, as most programs target x86-64. Each CPU benchmark is evaluated with multiple threads executing in parallel—proportional to the number of cores available.
Table 2: Details of evaluated benchmarks. We count benchmarks as containing recursive or indirect calls only if they execute those calls in the critical path of the application. All-Time crates.io download counts are as of 1/5/2023. ‘M’ and ‘A’ represent memory and arithmetically intensive benchmarks respectively. ‘D’ represents benchmarks with substantial divergence. ‘Alg’ represents benchmarks with significant algorithmic differences. *Bill-PDF uses a no-op system call as a barrier to mitigate heavy program divergence, but does not modify imported libraries.

**Syscall Performance.** System calls provide a simple, familiar abstraction for developers to transfer inputs to and from a GPU. However, performing per-VM system calls incurs high data transfer overheads for smaller inputs. To evaluate our system call implementation, we copy inputs to and from the GPU, using batch sizes of 2048 (v520), 4096 (T4), and 6144 (A10G). Figure 7 shows the bandwidth for our VMM excluding network IO. Native CUDA transfer speeds peak at 6.3 GB/s for the T4 and 12.9 GB/s for the A10G—for a single large transfer. Despite high batching overheads, VectorVisor obtains ~25% of the max possible bandwidth for fine-grained transfers of 256 KiB per-request using the T4. VectorVisor additionally supports overlapping data transfers with running GPU programs to avoid bottlenecks on VMM overhead.

### 4.2 System Performance

#### 4.2.1 Copy Efficiency

**Memory Bandwidth.** To demonstrate that our memory interleaving can efficiently utilize the high memory bandwidth of GPUs, we evaluate five different memcpy implementations which vary copy size (bytes copied per-loop iteration) and loop unroll count. For each configuration we copy 1 MiB of data (using volatile memory accesses to bypass caching effects) from one array in memory to another non-aliased array. Each benchmark is run 50 times, with a heap size of 3 MiB with 4096 (on T4) or 6144 (on A10G) VMs running concurrently. Figure 6 shows that VectorVisor can achieve close to 100% of the experimentally derived maximum memory bandwidth of the T4 [61] and 74% of the theoretical memory bandwidth of the A10G [5]. We can see that larger interleaves, loop unrolling, and instruction level parallelism (ILP) [88] all have substantial impacts on memory bandwidth. VectorVisor leverages the \texttt{memory.copy} and \texttt{memory.fill} WASM intrinsics to insert optimized copy and fill functions into programs.
Table 3: Profile-Guided Optimization Results. Cumulative indirect and unoptimized call counts for 200 invocations of each instrumented WASM function. These benchmarks were run locally using a 16-core, 64 GB RAM machine running Ubuntu 18.04.

<table>
<thead>
<tr>
<th>Function</th>
<th># Total Slowcalls</th>
<th># Total Slowcalls w/PGO</th>
<th># Indirect Calls</th>
<th># Indirect Calls w/PGO</th>
</tr>
</thead>
<tbody>
<tr>
<td>Scrypt</td>
<td>52062</td>
<td>206</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>Pbkd2</td>
<td>4923</td>
<td>1211</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Blur-Jpeg</td>
<td>4023</td>
<td>206</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Blur-Bmp</td>
<td>1416</td>
<td>213</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>PHash</td>
<td>11440</td>
<td>7844</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>PHash-Modified</td>
<td>1410</td>
<td>206</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Bill-PDF</td>
<td>285632</td>
<td>2023</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>Histogram</td>
<td>4117086</td>
<td>206</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>LZ4</td>
<td>804807</td>
<td>206</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Strings</td>
<td>43944</td>
<td>43744</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Strings-Go</td>
<td>137212574</td>
<td>2990690</td>
<td>143397289</td>
<td>0</td>
</tr>
<tr>
<td>Strings-AssemblyScript</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Complex runtimes such as Go and AssemblyScript have significantly higher overhead than Rust on x86-64 WASM baselines (on average 0.41× the throughput of the Rust baseline for Strings-Go using the T4). Runtime support for garbage collection, reflection, and compiler design choices in Tiny-Go/AssemblyScript all contribute to the observed overheads.

4.2.3 Throughput-per-dollar

Throughput as a metric is insufficient to evaluate VectorVisor. Improving throughput for data parallel workloads by allocating more resources (VMs) represents the status quo. Instead, we show that VectorVisor can achieve greater efficiency—improving throughput using fewer resources. Measuring efficiency requires normalizing performance across both CPUs and GPUs, which we accomplish using throughput-per-dollar. It is computed by dividing the requests-per-second (RPS) by the cost of each respective instance per-hour using on-demand pricing in us-east-1. On-demand prices are used as a conservative measure of the cost benefits of VectorVisor. Spot instance pricing can be cheaper, further improving the throughput-per-dollar of GPU (T4) instances vs CPU (Intel) instances by 1.49× (reported as of 1/5/2023).

Figure 9 shows the best throughput-per-dollar results for each configuration. Detailed throughput-per-dollar results for all system configurations can be found in Table 6 (Appendix). VectorVisor outperforms x86 instances for four benchmarks (Scrypt, Blur-Bmp, PHash-Modified, Bill-PDF), and on all but two benchmarks versus WebAssembly. Throughput-per-dollar results are overall lower than our throughput results in Section 4.2.2. Leveraging GPU acceleration requires substantial throughput improvements to offset the high cost of GPU hardware (e.g., 3.42× for the T4 vs. AMD x86-64 CPUs). Bottlenecks on application-level divergence (e.g., Strings) and data transfer overheads (e.g., LZ4 and Histogram) result in lower throughput and throughput-per-dollar results.

In three out of the four benchmarks where VectorVisor surpasses our x86-64 baselines, the T4 outperforms the A10G, even though it belongs to an earlier generation of GPUs (e.g., Turing vs. Ampere). Despite differences in GPU hardware, the best predictor of superior throughput-per-dollar with VectorVisor is the ratio of the global memory size (e.g., the number of VMs that can fit) to cost. Compared to the A10G and v520, the T4 packs 27.5% and 43% more VMs-per-dollar respectively. Workloads such as Scrypt, which leverage hardware differences like the larger memory bandwidth of the A10G, can break this trend.

4.2.4 Latency

VectorVisor runs many instances of a program in parallel, improving total throughput, but not latency. Batches of requests have higher on-device execution times than x86-64, ranging from 84-1040× longer using the T4—limiting usage to non-latency sensitive applications.
Figure 9: *Benchmark Throughput-per-Dollar.* Results are normalized to the x86-64 baseline for each benchmark except for Strings-Go and Strings-AScript, which are normalized to the x86-64 baseline of Strings-Rust instead. *Benchmarks without an AMD v520 result.

Figure 10: Per-benchmark latency breakdown of execution time, VMM overhead (e.g., syscall overhead), continuations overhead (e.g., context saving/restoring), and network IO. Breakdowns correspond to the best performing configurations with PGO disabled from Table 5.

### 4.3 Latency breakdown

Figure 10 shows the end-to-end (E2E) latency breakdown for each benchmark. Batch sizes, which impact request latency, can be found in Table 2. On-device execution time dominates the E2E latency for most benchmarks, with the histogram benchmark being the exception. We see that supporting preemption using continuations has low overhead, varying between <1% (PHash-Modified) and 19% (Blur-Bmp) of the on-device execution time. Similarly, by overlapping compute with VMM and network IO, VectorVisor significantly reduces related overheads. Benchmarks with a low operational intensity (Ops/Byte) (e.g., Histogram, LZ4) which cannot overlap on-device execution time with batch formation as efficiently are more likely to bottleneck on VMM or network IO.

<table>
<thead>
<tr>
<th>GPU</th>
<th>Platform</th>
<th>Instance Name</th>
<th>Benchmark</th>
<th>Throughput</th>
<th>Throughput/$</th>
</tr>
</thead>
<tbody>
<tr>
<td>NVIDIA T4</td>
<td>CUDA g4dn.xlarge</td>
<td>Blur-Bmp</td>
<td>804.83</td>
<td>1530.10</td>
<td></td>
</tr>
<tr>
<td>NVIDIA A10G</td>
<td>CUDA g5.xlarge</td>
<td>Blur-Bmp</td>
<td>1365.84</td>
<td>1357.69</td>
<td></td>
</tr>
<tr>
<td>NVIDIA T4</td>
<td>CUDA g4dn.xlarge</td>
<td>PHash-Modified</td>
<td>384.32</td>
<td>730.65</td>
<td></td>
</tr>
<tr>
<td>NVIDIA A10G</td>
<td>CUDA g5.xlarge</td>
<td>PHash-Modified</td>
<td>608.02</td>
<td>604.40</td>
<td></td>
</tr>
<tr>
<td>NVIDIA T4</td>
<td>CUDA g4dn.xlarge</td>
<td>Blur-Bmp</td>
<td>576.28</td>
<td>1095.59</td>
<td></td>
</tr>
<tr>
<td>NVIDIA T4</td>
<td>CUDA g4dn.xlarge</td>
<td>Blur-Bmp</td>
<td>1118.95</td>
<td>1487.96</td>
<td></td>
</tr>
<tr>
<td>NVIDIA A10G</td>
<td>CUDA g5.xlarge</td>
<td>Blur-Bmp</td>
<td>652.96</td>
<td>649.06</td>
<td></td>
</tr>
<tr>
<td>NVIDIA A10G</td>
<td>CUDA g5.2xlarge</td>
<td>PHash-Modified</td>
<td>821.15</td>
<td>1091.95</td>
<td></td>
</tr>
<tr>
<td>NVIDIA A10G</td>
<td>CUDA g5.xlarge</td>
<td>PHash-Modified</td>
<td>462.27</td>
<td>459.52</td>
<td></td>
</tr>
<tr>
<td>NVIDIA A10G</td>
<td>CUDA g5.2xlarge</td>
<td>PHash-Modified</td>
<td>896.35</td>
<td>738.56</td>
<td></td>
</tr>
</tbody>
</table>

Table 4: Performance of handwritten CUDA benchmarks.

#### 4.3.1 CUDA Comparison

Leveraging GPU acceleration typically involves manually breaking down a program into fine-grained tasks which can be parallelized—speeding up individual invocations of a function. In contrast, VectorVisor runs many instances of the same program in parallel, improving throughput but not latency. To evaluate the efficiency of our translation, we manually rewrote two benchmarks (Blur-Bmp and PHash-Modified) using CUDA. CUDA baselines incur additional CPU overhead from increased kernel launch overheads and running a fraction of the workload on the CPU. To fairly evaluate these baselines, we benchmark them using both xlarge and 2xlarge instances with additional CPUs (Table 1).

We see in Table 4 that VectorVisor slightly outperforms a handwritten CUDA Gaussian blur function, and obtains 67% of the throughput-per-dollar of our CUDA PHash function. PHash-Modified has higher VMM overhead than Blur-Bmp (35% vs 12% of the E2E latency) which affects overall efficiency.

## 5 Discussion

**Workload Characterization.** Identifying ideal workloads for VectorVisor is key to improving the cost efficiency of real
applications. Ideal workloads minimize divergent execution, recursion, indirect calls, and are compute-bound. Future work can incorporate model-based approaches [79] to identifying acceleration opportunities for VectorVisor.

**Evaluation Limitations.** We use both throughput and throughputs-per-dollar as evaluation metrics. Throughput-per-dollar is a powerful metric that enables us to compare the end-to-end efficiency of VectorVisor, which considers system complexities as well as the capital and operational cost implications of running throughput-oriented workloads. Cloud providers allow customers to insure themselves against high variation in hardware pricing [62, 63], providing a steady baseline cost (at a premium). On-premises hardware configurations can be less expensive over long periods of time, for those willing to pay higher up-front costs. Despite shortcomings, cost-based efficiency metrics provide tangible baselines.

**System Call Implementations.** Providing support for system calls using continuations was key to running realistic workloads using VectorVisor. Systems such as GPUs, GPUfs, and Berkeley Borph [76–78, 82, 86] instead provide support using a more performant RPC-like interface using vendor-specific APIs or custom drivers. RPC-style interfaces rely on the ability to perform concurrent and consistent CPU-GPU memory accesses. OpenCL 2.0 in theory enables this with fine-grained buffer SVM [3, 53]. In practice, support for fine-grained SVM is mixed—with NVIDIA OpenCL 3.0 not supporting the API and AMD providing partial support 2. Continuations provide a cross-platform and reasonably performant approach to supporting system call support for GPUs.

6 Related Work

**Continuations.** Continuations are often used by compilers to support complex control flow operations such as exceptions and preemption [27, 28, 30, 71, 84]. VectorVisor uses continuations to efficiently provide support for preemption and complex control flow on GPUs.

**GPU Preemption.** GPU kernel preemption can be supported through compiler-based approaches that partition (or slice) programs into chunks [29, 35, 39, 89, 93, 94], or with hardware/driver support [13, 68, 83, 86].

**High-Level GPU Languages.** CUDA or OpenCL require developers to write programs using low level abstractions. High level language approaches [2, 31, 44, 48, 52, 55, 57, 72, 92] make it easier to accelerate existing programs by reusing existing codebases. Common language features such as dynamic memory allocation, garbage collection, reflection, and recursion are often absent. Unlike VectorVisor, code often must be rewritten to explicitly leverage parallel APIs.

**Domain-Specific GPU Systems and Languages.** Programming languages designed for domain-specific workloads (DSLs) [24, 38, 40, 46, 58, 73–75, 80] can offer substantially improved performance over general-purpose programming languages. DSLs obtain superior performance through language restrictions, forcing developers to express programs using specific syntax or function calls. While DSLs can efficiently accelerate specific workloads, they trade off performance for programmability—e.g., many workloads cannot be expressed using restrictive DSLs. Similar to DSLs, domain-specific systems can significantly improve performance for throughput-oriented workloads [32, 42, 56, 87]. Domain-specific systems vectorize common workloads (e.g., image processing, machine learning, database operations) using handwritten GPU kernels. Other systems manually vectorize functions from (non GPU-specific) DSLs (i.e. SQL) [50, 51, 69].

**Vectorized Program Translation.** Systems that abstract a SIMT or SIMD lane as a VM often target restricted use-cases (e.g., fuzz testing) [37, 45, 47, 49]. VectorVisor’s design and implementation notably differ from prior work, offering superior GPU language, runtime, and hardware support.

7 Conclusion

VectorVisor is a research prototype which demonstrates that applications originally written for CPUs can be directly run on GPUs without significant modifications. Not only is such GPU execution possible, but it can in fact yield superior throughput-per-dollar versus compute-optimized x86-64 CPUs in the cloud.

Binary translation for GPUs is an exciting and predominantly unexplored area of research, with many potential applications. VectorVisor shows the viability of our new approach to parallelism, opening up the area to future research.

8 Acknowledgements

We would like to thank our shepherd, Redha Gouicem, and the anonymous reviewers for helping us improve this paper. We also thank Rachit Nigam, Mieszko Lis, and Devon Loehr for their valuable comments on earlier versions of it.

References


[54] Andreas Haas, Andreas Rossberg, Derek L Schuff, Ben L Titzer, Michael Holman, Dan Gohman, Luke Wagner, Alon Zakai, and JF Bastien. Bringing the web up


## A Appendix

### A.1 Tables

<table>
<thead>
<tr>
<th>System</th>
<th>Platform</th>
<th>PGO</th>
<th>Interleave</th>
<th>Scrypt</th>
<th>Pbkdf2</th>
<th>Blur Jpeg</th>
<th>Blur Bmp</th>
<th>PHash</th>
<th>PHash Mod.</th>
<th>Bill PDF</th>
<th>Histogram</th>
<th>LZ4</th>
<th>Strings (Rust / Go / AScript)</th>
</tr>
</thead>
<tbody>
<tr>
<td>VectorVisor</td>
<td>AMD v520</td>
<td>Y</td>
<td>4</td>
<td>N/A</td>
<td>32.27</td>
<td>202.24</td>
<td>366.64</td>
<td>70.28</td>
<td>77.85</td>
<td>N/A</td>
<td>832.43</td>
<td>449.89</td>
<td>N/A / N/A / N/A</td>
</tr>
<tr>
<td>VectorVisor</td>
<td>AMD v520</td>
<td>Y</td>
<td>8</td>
<td>N/A</td>
<td>29.29</td>
<td>245.76</td>
<td>N/A</td>
<td>79.67</td>
<td>89.86</td>
<td>N/A</td>
<td>848.31</td>
<td>N/A</td>
<td>N/A / N/A / N/A</td>
</tr>
<tr>
<td>VectorVisor</td>
<td>NVIDIA T4</td>
<td>N</td>
<td>4</td>
<td>Y</td>
<td>109.32</td>
<td>209.94</td>
<td>726.59</td>
<td>129.18</td>
<td>341.73</td>
<td>339.06</td>
<td>1570.90</td>
<td>818.90</td>
<td>9535.47 / 4112.16 / 861.24</td>
</tr>
<tr>
<td>VectorVisor</td>
<td>NVIDIA T4</td>
<td>N</td>
<td>8</td>
<td>Y</td>
<td>160.25</td>
<td>1355.05</td>
<td>804.83</td>
<td>140.44</td>
<td>367.31</td>
<td>380.55</td>
<td>1730.05</td>
<td>1114.30</td>
<td>10242.08 / 3735.93 / 826.29</td>
</tr>
<tr>
<td>VectorVisor</td>
<td>NVIDIA T4</td>
<td>Y</td>
<td>4</td>
<td>N/A</td>
<td>108.51</td>
<td>63.68</td>
<td>50.66</td>
<td>720.79</td>
<td>10.44</td>
<td>349.98</td>
<td>398.26</td>
<td>1827.91</td>
<td>371.75  / 9077.32 / 4204.24 / 845.07</td>
</tr>
<tr>
<td>VectorVisor</td>
<td>NVIDIA T4</td>
<td>Y</td>
<td>8</td>
<td>N/A</td>
<td>170.54</td>
<td>59.49</td>
<td>60.59</td>
<td>726.86</td>
<td>46.94</td>
<td>584.32</td>
<td>497.42</td>
<td>1676.06</td>
<td>486.94 / 10332.82 / 3842.04 / 841.04</td>
</tr>
<tr>
<td>VectorVisor</td>
<td>NVIDIA A10G</td>
<td>N</td>
<td>4</td>
<td>Y</td>
<td>2596.52</td>
<td>93.30</td>
<td>1237.87</td>
<td>153.99</td>
<td>546.35</td>
<td>480.29</td>
<td>868.25</td>
<td>1527.21</td>
<td>24534.24 / 8438.01 / 1781.76</td>
</tr>
<tr>
<td>VectorVisor</td>
<td>NVIDIA A10G</td>
<td>N</td>
<td>8</td>
<td>Y</td>
<td>1365.84</td>
<td>158.99</td>
<td>592.87</td>
<td>509.71</td>
<td>484.22</td>
<td>1490.67</td>
<td>25981.84 / 8030.55 / 1965.93</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VectorVisor</td>
<td>NVIDIA A10G</td>
<td>Y</td>
<td>4</td>
<td>N/A</td>
<td>297.08</td>
<td>145.66</td>
<td>71.73</td>
<td>1166.44</td>
<td>14.81</td>
<td>533.85</td>
<td>308.62</td>
<td>2356.72</td>
<td>722.01  / 24109.98 / 8377.06 / 2398.01</td>
</tr>
<tr>
<td>VectorVisor</td>
<td>NVIDIA A10G</td>
<td>Y</td>
<td>8</td>
<td>N/A</td>
<td>397.10</td>
<td>143.55</td>
<td>127.47</td>
<td>1163.57</td>
<td>20.87</td>
<td>608.02</td>
<td>619.04</td>
<td>517.97</td>
<td>945.06  / 26598.05 / 8002.31 / 1977.61</td>
</tr>
</tbody>
</table>

Table 5: Average requests per second (RPS) of each benchmark. Bold values correspond to the best throughput.

<table>
<thead>
<tr>
<th>System</th>
<th>Platform</th>
<th>PGO</th>
<th>Interleave</th>
<th>Scrypt</th>
<th>Pbkdf2</th>
<th>Blur Jpeg</th>
<th>Blur Bmp</th>
<th>PHash</th>
<th>PHash Mod.</th>
<th>Bill PDF</th>
<th>Histogram</th>
<th>LZ4</th>
<th>Strings (Rust / Go / AScript)</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU (x86-64)</td>
<td>AMD</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>33.89</td>
<td>1233.44</td>
<td>176.63</td>
<td>147.14</td>
<td>68.02</td>
<td>111.18</td>
<td>1140.20</td>
<td>2235.93</td>
<td>11002.53 / N/A / N/A</td>
</tr>
<tr>
<td>CPU (x86-64)</td>
<td>Intel</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>34.27</td>
<td>149.33</td>
<td>148.53</td>
<td>153.43</td>
<td>55.83</td>
<td>85.95</td>
<td>1144.96</td>
<td>1987.63</td>
<td>10182.98 / N/A / N/A</td>
</tr>
<tr>
<td>CPU (WASM)</td>
<td>AMD</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>5.33</td>
<td>52.67</td>
<td>33.81</td>
<td>36.13</td>
<td>19.86</td>
<td>24.97</td>
<td>697.94</td>
<td>700.77</td>
<td>1536.22 / 1450.18 / 485.62</td>
</tr>
<tr>
<td>CPU (WASM)</td>
<td>Intel</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>4.35</td>
<td>46.49</td>
<td>25.83</td>
<td>17.33</td>
<td>12.13</td>
<td>21.44</td>
<td>659.40</td>
<td>596.05</td>
<td>1431.18 / 1375.47 / 514.84</td>
</tr>
</tbody>
</table>

Table 6: Benchmark Throughput-per-Dollar. Values correspond to the average RPS of each benchmark normalized to instance cost. Bold values correspond to the best throughput-per-dollar.
A.2 Rust Example

```rust
#![macro_use]
extern crate lazy_static;
// Import existing open-source libraries!
use pdf_writer::*;
use pdf_writer::types::{ActionType, AnnotationType, BorderType};
use std::fs::File;
use std::io::Write;
use std::time::Instant;
// Import our custom 'serverless' runtime. We use this in our x86 and WASM benchmarks as well.
use wasm_serverless_invoke::wasm_handler::*;
use wasm_serverless_invoke::wasm_handler::WasmHandler;
use wasm_serverless_invoke::wasm_handler::SerializationFormat::MsgPack;
use serde::Deserialize;
use serde::Serialize;
// Image and compression libraries
use image::{ColorType, GenericImageView, ImageFormat};
use miniz_oxide::deflate::{compress_to_vec_zlib, CompressionLevel};
// Include a sample template image for our PDF footer
lazy_static! {
    static ref EMBED_IMAGE: &[u8] = include_bytes!("test.png");
}
// Syntactic sugar for (de)serializing JSON/MsgPack inputs
#[derive(Debug, Deserialize)]
struct FuncInput {
    name: String,
    purchases: Vec<String>,
    price: Vec<f64>, // Typically prices should not be encoded as floats, we do this for simplicity.
}
#[derive(Debug, Serialize)]
struct BatchInput {
    inputs: Vec<FuncInput>
}
#[derive(Debug, Serialize)]
struct FuncResponse {
    resp: Vec<u8>
}
#[derive(Debug, Serialize)]
struct BatchFuncResponse {
    resp: Vec<FuncResponse>
}
#[inline(never)]
fn makePdf(event: FuncInput) -> Vec<u8> {
    // Perform PDF formatting, image manipulation, and compression to generate a valid PDF
}
fn batch_genpdf(inputs: BatchInput) -> BatchFuncResponse {
    let mut results = vec![];
    for input in inputs.inputs {
        results.push(FuncResponse { resp: makePdf(input) });
        unsafe { vectorvisor_barrier() }; // We can wait on arbitrary subsets of VMs (unlike OpenCL barrier(...))
    }
    return BatchFuncResponse { resp: results ];
}
fn main() {
    // Specify input format type and buffer sizes
    let handler = WasmHandler::new(&batch_genpdf);
    // Starts the event-loop and encapsulates serverless_invoke/serverless_response
    handler.run_with_format(1024*512, MsgPack);
}
```

Figure 11: Bill-PDF. This benchmark performs PDF processing, image manipulation, and compression.
A.3 Golang Example

```go
package main;

// define our system call interface
// # include "serverless.c"
import "C"
import {
   // Import JSON + string manipulation libraries
   "github.com/json-iterator/tinygo"
   "unsafe"
   "strings"
}

// go: generate go run github.com/json-iterator/tinygo/gen

type Payload struct {
   Tweets []string `json:"tweets"`
}

// go: generate go run github.com/json-iterator/tinygo/gen

type Response struct {
   Tokenized [][]string
   Hashtags [][]string
}

// Go doesn't provide Map/Filter for us, so we use our own implementation
func Map[T, U any](ts []T, f func(T) U) []U {
   ...
}

func Filter(vs []string, f func(string) bool) []string {
   ...
}

func main() {
   json := jsoniter.CreateJsonAdapter(Payload_json{}, Response_json{})
   // Use this as a set, track all stopwords
   stopwordsSet := make(map[string]bool)
   for _, word := range stopWords {
      stopwordsSet[word] = true
   }
   input_buffer := make([]byte, 1024*450) // buffer for raw inputs from VectorVisor
   for i { // serverless_invoke is the system call used for transferring inputs from the host (CPU) to the GPU
      in_size := C.serverless_invoke((*C.char)(unsafe.Pointer(&input_buffer[0])), 1024*450)
      if in_size == 0 { // if in_size == 0, then this VM is blocked off and has no input for this batch
         fakeaddr := uintptr(0x0) // serverless_response copies inputs from the GPU back to the CPU.
         C.serverless_response((*C.char)(unsafe.Pointer(fakeaddr)), 0)
         continue
      }
      var input Payload;
      json.Unmarshal(input_buffer[0:in_size], &input);
      // First tokenize each tweet [][]string --> [][]string ...
      // Now process each tweet, filtering out stop words ...
      // Get the hashtags, we will add them as we see them
      var tags = make([][]string, 0)
      ...
      var response Response; // create a JSON response and return it!
      response.Tokenized = tokenized;
      response.Hashtags = tags;
      bytes, _ := json.Marshal(response);
      C.serverless_response((*C.char)(unsafe.Pointer(&bytes[0])), (C.uint)(len(bytes)))
   }
}

Figure 12: Strings-Go. Tokenize some input tweets and return the hashtags. TinyGo (https://tinygo.org/docs/reference/lang-support/) provides us with a conservative mark and sweep garbage collector, limited runtime reflection and goroutine support.
```
A.4 AssemblyScript Example

```assembly
import { Console, FileSystem, Descriptor } from "as-wasi/assembly"; // Import needed syscalls
import { JSON, JSONEncoder } from "assemblyscript-json/assembly"; // Import a JSON encoder/decoder
import { listen } from "./env"; // Import our event-driven runtime
import { stopWords, initSet, getSet } from "./stop"; // Import a dataset of stopwords

function abort(message: usize, fileName: usize, line: u32, column: u32): void {
  unreachable(); // needed for the AssemblyScript runtime
}

initSet(); // init our set of "stop words"
let set: Set<string> = getSet();

// TypeScript -like syntax for GPU programming!
function process_tweets(input: JSON.Obj): Uint8Array | null {
  let tweets: JSON.Arr | null = input.getArr("tweets");
  if (tweets != null) {
    let strTweets: string[] = tweets._arr.map<string>((val: JSON.Value): string => val.toString());
    // Split each tweet (tokenize)
    let tokenize: string[][] = strTweets.map<string>((string): string[] => string.split(" "));
    // Remove empty values and stop words
    let filtered: string[][] = tokenize.map<string>((arr: string[]): string[] =>
      arr.filter((word: string): bool => {
        if (set.has(word)) {
          return false;
        } else {
          return true;
        }
      });
    // Get the array of hashtags for each tweet
    let hashtags: string[][] = filtered.map<string>((string): string[] =>
      tweet.filter((word: string): bool => {
        if (word.charAt(0) == '#' && word.charAt(1) != '') {
          return true;
        } else {
          return false;
        }
      });
    let encoder = new JSONEncoder(); // encode a JSON response
    encoder.pushArray("tokenized");
    for (let tweet_idx = 0; tweet_idx < filtered.length; tweet_idx++) {
      ...}
    encoder.popArray();
    encoder.pushArray("hashtags");
    for (let tweet_idx = 0; tweet_idx < hashtags.length; tweet_idx++) {
      ...}
    encoder.popArray();
    let json: Uint8Array = encoder.serialize();
    return json;
  }
  // else we failed somehow...
  return null;
}
listen(1024*512, process_tweets); // Starts the event-loop and encapsulates serverless_invoke/serverless_response
```

Figure 13: Strings-AssemblyScript. Same as Strings+Strings-Go, but with different syntax. Support for incremental garbage collection is provided.