Research Completed or in Progress with Duane Bailey

Williams College provides students an opportunity to work on research with a faculty member, often as part of an honors thesis project. These theses were completed with Duane Bailey, and are available through interlibrary loan from Williams College.

Jump to author: Becker, [Bernheim] Brush, Burbage, de Carvalho, Chaffin, [Minnick] Cornett, Cox, Cyll, Effinger-Dean, Grassia, Healy, Hirshman, Kang, Leonard, Lindsey, Liu, McLaughry, Munson, Rubin, Sandys, Templeton '16, Templeton '20, Timilsina, Trepte, Zhu

The concealed inner workings of today's memory hierarchies lead to surprising side effects, in security and performance, for otherwise simple programs on general-purpose machines. For situations where this is untenable, developers may prefer access to an alternative memory system where software has control over the movement of its data. To explore this concept, we designed and implemented the Scratchpad Accelerator, a configurable, associative scratchpad memory for the RISC-V Rocket Chip. We evaluated the security and performance of the device after adding it to lowRISC, a version of the Rocket Chip that has been programmed to an FPGA board and runs full-stack Linux. We show that the Scratchpad Accelerator provides data isolation and uniform memory access, providing a protected, predictable memory system for developers. Additionally, its associativity and structure inspire effective memory models for common data structures in software development. The general-purpose memory acceleration provided by the Scratchpad Accelerator endows software with the important ability to gain freedom and control over its data.

PDF

With the power utilization wall imposed on modern microprocessors, architectural research has turned to the use of specialized hardware as a source of continued performance scaling. As a result, System-on-a-Chip (SoC) architectures containing hardware accelerators have gained signficant traction over the past years. Unfortunately, each accelerator has a specific set of tasks it can execute, and accelerator-based systems are often incapable of out-performaning general-purpose processors outside their target application domains. In this work, we propose SOAR, a reconfiguratble architecture that extends the range of hardware acceleration by adaptively tailoring its configuration of acceleartators to the running workload. The SOAR hardware is built as a 64-bit RISC-V SoC with a collection of RocCC accelerators attached to a scalar, in-order Rocket core. Evaluation of the design implemented on a Xiling Artix-7 FPGA under RSIC-V Linux demonstrates that our RoCC accelerators can return speedups up to 10x and energy-efficiency improvements up to 170x over equivalent software implementations. Along with SOAR hardware, we also introduce a software infrastructure that infers dynamic power demands and acceleartor utilization frequencies to determine the optimal configuration of on-chip accelerators. We propose the use of dynamic dispatch for switching function execution between using hardware and using software library calls. Finally, we explain how our proposed framework can be customized to enhance either energy-efficiency or performance, or any arbitrary optimization goal.

PDF

Word embeddings are a powerful natural language processing technique, but they are extremely difficult to interpret. In order to create more interpretable word embeddings, we transform pretrained dense word embeddings into sparse embeddings. We construct these embeddings through sparse coding. By inherently interpretable, we mean a system where each dimension is mechanically associated with some hint that can capture the meaning of that dimension. We show that models trained using our sparse embeddings can achieve good performance and are interpretable.

PDF

Field Programmable Gate Arrays (FPGAs) are hardware circuits that can be programmed to perform any digital circuit functionality. This flexibility makes them a useful tool in improving system and application throughput and energy efficiency. In this way, FPGAs can be part of the solution in rethinking our current model of computation to better support high performance computing and avoid Dark Silicon. In this work, we attempt to improve FPGA accessibility and offer motivation for a more widespread adoption of thise circuits by building on previous efforts in this area, with special emphasis to the Xilinx PYNQ FPGA+CPU development board. We explain in detail how to generate a working FPGA circuit for the PYNQ board from a software-specified function using the Xilinx Vivado Design Suite. We offer advice on efficient FPGA design, and demonstrate the acceleration potential of FPGAs: we find that an FPGA-implemented function runs up to 120x faster than its Python counterpart. We also demonstrate the potential of FPGAs to implement conservation cores by explaining the design of an FPGA circuit that supports five simultaneous cores that can be switched at runtime.

PDF

We define rules for cellular automata played on quasiperiodic tilings of the plane arising from the multigrid method in such a way that these cellular automata are isomorphic to Conway's Game of Life. Although these tilings are nonperiodic, determining the next state of each tile is a local computation, requiring only knowledge of the local structure of the tiling and the states of finitely many nearby tiles. As an example, we show a version of a "glider" moving through a region of a Penrose tiling. This constitutes a potential theoretical framework for a method of executing computations in non-periodically structured substrates such as quasicrystals.

PDF@arXiv:1708.09301

We propose the creation of cells within a modified version of Hutton's enzyme artificial chemistry. First, we define some types of atoms as active, and some as inactive. This allows us to create genetic material distinct from enzymes. Different types of atoms, which were originally used to encode genetic information, can now be used to encode the target of the enzymes: inside the cell, outside the cell, or in the membrane.

We also increase the number of reactions stored per atom. This sharply cuts the number of genes needed and, therefore, the computation required. This change also opens up possibilities for horizontal gene transfer between organisms. A single atom, analogous to a plasmid of bacteria, which contains all enzymes needed for a behavior, can be passed as a unit between organisms. Similarly, we can construct malicious viruses and other organisms of this sort.

Mutation is fundamental to evolution, and the mechanism of mutation needs to be both powerful and directed. The new genetic mechanisms allow us to tightly control mutation when reaction data is copied. The exact mechanism of mutation is well suited to experimentation, but there are some factors to consider. Each field of the reaction can have a specific chance of mutation. Some prediction of useful reactions may also be helpful: for example, the value for post-reaction states can frequently mutate to the pre-reaction state. Another potentially useful factor is the ability to inactivate a reaction: this allows for increased genetic variability through silent mutations, similar to unexpressed biological genes.

With these modifications, we hope to create a vibrant virtual ecosystem. This system allows many sorts of competition, information transfer, predation and parasitism. Combining these elements together will create a system that models with some complexity the interactions of real organisms.

PDF

In this work, we exploit these advantages of FPGAs to design an architecture that uses general purpose processor tightly coupled with a vector processor implemented on an FPGA. Previous work by Leonard demonstrated possible energy efficiency with FPGA coprocessor design. We extend his work to design a power sensitive vector processor. We believe that an FPGA-based vector processor will achieve energy savings comparable to conservation cores. While FPGA-based design seems promising, the engineering work remains a prohibitive obstacle to success. In previous works, one of the main challenges has been the usability of the software supporting the FPGAs. We aim to create an architecture that can run and synthesize circuits on FPGA without relying on traditional workflow.

PDF

This work would pursue a deeper exploration of irregular computing CAs. We plan on conducting a series of experiments on a spectrum of automata with irregular grids and connectivities, quantifying their behavior and comparing them to traditional CAs. As computation appears to emerge at a point of criticality somewhere in between the ordered and chaotic automata and Langton's $\Lambda$ parameter, representing the relative order of a CA, is used to parametrize the possible automata space. $\Lambda$ is important because it is an indicator of the class of CAs that have the ability to compute. However, this measure is tuned specifically for static neighborhood definitions and uniform grids. With the experimentation on irregular automata, the goal is to explore and develop a more general notion of $\Lambda$ and other metrics so that we can better understand and perhaps even quantify the conditions in which computation can emerge in noisy systems. Another important question we hope to address with this work is whether there is a qualitative difference between regular and irregular grid patterns and connectivities: are there cases where uniform CA models are sufficient for representing biological systems?

We believe that the study of CA behavior in irregular environments is critical to achieve a greater understanding of how biological systems combat imperfections. Ultimately, the contribution of work on natural computational systems is twofold: not only can we achieve a better understanding of how some biological processes operate, but knowledge of how these systems work can inspire alternative computing methods. The hope is to illuminate how nature is able to perform complex computation in noisy environments and apply these lessons to advance future computing models.

PDF

After four decades of exponential improvement, computing performance has hit a wall. Power constraints and physical limitations have led to the rise of dark silicon: most modern processors can run only 3\% of their transistors at maximum frequency. Pairing reconfig- urable hardware with a general purpose processor (GPP) to support partnered computa- tion can yield performance and efficiency improvements. Because software processors waste roughly 90\% of their energy on instruction interpretation, customized hardware can signif- icantly decrease the energy utilization of targeted computations. In this work, we explore the possibility of using a GPP tightly coupled to a dynamically reconfigurable co-processor to provide efficient, general purpose, partnered computation.

PDF

HYDRA: A CUDA-Based Tool Supporting Error-Resilient GPU Computation and General Purpose Heterogeneous Programming

The large-scale supercomputers popular in high performance computing require support for error-resiliency to protect against soft errors. We present Hydra, a scalable, CUDA-based tool supporting reslience, as well as programming tools that leverage that framework for other heterogeneous programming tasks. Hydra executes CUDA kernels redundantly on both CPU and GPU to leverage the parallelism of the heterogeneous architecture. To support redundancy with minimal additional programmer effort, Hydra provides wrappers for the CUDA functions that transfer data between GPU and CPU. Our results indicate that Hydra redundancy can be an efficient model for error-resilience in heterogeneous architectures, as well as a useful tool in heterogeneous programming.

PDF

Communication graphs induced by parallel programs can be overwhelming in size, but are highly structured. For example, trees, grids, and n-cubes are common. In order to understand program behavior, we seek informative graph visualizations that reduce visual complexity. We leverage the concept of ellipses (...) , an intuitive symbol that abstracts redundant structure.

Using graph grammars coupled with specific procedural methods, we create large, structured graphs of parallel program communication. These graphs that have prac- tical realizations in two dimensions can be visually simplified using an algorithm that finds contextual balls around important nodes. Graphs that are harder to visualize in two-space are more difficult to simplify. This work presents metrics and layout tech- niques that may aid in the visualization of such graphs. We evaluate these metrics by developing an ideal for want to see in abstracted visualizations of large, structured graphs. The metrics and layout techniques are implemented in Gephi, an open-source large graph visualization tool.

Our large graph visualizations are important tools for helping programmers understand interprocess communications of their parallel programs. This work establishes some prototypes for visualization that in the future may harnessed in the design of effective parallel programming environments.

PDF

Nonperiodic tilings of the plane exhibit no translational symmetry. Penrose tilings are a remarkable class of nonperiodic tilings for which the set of prototiles consists of just two shapes. The pentagrid method, introduced by N.G. de Bruijn, allows us to generate Penrose tilings by taking a slice of the integer lattice in five-dimensional space. The empire problem asks: Given a subset of a Penrose tiling, what tiles appear in all tilings that include that sub- set? We present a new approach to the empire problem that uses the pentagrid method to identify elements of the empire.

PDF

As yet, there is no formal science of designing virtual machines. Instead, virtual machine design has been ad hoc, a reaction to the current mismatch between hardware and software. This thesis develops a specification for a thin, general-purpose virtual machine that is sensitive to the needs of programmers as well as hardware designers. Some readers may consider this thesis to be a blueprint for a novel target virtual machine. Others may see this specification as a model for future hardware. By studying the ++VM virtual machine, introduced in later chapters, this thesis examines the strengths and establishes goals of virtual machine design.

PDF

Today, almost all software uses dynamic memory allocation. In some programming languages, it is impossible to even write "Hello World!" without implicitly calling the allocator. However, the costs of dynamic memory allocation on cache performance are relatively unstudied. This thesis explores the e ects of dynamic memory allocation on cache performance, as well as methods to make memory allocators cache-conscious.

We use profling techniques to build custom memory allocators. These allocators contain statically generated memory layouts that arrange memory in a cache-conscious manner for previously analyzed workloads. Using a variety of techniques to build memory layouts, we demonstrate that such layouts can dramatically a ect program speed and cache miss rate. We nd that some layout strategies speed up certain programs by as much as 28% while others cause certain programs to run up to 2.6 times slower. Additionally, we provide an open source foundation which is fully compatible with all standard C allocation functions in order to aid future research.

PDF

In 1996, Gummelt identified a single decagon that covering the plane aperiodically. The covering allowed tiels to overlap in five different ways. In this work we develop a partitioning of the regions of the decagon such that points in different partitions never collide in an overlap. Such a partioning is achieved by a special coloring process based on infinite decomposition of the Robinson tiles in a Canto-like construction. We show that the tile has positive measure everywhere and that the final tiling is aperiodic and dense in the plane.

PDF

A related paper, by Bailey and Zhu, PDF.

Most of the research done to improve the performance of Java Virtual Machines (JVM's) has focused on software implementations of the JVM specification. Very little consideration has been given to how Java programs interact with hardware resources and how hardware components can be used to improve Java performance. We generated and analyzed opcode and memory access trace for eight different Java benchmarks, finding that each Java opcode typically causes multiple memory accesses. We next investigated the effectiveness of unified and split caches at caching Java programs, using the performance of traditional compiled programs as a basis of comparison.

Motivated by the subdivision of memmory by the JVM specification into heap memory, constant pool data, and operand stacks, we examined the possible benefits of adding small specialized caches to hold constant pool data and stack data. We found that constant pool accesses have very low locaality of reference and that the addition of a constant cache is detrimental to cache performance for large main caches. The 24% of memory access made to operand stacks were found to be efficiently cached in a 64 byte cache. We simulated the cache performance of Java programs in the presence of registers by adding a register cache to unified and split caches. We found that given fair hardware resources, Java programs cached at least as well as traditional programs.

PDF

Penrose tiles are infinite tilings that cannot tile the plane in a periodic manner. They are of interest to researchers because they may model a new type of matter called \emph{quasicrystals}. Being able to easily generate and manipulate Penrose tiles makes them easier to study, and that is the primary goal of this research. Although Penrose tiles are infinite, we would like to be able to store them in a finite device, such as a computer, so that they may be more easily studied. The primary aim of this research is to attempt to discover some of the underlying structure of Penrose tiles, so that the tilings can be reduced and stored in an efficient, finite manner.

In addition to being able to store tilings, this research also aims to learn more about constructing the tilings themselves. Because we are reducing an infinite amount of data to a finite amount of information, there will necessarily be some computation required to reconstruct the information that is not explicitly stored. This research will investigate methods of constructing Penrose tiles from a small amount of initial information.

The algorithms developed and discussed in this work allow us to represent Penrose tiles as a small set of finite variables and then compute any arbitrary information about the tiling on demand. In this way, we can reconstruct as much of any tiling that we desire, without needing to store the entire tiling. It is hoped that the methods used will allow more detailed study of Penrose tiles to take place in future research.

PDF

(Jason's more recent work on generating empires is found here.)

In 1989, Danzer presented a set of four tetrahedra which, along with their mirror images, form aperiodic tilings of space when assembled according to certain matching rules. Though not the first known aperiodic set in three dimensions, the Danzer tetrahedra are attractive for a number of reasons, including several similarities to the two-dimensional Penrose tiles and Robinson triangles.

We present an expanded version of Danzer's paper[Danzer89], giving explanations and proofs and removing some of the notation to provide a more readable and comprehensible exposition of these tiles and their basic properties.

The tetrahedra in a Danzer tiling always meet face-to-face and vertex-to-vertex. Thus the next fundamental unit of the tiling beyond the tiles themselves is the vertex configuration, a ball of tiles clustered about and sharing a central vertex. Danzer's paper states that there are exactly 27 vertex configurations. We give a practical algorithm to compute a complete ``atlas'' of vertex configurations, and show that there are, in fact, 174.

Every patch in a Danzer tiling is duplicated an infinite number of times in every tiling, and the distance to the nearest identical copy is within a bounded multiple $\gamma$ of the radius of the patch. Using a computational approach, we show how to place a bound on $\gamma$.

PDF

An aperiodic tiling is one which uses a finite set of prototiles to tile space such that there is no translational symmetry within any one tiling. These tilings are of interest to researchers in the natural sciences because they can be used to model and understand quasicrystals, a recently-discovered type of matter which bridges the gap between glass, which has no regular structure, and crystals, which demonstrate translational symmetry and certain rotational symmetries. The connection between these two areas of research helps to create a practical motivation for the study of aperiodic tilings.

We plan to study both theoretical and applied aspects of several types of aperiodic structures, in one, two, and three dimensions. Penrose tilings are tilings of the plane which use sets of two prototiles (either kites and darts, or thin rhombs and thick rhombs) whose edges are marked in order to indicate allowed arrangements of the tiles. There is a small number of legal vertex configurations, and most of these force the placement of other tiles which may or may not be contiguous with the original ones. It is somewhat intuitive that in most cases the placement of a group of tiles forces the placement of some set of adjacent tiles. It is much more difficult to believe that many groups of tiles also force the placement of infinitely many non-adjacent tiles. This phenomenon has implications which may be important in designing and implementing an efficient data structure to explore and store information about aperiodic tilings.

For the most part, we plan to study forced tiles in two-dimensions, although we believe that this theory may be generalized to higher dimensions. We will use musical sequences and Ammann bars, concepts established by Conway and Ammann, to develop an algorithm for predicting the placement of fixed bars given an initial sequence of parallel bars. This, in turn, may be used to fix the positions of tiles and clusters of tiles within any legal tiling. Our goal is to determine a method for predicting the position of infinitely many forced tiles given an initial cluster of tiles. These descriptions will be partly grammar-based inflations of musical sequences, and partly algebraic solutions, and will themselves be aperiodic.

PDF

This work focuses on realities of implementing an abstract datatype that traverses infinite aperiodic tilings of the plane and three-space. The approach is to develop a generalized mechanism for specifying finite sets of prototiles that tile space nonperiodically. When the tiling exhibits an invertible inflation property, tiles may be addressed in a way that supports arbitrary, possibly oriented, wanderings through space. Techniques are presented that can be used to discover lines of symmetry in a tiling and to construct an atlas of its local configurations. Also developed are a number of theoretical results about this class of tilings, including a concise statement of a local isomorphism theorem. This pretty result suggests that any finite portion of a specific tiling by prototiles appears infinitely often in all other tilings by the same set of tiles.

Our analysis is inspired by the index sequences that are commonly used to uniquely specify tilings. An index sequence identifies a tiling about a point by locating the point within an infinite hierarchy of self-similar tilings known as inflations. Our work builds an addressing system with index sequences of finite rather than infinite length and uses addresses to represent the dual graph of a patch of tiles. An $n$-digit address identifies a tile within n inflations of some prototile. We prove that any tiling represented in our form can tile the plane or space and must do so nonperiodically. An algorithm is constructed to find the addresses neighboring any given tile. Appending digits to an address extends the currently represented region, placing it within a larger patch to simulate the exploration of an infinite expanse of tiles. The information maintained grows as needed and the number of digits stored is logarithmic in the number of tiles in the patch represented. Our system could be used to investigate very large quasiperiodic tilings and the principles behind the data structures illuminate many beautiful properties of the tilings themselves.

PDF

Many of the performance features of parallel programs, including massive concurrency and nondeterminism, make debugging difficult. To study this issue we have continued to extend Linder, a public domain programming environment supporting Gelernter's tuple space programming model.

The first section of this thesis develops protocols to support effective distribution of tuples among disparate processes. This distribution has potential performance advantages, but also complicates the coordination of processes.

In the second section, we present protocols that support centralized debugging of tuple space programs. The Linder Tuple Space Debugger (LDB) is a textually-based interactive debugging environment. LDP allows users to set breakpoints on tuples, query the tuple space, and examine the communication history of the system. Users can graphically animate their program's history at breakpoints and after execution using ORNL's ParaGraph, a graphical visualization system. With its unique mixture of history information and interactive control we believe novice users will find LDB a conductive debugging environment.

TS++: Communication Specification using Path Expressions in the Tuple Space Programming Model

The tuple space model provides a simple, flexible means of specifying interprocess communication for parallel programming. However, this flexibility can result in incorrect programs, that are then extremely difficult to debug. We discuss a few extensions to the tuple space model, that will permit the programmer to specify information about the indented communication structure of a program. This will reduce the chance of error and simplify debugging tasks, without greatly changing the paradigm.

Current technologies for programming massively parallel machines do not provide the programmer with the appropriate abstractions for communication. The Canister System was developed to allow programmers to compose point-to-point messages into communication patterns, called itineraries. While programmers may find this abstraction more appropriate for many algorithms, itineraries suffer from the fact that they must be statically determined at compile time.

This work presents an extension to itinerary-based specification in the Canister System that allows the programmer to specify dynamic process instantiation and itinerary extension in a controlled manner. This mechanism, while not universal, allows programmers to specify more algorithms safely using a more appropriate level of communication abstraction.

Program Composition Notation (PCN), a language for parallel programming, provides a precise, compact, and efficient paradigm for parallel programming. However, several aspects of programming in PCN are nonintuitive, even to experienced programmers. We believe that a visual programming language that draws on PCN as its paradigm can address PCN's non-intuitiveness and exploit graphical methods to make realization of communication channels and algorithms much easier.

This research focuses on the specification and implementation of a new graphical parallel programming language (GraPPLe). Key considerations in the design include: the exploitation of graphics to evince the fact that parallel programming is actually a necessary first step in sequential programming, the goal of providing a framework that maximizes implicit parallelism, and the interchangeability of visual and textual forms of expression. Our hope is that, due to the strong correspondence between GraPPLe and its underlying textual language, the system will be of use as a teaching aid, as well as a viable development tool for parallel programmers.

This thesis is concerned with the animation of algorithms for tightly coupled multiple-instruction, multiple-data computers for the purpose of visualizing the communication patterns which occur during execution of the algorithm. In order to understand how a tightly coupled algorithm operates it is essential to understand the interprocess communication which occurs during execution. An animation of the algorithm can provide a global view of these patterns in a form easily comprehended by the user. There have been recent attempts to organize the record of communication into a hierarchy of abstract user-defined communication events to increase the viewer's ability to comprehend the display. However these systems contain a great deal of complexity due to their attempt to describe the algorithm at a global level. The nondeterminism inherent in parallel programs makes the specification of a global event hierarchy very difficult. This thesis proposes an alternative paradigm for specifying the event hierarchy, attempting to reduce the complexity of the specification by working on a per-process basis, creating a global view of the algorithm only after the high level events have been recognized.

Research and Theses Completed with Duane Bailey