Uzi Vishkin

Explicit multi-threading (XMT) bridging models for instruction parallelism (extended abstract)

This paper envisions an extension to a standard instruction set which efficiently implements PRAM... more This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms through architecture to implementation is introduced; new elements are added where needed.

Download

Recursive *-tree parallel data-structure

RECURSIVE *-TREE PARALLEL DATA-STRUCTURE (Extended abstract) Omer Berkman and Uzi Vishkirtf Tel A... more

Methods in Parallel Algorithmics (Abstract)

Mathematical Foundations of Computer Science, Aug 24, 1992

ABSTRACT

Can Cooling Technology Save Many-Core Parallel Programming from Its Programming Woes?

This paper is advancing the following premise (henceforth, "vision"): that it is feasible to grea... more This paper is advancing the following premise (henceforth, "vision"): that it is feasible to greatly enhance data movement in the short term, and do it in ways that would be both power efficient and pragmatic in the long term. The paper spells this premise out in greater detail: 1. it is feasible to build first generations of a variety of (power-inefficient) designs for which data movement will not be a restriction and begin application software development for them; 2. growing reliance on silicon compatible photonic technologies, and feasible advances in them with proper investment, will allow reduction of power consumption in these design by several orders of magnitude; 3. successful high performance application software, the ease of programming demonstrated and growing adoption by customers, software vendors and programmers will incentivize (hardware vendor) investment in new application-software-compatible generations of these designs (a new "software spiral" a la former Intel CEO, Andy Grove) with further reduction of power consumption in each generation; 4. microfluidic cooling is instrumental for enabling item 1, as well as for midwifing this overall vision. The opening paragraph of the paper provides a preamble to that vision, the body of the paper supports it and the paragraph "Moore's-Law-type vision" summarizes it. The scope of the paper is a bit forward looking and it may not exactly fit any particular community. However, its new directions for interaction among architecture and programming may suggest new horizons for representing and exposing a greater variety of data and task parallelism.

Download

Trade-Offs between Depth and Width in Parallel Computation

SIAM Journal on Computing, May 1, 1985

A new technique for proving lower bounds for parallel computation is introduced. This technique e... more

Download

Parallel Simulation of Many-core Processors: Integration of Research and Education

Facilitating a transition into ubiquitous parallel computing has been a strategic objective for c... more Facilitating a transition into ubiquitous parallel computing has been a strategic objective for computer science and engineering since its inception in the 1940s. A theory enthusiast, the overriding theme guiding his work was using theory to guide the rest of the field in addressing this strategic objective. Key components in his comprehensive plan include the very rich PRAM parallel algorithmic theory and a PRAM-on-Chip vision comprising the explicit multi-threaded (XMT) computer system framework he invented. The latter provides a powerful approach to multi-core architectures, and, in particular, the exponential increase in the number of cores in the roadmap of most vendors into the late 2010s. Recently, he has focused on the feedback loop between algorithms, their programming and implementation, and architecture, as well as their performance modeling with an eye towards softer aspects, such as their ease-of-programming, teachability, and learnability.

Download

A PRAM-on-Chip Vision (invited abstract)

String Processing and Information Retrieval, Sep 27, 2000

ABSTRACT

Granularity of Parallel Memories

Consider algorithms which are designed for shared memory models of parallel computation in which ... more Consider algorithms which are designed for shared memory models of parallel computation in which processors are allowed to have fairly unrestricted access patterns to the shared memory. General fast simulations of such algorithms by parallel machines in which the shared memory is organized in modules where only one cell of each module can be accessed at a time are proposed. The paper provides a comprehensive study of the problem. The solution involves three stages: (a) Before a simulation, distribute randomly the memory addresses among the memory modules. (b) Keep several copies of each address and assign memory requests of processors to the "right'; copies at any time. (c) Satisfy these assigned memory requests according to specifications of the parallel machine.

Randomized parallel speedups for list ranking

Journal of Parallel and Distributed Computing, Jun 1, 1987

ABSTRACT

Dynamic parallel memories

Information and control, Mar 1, 1983

increasing the running time by more than a constant factor is considered. A solution for a family... more

Download

Efficient parallel triconnectivity in logarithmic time

Springer eBooks, 1988

We present two new techniques for trimming a logarithmic factor from the running time of efficien... more

Implementation of simultaneous memory address access in models that forbid it

Journal of Algorithms, Mar 1, 1983

Download

Approximate and exact parallel scheduling with applications to list, tree and graph problems

Abstract We study two parallel scheduling problems and their use in designing parallel algorithms... more

Linking parallel algorithmic thinking to many-core memory systems and speedups for boosted decision trees

Proceedings of the International Symposium on Memory Systems, Oct 1, 2018

The current focus of research on parallel computing takes current commercial hardware for granted... more The current focus of research on parallel computing takes current commercial hardware for granted. Here, we consider an alternative approach: start with a time-tested algorithmic theory and develop a supporting computer architecture and toolchain. This paper focuses on the hybrid memory architecture of this computer platform, which is designed to efficiently support execution of both serial and parallel code and switching between the two. A key part of this architecture is a flexible all-to-all interconnection network that connects processors to shared memory modules. To understand some recent advances in GPU memory architecture and how they relate to this hybrid memory architecture, we use microbenchmarks including list ranking. A second part of this work contrasts the scalability of applications with that of routines. In particular, regardless of the scalability needs of full applications, some routines may involve smaller problem sizes, and in particular smaller levels of parallelism, perhaps even serial. To see how a hybrid memory architecture can benefit such applications, we simulate a computer with such an architecture and demonstrate the potential for a speedup of 3.3X over NVIDIA's most powerful GPU to date on boosted decision trees, a timely machine learning application.

Download

What to Do with All this Hardware? (Invited Lecture)

Lecture Notes in Computer Science, 2001

The upcoming so-called “on-chip Billion transistor” era raises the question: What to do with all ... more The upcoming so-called “on-chip Billion transistor” era raises the question: What to do with all the on-chip hardware once the returns on adding more on-chip memory start to diminish? Parallel computing has been a strategic area of growth for computer science since the 1940s. So far, parallel computing affected main stream computer science only in a limited way. The key problem with parallel computers has been their programmability. The parallel algorithms research community has developed a theory of parallel algorithms for a very simple parallel computation model, the so-called PRAM (for parallel random-access machine, or model). That theory appears to be second in magnitude only to serial algorithmics. However, the evolution of parallel computers never reached a situation where the PRAM algorithmic computation model offered effective abstraction for them. So, this elegant algorithmic theory remained in the ivory towers of theorists. Not only that it has not been matched with a real computer system, there has hardly been an experimental study of what works better, more refined performance measurements, and a broad study of applications. For example, the general question “how good par- allel algorithms can really be” has remained generally open. Explicit Multi-Threading (XMT) is a new fine-grained computation framework which tries to address the hardware opportunity using the PRAM parallel algorithmic knowledge base. XMT aims at faster single- task completion time by way of executing in parallel many instruction all within a single chip. Building on some key ideas of parallel computing, XMT covers the spectrum from algorithms through architecture to im- plementation; the main implementation related innovation in XMT was through the incorporation of low-overhead hardware mechanisms (for more effective fine-grained parallelism). The two key research questions facing our “PRAM-on-chip vision” are: (i) “how to build?” an XMT computer, and (ii) “who cares?”; that is, what will be the key applications?

Recursive star-tree parallel data structure. Technical report

The model of parallel computation that is used in this paper is the concurrent-read concurrent-wr... more The model of parallel computation that is used in this paper is the concurrent-read concurrent-write (CRCW) parallel random access machine (PRAM). We assume that several processors may attempt to write at the same memory location only if they are seeking to write the same value (the so called, Common CRCW PRAM). We use the weakest Common CRCW PRAM model, in which only concurrent writes of the value one are allowed. Given two parallel algorithms for the same problem one is more efficient than the other if: (1) primarily, its time-processor product is smaller, and (2) secondarily (but important), its parallel time is smaller. Optimal parallel algorithms are those with a linear time-processor product. A fully-parallel algorithm that runs in constant time using an optimal number of processors. An almost fully-parallel algorithm is a parallel algorithm that runs in alpha(n) (the inverse of Ackermann function) time using on optimal number of processors. The notion of fully-parallel algorithm represents an ultimate theoretical goal for designers of parallel algorithms. Research on lower bounds for parallel computation (see references later) indicates that for nearly any interesting problem this goal is unachievable. These same results also preclude almost fully-parallel algorithms for the same problems. Therefore,more » any result that approaches this goal is somewhat surprising.« less

On Choice of a Model of Parallel Computation

Recursive Star-Tree Parallel Data-Structure

This paper introduces a novel parallel data-structure, called recursive STAR-tree (denoted '* tre... more This paper introduces a novel parallel data-structure, called recursive STAR-tree (denoted '* tree'). For its definition, we use a generalization of the * functional 1. Using recursion in the spirit of the inverse-Ackermann function, we derive recursive *-trees. The recursive *-tree data-structure leads to a new design paradigm for parallel algorithms. This paradigm allows for: * Extremely fast parallel computations. Specifically, O(a(n)) time (where c(xn) is the inverse of Ackermann function) using an optimal number of processors on the (weakest) CRCW PRAM. e These computations need only constant time, using an optimal number of processors if the following non-standard assumption about the model of parallel computation is added to the CRCW PRAM: an extremely small number of processors can write simultaneously each into different bits of the same word. Applications include: (1) A new algorithm for finding lowest common ancestors in trees which is considerably simpler than the known algorithms for the problem. (2) Restricted domain merging. (3) Parentheses matching. (4) A new parallel reducibility. i" -DSiTtBUTION STAT Mt A A.).--oyed for publle fslecansj -.. ' .)t~o-j Onlim jt,4 t The research of this author was supported by NSF grants CCR-8615337 and CCR-8906949 and ONR grant N00014-85-K-0046. ' Given a real function f, denote f (1)(n) = f (n) and f " (n) = f (f" (-(n)) for i > 1. The * functional maps f into another function *f . *f (n) = minimwn (i I f (I)(n) < 1). If this minimum does not exist then *f (n) = **.

Download

Is multicore hardware for general-purpose parallel processing broken?

Communications of The ACM, Apr 1, 2014

The current generation of general-purpose multicore hardware must be fixed to support more applic... more

On Parallel Integer Merging

Information & Computation, Oct 1, 1993

Download

Uploads

Papers by Uzi Vishkin

Log In