- AboutThis should describe the systems research collaboration, and present the overall research goals of the new group.
- PeopleHere are the different labs in the SRC…
- PublicationsA page where you will find categorized publications!
- ProjectsA page where you will find our projects
- Software
- News
- ResourcesVarious resources for prospective students, current students, alumni. Maybe put something here about life in NYC and at Columbia…
Publications from 2010
Comet: An active distributed key/value store
Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI),
October 2010
Abstract
Distributed key-value storage systems are widely used in corporations and across the Internet. Our research seeks to greatly expand the application space for key-value storage systems through application-specific customization. We designed and implemented Comet, an extensible, distributed key-value store. Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-specific actions to customize its behavior. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation.
We implemented a Comet prototype for the Vuze DHT, deployed Comet nodes on Vuze from PlanetLab, and built and evaluated over a dozen Comet applications. Our experience demonstrates that simple, safe, and restricted extensibility can significantly increase the power and range of applications that can run on distributed active storage systems. This approach facilitates the sharing of a single storage system by applications with diverse needs, allowing them to reap the consolidation benefits inherent in today's massive clouds.
Stable Deterministic Multithreading through Schedule Memoization
Proceedings of the Ninth Symposium on Operating Systems Design and Implementation (OSDI '10),
October 2010
Abstract
A deterministic multithreading (DMT) system eliminates nondeterminism in thread scheduling, simplifying the development of multithreaded programs. However, existing DMT systems are unstable; they may force a program to (ad)venture into vastly different schedules even for slightly different inputs or execution environments, defeating many benefits of determinism. Moreover, few existing DMT systems work with server programs whose inputs arrive continuously and nondeterministically.
TERN is a stable DMT system. The key novelty in TERN is the idea of schedule memoization that memoizes past working schedules and reuses them on future inputs, making program behaviors stable across different inputs. A second novelty in TERN is the idea of windowing that extends schedule memoization to server programs by splitting continuous request streams into windows of requests. Our TERN implementation runs on Linux. It operates as user-space schedulers, requiring no changes to the OS and only a few lines of changes to the application programs. We evaluated TERN on a diverse set of 14 programs (e.g., Apache and MySQL) with real and synthetic workloads. Our results show that TERN is easy to use, makes programs more deterministic and stable, and has reasonable overhead.
Scalable and Systematic Detection of Buggy Inconsistencies in Source Code
Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA '10),
October 2010
Abstract
Software developers often duplicate source code to replicate functionality. This practice can hinder the maintenance of a software project: bugs may arise when two identical code segments are edited inconsistently. This paper presents DejaVu, a highly scalable system for detecting these general syntactic inconsistency bugs. DejaVu operates in two phases. Given a target code base, a parallel inconsistent clone analysis first enumerates all groups of source code fragments that are similar but not identical. Next, an extensible buggy change analysis framework refines these results, separating each group of inconsistent fragments into a fine-grained set of inconsistent changes and classifying each as benign or buggy. On a 75+ million line pre-production commercial code base, DejaVu executed in under five hours and produced a report of over 8,000 potential bugs. Our analysis of a sizable random sample suggests with high likelihood that at this report contains at least 2,000 true bugs and 1,000 code smells. These bugs draw from a diverse class of software defects and are often simple to correct: syntactic inconsistencies both indicate problems and suggest solutions.
Bypassing Races in Live Applications with Execution Filters
Proceedings of the Ninth Symposium on Operating Systems Design and Implementation (OSDI '10),
October 2010
Abstract
Deployed multithreaded applications contain many races because these applications are difficult to write, test, and debug. Worse, the number of races in deployed applications may drastically increase due to the rise of multicore hardware and the immaturity of current race detectors.
LOOM is a "live-workaround" system designed to quickly and safely bypass application races at runtime. LOOM provides a flexible and safe language for developers to write execution filters that explicitly synchronize code. It then uses an evacuation algorithm to safely install the filters to live applications to avoid races. It reduces its performance overhead using hybrid instrumentation that combines static and dynamic instrumentation.
We evaluated LOOM on nine real races from a diverse set of six applications, including MySQL and Apache. Our results show that (1) LOOM can safely fix all evaluated races in a timely manner, thereby increasing application availability; (2) LOOM incurs little performance overhead; (3) LOOM scales well with the number of application threads; and (4) LOOM is easy to use.
CPU Scheduling with Automatic Interactivity and Dependency Detection
Haoqiang Zheng
Ph.D. Thesis, Department of Computer Science, Columbia University,
August 2010
Abstract
Recent trends in virtualization and server consolidation have expanded the number of appli- cations with different resource requirements and quality-of-service demands being run on the same system. Users expect computers not only to run these different applications, but to be able to run them all at the same time. A key challenge is how to ensure that the sys- tem provides acceptable interactive responsiveness to users while multiplexing resources among a diverse collection of applications. However, identifying interactive processes and scheduling them promptly are not easy because the latency sensitivity of a process in mod- ern computer systems may vary dynamically based on the user request it is processing and the user requests that depend on this process directly and indirectly. Most commod- ity operating systems either allow users to specify the latency sensitivity of a process or try to detect interactive latency-sensitive processes based on processor resource usage and sleeping behavior. These are generally ineffective. This dissertation introduces RSIO. It detects latency-sensitive processes by monitor- ing accesses to I/O channels and inferring when user interactions occur. RSIO provides a general mechanism for all user interactions, including direct interactions via local HCI devices such as mouse and keyboard, indirect interactions through middleware, and remote interactions through networks. It automatically and dynamically identifies processes in- volved in a user interaction and boosts their priorities at the time the interaction occurs to improve system response time. RSIO detects processes that directly handle a user in- teraction as well as those indirectly involved in processing the interaction, automatically accounting for dependencies and boosting their priorities accordingly. RSIO works with existing schedulers, processes that may mix interactive and batch activities, and requires no application modifications to identify periods of latency-sensitive application activity. Even when a process is detected as latency-sensitive and its priority is boosted, the process may still not be scheduled promptly because of a problem known as priority in- version, which happens when a high priority process blocks waiting for the response from a low priority process. Without knowing the dependency among the processes, the CPU scheduler may schedule a medium priority process to run, and thus effectively delay the execution of the high priority process. We have developed SWAP to address the prior- ity inversion problems caused by inter-process dependencies. SWAP can automatically determine possible resource dependencies among processes based on process system call history. Because some dependencies cannot be precisely determined, SWAP associates confidence levels with dependency information that are dynamically adjusted using feed- back from process blocking behavior. SWAP can schedule processes using this imprecise dependency information in a manner that is compatible with existing scheduling mecha- nisms and ensures that actual scheduling behavior corresponds to the desired scheduling policy in the presence of process dependencies. Our results show that SWAP can provide substantial improvements in system performance in scheduling processes with dependen- cies. As CPU schedulers are complicated to develop and increasingly important with the introduction of multi-core systems, we also introduce WARP, which is a new scheduler de- velopment and evaluation platform which facilitated our solutions. WARP is a trace-driven virtualized CPU scheduler execution environment that can dramatically simplify and speed the development and evaluation of CPU schedulers, including SWAP and RSIO. It is easy to use as it can run unmodified kernel scheduling code in user-space and can be used with standard user-space debugging and performance monitoring tools. We have implemented a WARP Linux prototype. Our results show that it can use application traces captured from its toolkit to accurately reflect the scheduling behavior of the real Linux operating system. Executing an application trace using WARP can be two orders of magnitude faster than running real applications.
Linux-CR: Transparent Application Checkpoint-Restart in Linux
Oren Laadan, Serge E. Hallyn
Proceedings of the 12th Annual Linux Symposium,
July 2010
Abstract
Application checkpoint-restart is the ability to save the state of a running application so that it can later resume its execution from the time of the checkpoint. Applica- tion checkpoint-restart provides many useful benefits in- cluding fault recovery, advanced resources sharing, dy- namic load balancing and improved service availability. For several years the Linux kernel has been gaining the necessary groundwork for such functionality, and now support for kernel based transparent checkpoint-restart is also maturing. In this paper we present the imple- mentation of Linux checkpoint-restart, which aims for inclusion in Linux mainline. We explain the usage model and describe the user interfaces and some key kernel in- terfaces. Finally, we present preliminary performance results of the implementation.
KVM for ARM
Proceedings of the 12th Annual Linux Symposium,
July 2010
Abstract
As ARM CPUs grow in performance and ubiquity across phones, netbooks, and embedded computers, providing virtualization support for ARM-based devices is increasingly important. We present KVM/ARM, a KVM-based virtualization solution for ARM-based devices that can run virtual machines with nearly unmodified operating systems. Because ARM is not virtualizable, KVM/ARM uses lightweight paravirtualization, a script-based method to automatically modify the source code of an operating system kernel to allow it to run in a virtual machine. Lightweight paravirtualization is architecture specific, but operating system independent. It is minimally intrusive, completely automated, and requires no knowledge or understanding of the guest operating system kernel code. By leveraging KVM, which is an intrinsic part of the Linux kernel, KVM/ARM's code base can be always kept in line with new kernel releases without additional maintenance costs, and can be easily included in most Linux distributions. We have implemented a KVM/ARM prototype based on the Linux kernel used in Google Android, and demonstrated its ability to successfully run nearly unmodified Linux guest operating systems.
Guaranteeing Performance through Fairness in Peer-to-Peer File-Sharing and Streaming Systems
Ph.D. Thesis, Department of Computer Science, Columbia University,
July 2010
Abstract
Over the past decade, Peer-to-Peer (P2P) file-sharing and streaming systems have evolved as a cheap and effective technology in distributing content to users. Guar- anteeing a level of performance in P2P systems is, therefore, of utmost importance. However, P2P file-sharing and streaming applications suffer from a fundamental prob- lem of unfairness, where many users have a tendency to free-ride by contributing little or no upload bandwidth while consuming much download bandwidth. By taking away an unfair share of resources, free-riders deteriorate the quality of service experienced by other users, by causing slower download times in P2P file-sharing networks and higher stream updates’ miss rates in P2P streaming networks. Previous attempts at addressing fair bandwidth allocation in P2P, such as BitTorrent-like systems, suf- fer from slow peer discovery, inaccurate predictions of neighboring peers’ bandwidth allocations, under-utilization of bandwidth, and complex parameter tuning. We present FairTorrent, a new deficit-based distributed algorithm that accurately rewards peers in accordance with their contribution in a file-sharing P2P system. In a nutshell, a FairTorrent peer uploads the next data block to a peer to whom it owes the most data. FairTorrent is resilient to exploitation by free-riders and strategic peers, is simple to implement, requires no bandwidth over-allocation, no prediction of peers’ rates, no centralized control, and no parameter tuning. We implemented FairTorrent in a BitTorrent client without modifications to the BitTorrent protocol, and evaluated its performance against other widely-used BitTorrent clients using various scenarios including live BitTorrent swarms. Our results show that FairTorrent provides up to two orders of magnitude better fairness, up to five times better download performance for contributing peers, and 60-100% better performance on average in live BitTorrent swarms. We show analytically that for a number of upload capacity distributions, in an n-node FairTorrent network, no peer is ever owed more than O(log n) data blocks with high probability. Achieving fair bandwidth allocation in a P2P streaming scenario is even more difficult, as it comes with an additional constraint: each stream update must be received before its playback deadline. P2P live streaming systems require global re- source over-provisioning to deliver adequate streaming performance. When there is not enough bandwidth to accommodate all users for a particular stream, such as due to free-riders or low-contributing peers, all users, including high-contributing peers, observe poor performance. We present FairStream, a new P2P streaming system that delivers a good quality stream to peers that upload data at a rate above the stream rate, even in the presence of free-riders or malicious users. FairStream achieves this with three mechanisms. First, it provides a new peer reply policy framework that enables file sharing incentive mechanisms to be adapted for streaming. Second, it uses this framework to incorporate a deficit-based peer reply policy that enables each peer to reply first to the neighbor to whom it owes the most data as measured by a deficit counter. Third, it introduces a collusion-resistant mechanism to ensure ef- fective data distribution of a stream despite a large fraction of free-riders who do not forward received data. We prove that FairStream is resilient to free-riders and rewards peers with streaming performance correlated with their contributions. We have also implemented FairStream as a BitTorrent client and evaluated its perfor- mance against other popular streaming systems. Our results on PlanetLab show that FairStream, similar to other systems, provides good quality streaming performance when resources are over-provisioned, but it also provides orders of magnitude better streaming performance for peers uploading above the stream rate when resources are constrained, in the presence of free-riders and low-contributing peers.
RSIO: Automatic User Interaction Detection and Scheduling
Haoqiang Zheng, Jason Nieh
Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2010),
June 2010
Abstract
We present RSIO, a processor scheduling framework for im- proving the response time of latency-sensitive applications by monitoring accesses to I/O channels and inferring when user interactions occur. RSIO automatically identifies pro- cesses involved in a user interaction and boosts their prior- ities at the time the interaction occurs to improve system response time. RSIO also detects processes indirectly in- volved in processing an interaction, automatically account- ing for dependencies and boosting their priorities accord- ingly. RSIO works with existing schedulers and requires no application modifications to identify periods of latency- sensitive application activity. We have implemented RSIO in Linux and measured its effectiveness on microbenchmarks and real applications. Our results show that RSIO is easy to use and can provide substantial improvements in system performance for latency-sensitive applications.
Transparent, Lightweight Application Execution Replay on Commodity Multiprocessor Operating Systems
Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2010),
June 2010
Abstract
We present SCRIBE, the first system to provide transparent, low- overhead application record-replay and the ability to go live from replayed execution. SCRIBE introduces new lightweight operat- ing system mechanisms, rendezvous and sync points, to efficiently record nondeterministic interactions such as related system calls, signals, and shared memory accesses. Rendezvous points make a partial ordering of execution based on system call dependen- cies sufficient for replay, avoiding the recording overhead of main- taining an exact execution ordering. Sync points convert asyn- chronous interactions that can occur at arbitrary times into syn- chronous events that are much easier to record and replay. We have implemented SCRIBE without changing, relinking, or re- compiling applications, libraries, or operating system kernels, and without any specialized hardware support such as hardware perfor- mance counters. It works on commodity Linux operating systems, and commodity multi-core and multiprocessor hardware. Our re- sults show for the first time that an operating system mechanism can correctly and transparently record and replay multi-process and multi-threaded applications on commodity multiprocessors. SCRIBE recording overhead is less than 2.5% for server applications includ- ing Apache and MySQL, and less than 15% for desktop applica- tions including Firefox, Acrobat, OpenOffice, parallel kernel com- pilation, and movie playback.