Grapevine - An Internet Distributed Computing Framework
Developers: Kunal Agrawal, Huey Ting Ang, Guoliang Li, Kevin Chu
Overview
Tapping into the unused resources of volunteer computers connected to the internet is not a new idea. Indeed, there are currently many active efforts in this area (e.g. the SETI project, distributed.net, Parabon). While many early distributed computing efforts targeted specific applications, more recent work has focused on the development of general distributed computing infrastructure. Leaders in this field include Parabon Computation, United Devices, and the Globus Project. Each of these efforts has developed a software infrastructure that supports distributed computing.
For our class project, we propose to explore some fundamental and practical issues in distributed computing on volunteered nodes connected to the internet by designing and implementing a simple distributed computing infrastructure. To gain an appreciation for the types of computing applications that are suited for a distributed computing environment, we also intend to develop one application to be deployed on the infrastructure.
Thus the project will consist of essentially two parts.
- Development of the distributed computing Infrastructure.
- Development of an application for this environment.
Part 1 : Design of the Infrastructure
To facilitate the design process, we plan to study the designs of the Parabon Computation, United Devices, and Globus Project infrastructures.A few key features of this project that make it interesting are:
- In contrast to a traditional parallel computer, the application cannot be designed to maximize utilization of the computing resources because the machines that the application is running are volunteered (as opposed to dedicated). Exploration of what types of applications fit this model of computing is important given the trend towards distributed computing as a viable alternative to dedicated massively-parallel computing.
- Issues of communication vs. computation, node reliability, load balancing, etc. involved in traditional parallel computing will be amplified in a distributed computing environment deployed over the internet. A good understanding of the implication of these issues in the context of "internet supercomputing" is necessary if supercomputing is to progress beyond economic bounds of dedicated machinery.
- It's free! If we design it right (security, etc.), it could potentially be a free distributed computing infrastructure that could be used for personal use. For example, this would allow poor graduate students (who can't afford the costs of using commercial systems) to gain experience in distributed, internet supercomputing.
Preliminary Infrastructure Design Ideas
There will be two primary types of nodes in the infrastructure:- Volunteer/Worker node - a daemon program running on these nodes awaits "program fragments" to run from the node.
- Dispatch node - a daemon program running on these nodes receives jobs that are submitted by users. Based on some load balancing algorithm, the dispatch node sends the "program fragments" out to the volunteer nodes to be run.
A simple model for the "program fragment" might be that they are serialized objects that implement some Java interface that we define. As a possibility, we could require that users implement a class that extends the Thread class and implements the Serializable interface plus a few of our own methods. Then when the job is submitted, the dispatcher could serialize the user class, send it over the network to the worker daemon which would unserialize the object and run the thread.
Some immediate issues to work out include:
- Definition of the user class interfaces.
- Work out the details of how the dispatcher and volunteer nodes operate.
Part II: Distribted Application
Machine Learning Classifiers
Machine learning techniques have a wide range of practical applications in fields like data mining, information retrival, image processing ,bioinformatics, stock prediction etc. A machine learning application typically involves finding inherent patterns in massive data that may not be interpretable by human. We are referring to the set of supervised learning algorithms here, which involves a training and testing phase. In the training phase, we are concerned with the selection of tranining examples and the model used to represent the data.Some of the widely used models are listed below:
- Decision Tree, C5
- Neural Networks
- Bayesian Learning
- Support Vector Machine
- Maximum Entropy
Task Definition: Mechanism for Bagging and Boosting
We would like to build a general framework to support Bagging and Boosting in a parallel distributed computing environment. Bagging is inherently parallelizable and boosting can be adapted to work in a parallel way. Our mechanism should support any models that adhere to our specified format of input/output for training file.Problems
- In a distributed environment across network, there is a need to cater for packets "lost" in the dispatching process. That is, we should also sent out more jobs than is reqiured and the final results are computed from whatever results received within a specified timeframe.
- Depending on the size of each subset, we may need to send voluminous amounts of data (depends on application)
Demonstration of Result - Text Categorization
Imagine building a text categorization system (as in Yahoo, but involves not only websites but all documents residing on the machines in the environment). On each machine there is an individual classifier that will train on the documents residing on that machine. In boosting mode, it can send the documents that it classified incorrectly to other machine for training too. The classifier on each machine needs to be retrained after the documents increase by a certain number.A comittee based classifier will be built using a voting mechanism based on these classifiers.Different machine learning algorithms involve different costs for training and testing.We will parallelize either the training phase, the testing phase or both phases, subject to time constraints, feasibility and usefulness.
Basic assumptions
- The trained model is small, and can be easily duplicated on different machines.
- Bulk of the training data resides on each individual machine, with some training examples from other machines in the boosting mode.