twit88.com » research

Open Source Distributed Filesystem

admin — Sun, 21 Sep 2008 15:43:22 +0000

MogileFS is an open source distributed filesystem.

As quoted from its website, its properties and features include:

Application level — no special kernel modules required.
No single point of failure — all three components of a MogileFS setup (storage nodes, trackers, and the tracker’s database(s)) can be run on multiple machines, so there’s no single point of failure. (you can run trackers on the same machines as storage nodes, too, so you don’t need 4 machines…) A minimum of 2 machines is recommended.
Automatic file replication — files, based on their “class”, are automatically replicated between enough different storage nodes as to satisfy the minimum replica count as requested by their class. For instance, for a photo hosting site you can make original JPEGs have a minimum replica count of 3, but thumbnails and scaled versions only have a replica count of 1 or 2. If you lose the only copy of a thumbnail, the application can just rebuild it. In this way, MogileFS (without RAID) can save money on disks that would otherwise be storing multiple copies of data unnecessarily.
“Better than RAID” — in a non-SAN RAID setup, the disks are redundant, but the host isn’t. If you lose the entire machine, the files are inaccessible. MogileFS replicates the files between devices which are on different hosts, so files are always available.
Flat Namespace — Files are identified by named keys in a flat, global namespace. You can create as many namespaces as you’d like, so multiple applications with potentially conflicting keys can run on the same MogileFS installation.
Shared-Nothing — MogileFS doesn’t depend on a pricey SAN with shared disks. Every machine maintains its own local disks.
No RAID required — Local disks on MogileFS storage nodes can be in a RAID, or not. It’s cheaper not to, as RAID doesn’t buy you any safety that MogileFS doesn’t already provide.
Local filesystem agnostic — Local disks on MogileFS storage nodes can be formatted with your filesystem of choice (ext3, XFS, etc..). MogileFS does its own internal directory hashing so it doesn’t hit filesystem limits such as “max files per directory” or “max directories per directory”. Use what you’re comfortable with.

Digg is using MogileFS. An interesting article can be found at How Digg Works

Apache UIMA: Unstructured Information Management Architecture

admin — Wed, 06 Aug 2008 07:01:07 +0000

UIMA is a framework and SDK for developing software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.

As quoted from the website, an example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables such an application to be decomposed into components, for example “language identification” -> “language specific segmentation” -> “sentence boundary detection” -> “entity detection (person/place names etc.)”.

UIMA is a component framework for analysing unstructured content such as text, audio and video. It comprises an SDK and tooling for composing and running analytic components written in Java and C++, with some support for Perl, Python and TCL.

SEDA: An Architecture for Highly Concurrent Server Applications

admin — Sat, 19 Jul 2008 15:53:42 +0000

As quoted from the website, SEDA is an acronym for staged event-driven architecture, and decomposes a complex, event-driven application into a set of stages connected by queues. This design avoids the high overhead associated with thread-based concurrency models, and decouples event and thread scheduling from application logic. By performing admission control on each event queue, the service can be well-conditioned to load, preventing resources from being overcommitted when demand exceeds service capacity.

SEDA employs dynamic control to automatically tune runtime parameters (such as the scheduling parameters of each stage), as well as to manage load, for example, by performing adaptive load shedding. Decomposing services into a set of stages also enables modularity and code reuse, as well as the development of debugging tools for complex event-driven applications.

SEDA is used in a number of open source and commercial projects.

Read the research papers if you interested in developing high concurrent systems.

Open Source Grid Computing

admin — Mon, 19 May 2008 06:32:58 +0000

Here are some open source grid computing software that are quite interesting.

GridGain is the open source grid computing software for Java. It is dual-licensed under LGPL and Apache 2.0 licenses and is built on open source software foundation

BOINC is a software platform for volunteer computing and desktop grid computing. BOINC is designed to support applications that have large computation requirements, storage requirements, or both. The main requirement of the application is that it be divisible into a large number (thousands or millions) of jobs that can be done independently.

BOINC is used by SETI@home

Globus Toolkit is an open source software toolkit used for building Grid systems and applications. It is being developed by the Globus Alliance and many others all over the world.

NGrid is an open source (LGPL) grid computing framework written in C#. NGrid aims to be platform independent via the Mono project.

Other useful references

Open Grid Forum
Open Grid Service Architecture represents an evolution towards a Grid system architecture based on Web services concepts and technologies.
Software components for grid systems and applications.