RSS Feed for This PostCurrent Article

Hadoop Batch Job Scheduler

Azkaban is simple batch scheduler for constructing and running Hadoop jobs or other offline processes.

A batch job scheduler can be seen as a combination of the cron and make Unix utilities combined with a friendly UI. Batch jobs need to be scheduled to run periodically. They also typically have intricate dependency chains—for example, dependencies on various data extraction processes or previous steps. Larger processes might have 50 or 60 steps, of which some might run in parallel and others must wait for the output of earlier steps. Combining all these processes into a single program allows you to control the dependency management, but can lead to sprawling monolithic programs that are difficult to test or maintain. Simply scheduling the individual pieces to run at different times avoids the monolithic problem, but introduces many timing assumptions that are inevitably broken. Azkaban is a workflow scheduler that allows the independent pieces to be declaratively assembled into a single workflow, and for that workflow to be scheduled to run periodically.

A good batch workflow system allows a program to be built out of small reusable pieces that need not know about one another. By declaring dependencies, you can control sequencing. Other functionality available from Azkaban can then be declaratively layered on top of the job without having to add any code. This includes things like email notifications of success or failure, resource locking, retry on failure, log collection, historical job runtime information, and so on.


Trackback URL

Sorry, comments for this entry are closed at this time.