Wednesday, August 19, 2009

Simplified Map Reduce

Requirement

Need to process 10 Million XML files.

Problems
1. Time constraint as there are other jobs/ tasks which need to work after these records are processed. We cannot take our own sweet time
2. Need to have efficient logic so that the records can be processed in parallel
3. Need to take care of various other things like OOM, DB Locks etc.

Solution: -

Unfamiliarity from the word like Map-reduce doesn’t stop anybody from reading this solution. At the end of this Article you will be introduced to a scalable and highly efficient logic to process heavy number of records in fraction of seconds.

What this solution offer is a simplified Map-Reduce implementation where the framework itself a self-managed service which is invoked by the various small tasks/ processes.

Features of the framework service: -

1. Manages its own lifecycle and is not dependent upon the tasks.
2. Has the capability to create its own instances or re-use the existing ones for the new processes
3. Manages all sorts of coordination between the various tasks running in parallel
4. Keeps track of all processed and un-processed threads and each processed thread is further send to other threads (merging)
5. Exposes all its runtime configurations through JMX API’s
6. Number of parallel threads can be increased or decreased at runtime

High level Diagram





Major Components: -

1. FeedsExecutor – A wrapper around the Executors defined in JDk1.5. It entrusted to hold/ process and manage all tasks submitted by the client applications
2. Data-Merger – One of the major components which tracks and consolidates all the results produced by the various tasks executed by the FeedsExecutor.
3. DataWriter – A client implemented API, which is invoked by the merger and contains the logic for persisting the data to any of persistent storage area.
4. Map-Reduce Executor Service – Heart of the whole Framework which is responsible for all sorts of coordination between Executors, Mergers, Writers. It also keep tracks of various Executors dedicated to the processes and expose all the runtime information and statistics to the users through JMX Console.

Some of the performance numbers: -

Hardware Specs:- Standard Intel Core processor with 2 GB of Ram
Target: - Process 10 million XML files (ranging from 10 KB to 100 KB) and persist the data in database.
Total No of Threads running in parallel: - 36000 K
Total -XMS provided to the process 1536
Total Time Taken: - 10.5 min for processing records and 3 minutes (approx.) to persist these records in Database.

No comments: