Tuesday, March 31, 2015

Neo4j - New Tool in my Technology Arsenal

Brief About Neo4j


Understanding the Data is the key to all systems and depicting analytical Models on that data, which is built on paradigm of real world entities and relationships, are the success stories of large-scale Enterprise systems.

Architects/ developers and Data scientist have been struggling hard to derive the real world models and relationships from discrete and disparate systems which consist of structured/ un-structured data.
We would all agree that without any relationships data is useless. If we cannot derive relationships between entities then it will be of little or of no use. After all - It’s all about the connections in the data

Let us introduce a different form of database, which focusses on relationships between the entities rather than the entities itself – Neo4J.

Neo4J as NoSQL Database leverages the theory of Graphs. Neo4j is an open-source Graph database implemented in Java.

Its first version (1.0) was released in February 2010 and since then it has never stopped. It is amazing to see the pace at which the Neo4j has evolved over the years. At the time of writing this book, the current stable version is 2.2, released in March-2015.

Where to go Now?


Learn more about Neo4j and jump into the nitty-gritties of Neo4j, uncovering various aspects of Neo4J dealing with Data Modelling, Performance Tuning, Security, Extensions - Neo4j Essentials

Read More!

Saturday, February 1, 2014

Friday, December 27, 2013

Friday, October 11, 2013

Parquet - Columnar Storage format for Hadoop

Based on "record shredding and assembly algorithm" defined in Google's Dremel Paper , "parquet" seems to be good choice for Efficient Data Storage. - http://parquet.io/

The Complete project is divided into 2 parts: -

1. Parquet Format - This Contains the thrift based definitions for the Storage Format.
2. Parquet-MR - Parquet MR contains M/R (Java) based implementation of the Parquet Format. It contains implementations for Hive, Avro, hadoop, Pig and Cascading.

The Best part is that all definitions are written in Thrift, so implementations can be in cross language.

Read More!

Saturday, January 7, 2012

RCFile: A Fast and Space-efficient Data Placement

RCFile (Record Columnar File) is an efficient mechanism for storing and faster retrieval of huge amount of Data sets.
It is basically a concept developed at Facebook to overcome the challenge’s exposed by the Large Data sets.

Without saying that every BigData use case involve the solution for 3 critical problems (You can also say NFR's of BigData solution's)

1) Fast data loading - Typical use cases involves the loading of TB of data for Analytics, so it is highly desirable that an efficient mechanism should be used for so that overheads can be reduced and data can be loaded in the minimum possible time.

2) Fast query processing - Time take for processing any query is vital for any kind of analytics, which to a big extent depends on the way the data is being stored and partitioned.

3) Highly efficient storage space utilization - Data is growing or in BigData we usually say "Data Explosion". Definitely the various type of Compression needs to be considered so that the space can be used efficiently.

RCFile seems to be solving much of above problems.

Typical RDBMS divides the partitions the table row wise and Column oriented Database divides the tables into column wise partitions.

RCFile employs the benefits of both by partitioning the data row wise than column wise too.

Below Diagram shows the way a typical RDBMS Datafile is handled within the RCFile.




API of RCFile is provided in Hive and facilitates 2 different ways of the implementation: -

1. It can be used in M/R jobs by extending RCFileOutputFormat, RCFileInputFormat and RCFileRecordReader
2. Reader and Writer which can be used by the applications for reading and writing the data to RC files in their own Way

Here is a small example for using RCFile Reader and Writers to write and read the data in 2 different ways (Column wise and Row Wise)

Pre-requisites for Running this Example: -

1. Hadoop 0.20.* should be installed and running
2. Hive 0.6+ should be in classpath

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.RCFile;
import org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable;
import org.apache.hadoop.hive.serde2.columnar.BytesRefWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile.Metadata;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;

public class TestRCSimpleReadWrite /*extends RCFileCat */  /* USE RCFileCat only if this programme will be invoking a M/R Job, basically it extends Tool*/ {

 Configuration conf;
 FileSystem fs;
 private static final int maxHiveColumns = 5;
 

 /**
  * Constructor
  */
 TestRCSimpleReadWrite() {
  try {
   System.setProperty("HADOOP_HOME", "D:\\myWork\\hadoop\\hadoop-0.20.2");
   conf = new Configuration();
   conf.addResource(new Path("D:\\myWork\\hadoop\\hadoop-0.20.2\\conf\\core-site.xml"));
   conf.addResource(new Path("D:\\myWork\\hadoop\\hadoop-0.20.2\\conf\\hdfs-site.xml"));
   conf.set("hive.io.rcfile.column.number.conf", String.valueOf(maxHiveColumns));
   fs = FileSystem.get(conf);
  } catch (Exception e) {
   e.printStackTrace();
  }
 }

 /**
  * Writing Data to a file
  */
 public void writeRCData() {
  try {

   RCFile.Writer rcFileWriter = new RCFile.Writer(fs, conf, new Path("sumit/rctext"), null, new Metadata(), new BZip2Codec());
   // No of Rows........
   for (int j = 0; j < 10; j++) {
    BytesRefArrayWritable dataWrite = new BytesRefArrayWritable(10);
    // Number of Column in Each Row......
    for (int i = 0; i < maxHiveColumns; i++) {
     Text coulmn1 = new Text("ROW-NUM - " + j + ", COLUMN-NUM = " + i + "\n");
     BytesRefWritable bytesRefWritable = new BytesRefWritable();
     bytesRefWritable.set(coulmn1.getBytes(), 0, coulmn1.getLength());
     // ensure the if required the capacity is increased
     dataWrite.resetValid(i);
     dataWrite.set(i, bytesRefWritable);

    }
    rcFileWriter.append(dataWrite);
   }

   rcFileWriter.close();

  } catch (Exception e) {
   e.printStackTrace();
  }
 }

 /*
  * Reading Column Wise Data
  */
 public void readColumnWiseRCFileData() {
  try {
   RCFile.Reader rcFileReader = new RCFile.Reader(fs, new Path("sumit/rctext"), conf);
   int counter = 1;
   // Getting the Chunk of Row Groups...
   while (rcFileReader.nextColumnsBatch()) {
    System.out.println("READ COLUMN WISE - we are getting some data");

    // Iterate over all Rows fetched and iterate over each column
    for (int i = 0; i < maxHiveColumns; i++) {

     BytesRefArrayWritable dataRead = rcFileReader.getColumn(i, null);
     // rcFileReader.getCurrentRow(dataRead);
     for (int j = 0; j < dataRead.size() - 1; j++) {
      BytesRefWritable bytesRefread = dataRead.get(j);
      byte b1[] = bytesRefread.getData();
      Text returnData = new Text(b1);
      // This will PRINT the Data for the existing Row
      System.out.println("READ-DATA = " + returnData.toString());
     }
    }

    System.out.println("Checking for next Iteration outer");

    counter++;
   }

  } catch (Exception e) {
   e.printStackTrace();
  }
 }

 /*
  * Reading Row Wise Data
  */
 
 public void readRowWiseRCFileData() {
  try {
   RCFile.Reader rcFileReader = new RCFile.Reader(fs, new Path("sumit/rctext"), conf);
   int counter = 1;
   // Getting the Chunk of Row Groups...
   while (rcFileReader.next(new LongWritable(counter))) {
    System.out.println("READ ROW WISE - we are getting some data for ROW = " + counter);
    BytesRefArrayWritable dataRead = new BytesRefArrayWritable();
    rcFileReader.getCurrentRow(dataRead);

    // Iterate over all Rows fetched and iterate over each column
    System.out.println("Size of Data Read - " + dataRead.size());
    for (int i = 0; i < dataRead.size(); i++) {
     BytesRefWritable bytesRefread = dataRead.get(i);
     byte b1[] = bytesRefread.getData();
     Text returnData = new Text(b1);
     // This will PRINT the Data for the existing Row
     System.out.println("READ-DATA = " + returnData.toString());
    }
    System.out.println("Checking for next Iteration");

    counter++;
   }

  } catch (Exception e) {
   e.printStackTrace();
  }
 }
 
 /**
  * Main Method
  * @param args
  */

 public static void main(String[] args) {
  try {

   TestRCSimpleReadWrite obj = new TestRCSimpleReadWrite();
   System.out.println("Start writing the Data");
   obj.writeRCData();
   System.out.println("Start reading Column Wise Data");
   obj.readColumnWiseRCFileData();
   System.out.println("Start reading Row Wise Data");
   obj.readRowWiseRCFileData();

  } catch (Exception e) {
   e.printStackTrace();
  }

 }

}

References: -
1. http://en.wikipedia.org/wiki/RCFile
2. http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf

Read More!

Friday, December 30, 2011

Network Monitoring Tool - Nagios

Lately I have been working on evaluating the various Open source Network monitoring tools for monitoring the large Networks.

Yes Networks, which not only comprised of servers/ boxes but also other devices like printers, routers, switches etc.

Moreover the network which I am talking about is not of 5-10 machines but I am talking monitoring a cluster of 100+ boxes.

Here is brief and a good article defining the few of the popular Open source Network Monitoring tools.

Apart from all the tools mentioned in the above link I have been extensively working on Nagios and found it pretty good, efficient, extensible, very easy to use and setup

Written in C and Backed by a very active community support, Nagios comes with an extremely robust backend, fast and intuitive UI and gives almost everything a system or network administrator could ask for monitoring the network.

Nagios comes with a wide range of plugins which can monitor almost every bit of hardware or software installed anywhere in your network or networks of networks.

Starting from the basic Ping plugin it has gone far ahead and provided the features of active (see NRPA plugin) and passive (see NSCA Plugin) monitoring .

Of course we can provide our own plugins and also can use it like any other plugin's.

Notifications as usual remains the critical part of any monitoring tool and Nagios itself provides a good and robust SMS and email notification framework.

The webUI is pretty intuitive and provides different ways to logically group the boxes and networks, so that monitoring is much easier and can provide the health of my network in a logically manner.

After learning all the above there is no second thought that Nagios is rightly termed as the Enterprise class monitoring tool, which have been used to monitor the IT-Infrastructure of the companies

The only missing thing what I found is that it does not support installation on Windows but it does provide the plugin's to monitor Window's Boxes (see here).

Few important Links: -

1. Download Nagios
2. Installing Nagios on Ubuntu
3. Short Intro about Nagios Plugins
4. Nagios Directory
5. Nagios Plugins and Add-ons
6. Plugins for Monitoring Hadoop - HDFS , Datanodes and JMX Plugin
7. Monitoring Windows Boxes

Tips and Tricks: -

1. Where to find Nagios Configurations files - /usr/local/nagios

2. Where is the Plugin Directory - /usr/local/nagios/libexec

3. How to stop/ start/ restart Nagios - sudo /etc/init.d/nagios stop|reload|start

4. Debugging issues during Nagios startup -

In Nagios what all matters is the /usr/local/nagios/etc/nagios.cfg file.

This is the file which you need to validate before starting or restarting your nagios process.

To validate the nagios.cfg file run this command "sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg".

In case of any issues the errors will be displayed on the console and would help in debugging and fixng the issue.

Error handling and reporting in nagios is in very raw state and even the type of plugins used also matters a lot.
(I had a hard time in configuring NSCA becasue there was one extra blank line in the one of the Shell Scripts).

So it is always advisable to frequently the check the configurations, so that the errors can be catch at early stages.


Read More!

Friday, December 23, 2011

Converting Java to Windows Executables

Recently came across a wonderfull tool for converting executables jar files into windows executable's.

Launch4j - It works preety well and also provides various options which is very much required to deploy any java based solutions.

The good part is that it is open source :)

Moreover the best part is that it provides you the framework to generate the EXE and also at the same time all the configuraitons can be saved in an XML file for future use.

List of all the available features and other details can be found here and installer can be downloaded from here

Read More!