Friday, October 11, 2013

Parquet - Columnar Storage format for Hadoop

Based on "record shredding and assembly algorithm" defined in Google's Dremel Paper , "parquet" seems to be good choice for Efficient Data Storage. - http://parquet.io/

The Complete project is divided into 2 parts: -

1. Parquet Format - This Contains the thrift based definitions for the Storage Format.
2. Parquet-MR - Parquet MR contains M/R (Java) based implementation of the Parquet Format. It contains implementations for Hive, Avro, hadoop, Pig and Cascading.

The Best part is that all definitions are written in Thrift, so implementations can be in cross language.

No comments: