Skip to main content

Posts

Showing posts from February, 2014

SparkR : Where R meets Spark

R is one of the widely used Statistical tools. In recent years R is gaining lot of popularity and it is forecasted to beat SPSS and SAS in near future.                                          R is Open Source, used and supported by millions of people Thousands of packages, most of which are used/developed by people in the statistics. ( I feel most of the products are developed by people who truly do not understand the end use case) Cons It runs on single core of processor, Requires all data to be stored in RAM Does not handle big data processing    Here is where Spark comes in, Spark is designed for high speed big data processing or real time big data processing.                                               High speed processing, 100X times faster than hadoop. Ease of use shark,MLib,spark streaming makes spark really powerful.   SparkR is a lightweight interface for spark through R. This aid the big data processing in R and results can be used in fu

Upgrading to Cloudera 5 beta 2

How to upgrade Cloudera 5 beta 1 and install beta2  To upgrade Cloudera 5 beta 1 to beta 2 there are two methods. I'm using method B. For guys who cant uninstall, they can give a shot at method A. A. Upgrade manually ( this is from cloudera forum) Stop CDH5b1 cluster in CM5b1. Stop Cloudera Monitoring Services. Upgrade CM to 5b2. Start CM Upgrade CDH to 5b2 (for packages - manually, for parcels - using CM UI). If upgrading using parcels, DO NOT start the cluster. (Only if YARN is present in secure cluster) Goto “Settings” > “Administration”, Select ALL “yarn” principals and hit “Regenerate”. Wait for the “Generate Credentials” command to finish. This is needed because CDH5b2 added the HTTP principal for YARN. Start ZooKeeper Run HDFS Metadata Upgrade command from HDFS Actions Menu Run Oozie Database Upgrade command from Oozie Actions Menu Run Oozie Install Sharelib command from Oozie Actions Menu Run Upgrade Sqoop command from Sqoop Actions Menu Run Upgrade H

Spark: Next big thing in Big Data?

Spark indeed has gained a lot of popularity. Its mailing list one of the most active in all of the big data projects(It shows lot of people are banging it). As one of the big data enthusiasts, I'm going to dig in ;-)    Watch out this space for benchmarking + optimization factors in spark. Lets play spark