Skip to main content

SparkR : Where R meets Spark

R is one of the widely used Statistical tools. In recent years R is gaining lot of popularity and it is forecasted to beat SPSS and SAS in near future.
                                       
 R is

  • Open Source, used and supported by millions of people
  • Thousands of packages, most of which are used/developed by people in the statistics. ( I feel most of the products are developed by people who truly do not understand the end use case)

Cons
  • It runs on single core of processor,
  • Requires all data to be stored in RAM
  • Does not handle big data processing

   Here is where Spark comes in, Spark is designed for high speed big data processing or real time big data processing.
                                             
  • High speed processing, 100X times faster than hadoop.
  • Ease of use
  • shark,MLib,spark streaming makes spark really powerful.

  SparkR is a lightweight interface for spark through R. This aid the big data processing in R and results can be used in further statistics. 

How to install SparkR

   1. SparkR requires packages 
     require(devtools)
               require(rJava)

  2.   Install SparkR - this is a fairly easy step,
      install_github("amplab-extras/SparkR-pkg", subdir="pkg") 
More Information on github or Amplab SparkR




  

Comments

Popular posts from this blog

Common issues on Shark with CDH5-beta2

Issues on Shark with CDH5-beta2 1. IncompatibleClassChangeError: Implementing class Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.ClassLoader.defineC...

Upgrading nodejs in ubuntu 14.04

My machine has 5.x installed and had lot of trouble updating it to 8.x. Below are the steps I followed to upgrade nodejs from 5.x to 8.x #add the new source list sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 68576280  sudo apt-add-repository "deb https://deb.nodesource.com/node_8.x $(lsb_release -sc) main" sudo apt-get update #Remove the previous installation sudo apt-get purge nodejs npm  #Verify if proper version is going to be installed apt-cache policy <package> #Install new version sudo apt-get install -y nodejs

Spark & Open Street Data | How to read PBF data

Recently I started playing with open street data in spark. Here are the steps to load the data into spark 1. Convert the PBF data into Parquet format.      https://github.com/adrianulbona/osm-parquetizer 2.  Read the data in Spark spark.sqlContext.setConf("spark.sql.parquet.binaryAsString","true") This ensures, tags are properly read as string instead of binary objects