Skip to main content

SparkR : Where R meets Spark

R is one of the widely used Statistical tools. In recent years R is gaining lot of popularity and it is forecasted to beat SPSS and SAS in near future.
                                       
 R is

  • Open Source, used and supported by millions of people
  • Thousands of packages, most of which are used/developed by people in the statistics. ( I feel most of the products are developed by people who truly do not understand the end use case)

Cons
  • It runs on single core of processor,
  • Requires all data to be stored in RAM
  • Does not handle big data processing

   Here is where Spark comes in, Spark is designed for high speed big data processing or real time big data processing.
                                             
  • High speed processing, 100X times faster than hadoop.
  • Ease of use
  • shark,MLib,spark streaming makes spark really powerful.

  SparkR is a lightweight interface for spark through R. This aid the big data processing in R and results can be used in further statistics. 

How to install SparkR

   1. SparkR requires packages 
     require(devtools)
               require(rJava)

  2.   Install SparkR - this is a fairly easy step,
      install_github("amplab-extras/SparkR-pkg", subdir="pkg") 
More Information on github or Amplab SparkR




  

Comments

Popular posts from this blog

Upgrading nodejs in ubuntu 14.04

My machine has 5.x installed and had lot of trouble updating it to 8.x. Below are the steps I followed to upgrade nodejs from 5.x to 8.x #add the new source list sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 68576280  sudo apt-add-repository "deb https://deb.nodesource.com/node_8.x $(lsb_release -sc) main" sudo apt-get update #Remove the previous installation sudo apt-get purge nodejs npm  #Verify if proper version is going to be installed apt-cache policy <package> #Install new version sudo apt-get install -y nodejs

Common issues on Shark with CDH5-beta2

Issues on Shark with CDH5-beta2 1. IncompatibleClassChangeError: Implementing class Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.ClassLoader.defineC...

org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.

Recently installed the latest cloudera hadoop. First issue I faced while working with hive. Diagnostic Messages for this Task: Container launch failed for container_1406173012885_0009_01_000021 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container . This token is expired. current time is 1406254943000 found 1406254938244     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)     at java.lang.reflect.Constructor.newInstance(Constructor.java:526)     at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)     at org.apache.hadoop.yarn.api.records.impl.pb.Serial...