Skip to main content

SparkR : Where R meets Spark

R is one of the widely used Statistical tools. In recent years R is gaining lot of popularity and it is forecasted to beat SPSS and SAS in near future.
                                       
 R is

  • Open Source, used and supported by millions of people
  • Thousands of packages, most of which are used/developed by people in the statistics. ( I feel most of the products are developed by people who truly do not understand the end use case)

Cons
  • It runs on single core of processor,
  • Requires all data to be stored in RAM
  • Does not handle big data processing

   Here is where Spark comes in, Spark is designed for high speed big data processing or real time big data processing.
                                             
  • High speed processing, 100X times faster than hadoop.
  • Ease of use
  • shark,MLib,spark streaming makes spark really powerful.

  SparkR is a lightweight interface for spark through R. This aid the big data processing in R and results can be used in further statistics. 

How to install SparkR

   1. SparkR requires packages 
     require(devtools)
               require(rJava)

  2.   Install SparkR - this is a fairly easy step,
      install_github("amplab-extras/SparkR-pkg", subdir="pkg") 
More Information on github or Amplab SparkR




  

Comments

Popular posts from this blog

Common issues on Shark with CDH5-beta2

Issues on Shark with CDH5-beta2 1. IncompatibleClassChangeError: Implementing class Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.ClassLoader.defineC...

Upgrading nodejs in ubuntu 14.04

My machine has 5.x installed and had lot of trouble updating it to 8.x. Below are the steps I followed to upgrade nodejs from 5.x to 8.x #add the new source list sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 68576280  sudo apt-add-repository "deb https://deb.nodesource.com/node_8.x $(lsb_release -sc) main" sudo apt-get update #Remove the previous installation sudo apt-get purge nodejs npm  #Verify if proper version is going to be installed apt-cache policy <package> #Install new version sudo apt-get install -y nodejs

rWordCloud - An htmlwidget interface for D3 word cloud

With htmlwidget, its become easy to bind d3 scripts to R. rWordCloud is one such package. To install rWordCloud require(devtools) install_github('adymimos/rWordCloud') Two main functions in rWordClouds are d3TextCloud - this function takes strings as input, and performs word count. Before word count, it does stemming, and stop word removal. content <- c('R is a programming language and software environment for statistical computing and graphics open source','The R language is widely used among statisticians and data miners for developing statistical software and data analysis','Polls, surveys of data miners,and studies of scholarly literature databases show that R popularity has increased substantially in recent years','languages programming study open source, analysis') label <- c('a1','a2','a3','a4') d3TextCloud(content = content, label = label ) d3Cloud - Function accepts word and it...