Skip to main content

SparkR : Where R meets Spark

R is one of the widely used Statistical tools. In recent years R is gaining lot of popularity and it is forecasted to beat SPSS and SAS in near future.
                                       
 R is

  • Open Source, used and supported by millions of people
  • Thousands of packages, most of which are used/developed by people in the statistics. ( I feel most of the products are developed by people who truly do not understand the end use case)

Cons
  • It runs on single core of processor,
  • Requires all data to be stored in RAM
  • Does not handle big data processing

   Here is where Spark comes in, Spark is designed for high speed big data processing or real time big data processing.
                                             
  • High speed processing, 100X times faster than hadoop.
  • Ease of use
  • shark,MLib,spark streaming makes spark really powerful.

  SparkR is a lightweight interface for spark through R. This aid the big data processing in R and results can be used in further statistics. 

How to install SparkR

   1. SparkR requires packages 
     require(devtools)
               require(rJava)

  2.   Install SparkR - this is a fairly easy step,
      install_github("amplab-extras/SparkR-pkg", subdir="pkg") 
More Information on github or Amplab SparkR




  

Comments

Popular posts from this blog

Common issues on Shark with CDH5-beta2

Issues on Shark with CDH5-beta2 1. IncompatibleClassChangeError: Implementing class Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.ClassLoader.defineC...

Upgrading nodejs in ubuntu 14.04

My machine has 5.x installed and had lot of trouble updating it to 8.x. Below are the steps I followed to upgrade nodejs from 5.x to 8.x #add the new source list sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 68576280  sudo apt-add-repository "deb https://deb.nodesource.com/node_8.x $(lsb_release -sc) main" sudo apt-get update #Remove the previous installation sudo apt-get purge nodejs npm  #Verify if proper version is going to be installed apt-cache policy <package> #Install new version sudo apt-get install -y nodejs

Common Issues with Solr Data Import Handler (DIH)

1. Could not load driver: org.postgresql.Driver org.apache.solr.common.SolrException; Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Could not load driver: org.postgresql.Driver Solution : Put rmdbs driver, in my case postgres driver in $SOLR_HOME/dist folder and point it in solrconfig.xml <lib dir="${solr.install.dir:../../../..}/dist/" regex="postgresql.*\.jar" />  2. ERROR StreamingSolrClients org.apache.solr.common.SolrException: Bad Request request: http://host:7574/solr/collection_shard2_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fhost%3A8983%2Fsolr%2Fcollection_shard1_replica2%2F&wt=javabin&version=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurr...