Skip to main content

Installing Shark in CDH5 beta2



How to install Shark in CDH5-beta2.


Requirements for Shark

1. Hadoop Cluster
2. Spark

   When you install CDH5-beta2, spark and hadoop gets installed.

1. Download Shark source code 
 
sudo mkdir /opt/shark/
sudo chmod 777 /opt/shark/
git clone https://github.com/amplab/shark -b branch-0.9 /opt/shark/shark-0.9.1-bin-cdh5

2. Build Shark

    Shark-0.9 requires Jdk7 which comes preinstalled with cdh5-beta

export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
cd /opt/shark/
ln -s shark-0.9.1-bin-cdh5
cd shark
export SCALA_HOME=/opt/cloudera/parcels/CDH/lib/spark/
SHARK_HADOOP_VERSION=2.2.0-cdh5.0.0-beta-2 ./sbt/sbt package

3. Configure Shark
  •  Hive
    Download AMPLab's Hive 0.11: hive-0.11.0-bin
  •  Scala 
    Download Scala-2.10.3
  •  Configure Scala and Hive 
    cd /opt/shark/shark
    mkdir dep
    cd dep
    tar xvf ~/scala-2.10.3.tgz;
    ln -s scala-2.10.3 scala;
    tar xvf ~/hive-0.9.0-bin.tgz;
    ln -s hive-0.9.0-bin hive;
    
    Hive requires some more configuration mentioned in AMPLab wiki. Copy all configurations from apache hive to shark
     cp /etc/hive/conf/* /opt/shark/shark/conf/
    As mentioned in AMPlab's configuration append below lines to hive-site.xml
    
      <property>
        <name>fs.default.name</name>
        <value>hdfs://master-address:8020/</value>
      </property>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master-address:8020/</value>
      </property>
      <property>
        <name>mapreduce.framework.name</name>
        <value>NONE</value>
      </property>
      <property>
        <name>mapred.job.tracker</name>
        <value>NONE</value>
      </property>
      

    master-address is the namenode address

  • Configure shark-env.sh
    Set following values in shark-env.sh
    export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
    export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
    export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
    export SHARK_HOME=/opt/shark/shark
    export CDH_HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
    export HIVE_CONF_DIR=$SHARK_HOME/conf
    export SCALA_HOME=$SHARK_HOME/dep/scala
    export HIVE_HOME=$SHARK_HOME/dep/hive/
    export HIVE_CONF_DIR="$HIVE_HOME/conf"
    export MASTER=spark://xxxxxxxx:7077
    
  • For Parquet support
    cd /opt/shark/shark/lib
    wget http://repo1.maven.org/maven2/com/twitter/parquet-hive/1.0.0/parquet-hive-1.0.0.jar
    
  • Distribute shark to all the worker nodes.
    #MASTER
    tar -czf shark.tgz shark-0.9.1-bin-cdh5/
    
    #WORKER
    sudo mkdir /opt/shark/
    sudo chmod 777 /opt/shark/
    ssh user@master:/opt/shark/shark.tgz /opt/shark/
    cd $_
    tar -xzf shark.tgz 
    ln -s  shark-0.9.1-bin-cdh5 shark
    


  • 4. Run
    5. Common Issues
    Check here
    Please excuse my poor syntaxing and explanation of the process. I should be improving this page as I learn more about blogging

    Comments

    Popular posts from this blog

    Common Issues with Solr Data Import Handler (DIH)

    1. Could not load driver: org.postgresql.Driver

    org.apache.solr.common.SolrException; Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Could not load driver: org.postgresql.Driver
    Solution :
    Put rmdbs driver, in my case postgres driver in $SOLR_HOME/dist folder and point it in solrconfig.xml<lib dir="${solr.install.dir:../../../..}/dist/" regex="postgresql.*\.jar" />
     2.
    ERROR StreamingSolrClients org.apache.solr.common.SolrException: Bad Request request: http://host:7574/solr/collection_shard2_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fhost%3A8983%2Fsolr%2Fcollection_shard1_replica2%2F&wt=javabin&version=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolEx…

    rWordCloud - An htmlwidget interface for D3 word cloud

    With htmlwidget, its become easy to bind d3 scripts to R. rWordCloud is one such package.

    To install rWordCloud

    require(devtools) install_github('adymimos/rWordCloud')
    Two main functions in rWordClouds are
    d3TextCloud - this function takes strings as input, and performs word count. Before word count, it does stemming, and stop word removal.
    content <- c('R is a programming language and software environment for statistical computing and graphics open source','The R language is widely used among statisticians and data miners for developing statistical software and data analysis','Polls, surveys of data miners,and studies of scholarly literature databases show that R popularity has increased substantially in recent years','languages programming study open source, analysis') label <- c('a1','a2','a3','a4') d3TextCloud(content = content, label = label )
    d3Cloud - Function accepts word and its size
    text <…

    spark java.lang.IllegalArgumentException: java.net.UnknownHostException: user

    Today I faced an error while trying to use Spark shell. This is how I resolved.
    scala> val file = sc.textFile("hdfs://...")
    14/10/21 13:34:23 INFO MemoryStore: ensureFreeSpace(217085) called with curMem=0, maxMem=309225062
    14/10/21 13:34:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 212.0 KB, free 294.7 MB)
    file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

    scala> file.count()
    java.lang.IllegalArgumentException: java.net.UnknownHostException: user
        at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
        at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:237)
        at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141)
    This error can be fixed by giving proper hostname and port

    sc.textFile("hdfs://{hostname}:8020/{filepath}...")
    scala> file.count()14/10/21 13:44:23 INFO FileInputFormat: Total input paths to pr…