Skip to main content

Installing Shark in CDH5 beta2

How to install Shark in CDH5-beta2.

Requirements for Shark

1. Hadoop Cluster
2. Spark

   When you install CDH5-beta2, spark and hadoop gets installed.

1. Download Shark source code 
sudo mkdir /opt/shark/
sudo chmod 777 /opt/shark/
git clone -b branch-0.9 /opt/shark/shark-0.9.1-bin-cdh5

2. Build Shark

    Shark-0.9 requires Jdk7 which comes preinstalled with cdh5-beta

export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
cd /opt/shark/
ln -s shark-0.9.1-bin-cdh5
cd shark
export SCALA_HOME=/opt/cloudera/parcels/CDH/lib/spark/
SHARK_HADOOP_VERSION=2.2.0-cdh5.0.0-beta-2 ./sbt/sbt package

3. Configure Shark
  •  Hive
    Download AMPLab's Hive 0.11: hive-0.11.0-bin
  •  Scala 
    Download Scala-2.10.3
  •  Configure Scala and Hive 
    cd /opt/shark/shark
    mkdir dep
    cd dep
    tar xvf ~/scala-2.10.3.tgz;
    ln -s scala-2.10.3 scala;
    tar xvf ~/hive-0.9.0-bin.tgz;
    ln -s hive-0.9.0-bin hive;
    Hive requires some more configuration mentioned in AMPLab wiki. Copy all configurations from apache hive to shark
     cp /etc/hive/conf/* /opt/shark/shark/conf/
    As mentioned in AMPlab's configuration append below lines to hive-site.xml

    master-address is the namenode address

  • Configure
    Set following values in
    export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
    export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
    export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
    export SHARK_HOME=/opt/shark/shark
    export CDH_HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
    export HIVE_CONF_DIR=$SHARK_HOME/conf
    export SCALA_HOME=$SHARK_HOME/dep/scala
    export HIVE_HOME=$SHARK_HOME/dep/hive/
    export HIVE_CONF_DIR="$HIVE_HOME/conf"
    export MASTER=spark://xxxxxxxx:7077
  • For Parquet support
    cd /opt/shark/shark/lib
  • Distribute shark to all the worker nodes.
    tar -czf shark.tgz shark-0.9.1-bin-cdh5/
    sudo mkdir /opt/shark/
    sudo chmod 777 /opt/shark/
    ssh user@master:/opt/shark/shark.tgz /opt/shark/
    cd $_
    tar -xzf shark.tgz 
    ln -s  shark-0.9.1-bin-cdh5 shark

  • 4. Run
    5. Common Issues
    Check here
    Please excuse my poor syntaxing and explanation of the process. I should be improving this page as I learn more about blogging


    Popular posts from this blog

    spark java.lang.IllegalArgumentException: user

    Today I faced an error while trying to use Spark shell. This is how I resolved.
    scala> val file = sc.textFile("hdfs://...")
    14/10/21 13:34:23 INFO MemoryStore: ensureFreeSpace(217085) called with curMem=0, maxMem=309225062
    14/10/21 13:34:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 212.0 KB, free 294.7 MB)
    file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

    scala> file.count()
    java.lang.IllegalArgumentException: user
        at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(
        at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(
    This error can be fixed by giving proper hostname and port

    scala> file.count()14/10/21 13:44:23 INFO FileInputFormat: Total input paths to pr…

    rWordCloud - An htmlwidget interface for D3 word cloud

    With htmlwidget, its become easy to bind d3 scripts to R. rWordCloud is one such package.

    To install rWordCloud

    require(devtools) install_github('adymimos/rWordCloud')
    Two main functions in rWordClouds are
    d3TextCloud - this function takes strings as input, and performs word count. Before word count, it does stemming, and stop word removal.
    content <- c('R is a programming language and software environment for statistical computing and graphics open source','The R language is widely used among statisticians and data miners for developing statistical software and data analysis','Polls, surveys of data miners,and studies of scholarly literature databases show that R popularity has increased substantially in recent years','languages programming study open source, analysis') label <- c('a1','a2','a3','a4') d3TextCloud(content = content, label = label )
    d3Cloud - Function accepts word and its size
    text <…

    org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.

    Recently installed the latest cloudera hadoop. First issue I faced while working with hive.
    Diagnostic Messages for this Task: Container launch failed for container_1406173012885_0009_01_000021 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is 1406254943000 found 1406254938244     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)     at sun.reflect.NativeConstructorAccessorImpl.newInstance(     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(     at java.lang.reflect.Constructor.newInstance(     at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(     at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(     at org.apache.hadoop…