Beyond Big Data

Posts

Showing posts from 2014

spark java.lang.IllegalArgumentException: java.net.UnknownHostException: user

Today I faced an error while trying to use Spark shell. This is how I resolved. scala> val file = sc.textFile("hdfs://...") 14/10/21 13:34:23 INFO MemoryStore: ensureFreeSpace(217085) called with curMem=0, maxMem=309225062 14/10/21 13:34:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 212.0 KB, free 294.7 MB) file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12 scala> file.count() java.lang.IllegalArgumentException: java.net.UnknownHostException: user at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:237) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141) This error can be fixed by giving proper hostname and port sc.textFile("hdfs://{hostname}:8020/{filepath}...") scala> file.count() 14/10/21 13:44:23 IN...

/lib/spark/bin/utils.sh: No such file or directory in CDH-5.2

In the latest version of CDH5.2, while trying to run spark-shell will encounter this error. user@spark-master:~# spark-shell /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/bin/../lib/spark/bin/spark-shell: line 44: /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/bin/utils.sh: No such file or directory Solution: utils.sh can be downloaded from github. I'm not sure this is the perfect solution, but things seems to be working after putting the file 1. get the file from https://github.com/apache/spark/blob/master/bin/utils.sh 2. copy utils.sh to /opt/cloudera/parcels/CDH/lib/spark/bin/ user@spark-master:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/bin# spark-shell SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/spark-assembly-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/clo...

org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.

Recently installed the latest cloudera hadoop. First issue I faced while working with hive. Diagnostic Messages for this Task: Container launch failed for container_1406173012885_0009_01_000021 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container . This token is expired. current time is 1406254943000 found 1406254938244 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152) at org.apache.hadoop.yarn.api.records.impl.pb.Serial...

Common issues on Shark with CDH5-beta2

Issues on Shark with CDH5-beta2 1. IncompatibleClassChangeError: Implementing class Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.ClassLoader.defineC...

SparkR : Where R meets Spark

R is one of the widely used Statistical tools. In recent years R is gaining lot of popularity and it is forecasted to beat SPSS and SAS in near future. R is Open Source, used and supported by millions of people Thousands of packages, most of which are used/developed by people in the statistics. ( I feel most of the products are developed by people who truly do not understand the end use case) Cons It runs on single core of processor, Requires all data to be stored in RAM Does not handle big data processing Here is where Spark comes in, Spark is designed for high speed big data processing or real time big data processing. High speed processing, 1...

Upgrading to Cloudera 5 beta 2

How to upgrade Cloudera 5 beta 1 and install beta2 To upgrade Cloudera 5 beta 1 to beta 2 there are two methods. I'm using method B. For guys who cant uninstall, they can give a shot at method A. A. Upgrade manually ( this is from cloudera forum) Stop CDH5b1 cluster in CM5b1. Stop Cloudera Monitoring Services. Upgrade CM to 5b2. Start CM Upgrade CDH to 5b2 (for packages - manually, for parcels - using CM UI). If upgrading using parcels, DO NOT start the cluster. (Only if YARN is present in secure cluster) Goto “Settings” > “Administration”, Select ALL “yarn” principals and hit “Regenerate”. Wait for the “Generate Credentials” command to finish. This is needed because CDH5b2 added the HTTP principal for YARN. Start ZooKeeper Run HDFS Metadata Upgrade command from HDFS Actions Menu Run Oozie Database Upgrade command from Oozie Actions Menu Run Oozie Install Sharelib command from Oozie Actions Menu Run Upgrade Sqoop command from Sqoop Actions Menu Run Upgrade H...

Spark: Next big thing in Big Data?

Spark indeed has gained a lot of popularity. Its mailing list one of the most active in all of the big data projects(It shows lot of people are banging it). As one of the big data enthusiasts, I'm going to dig in ;-) Watch out this space for benchmarking + optimization factors in spark. Lets play spark