Beyond Big Data

Posts

Spark & Open Street Data | How to read PBF data

Recently I started playing with open street data in spark. Here are the steps to load the data into spark 1. Convert the PBF data into Parquet format. https://github.com/adrianulbona/osm-parquetizer 2. Read the data in Spark spark.sqlContext.setConf("spark.sql.parquet.binaryAsString","true") This ensures, tags are properly read as string instead of binary objects

Common Issues with Solr Data Import Handler (DIH)

1. Could not load driver: org.postgresql.Driver org.apache.solr.common.SolrException; Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Could not load driver: org.postgresql.Driver Solution : Put rmdbs driver, in my case postgres driver in $SOLR_HOME/dist folder and point it in solrconfig.xml <lib dir="${solr.install.dir:../../../..}/dist/" regex="postgresql.*\.jar" /> 2. ERROR StreamingSolrClients org.apache.solr.common.SolrException: Bad Request request: http://host:7574/solr/collection_shard2_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fhost%3A8983%2Fsolr%2Fcollection_shard1_replica2%2F&wt=javabin&version=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurr...

solr 5.1 DIH

Recently I had to use Data Import Handler to index data from postgres database. Unfortunately I had to encounter few issues, I'm blogging the steps and the issues faced. Setting up DataImportHandler Edit your solrconfig.xml to add the request handler <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler> Create a data-config.xml file as follows and save it to the conf dir <dataConfig> <dataSource type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://host:port/dbname" user="username" password="password"/> <document> <entity name="col_id" query="select * from report_ks"> ...

rWordCloud - An htmlwidget interface for D3 word cloud

With htmlwidget, its become easy to bind d3 scripts to R. rWordCloud is one such package. To install rWordCloud require(devtools) install_github('adymimos/rWordCloud') Two main functions in rWordClouds are d3TextCloud - this function takes strings as input, and performs word count. Before word count, it does stemming, and stop word removal. content <- c('R is a programming language and software environment for statistical computing and graphics open source','The R language is widely used among statisticians and data miners for developing statistical software and data analysis','Polls, surveys of data miners,and studies of scholarly literature databases show that R popularity has increased substantially in recent years','languages programming study open source, analysis') label <- c('a1','a2','a3','a4') d3TextCloud(content = content, label = label ) d3Cloud - Function accepts word and it...

spark java.lang.IllegalArgumentException: java.net.UnknownHostException: user

Today I faced an error while trying to use Spark shell. This is how I resolved. scala> val file = sc.textFile("hdfs://...") 14/10/21 13:34:23 INFO MemoryStore: ensureFreeSpace(217085) called with curMem=0, maxMem=309225062 14/10/21 13:34:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 212.0 KB, free 294.7 MB) file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12 scala> file.count() java.lang.IllegalArgumentException: java.net.UnknownHostException: user at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:237) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141) This error can be fixed by giving proper hostname and port sc.textFile("hdfs://{hostname}:8020/{filepath}...") scala> file.count() 14/10/21 13:44:23 IN...

/lib/spark/bin/utils.sh: No such file or directory in CDH-5.2

In the latest version of CDH5.2, while trying to run spark-shell will encounter this error. user@spark-master:~# spark-shell /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/bin/../lib/spark/bin/spark-shell: line 44: /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/bin/utils.sh: No such file or directory Solution: utils.sh can be downloaded from github. I'm not sure this is the perfect solution, but things seems to be working after putting the file 1. get the file from https://github.com/apache/spark/blob/master/bin/utils.sh 2. copy utils.sh to /opt/cloudera/parcels/CDH/lib/spark/bin/ user@spark-master:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/bin# spark-shell SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/spark-assembly-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/clo...