最近工作中遇到的問題,roaringbitmap存放在sequencefile中,由於bitmap對象過大觸發了hadoop-common中的一個bug。這裡寫了問題原因和處理辦法。
1.問題:
讀取的單個Bitmap的sequenceFile文件(sequence File的value是bitmap對象)報錯:java.lang.NegativeArraySizeException
$ hf -du -h /data/datacenter/bitmap/label/init/part-r-00000/data
688.5 M 2.0 G /data/datacenter/bitmap/label/init/part-r-00000/data
讀取大bitmap的SequenceFile 報錯:
java.lang.NegativeArraySizeException at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:144) at org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:123) at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:179) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2245) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2218) at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
2.原因
發現hadoop-common包BytesWritable類裡的bug。
There is an int overflow by setting new capacity in BytesWritable here:
public void setSize(int size) { if (size > getCapacity()) { setCapacity(size * 3 / 2); } this.size = size; }
700 Mb * 3 > 2Gb = int overflow!
As result you cannot deserialize (but can write and serialize) more than 700 Mb into BytesWritable.
In case you would like to use BytesWritable
, an option is set the capacity high enough before, so you utilize 2GB, not only 700MB:
randomValue.setCapacity(numBytesToWrite); randomValue.setSize(numBytesToWrite); // will not resize now
This bug has fixed in Hadoop recently, so in newer versions it should work even without that:
public void setSize(int size) { if (size > getCapacity()) { // Avoid overflowing the int too early by casting to a long. long newSize = Math.min(Integer.MAX_VALUE, (3L * size) / 2L); setCapacity((int) newSize); } this.size = size; }
3.解決方法:
Hadoop-common 到2.8才修復這個問題,而spark1.6依賴的是hadoop-common的2.2版本
解決方法,修改代碼,就是指定先加載用戶的類路徑,再加載spark裡的類。官網上寫的很清楚,只怪沒看官網啊, 檢討一下把官網的配置再看一遍。順著這個思路可以看下spark是怎麼設置yarn的參數的。
spark.driver.userClassPathFirst
false
(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.
指定:
--conf spark.driver.userClassPathFirst=true \
--conf spark.executor.userClassPathFirst=true \
--jars /usr/local/hadoop/share/hadoop/common/hadoop-common-2.6.0-cdh5.5.2.jar,/home/mcloud/platform3/data-api_2.10-1.0.8.jar,/home/mcloud/platform3/JavaEWAH-1.0.2.jar,/home/mcloud/platform3/bitmap-ext_2.10-1.0.3.jar,/home/mcloud/platform3/commons-pool2-2.0.jar,/home/mcloud/platform3/jedis-2.5.1.jar,/home/mcloud/platform3/commons-dbutils-1.6.jar \
/home/xx/datacenter/jobcenter-job_2.10-1.0.jar \
指定這三個參數可以搞定這個問題, 找了好多方法,還是沒有首先看spark的官方文檔,很容易的一個問題,被搞的這麼複雜。
閱讀更多 從大數據說起 的文章