Spark讀取大roaringbitmap對象報錯 NegativeArraySizeException

最近工作中遇到的問題,roaringbitmap存放在sequencefile中,由於bitmap對象過大觸發了hadoop-common中的一個bug。這裡寫了問題原因和處理辦法。

1.問題:

讀取的單個Bitmap的sequenceFile文件(sequence File的value是bitmap對象)報錯:java.lang.NegativeArraySizeException

$ hf -du -h /data/datacenter/bitmap/label/init/part-r-00000/data

688.5 M 2.0 G /data/datacenter/bitmap/label/init/part-r-00000/data

讀取大bitmap的SequenceFile 報錯:

java.lang.NegativeArraySizeException at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:144) at org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:123) at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:179) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2245) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2218) at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

2.原因

發現hadoop-common包BytesWritable類裡的bug。

There is an int overflow by setting new capacity in BytesWritable here:

public void setSize(int size) { if (size > getCapacity()) { setCapacity(size * 3 / 2); } this.size = size; }

700 Mb * 3 > 2Gb = int overflow!

As result you cannot deserialize (but can write and serialize) more than 700 Mb into BytesWritable.

In case you would like to use BytesWritable, an option is set the capacity high enough before, so you utilize 2GB, not only 700MB:

randomValue.setCapacity(numBytesToWrite); randomValue.setSize(numBytesToWrite); // will not resize now

This bug has fixed in Hadoop recently, so in newer versions it should work even without that:

public void setSize(int size) { if (size > getCapacity()) { // Avoid overflowing the int too early by casting to a long. long newSize = Math.min(Integer.MAX_VALUE, (3L * size) / 2L); setCapacity((int) newSize); } this.size = size; }

3.解決方法:

Hadoop-common 到2.8才修復這個問題,而spark1.6依賴的是hadoop-common的2.2版本

解決方法,修改代碼,就是指定先加載用戶的類路徑,再加載spark裡的類。官網上寫的很清楚,只怪沒看官網啊, 檢討一下把官網的配置再看一遍。順著這個思路可以看下spark是怎麼設置yarn的參數的。

spark.driver.userClassPathFirst

false

(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.

指定:

--conf spark.driver.userClassPathFirst=true \

--conf spark.executor.userClassPathFirst=true \

--jars /usr/local/hadoop/share/hadoop/common/hadoop-common-2.6.0-cdh5.5.2.jar,/home/mcloud/platform3/data-api_2.10-1.0.8.jar,/home/mcloud/platform3/JavaEWAH-1.0.2.jar,/home/mcloud/platform3/bitmap-ext_2.10-1.0.3.jar,/home/mcloud/platform3/commons-pool2-2.0.jar,/home/mcloud/platform3/jedis-2.5.1.jar,/home/mcloud/platform3/commons-dbutils-1.6.jar \

/home/xx/datacenter/jobcenter-job_2.10-1.0.jar \

指定這三個參數可以搞定這個問題, 找了好多方法,還是沒有首先看spark的官方文檔,很容易的一個問題,被搞的這麼複雜。


分享到:


相關文章: