SQL Query in Spark/Scala Size Exceeds Integer.Max_Value

SQL query in Spark/scala Size exceeds Integer.MAX_VALUE

No Spark shuffle block can be larger than 2GB (Integer.MAX_VALUE bytes) so you need more / smaller partitions.

You should adjust spark.default.parallelism and spark.sql.shuffle.partitions (default 200) such that the number of partitions can accommodate your data without reaching the 2GB limit (you could try aiming for 256MB / partition so for 200GB you get 800 partitions). Thousands of partitions is very common so don't be afraid to repartition to 1000 as suggested.

FYI, you may check the number of partitions for an RDD with something like rdd.getNumPartitions (i.e. d2.rdd.getNumPartitions)

There's a story to track the effort of addressing the various 2GB limits (been open for a while now): https://issues.apache.org/jira/browse/SPARK-6235

See http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications/25 for more info on this error.

spark error:java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

The problem hers is the ShuffleRDD block size after it materializes is greater than 2GB. Spark has this limitation. You need to change the spark.sql.shuffle.partitions parameter which is set to 200 be default.

Also you might need to increase the number of partitions that your dataset has. Re partition and save it first then read the new dataset and perform operation.

spark.sql("SET spark.sql.shuffle.partitions = 10000")
dataset.repartition(10000).write.parquet("/path/to/hdfs")
val newDataset = spark.read.parquet("/path/to/hdfs")  
newDatase.filter(...).count

Alternatively if you want to use Hive Table

spark.sql("SET spark.sql.shuffle.partitions = 10000")
dataset.repartition(10000).asveAsTable("newTableName")
val newDataset = spark.table("newTableName")  
newDatase.filter(...).count

Spark IllegalArgumentException: Size exceeds Integer.MAX_VALUE

It seems that there is an open issue for it...

https://issues.apache.org/jira/browse/SPARK-1476

From the description:

underlying abstraction for blocks in spark is a ByteBuffer : which limits the size of the block to 2GB.
This has implication not just for managed blocks in use, but also for shuffle blocks (memory mapped blocks are limited to 2gig, even though the api allows for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
This is a severe limitation for use of spark when used on non trivial datasets.

why I got the error: Size exceed Integer.MAX_VALUE when using spark+cassandra?

No Spark shuffle block can be greater than 2 GB.

Spark uses ByteBuffer as abstraction for storing blocks and its size is limited by Integer.MAX_VALUE (2 billions).

Low number of partitions can lead to high shuffle block size. To solve this issue try to increase the number of partitions using rdd.repartition() or rdd.coalesce() or.

If this doesn't help, it means that at least one of your partitions is still too big and you may need to use some more sophisticated approach to make it smaller - for example use randomness to equalize distribution of RDD data between individual partitions.

SQL Query in Spark/Scala Size Exceeds Integer.Max_Value