spark Container killed by YARN for exceeding memory limits -
i'm running spark in aws emr. it's simple computation of pagerank, dataset 8gb
i use 6 m3.xlarge cluster,each 16gb memory
here configuration
spark.executor.instances 4 spark.executor.cores 8 spark.driver.memory 10473m spark.executor.memory 9658m
i'm new spark, have no common sense how memory-consuming it's......is cluster small? , memory hard limit? or , small memory ok computing slow....
here code (in our homework we're not allowed use graphx, use plain function)
import org.apache.spark.sparkconf; import org.apache.spark.sparkcontext; import org.apache.spark.rdd.rdd; object pagerank { def main(args: array[string]): unit = { val conf = new sparkconf().setappname("pagerank") val sc = new sparkcontext(conf) val iternum = 10 val file = sc.textfile("hdfs:///input") val = file.flatmap { line => line.split("\t") }.distinct() val contributor = file.map ( line => line.split("\t")(0)).distinct() val dangling = all.subtract(contributor) val graphdangling = dangling.cartesian(all).groupbykey() val graph0 = file.map(line => { val temp = line.split("\t"); (temp(0), temp(1)) }).distinct().groupbykey() val graph = graph0.union(graphdangling) graph.cache() var ranks = graph.mapvalues { x => 1.0 } var = 0 (i <- 0 iternum) { val contrireceive = graph.join(ranks).values.flatmap { case (followees, rank) => { val size = followees.size; followees.map(followee => (followee, rank / size)) } } ranks = contrireceive.reducebykey(_ + _).mapvalues { x => 0.15 + 0.85 * x } } val result = ranks.map{case(user,rank)=>user+"\t"+rank} ranks.saveastextfile("hdfs:///pagerank-output") sc.stop() } }
i'm not sure rdd , memory management. there many intermediate rdd, have explicitly release them free resource? if so, how ....just assign null , gc deal it?
Comments
Post a Comment