local storage - Incorrect data read in Spark standalone cluster mode

Question

Welcome To Ask or Share your Answers For Others

local storage - Incorrect data read in Spark standalone cluster mode

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

local storage - Incorrect data read in Spark standalone cluster mode

I have setup spark standalone mode with 1 master and 2 workers. I launched spark application (java jar) using spark-submit and as expected my application runs and produces output in both of the worker machines.

I can find part files in both of the worker machines.

Spark job as follows,

I have a text file which has numbers between 1 to 100 in worker node 1 in location /Users/abc/spark/a.txt
I have another text file between 101-200 in another worker node in the same path and the same file name.

My spark job read data from the textFile, maps each number multiplied by 2 and then saves the output.

The problem is, when I checked the output in both worker nodes,

Spark ignored the numbers 50-60 in the txtfile in worker node 1. It only considered 1-49 and 61-100 and produced output in the same machine as between 2-98 and 122-200 in 8 part files.

In 2nd worker node it considered only the numbers between 150-160. It produced output in second worker node as between 300-320 in 2 part files.

Not sure why spark ignored the other portion of the input data.

Am I storing the data in wrong format? Or is because I'm not using Hdfs?

question from:https://stackoverflow.com/questions/65651022/incorrect-data-read-in-spark-standalone-cluster-mode

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

local storage - Incorrect data read in Spark standalone cluster mode

local storage - Incorrect data read in Spark standalone cluster mode

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags