Spark 설치 및 실습

728x90

하둡을 별도 로 설치 하지 않을것임으로 2번에서 빌트인된 하둡을 다운받는다. 여기선 최신 2.7 다운

tar 파일 압축해제 후 spark-shell.cmd 를 실행한다.

${SPARK_HOME} 에있는 README.md 파일을 읽어 워드카운트를 세는 샘플을 실행한다.

scala> val lines = sc.textFile("README.md")

lines: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:24

scala> val words = lines.flatMap(line=> line.split(""))

words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26

scala> val wordMap = words.map(word=>(word,1))

wordMap: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:28

scala> val result = wordMap.reduceByKey((a,b) => a+b)

간단한 워드 카운트 샘플.

실행 오류시 참고.

윈도우에 스파크 설치시 다음과 같은 오류는 PATH 에 $HADOOP_HOME$ 과 hive 폴더의 접근권한 문제로 발생된다.

HADOOP_HOME 설정과 winutils.exe 파일을 다운받아 실행 하면 된다.

Error while running command to get file permissions : java.io.IOException: (null) entry in command string: null ls -F D:\tmp\hive

I used Spark 1.5.2 with Hadoop 2.6 and had similar problems. Solved by doing the following steps:

Download winutils.exe from the repository to some local folder, e.g. C:\hadoop\bin.
Set HADOOP_HOME to C:\hadoop.
Create c:\tmp\hive directory (using Windows Explorer or any other tool).
Open command prompt with admin rights.
Run C:\hadoop\bin\winutils.exe chmod 777 /tmp/hive

With that, I am still getting some warnings, but no ERRORs and can ru

Spark Dataset (0)	2020.04.06
Spark 모니터링 (0)	2020.04.06
spark sql (0)	2020.04.06
spark linux install (Master / Worker) (0)	2020.04.03
Spark ML pipeline (0)	2017.06.16

IT.FARMER