转载:
Spark环境搭建-WIndows版本 这段时间在看Scala语言方面的资料,接触到了Spark,于是昨天下午在公司,把Spark的环境搭建起来了。安装的时候陪到了一个问题,在网上没有找到解决方案,于是自己查了一下原因。现在做一下笔记。
1. spark的下载文件可以在官方找到,地址:http://spark.incubator.apache.org/downloads.html ,这次装的是截至目前为止,最新的版本:0.9
2. 下载完以后,直接解压到指定的路径,例如,d:/programs
3. 安装scala,并制定Scala_Home路径,scala安装请查看官网
4. 按照Spark官方的安装指南,在解压的目录下,运行
sbt/sbt package
命令就可以。
但是这是针对linux和OS X系统的,在windows下运行这条命令,会报错:
not a valid command
这个问题是因为,spark知道的sbt脚本无法在windows下运行,只要在网上下载一个windows版本的sbt,然后将里面的文件拷贝到Spark目录下的sbt (http://www.scala-sbt.org/),然后在运行命令,安装就会成功。
试试spark-shell
1 scala> val textFile = sc.textFile("README.md")
2 14/02/14 16:38:12 INFO MemoryStore: ensureFreeSpace(35480) called with curMem=177376, maxMem=308713881
3 14/02/14 16:38:12 INFO MemoryStore: Block broadcast_5 stored as values to memory (estimated size 34.6 KB, free 294.2 MB)
4
5 textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[16] at textFile at <console>:12
6
7 scala> textFile.count
8 14/02/14 16:38:14 INFO FileInputFormat: Total input paths to process : 1
9 14/02/14 16:38:14 INFO SparkContext: Starting job: count at <console>:15
10 14/02/14 16:38:14 INFO DAGScheduler: Got job 7 (count at <console>:15) with 1 output partitions (allowLocal=false)
11 14/02/14 16:38:14 INFO DAGScheduler: Final stage: Stage 7 (count at <console>:15)
12 14/02/14 16:38:14 INFO DAGScheduler: Parents of final stage: List()
13 14/02/14 16:38:14 INFO DAGScheduler: Missing parents: List()
14 14/02/14 16:38:14 INFO DAGScheduler: Submitting Stage 7 (MappedRDD[16] at textFile at <console>:12), which has no missin
15 g parents
16 14/02/14 16:38:14 INFO DAGScheduler: Submitting 1 missing tasks from Stage 7 (MappedRDD[16] at textFile at <console>:12)
17
18 14/02/14 16:38:14 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
19 14/02/14 16:38:14 INFO TaskSetManager: Starting task 7.0:0 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
20 14/02/14 16:38:14 INFO TaskSetManager: Serialized task 7.0:0 as 1560 bytes in 1 ms
21 14/02/14 16:38:14 INFO Executor: Running task ID 5
22 14/02/14 16:38:14 INFO BlockManager: Found block broadcast_5 locally
23 14/02/14 16:38:14 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
24 14/02/14 16:38:14 INFO Executor: Serialized size of result for 5 is 563
25 14/02/14 16:38:14 INFO Executor: Sending result for 5 directly to driver
26 14/02/14 16:38:14 INFO Executor: Finished task ID 5
27 14/02/14 16:38:14 INFO TaskSetManager: Finished TID 5 in 6 ms on localhost (progress: 0/1)
28 14/02/14 16:38:14 INFO DAGScheduler: Completed ResultTask(7, 0)
29 14/02/14 16:38:14 INFO TaskSchedulerImpl: Remove TaskSet 7.0 from pool
30 14/02/14 16:38:14 INFO DAGScheduler: Stage 7 (count at <console>:15) finished in 0.009 s
31 14/02/14 16:38:14 INFO SparkContext: Job finished: count at <console>:15, took 0.012329265 s
32 res10: Long = 119
33
34 scala> textFile.first
35 14/02/14 16:38:24 INFO SparkContext: Starting job: first at <console>:15
36 14/02/14 16:38:24 INFO DAGScheduler: Got job 8 (first at <console>:15) with 1 output partitions (allowLocal=true)
37 14/02/14 16:38:24 INFO DAGScheduler: Final stage: Stage 8 (first at <console>:15)
38 14/02/14 16:38:24 INFO DAGScheduler: Parents of final stage: List()
39 14/02/14 16:38:24 INFO DAGScheduler: Missing parents: List()
40 14/02/14 16:38:24 INFO DAGScheduler: Computing the requested partition locally
41 14/02/14 16:38:24 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
42 14/02/14 16:38:24 INFO SparkContext: Job finished: first at <console>:15, took 0.002671379 s
43 res11: String = # Apache Spark
44
45 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
46 linesWithSpark: org.apache.spark.rdd.RDD[String] = FilteredRDD[17] at filter at <console>:14
47
48 scala> textFile.filter(line=> line.contains("spark")).count
49 14/02/14 16:38:37 INFO SparkContext: Starting job: count at <console>:15
50 14/02/14 16:38:37 INFO DAGScheduler: Got job 9 (count at <console>:15) with 1 output partitions (allowLocal=false)
51 14/02/14 16:38:37 INFO DAGScheduler: Final stage: Stage 9 (count at <console>:15)
52 14/02/14 16:38:37 INFO DAGScheduler: Parents of final stage: List()
53 14/02/14 16:38:37 INFO DAGScheduler: Missing parents: List()
54 14/02/14 16:38:37 INFO DAGScheduler: Submitting Stage 9 (FilteredRDD[18] at filter at <console>:15), which has no missin
55 g parents
56 14/02/14 16:38:37 INFO DAGScheduler: Submitting 1 missing tasks from Stage 9 (FilteredRDD[18] at filter at <console>:15)
57
58 14/02/14 16:38:37 INFO TaskSchedulerImpl: Adding task set 9.0 with 1 tasks
59 14/02/14 16:38:37 INFO TaskSetManager: Starting task 9.0:0 as TID 6 on executor localhost: localhost (PROCESS_LOCAL)
60 14/02/14 16:38:37 INFO TaskSetManager: Serialized task 9.0:0 as 1642 bytes in 0 ms
61 14/02/14 16:38:37 INFO Executor: Running task ID 6
62 14/02/14 16:38:37 INFO BlockManager: Found block broadcast_5 locally
63 14/02/14 16:38:37 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
64 14/02/14 16:38:37 INFO Executor: Serialized size of result for 6 is 563
65 14/02/14 16:38:37 INFO Executor: Sending result for 6 directly to driver
66 14/02/14 16:38:37 INFO Executor: Finished task ID 6
67 14/02/14 16:38:37 INFO TaskSetManager: Finished TID 6 in 10 ms on localhost (progress: 0/1)
68 14/02/14 16:38:37 INFO DAGScheduler: Completed ResultTask(9, 0)
69 14/02/14 16:38:37 INFO TaskSchedulerImpl: Remove TaskSet 9.0 from pool
70 14/02/14 16:38:37 INFO DAGScheduler: Stage 9 (count at <console>:15) finished in 0.010 s
71 14/02/14 16:38:37 INFO SparkContext: Job finished: count at <console>:15, took 0.020335125 s
72 res12: Long = 7
另外Spark官网提供了入门的四段视频,但是国内被墙了,无法观看youtube,我把这四段视频放到了土豆网,大家可以看看。
Spark Screencast 1 – 搭建Spark环境Spark Screencast 2 – Spark文档总览Spark Screencast 3 – 转换和缓存Spark Screencast 4 – Scala独立任务-----------------------------------------------------
Silence, the way to avoid many problems;
Smile, the way to solve many problems;