Почему Spark запрашивает основной метод, когда я отправляю файл Python?

У меня есть файл sql2.py со следующим содержимым:

from __future__ import print_function

import os
import pyspark.sql
import pyspark.sql.types

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

def main(sc):
    pass

if __name__ == "__main__":
    sc = SparkContext(appName="PythonSQL")
    sqlContext = SQLContext(sc)
    some_rdd = sc.parallelize([Row(name="John", age=19),
                              Row(name="Smith", age=23),
                              Row(name="Sarah", age=18)])
    some_df = sqlContext.createDataFrame(some_rdd)
    some_df.printSchema()
    some_df.registerAsTable("some")
    teenagers = sqlContext.sql("SELECT name FROM some WHERE age >= 13 AND age <= 19")
    for each in teenagers.collect():
        print(each[0])
    main(sc)

На моей машине я перехожу к /apps/.../spark/bin и выполняю:

./spark-submit ~/.../SparkProj/sql2.py

Я получаю этот вывод:

error: Must specify a main class with --class

Я ожидал бы этого сообщения, если бы выполнял задание Java или Scala, но это не имеет смысла для задания Python. У кого еще возникла эта проблема?

Кроме того, версия Spark, которую я сейчас использую, — 1.0.0.


person BitPusher16    schedule 18.04.2015    source источник


Ответы (1)


Здесь, на spark-1.3.0, работает нормально, ваш скрипт на Python был сохранен здесь как ./so2.py, и он работал как есть, без изменений и без странного поведения, о котором вы сообщали в spark-1.0.0. См. вывод ниже.

Я собрал spark-1.3.0 из исходников против openjdk-8. Полное описание моей настройки искры (возможно, достаточное, чтобы ее точно воспроизвести) находится в этом предыдущем ответе искры.

paul@ki6cq:~/spark/spark-1.3.0$ ./bin/spark-submit --master spark://192.168.1.10:7077 ./so2.py 
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/04/18 04:02:03 INFO SparkContext: Running Spark version 1.3.0
15/04/18 04:02:03 WARN Utils: Your hostname, ki6cq resolves to a loopback address: 127.0.1.1; using 192.168.1.105 instead (on interface wlan0)
15/04/18 04:02:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/04/18 04:02:03 INFO SecurityManager: Changing view acls to: paul
15/04/18 04:02:03 INFO SecurityManager: Changing modify acls to: paul
15/04/18 04:02:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(paul); users with modify permissions: Set(paul)
15/04/18 04:02:03 INFO Slf4jLogger: Slf4jLogger started
15/04/18 04:02:03 INFO Remoting: Starting remoting
15/04/18 04:02:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:32834]
15/04/18 04:02:03 INFO Utils: Successfully started service 'sparkDriver' on port 32834.
15/04/18 04:02:03 INFO SparkEnv: Registering MapOutputTracker
15/04/18 04:02:03 INFO SparkEnv: Registering BlockManagerMaster
15/04/18 04:02:03 INFO DiskBlockManager: Created local directory at /tmp/spark-b583f6a0-5dba-4989-b439-35f8b9ab412b/blockmgr-f4afaee1-5f89-4d78-8963-8a3793374805
15/04/18 04:02:03 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
15/04/18 04:02:03 INFO HttpFileServer: HTTP File server directory is /tmp/spark-e101e559-cc23-4d3c-b797-b5ec3849d3e0/httpd-f4642274-8197-4f3d-a589-fa6cf31c42e6
15/04/18 04:02:03 INFO HttpServer: Starting HTTP Server
15/04/18 04:02:03 INFO Server: jetty-8.y.z-SNAPSHOT
15/04/18 04:02:03 INFO AbstractConnector: Started [email protected]:54950
15/04/18 04:02:03 INFO Utils: Successfully started service 'HTTP file server' on port 54950.
15/04/18 04:02:03 INFO SparkEnv: Registering OutputCommitCoordinator
15/04/18 04:02:04 INFO Server: jetty-8.y.z-SNAPSHOT
15/04/18 04:02:04 INFO AbstractConnector: Started [email protected]:4040
15/04/18 04:02:04 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/04/18 04:02:04 INFO SparkUI: Started SparkUI at http://192.168.1.105:4040
15/04/18 04:02:04 INFO Utils: Copying /home/paul/spark/spark-1.3.0/./so2.py to /tmp/spark-9b38ddf2-2fa8-478b-a2fc-10930f774824/userFiles-b32ad0f3-8599-4a34-acb2-72a149424908/so2.py
15/04/18 04:02:04 INFO SparkContext: Added file file:/home/paul/spark/spark-1.3.0/./so2.py at http://192.168.1.105:54950/files/so2.py with timestamp 1429344124054
15/04/18 04:02:04 INFO AppClient$ClientActor: Connecting to master akka.tcp://[email protected]:7077/user/Master...
15/04/18 04:02:04 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20150418080203-0000
15/04/18 04:02:04 INFO AppClient$ClientActor: Executor added: app-20150418080203-0000/0 on worker-20150418080134-spark1-37597 (spark1:37597) with 8 cores
15/04/18 04:02:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150418080203-0000/0 on hostPort spark1:37597 with 8 cores, 512.0 MB RAM
15/04/18 04:02:04 INFO AppClient$ClientActor: Executor updated: app-20150418080203-0000/0 is now RUNNING
15/04/18 04:02:04 INFO AppClient$ClientActor: Executor updated: app-20150418080203-0000/0 is now LOADING
15/04/18 04:02:04 INFO NettyBlockTransferService: Server created on 58441
15/04/18 04:02:04 INFO BlockManagerMaster: Trying to register BlockManager
15/04/18 04:02:04 INFO BlockManagerMasterActor: Registering block manager 192.168.1.105:58441 with 265.1 MB RAM, BlockManagerId(<driver>, 192.168.1.105, 58441)
15/04/18 04:02:04 INFO BlockManagerMaster: Registered BlockManager
15/04/18 04:02:04 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/04/18 04:02:04 INFO SparkContext: Starting job: runJob at PythonRDD.scala:356
15/04/18 04:02:04 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:356) with 1 output partitions (allowLocal=true)
15/04/18 04:02:04 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:356)
15/04/18 04:02:04 INFO DAGScheduler: Parents of final stage: List()
15/04/18 04:02:04 INFO DAGScheduler: Missing parents: List()
15/04/18 04:02:04 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at RDD at PythonRDD.scala:42), which has no missing parents
15/04/18 04:02:04 INFO MemoryStore: ensureFreeSpace(3464) called with curMem=0, maxMem=278019440
15/04/18 04:02:04 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.4 KB, free 265.1 MB)
15/04/18 04:02:04 INFO MemoryStore: ensureFreeSpace(2592) called with curMem=3464, maxMem=278019440
15/04/18 04:02:04 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.5 KB, free 265.1 MB)
15/04/18 04:02:04 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.105:58441 (size: 2.5 KB, free: 265.1 MB)
15/04/18 04:02:04 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/04/18 04:02:04 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:839
15/04/18 04:02:04 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[1] at RDD at PythonRDD.scala:42)
15/04/18 04:02:04 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/04/18 04:02:06 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@spark1:56010/user/Executor#1681345948] with ID 0
15/04/18 04:02:06 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, spark1, PROCESS_LOCAL, 1380 bytes)
15/04/18 04:02:06 INFO BlockManagerMasterActor: Registering block manager spark1:44331 with 265.1 MB RAM, BlockManagerId(0, spark1, 44331)
15/04/18 04:02:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark1:44331 (size: 2.5 KB, free: 265.1 MB)
15/04/18 04:02:07 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1105 ms on spark1 (1/1)
15/04/18 04:02:07 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/04/18 04:02:07 INFO DAGScheduler: Stage 0 (runJob at PythonRDD.scala:356) finished in 3.165 s
15/04/18 04:02:07 INFO DAGScheduler: Job 0 finished: runJob at PythonRDD.scala:356, took 3.342403 s
15/04/18 04:02:07 INFO SparkContext: Starting job: runJob at PythonRDD.scala:356
15/04/18 04:02:07 INFO DAGScheduler: Got job 1 (runJob at PythonRDD.scala:356) with 1 output partitions (allowLocal=true)
15/04/18 04:02:07 INFO DAGScheduler: Final stage: Stage 1(runJob at PythonRDD.scala:356)
15/04/18 04:02:07 INFO DAGScheduler: Parents of final stage: List()
15/04/18 04:02:07 INFO DAGScheduler: Missing parents: List()
15/04/18 04:02:07 INFO DAGScheduler: Submitting Stage 1 (PythonRDD[3] at RDD at PythonRDD.scala:42), which has no missing parents
15/04/18 04:02:07 INFO MemoryStore: ensureFreeSpace(4872) called with curMem=6056, maxMem=278019440
15/04/18 04:02:07 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.8 KB, free 265.1 MB)
15/04/18 04:02:07 INFO MemoryStore: ensureFreeSpace(3756) called with curMem=10928, maxMem=278019440
15/04/18 04:02:07 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.7 KB, free 265.1 MB)
15/04/18 04:02:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.105:58441 (size: 3.7 KB, free: 265.1 MB)
15/04/18 04:02:07 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/04/18 04:02:07 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
15/04/18 04:02:07 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (PythonRDD[3] at RDD at PythonRDD.scala:42)
15/04/18 04:02:07 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/04/18 04:02:07 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, spark1, PROCESS_LOCAL, 1380 bytes)
15/04/18 04:02:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on spark1:44331 (size: 3.7 KB, free: 265.1 MB)
15/04/18 04:02:07 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 57 ms on spark1 (1/1)
15/04/18 04:02:07 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
15/04/18 04:02:07 INFO DAGScheduler: Stage 1 (runJob at PythonRDD.scala:356) finished in 0.058 s
15/04/18 04:02:07 INFO DAGScheduler: Job 1 finished: runJob at PythonRDD.scala:356, took 0.064265 s
15/04/18 04:02:07 INFO SparkContext: Starting job: runJob at PythonRDD.scala:356
15/04/18 04:02:07 INFO DAGScheduler: Got job 2 (runJob at PythonRDD.scala:356) with 1 output partitions (allowLocal=true)
15/04/18 04:02:07 INFO DAGScheduler: Final stage: Stage 2(runJob at PythonRDD.scala:356)
15/04/18 04:02:07 INFO DAGScheduler: Parents of final stage: List()
15/04/18 04:02:07 INFO DAGScheduler: Missing parents: List()
15/04/18 04:02:07 INFO DAGScheduler: Submitting Stage 2 (PythonRDD[4] at RDD at PythonRDD.scala:42), which has no missing parents
15/04/18 04:02:07 INFO MemoryStore: ensureFreeSpace(4872) called with curMem=14684, maxMem=278019440
15/04/18 04:02:07 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.8 KB, free 265.1 MB)
15/04/18 04:02:07 INFO MemoryStore: ensureFreeSpace(3756) called with curMem=19556, maxMem=278019440
15/04/18 04:02:07 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.7 KB, free 265.1 MB)
15/04/18 04:02:07 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.105:58441 (size: 3.7 KB, free: 265.1 MB)
15/04/18 04:02:07 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/04/18 04:02:07 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/04/18 04:02:07 INFO DAGScheduler: Submitting 1 missing tasks from Stage 2 (PythonRDD[4] at RDD at PythonRDD.scala:42)
15/04/18 04:02:07 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/04/18 04:02:07 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, spark1, PROCESS_LOCAL, 1467 bytes)
15/04/18 04:02:07 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on spark1:44331 (size: 3.7 KB, free: 265.1 MB)
15/04/18 04:02:07 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 82 ms on spark1 (1/1)
15/04/18 04:02:07 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
15/04/18 04:02:07 INFO DAGScheduler: Stage 2 (runJob at PythonRDD.scala:356) finished in 0.083 s
15/04/18 04:02:07 INFO DAGScheduler: Job 2 finished: runJob at PythonRDD.scala:356, took 0.088082 s
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

15/04/18 04:02:08 INFO BlockManager: Removing broadcast 2
15/04/18 04:02:08 INFO BlockManager: Removing block broadcast_2_piece0
15/04/18 04:02:08 INFO MemoryStore: Block broadcast_2_piece0 of size 3756 dropped from memory (free 277999884)
15/04/18 04:02:08 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.1.105:58441 in memory (size: 3.7 KB, free: 265.1 MB)
15/04/18 04:02:08 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/04/18 04:02:08 INFO BlockManager: Removing block broadcast_2
15/04/18 04:02:08 INFO MemoryStore: Block broadcast_2 of size 4872 dropped from memory (free 278004756)
15/04/18 04:02:08 INFO BlockManagerInfo: Removed broadcast_2_piece0 on spark1:44331 in memory (size: 3.7 KB, free: 265.1 MB)
15/04/18 04:02:08 INFO ContextCleaner: Cleaned broadcast 2
15/04/18 04:02:08 INFO BlockManager: Removing broadcast 1
15/04/18 04:02:08 INFO BlockManager: Removing block broadcast_1
15/04/18 04:02:08 INFO MemoryStore: Block broadcast_1 of size 4872 dropped from memory (free 278009628)
15/04/18 04:02:08 INFO BlockManagerInfo: Removed broadcast_1_piece0 on spark1:44331 in memory (size: 3.7 KB, free: 265.1 MB)
15/04/18 04:02:08 INFO BlockManager: Removing block broadcast_1_piece0
15/04/18 04:02:08 INFO MemoryStore: Block broadcast_1_piece0 of size 3756 dropped from memory (free 278013384)
15/04/18 04:02:08 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.1.105:58441 in memory (size: 3.7 KB, free: 265.1 MB)
15/04/18 04:02:08 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/04/18 04:02:08 INFO ContextCleaner: Cleaned broadcast 1
15/04/18 04:02:08 INFO SparkContext: Starting job: collect at /home/paul/spark/spark-1.3.0/./so2.py:25
15/04/18 04:02:08 INFO DAGScheduler: Got job 3 (collect at /home/paul/spark/spark-1.3.0/./so2.py:25) with 2 output partitions (allowLocal=false)
15/04/18 04:02:08 INFO DAGScheduler: Final stage: Stage 3(collect at /home/paul/spark/spark-1.3.0/./so2.py:25)
15/04/18 04:02:08 INFO DAGScheduler: Parents of final stage: List()
15/04/18 04:02:08 INFO DAGScheduler: Missing parents: List()
15/04/18 04:02:08 INFO DAGScheduler: Submitting Stage 3 (MapPartitionsRDD[14] at collect at /home/paul/spark/spark-1.3.0/./so2.py:25), which has no missing parents
15/04/18 04:02:08 INFO MemoryStore: ensureFreeSpace(9040) called with curMem=6056, maxMem=278019440
15/04/18 04:02:08 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 8.8 KB, free 265.1 MB)
15/04/18 04:02:08 INFO MemoryStore: ensureFreeSpace(6262) called with curMem=15096, maxMem=278019440
15/04/18 04:02:08 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 6.1 KB, free 265.1 MB)
15/04/18 04:02:08 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.1.105:58441 (size: 6.1 KB, free: 265.1 MB)
15/04/18 04:02:08 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/04/18 04:02:08 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:839
15/04/18 04:02:08 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3 (MapPartitionsRDD[14] at collect at /home/paul/spark/spark-1.3.0/./so2.py:25)
15/04/18 04:02:08 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
15/04/18 04:02:08 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, spark1, PROCESS_LOCAL, 1380 bytes)
15/04/18 04:02:08 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 4, spark1, PROCESS_LOCAL, 1467 bytes)
15/04/18 04:02:08 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on spark1:44331 (size: 6.1 KB, free: 265.1 MB)
15/04/18 04:02:09 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 905 ms on spark1 (1/2)
15/04/18 04:02:09 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 4) in 907 ms on spark1 (2/2)
15/04/18 04:02:09 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
15/04/18 04:02:09 INFO DAGScheduler: Stage 3 (collect at /home/paul/spark/spark-1.3.0/./so2.py:25) finished in 0.908 s
15/04/18 04:02:09 INFO DAGScheduler: Job 3 finished: collect at /home/paul/spark/spark-1.3.0/./so2.py:25, took 0.916314 s
John
Sarah
person Paul    schedule 18.04.2015
comment
Хорошо, тогда моя проблема заключается в попытке запустить код 1.3.0 на Spark 1.0.0. Спасибо большое. - person BitPusher16; 18.04.2015