Spark SQL по-разному читает паркетные таблицы и таблицы csv

У меня есть две внешние таблицы, созданные в spark-sql. Один имеет формат файла parquet, а другой имеет формат файла textfile.

Когда мы извлекаем план запроса для этих двух таблиц, spark обрабатывает две таблицы по-разному.

Вывод плана запроса для паркетной таблицы:

== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
   +- 'UnresolvedRelation `test_p`

== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
   +- SubqueryAlias test_p
      +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet

== Optimized Logical Plan ==
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], Statistics(sizeInBytes=2.2 KB, hints=none)
+- Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=2.2 KB, hints=none)
   +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet, Statistics(sizeInBytes=2.2 KB, hints=none)

== Physical Plan ==
*FileScan parquet default.test_p[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/C:/dev/tests2/Country=Korea], PartitionCount: 1, PartitionFilters: [isnotnull(Country#9), (Country#9 = Korea)], PushedFilters: [], ReadSchema: struct<Address:string,Age:string,CustomerID:string,CustomerName:string,CustomerSuffix:string,Loca...

Вывод плана запроса в таблице csv:

 == Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
   +- 'UnresolvedRelation `test_p3`

== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
   +- SubqueryAlias test_p3
      +- HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9]

== Optimized Logical Plan ==
Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=1134.0 B, rowCount=3, hints=none)
+- HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9], Statistics(sizeInBytes=9.6 KB, rowCount=128, hints=none)

== Physical Plan ==
HiveTableScan [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9], [isnotnull(country#9), (country#9 = Korea)]

Почему такая разница? Искра версия: 2.2.1

apache-spark apache-spark-sql cost-based-optimizer

Rajat Mishra 21.08.2018 источник

Ответы (1)

arrow_upward
0
arrow_downward

Логически это не относится к ним по-разному.

Но у них разные внутренние форматы, паркет оптимизирован по столбцам, поэтому может применяться другой подход. Например. обрезка ПАРКЕТА.

thebluephantom 21.08.2018

Spark SQL по-разному читает паркетные таблицы и таблицы csv

Ответы (1)

Похожие вопросы