SPARK 命令行读取 parquet 数据

查看 HDFS 数据

[root@node-master]# hadoop fs -ls /
Found 12 items
drwxrwxrwx   - hdfs   hadoop            0 2020-11-24 17:59 /app-logs
drwxrwxrwx   - hdfs   hadoop            0 2020-11-24 17:59 /ats
drwxr-xr-x   - hdfs   hadoop            0 2020-11-24 17:59 /datasets
drwxrwxrwx   - flink  hadoop            0 2020-11-24 18:00 /flink
drwxrwxrwx   - mapred hadoop            0 2020-11-24 17:59 /mr-history
drwxrwxrwx   - hdfs   hadoop            0 2020-11-24 17:59 /mrs
drwxrwxrwx   - hdfs   hadoop            0 2020-11-24 18:03 /tmp
drwxr-xr-x   - root   ficommon          0 2020-12-07 17:41 /aka
drwxrwxrwx   - hdfs   hadoop            0 2020-12-07 17:40 /user

查看表

val db = spark.read.parquet("/aka/test")
db: org.apache.spark.sql.DataFrame = [value: string]
db.show(false)

查看数据

# 拷贝文件到 hdfs 我已经拷贝过去 /train_data/下全部文件
# 打开spark-shell
# 输入以下内容
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val parquetFile = sqlContext.parquetFile("/data/test/*.parquet")
# 打印 150 行内容
parquetFile.take(150).foreach(println)
阅读剩余
THE END