注:
本文需要用到pyspark模块,请自行在PyCharm中添加。
本文用到的数据会在下方代码块贴出 [ 请自行复制粘贴到本地文件 ]
Admin_Log
数据文件[ 请自行复制粘贴到本地文件 ]:
{"id":1,"timestamp":"2019-05-08T01:03.00Z","category":"平板电脑","areaName":"北京","money":"1450"}|{"id":2,"timestamp":"2019-05-08T01:01.00Z","category":"手机","areaName":"北京","money":"1450"}|{"id":3,"timestamp":"2019-05-08T01:03.00Z","category":"手机","areaName":"北京","money":"8412"} {"id":4,"timestamp":"2019-05-08T05:01.00Z","category":"电脑","areaName":"上海","money":"1513"}|{"id":5,"timestamp":"2019-05-08T01:03.00Z","category":"家电","areaName":"北京","money":"1550"}|{"id":6,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"杭州","money":"1550"} {"id":7,"timestamp":"2019-05-08T01:03.00Z","category":"电脑","areaName":"北京","money":"5611"}|{"id":8,"timestamp":"2019-05-08T03:01.00Z","category":"家电","areaName":"北京","money":"4410"}|{"id":9,"timestamp":"2019-05-08T01:03.00Z","category":"家具","areaName":"郑州","money":"1120"} {"id":10,"timestamp":"2019-05-08T01:01.00Z","category":"家具","areaName":"北京","money":"6661"}|{"id":11,"timestamp":"2019-05-08T05:03.00Z","category":"家具","areaName":"杭州","money":"1230"}|{"id":12,"timestamp":"2019-05-08T01:01.00Z","category":"书籍","areaName":"北京","money":"5550"} {"id":13,"timestamp":"2019-05-08T01:03.00Z","category":"书籍","areaName":"北京","money":"5550"}|{"id":14,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"北京","money":"1261"}|{"id":15,"timestamp":"2019-05-08T03:03.00Z","category":"电脑","areaName":"杭州","money":"6660"} {"id":16,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"天津","money":"6660"}|{"id":17,"timestamp":"2019-05-08T01:03.00Z","category":"书籍","areaName":"北京","money":"9000"}|{"id":18,"timestamp":"2019-05-08T05:01.00Z","category":"书籍","areaName":"北京","money":"1230"} {"id":19,"timestamp":"2019-05-08T01:03.00Z","category":"电脑","areaName":"杭州","money":"5551"}|{"id":20,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"北京","money":"2450"} {"id":21,"timestamp":"2019-05-08T01:03.00Z","category":"食品","areaName":"北京","money":"5520"}|{"id":22,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"北京","money":"6650"} {"id":23,"timestamp":"2019-05-08T01:03.00Z","category":"服饰","areaName":"杭州","money":"1240"}|{"id":24,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"天津","money":"5600"} {"id":25,"timestamp":"2019-05-08T01:03.00Z","category":"食品","areaName":"北京","money":"7801"}|{"id":26,"timestamp":"2019-05-08T01:01.00Z","category":"服饰","areaName":"北京","money":"9000"} {"id":27,"timestamp":"2019-05-08T01:03.00Z","category":"服饰","areaName":"杭州","money":"5600"}|{"id":28,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"北京","money":"8000"}|{"id":29,"timestamp":"2019-05-08T02:03.00Z","category":"服饰","areaName":"杭州","money":"7000"}
代码块
# 导包 from pyspark import SparkConf, SparkContext import os import json # os.environ["PYSPARK_PYTHON"] = "D:/program files/Python310/python.exe" # 创建SparkConf类对象 conf = SparkConf().setMaster("local[*]").setAppName("test_spark") # 基于SparkConf类对象创建SparkContext对象 sc = SparkContext(conf=conf) # TODO 需求1:城市销售额排名 # 1.1 读取文件得到RDD file_rdd = sc.textFile("E:/PycharmProjects/orders.txt") # 1.2 取出一个个JSON字符串 json_str_rdd = file_rdd.flatMap(lambda x: x.split("|")) # 1.3 将一个个JSON字符串转换为字典 dict_rdd = json_str_rdd.map(lambda x: json.loads(x)) # print(dict_rdd.collect()) # 1.4 取出城市和销售额集合 # (城市,销售额) city_with_money_rdd = dict_rdd.map(lambda x: (x['areaName'], int(x['money']))) # 方便作累加操作将money转换为int类型 # 1.5 按城市分组按销售额聚合 city_result_rdd = city_with_money_rdd.reduceByKey(lambda a, b: a + b) # 1.6 按销售额聚合结果进行排序 result1_rdd = city_result_rdd.sortBy(lambda x: x[1], ascending=False, numPartitions=1) print("城市销售额如下(降序):", result1_rdd.collect()) # 城市销售额如下(降序): [('北京', 91556), ('杭州', 28831), ('天津', 12260), ('上海', 1513), ('郑州', 1120)] # TODO 需求2:全部城市有哪些商品类别在售卖 # 2.1 取出全部的商品类别 category_rdd = dict_rdd.map(lambda x: x['category']).distinct() print("有如下商品类别在售卖:", category_rdd.collect()) # 有如下商品类别在售卖: ['平板电脑', '家电', '书籍', '手机', '电脑', '家具', '食品', '服饰'] # 2.2 对全部商品类别进行去重 [ 链式调用 ↑ ] # category_rdd = dict_rdd.map(lambda x: x['category']) # TODO 需求3:北京市有哪些商品类别在售卖 # 3.1 过滤北京市的数据 beijing_data_rdd = dict_rdd.filter(lambda x: x['areaName'] == '北京') # 3.2 取出全部商品类别 result3_rdd = beijing_data_rdd.map(lambda x: x['category']).distinct() print("北京市有如下商品类别在售卖:", result3_rdd.collect()) # 北京市有如下商品类别在售卖: ['平板电脑', '家电', '书籍', '手机', '电脑', '家具', '食品', '服饰'] # 3.3 进行商品类别去重 [ 链式调用 ↑ ] # result3_rdd = beijing_data_rdd.map(lambda x: x['category']) # result3_rdd.distinct()
三次[ print ]输出结果
- 需求1:城市销售额排名
城市销售额如下(降序): [('北京', 91556), ('杭州', 28831), ('天津', 12260), ('上海', 1513), ('郑州', 1120)]
- 需求2:全部城市有哪些商品类别在售卖
有如下商品类别在售卖: ['平板电脑', '家电', '书籍', '手机', '电脑', '家具', '食品', '服饰']
- 需求3:北京市有哪些商品类别在售卖
北京市有如下商品类别在售卖: ['平板电脑', '家电', '书籍', '手机', '电脑', '家具', '食品', '服饰']