2023数据采集与融合技术实践作业一

news/发布时间2024/5/19 19:22:41

作业1

实验要求

具体要求
用requests和BeautifulSoup库方法定向爬取给定网址的数据，屏幕打印爬取的大学排名信息。
输出信息

排名	学校名称	省市	学校类型	总分
1	清华大学	北京	综合	852.5
2...

具体代码

import bs4 as bs
import urllib.requesturl = "https://www.shanghairanking.cn/rankings/bcur/2020"html = urllib.request.urlopen(url).read().decode('utf-8')
soup = bs.BeautifulSoup(html, 'html.parser')table = soup.find('table', class_='rk-table')
tbody = table.find("tbody")
tr = tbody.find_all("tr")limit = -1
print("排名", "学校名称", "所在城市", "学校类型", "总分", sep="\t")
for rows in tr:if limit == 0:breakrank = rows.find("div", class_='ranking').string.rstrip(" \n").lstrip("\n ")name = rows.find("a", class_='name-cn').string.rstrip(" \n")city = rows.find_all("td")[-4].text.rstrip(" \n").lstrip("\n ")type_ = rows.find_all("td")[-3].text.rstrip(" \n").lstrip("\n ")score = rows.find_all("td")[-2].text.rstrip(" \n").lstrip("\n ")print(rank, name, city, type_, score, sep="\t")limit -= 1

心得体会

对于urllib有了更深的理解，对于爬虫有了初步的理解

作业2

实验要求

具体要求
用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。
输出信息

序号	价格	商品名
1	60.5	...
2...

具体代码

import requests, json, os, re
from copy import deepcopyurl = "https://api2.order.mi.com/search/index"headers = {'Referer': 'https://www.mi.com/',
}params = {"query": "家电","page_index": 1,"page_size": 20,"filter_tag": 0,"classIndex": 0,"callback": "__jp6"
}def save_img(img_url, img_name):# 判断图片的格式extention = img_url.split('.')[-1]response = requests.get(img_url, headers=headers)with open(img_name + "." + extention, 'wb') as f:f.write(response.content)def get_data(url, headers, params):response = requests.request("GET", url, headers=headers, params=params)# 去除头部的"__jp6("与尾部的");"，使其成为标准的json格式response_text = response.text[6:-2]js = json.loads(response_text)return jsdef get_img(limit):_params = deepcopy(params)img_list = []total_page = get_data(url, headers, _params)["data"]["total"] // 20 + 1count = 0for i in range(1, total_page + 1):if count >= limit:break_params["page_index"] = ijs = get_data(url, headers, _params)for i in js["data"]["pc_list"]:for j in i["commodity_list"]:if count >= limit:breakimg_list.append({"name": j["name"], "price": j["price"]})print(str(count) + " name:" + j["name"] + " price:" + j["price"])count += 1return img_listimg_list = get_img(100)

心得体会

学习了更多的python知识，同时对常用的爬虫库有了深刻了解，对于网站api也有了一定了解

作业3

实验要求

具体要求
爬取一个给定网页或者自选网页的所有JPEG和JPG格式文件
输出信息
将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

具体代码

import urllib.request
import os, requests, reurl = "https://xcb.fzu.edu.cn/info/1071/4481.htm"
base_url = "https://xcb.fzu.edu.cn"
img_tag_pattern = r'<img[^>]+>'
img_src_pattern = r'src="([^"]*)"'html = urllib.request.urlopen(url).read().decode('utf-8')
img_tag = re.findall(img_tag_pattern, html)def save_img(img_url, img_name):# 判断图片的格式extention = img_url.split('.')[-1]response = requests.get(img_url)with open(img_name + "." + extention, 'wb') as f:f.write(response.content)imgaes_list = []
for i in img_tag:imgaes_list.append(re.findall(img_src_pattern, i)[0])print(re.findall(img_src_pattern, i)[0])if not os.path.exists("down4"):os.mkdir("down4")
os.chdir("down4")
count = 1
for i in imgaes_list:save_img(base_url+i, str(count))print("正在下载：" + str(count))count += 1

心得体会

对于正则表达式有了更深的印象，在实践中我也对于re库的应用有了一定的了解

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.ulsteruni.cn/article/34045804.html

如若内容造成侵权/违法违规/事实不符，请联系编程大学网进行投诉反馈email:xxxxxxxx@qq.com，一经查实，立即删除！

shiro-721 CVE-2019-12422

漏洞描述Apache Shiro是一个强大易用的Java安全框架，提供了认证、授权、加密和会话管理等功能。Shiro框架直观、易用，同时也能提供健壮的安全性。Apache Shiro框架提供了记住密码的功能（RememberMe），用户登录成功后会生成经过加密并编码的cookie。在服务端对rememberMe的c…

CentOS环境 nginx配置vue项目

nginx配置vue项目ps: 这里使用ruoyi-vue-plus项目举例，官网：https://plus-doc.dromara.org/ 一、配置不带应用路径的vue项目 1、打包。首先将vue项目生产配置文件的的应用访问路径设为/，然后命令行输入run run build:prod进行打包。2、导入环境。将打包文件（dist）拖入Cent…

2020ICPC区域赛南京站

2020ICPC区域赛南京站 K Co-prime Permutation 解题思路: 首先，根据样例2不难发现，\(k\)的下界为\(1\)，因为1和排列中的任何数都会互质。其次，我们考虑下上界大概是多少，也就是\(k = n\)是否一定合法。假设，我们有一个初识排列\(p_i = i\).此时我们有\(1\)个元素和他的…