我试图抓取CSV文件中包含的URL列表。网址列在CSV第6栏中。URL的格式是:https://www.targetdomain.com/mainDirectoryName/subDirectoryName/pageName。
下面的代码没有正确读取CSV中的数据。我在哪里做编码错误?
list_of_urls = open(filename).read()
for i in range(6,len(list_of_urls)):
try:
url=str(list_of_urls[i][0])
#crawl urls
secondCrawlRequest = requests.get(url, headers=http_headers, timeout=5)
raw_html = secondCrawlRequest.text
except requests.ConnectionError as e:
logging.exception(e)
except requests.HTTPError as e:
logging.exception(e)
except requests.Timeout as e:
logging.exception(e)
except requests.RequestException as e:
logging.exception(e)
sys.exit(1)发布于 2016-03-20 18:20:43
你应该使用csv.reader
import csv
with open(filename, newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
try:
# 0-based column numbering, so 6th column is number 5
response = requests.get(row[5], headers=http_headers, timeout=5)
print(response.text)
except (requests.ConnectionError, requests.HTTPError, requests.Timeout) as e:
logging.exception(e)
except requests.RequestException as e:
logging.exception(e)
sys.exit(1)如果需要跳过标题行,可以通过调用next(reader)来实现。
reader = csv.reader(csvfile)
next(reader) # consumes one input row discarding it
for row in reader: ...发布于 2016-03-20 20:38:34
如果url对于csv中的列或行没有固定的出现,您可以使用regex逐行读取文件,如下所示:
import re
import requests
filename = 'shitty_url.csv'
with open(filename, 'r') as csvfile:
for line in csvfile:
url_pattern = re.search('https:\/\/(.+?) ', line)
if url_pattern:
found_url = url_pattern.group(1)
url = 'https://%s' % found_url
crawler = requests.get(url, timeout=5)希望这会有所帮助:)
https://stackoverflow.com/questions/36117635
复制相似问题