Scraping Links from Wikipedia Using Beautiful Soup

Python

Here we are going to scrape the list of constituents of the FSTE 100 index. To do this we are going to scrape the company ticker and the wikipedia page link stored in the table with the 'constituents' ID on the FTSE 100 Wikipedia page (https://en.wikipedia.org/wiki/FTSE_100_Index)

Each data point we require is following the path #constituents > td > a.

 1|  import requests
 2|  from bs4 import BeautifulSoup
 3|  
 4|  response = requests.get('https://en.wikipedia.org/wiki/FTSE_100_Index')
 5|  parser = BeautifulSoup(response.content, 'html.parser')
 6|  
 7|  # Get all tags in the table with the 'constituents' id
 8|  constituents_table = parser.select('#constituents')
 9|  
10|  # Find all <td> tags in the constituents table
11|  td = constituents_table[0].find_all('td')
12|  
13|  # Create empty dataframe to store scraped stock and url data
14|  output = pd.DataFrame(columns=['stock','url'])
15|  
16|  # Loop through all <td> tags, find all <a> tags and extract text and href from the tag
17|  # Note: We use try/except to skip over <td> tags that don't contain an <a> tag
18|  for tag in td:
19|      try:
20|          a = tag.find_all('a')
21|          stock = a[0].text
22|          url = a[0]['href']
23|          url = 'https://en.wikipedia.org' + url
24|          print(stock, url)
25|  	# add current stock to output dataframe
26|  	current_stock = {'stock':stock, 'url':url}
27|  	current_stock = pd.DataFrame([current_stock])
28|  	output = pd.concat([output,current_stock])
29|      except:
30|          continue
31|  
32|  output.to_csv('ftse_100_stocks.csv',index=False)
Did you find this snippet useful?

Sign up for free to to add this to your code library