Scraping Links from Wikipedia Using Beautiful Soup
Python
Here we are going to scrape the list of constituents of the FSTE 100 index. To do this we are going to scrape the company ticker and the wikipedia page link stored in the table with the 'constituents' ID on the FTSE 100 Wikipedia page (https://en.wikipedia.org/wiki/FTSE_100_Index)
Each data point we require is following the path #constituents > td > a.
1| import requests 2| from bs4 import BeautifulSoup 3| 4| response = requests.get('https://en.wikipedia.org/wiki/FTSE_100_Index') 5| parser = BeautifulSoup(response.content, 'html.parser') 6| 7| # Get all tags in the table with the 'constituents' id 8| constituents_table = parser.select('#constituents') 9| 10| # Find all <td> tags in the constituents table 11| td = constituents_table[0].find_all('td') 12| 13| # Create empty dataframe to store scraped stock and url data 14| output = pd.DataFrame(columns=['stock','url']) 15| 16| # Loop through all <td> tags, find all <a> tags and extract text and href from the tag 17| # Note: We use try/except to skip over <td> tags that don't contain an <a> tag 18| for tag in td: 19| try: 20| a = tag.find_all('a') 21| stock = a[0].text 22| url = a[0]['href'] 23| url = 'https://en.wikipedia.org' + url 24| print(stock, url) 25| # add current stock to output dataframe 26| current_stock = {'stock':stock, 'url':url} 27| current_stock = pd.DataFrame([current_stock]) 28| output = pd.concat([output,current_stock]) 29| except: 30| continue 31| 32| output.to_csv('ftse_100_stocks.csv',index=False)
156
138
133
125