Subscribe to the newsletter!
Subscribing you will be able to read a weekly summary with the new posts and updates.
We'll never share your data with anyone else.

Webscraping Finviz's Screener

Posted by Daniel Cano

Edited on May 15, 2024, 5:21 p.m.



What is web scraping?

It is a technique used to extract data from web pages in an automated manner. Essentially, it involves a process where a program or script traverses and analyzes the HTML code of a web page to extract relevant and structured information.

This technique is widely used in various areas, from data analysis and research to application development and monitoring prices in e-commerce. Web scraping allows access to data that would otherwise be difficult or tedious to obtain manually and can be used to gather information such as product prices, news, user reviews, contact information, among others.

The web scraping process generally involves several stages. First, relevant data sources are identified, namely, the web pages from which the information is desired to be extracted. Then, a script or program is developed to access these web pages, download their HTML content, and analyze it in search of the specific data desired to be extracted. This analysis may involve searching for text patterns, HTML tags, or even the use of more advanced techniques such as natural language processing.

It is important to note that web scraping should be done ethically and in compliance with the terms of service of the websites being scraped. Some sites may have policies against web scraping or may limit access to their content through measures such as blocking IP addresses. Therefore, it is fundamental to follow best practices and respect the usage policies of websites to avoid legal or technical issues.

While people often tend to perform web scraping directly, it is often better to conduct a preliminary analysis of the website's functionality. Typically, three possible cases may arise:

  1. The website is loaded as HTML directly by the backend before being sent to the client.
  2. The website is fully loaded but requires the execution of JavaScript to fill HTML elements.
  3. The website makes calls to the backend to obtain data instead of loading it beforehand.

In the first case, which is the one we will focus on today, we work with the HTML file searching for the elements of interest. This type of webpage is becoming less common as obtaining data is straightforward. In the case of needing to load JavaScript, we must find where the data is located and extract it. An example would be Benzinga, which stores data in a script element with the id "__NEXT_DATA__". Its extraction is straightforward, but many people approach the problem with the first method in mind and get stuck when they don't see the HTML loading in the source code. Lastly, there's the possibility of replicating backend calls, which is my favorite method. Often, to avoid web scraping, companies do not load their data in either of the previously mentioned ways but rather load it as the client uses the page through backend calls. The parameters of these calls can be replicated to perform them using a script. This can be quite convoluted or they may use methods to prevent third-party connections, such as through cookies or tokens that need to be renewed periodically. An example of this case would be Yahoo Finance.

What is Finviz?

Finviz.com is a widely used website by investors and traders to obtain financial information, conduct market analysis, and make investment decisions. Its name comes from "Financial Visualization." It offers a wide range of tools and resources for analyzing stocks, futures, currencies, and other financial instruments, although some tools are paid.

One of Finviz.com's most prominent features is its ability to provide quick and accessible visualization of a large amount of financial data. Through its online platform, users can access interactive charts, performance tables, fundamental data, financial news, and technical analysis of thousands of financial assets worldwide.

Among the most popular tools offered by Finviz.com are:

  1. Market Map: A global visualization showing the performance of different sectors and assets in real-time.

  2. Stock Screener: Allows users to filter stocks based on a wide range of criteria, such as market capitalization, price-to-earnings ratio (P/E ratio), trading volume, and technical patterns.

  3. Heatmap: Provides a visual representation of the relative strength of stocks within an index or sector.

  4. Portfolio Analyzer: Enables users to track and analyze the performance of their investment portfolios, as well as make comparisons with benchmark indices.

  5. Fundamental and Technical Data: Offers access to fundamental financial data such as revenue, earnings, and price-to-book ratio (P/B ratio), as well as technical indicators such as moving averages, RSI (Relative Strength Index), and MACD (Moving Average Convergence Divergence).

In this article, we will focus on obtaining data from the stock screener only, although the same techniques can be applied to obtain all the details of a ticker or sectors and industries.

Approach

The first step will be to see if we can obtain the data by replicating calls to the API, as this is preferable since it's more common for the appearance of the webpage to change (requiring modifications to the web scraping script when this happens) than the API. To do this, we will open the Chrome developer panel, either by pressing FN + F12 or by right-clicking and selecting "Inspect". Among the top options, we will look for the "Network" tab, where backend calls can be seen and by clicking, these calls can be filtered based on their type. Although we're interested in Fetch/XHR type calls, I prefer not to filter them. In this case, we can see that the call we're interested in is to "https://finviz.com/screener.ashx?v=111&o=-change" where "o=-change" is simply the method used to sort the table.

Since, in this case, the HTML obtained from the previous URL is the same as what is seen on the page (without modifications by JavaScript, etc.), we can directly inspect the webpage without needing to delve into the pure HTML. Upon a closer look, we notice that the information we want is located in a table with a class called "screener_table." This class probably only applies to the table we want to extract, allowing us to easily identify it; we just need to iterate over each value in each row to obtain its data.

However, not all data is displayed in a single table; instead, pagination is applied to it. Specifically, in the previous image, it can be seen that without applying any filters, there are 476 pages. We will need to obtain the data from the table for each of these pages, so we must first select the number of pages. Additionally, when moving to the next table, the URL changes, including a parameter "r" with a value of 21 for the second page, 41 for the third, 61 for the fourth, and so on. We can get to the conclusion that the pattern is to add 20 for each page increase, due to there being 20 rows per page. Having identified this, a script can be created to iterate over the pages.

In case any filters are selected, the "f" parameter will be added to the URL followed by a list of the ids of the different selected filters. This will need to be included in the script if the intention is to filter the assets.

Coding

The first step will be to import the necessary libraries, which in our case will be certifi, urllib3, BeautifulSoup, and pandas. Additionally, a function will be created to handle the calls and return the HTML in a usable format. This function consists of two main parts: in the first part, the call to the URL is made and a response is obtained, and in the second part, the content of the response is transformed into a BeautifulSoup element that can be used to search for the desired HTML. To avoid potential issues, it may be convenient to introduce measures such as a random "user-agent" in the headers of the calls or to wait for a random time between different calls.

import certifi
import pandas as pd
import urllib3
from bs4 import BeautifulSoup

def makeRequest(url:str) -> BeautifulSoup:

    '''
    Makes a request with the correct header and certifications.

    Parameters
    ----------
    url: str
        String with the url from which to get the HTML.

    Returns
    -------
    soup: bs4.BeautifulSoup
        Contains the HTML of the url.
    '''
    headers: dict = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:28.0) Gecko/20100101 Firefox/28.0'}

    content: urllib3.PoolManager = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where()).urlopen('GET',url, headers=headers)
    soup: BeautifulSoup = BeautifulSoup(content.data,'html.parser')

    return soup

The next function we will define will obtain the data from the table (getTable). In this function, first, we will have to obtain the table by searching for the "table" tag and filtering the table that has the "screener_table" class. Once we have the table, we will proceed to identify the columns by iterating over the elements with the "th" tag and selecting the text they contain, additionally removing unnecessary characters such as line breaks.

Once the table structure is available, we will iterate over the rows of the table (ignoring the one containing the column names) and over the elements within the various cells of the row to extract the text. Since these cells all contain elements of type "a," we will filter these. With the data from the columns and rows, a DataFrame will be generated because this format is easier and more efficient to manipulate than normal Python lists.

The function would be defined as follows, where the only parameter would be the HTML of the page and the loops that have been previously explained will be performed in a single line to improve speed and efficiency. If the table being searched for does not exist, an empty DataFrame will be returned.

def getTable(soup:BeautifulSoup) -> pd.DataFrame:

    '''
    Extraction of screener table.

    Parameters
    ----------
    soup: bs4.BeautifulSoup
        HTML of a webpage.

    Returns
    -------
    table: pd.DataFrame
        Contains the screener table.
    '''
    
    table = soup.find('table', {'class':'screener_table'})
    if table:
        cols: list = [col.get_text().replace('\n', '') for col in table.find_all('th')]
        rows: list = [[a.get_text() for a in row.find_all('a')] for row in table.find_all('tr')[1:]]
        return pd.DataFrame(data=rows, columns=cols)
    else:
        return pd.DataFrame()

The next step will be to create the function that generates the URLs and iterates over them to obtain the HTML from the different pages. Since the number of pages cannot be known in advance, it will always be necessary to make the first call and obtain it from there. This turns out to be a simple task since the buttons to change pages are located in an element of type "td" with the id "screener_pagination". Within this element, all the links are included, and we just have to filter out those with a numerical value and select the one with the highest value.

Additionally, we will need to store the tables from each page somehow. For this purpose, we will define a list called "complete_data" where we will include the DataFrames, and in the end, we will concatenate them all to generate a single table.

def getPages(filters:list=None) -> pd.DataFrame:

    '''
    Extraction of screener table.

    Parameters
    ----------
    filters: list
        List of filters to apply. The filters must be the ids of 
        the ones from finviz.com.

    Returns
    -------
    table: pd.DataFrame
        Contains the screener table.
    '''

    filters: list = [] if filters == None else filters
    complete_data: list = []
    
    # Defining the url and connecting to obtain html 
    tempurl: str = f"https://finviz.com/screener.ashx?v=111&ft=4&o=-change&f={','.join(filters)}"
    soup: BeautifulSoup = makeRequest(tempurl)
    complete_data.append(getTable(soup))
    page = max([int(a.get_text().replace('\n', '')) for a in soup.find('td', {'id': 'screener_pagination'}).find_all('a') if a.get_text().replace('\n', '') != ''])
    
    if page:
        n: int = page - 1
        if n > 0:
            for i in range(1, n):
                tempurl: str = f"https://finviz.com/screener.ashx?v=111&ft=4&o=-change&f={','.join(filters)}&r={i*2}1"
                soup: BeautifulSoup = makeRequest(tempurl)
                complete_data.append(getTable(soup))
                
    return pd.concat(complete_data)

The last step will be to give the desired format to the columns of the table, and for that, we will have to create a function that is used to change the numbers in financial format to a format that can be handled with Python. Since the function is quite simple, it will not be explained.

def _to_numeric(string:str) -> (int | float):

    '''
    Converts a string to a number of type float 
    or int.

    Parameters
    ----------
    string: str
        String containing a number.

    Returns
    -------
    number: int | float
        Number in correct format.
    '''

    string: str = string.replace(',','')
    float_cond: bool = '.' in string or string == 'nan'
    number: (int | float)
    if 'k' in string or 'K' in string:
        number = float(string[:-1]) if float_cond else int(string[:-1])
        number = number * 1000
    elif 'm' in string or 'M' in string:
        number = float(string[:-1]) if float_cond else int(string[:-1])
        number = number * 1000000
    elif 'b' in string or 'B' in string:
        number = float(string[:-1]) if float_cond else int(string[:-1])
        number = number * 1000000000
    elif 't' in string or 'T' in string:
        number = float(string[:-1]) if float_cond else int(string[:-1])
        number = number * 1000000000000
    elif '%' in string:
        number = float(string[:-1]) if float_cond else int(string[:-1])
    else:
        number = float(string) if float_cond else int(string)

    return number

The formats will be applied to the specific columns desired as follows.

screener_df: pd.DataFrame = screener_df.drop_duplicates()
screener_df['Market Cap'] = screener_df['Market Cap'].apply(lambda x: _to_numeric(x))
screener_df['P/E'] = screener_df['P/E'].apply(lambda x: float('nan' if x == '-' else x))
screener_df['Price'] = screener_df['Price'].astype(float)
screener_df['Change'] = screener_df['Change'].apply(lambda x: float(x.replace('%', ''))/100)
screener_df['Volume'] = screener_df['Volume'].str.replace(',','').astype(float)

When we execute the "getPages()" function with some filters, we obtain a table with data such as the symbol, company name, sector, industry, country of origin, market capitalization, price, and volume.

stock_filters: list = ['cap_largeunder', 'sh_avgvol_o1000', 'ta_highlow20d_b0to10h', 'ta_perf_4w30o', 'ta_sma20_pa10', 'ta_sma50_pa']
screener_df: pd.DataFrame = getPages(stock_filters)
No. Ticker Company Sector Industry Country Market Cap P/E Price Change Volume
1 JAGX Jaguar Health Inc Healthcare Biotechnology USA 8.784000e+07 NaN 0.32 0.0846 140586417.0
2 CVNA Carvana Co. Consumer Cyclical Auto & Truck Dealerships USA 1.415000e+10 49.01 121.67 0.0444 8515304.0
3 JANX Janux Therapeutics Inc Healthcare Biotechnology USA 3.360000e+09 NaN 64.78 0.0376 1044152.0
4 PHG Koninklijke Philips N.V. ADR Healthcare Medical Devices Netherlands 2.431000e+10 NaN 26.82 0.0098 1661489.0
5 BILI Bilibili Inc ADR Communication Services Electronic Gaming & Multimedia China 4.740000e+09 NaN 14.83 0.0075 7349068.0
6 MTTR Matterport Inc Technology Software - Application USA 1.430000e+09 NaN 4.54 0.0067 3069818.0
7 CDMO Avid Bioservices Inc Healthcare Biotechnology USA 5.237100e+08 NaN 8.25 0.0061 1300201.0
8 HUMA Humacyte Inc Healthcare Biotechnology USA 5.251400e+08 NaN 4.41 0.0023 802873.0
9 DCPH Deciphera Pharmaceuticals Inc Healthcare Drug Manufacturers - Specialty & Generic USA 2.090000e+09 NaN 25.38 0.0000 1851787.0
10 SNAP Snap Inc Communication Services Internet Content & Information USA 2.682000e+10 NaN 16.25 -0.0031 23736909.0
11 CUTR Cutera Inc Healthcare Medical Devices USA 5.050000e+07 NaN 2.53 -0.0156 454805.0

 

Conclusion

Web scraping of the Finviz screener emerges as a valuable strategy for investors and traders looking to optimize their analysis and decision-making process in the financial market. This technique offers the advantage of automating data collection, saving time and effort for investors by avoiding the need to manually enter each search criterion into the Finviz screener. Additionally, web scraping allows for greater flexibility in handling and analyzing the extracted data, facilitating the identification of investment opportunities, conducting detailed comparisons, and tracking market trends in real time. However, it is important to note that web scraping must be done ethically and legally, respecting the terms of service of Finviz and any other website from which data is extracted.