Can we find patterns from a large enough dataset of trades.

This is an overview of analysis of using stock price movements to try and predict where the price will be at the end of the day. We further categorize this by the profile of the company and the movement during prior periods. At the end of this we will try to find out if certain features of the data have some predictive power over price movement. Thereby we will build a trading strategy to exploit this information.

All the analysis is done on the Python Pandas library, the data from Quantopian and the reporting on the Python Django platform.

In day trading, there are times when your knowledge and experience go against you.

If I could get a large enough dataset of trades, there may be patterns that repeat that are not noticeable in the trenches but may become apparent if zoomed out and presented well. Of course there is no sure thing strategy when trading but if there are patterns that repeat a little more often than not, then doing this type of analysis maybe a worthwhile exercise.

A data analysis project starts with a question or a series of questions. This is done to frame the project and make sure any work you do is faithful to the reason why you started it out in the first place. Although this seems obvious at first, as projects get more and more complex, having a unifying reason keeps it from going off on tangents that don't serve the original project specification. During the project, it might be revealed that the initial specification is inaccurate or unworkable and this may lead to a different specification. If that is the case, the original needs to be changed and it will be back to the drawing board.


The particular day trading strategy I will be focusing on in the beginning are trading gappers. Gappers are where the price of a share goes up or down on the day's opening in comparison to its prior day close. Gappers flag up on stock scanners as a percentage against a previous close. I have selected an arbitrary number of 4% meaning, analysis will be done on stocks that have gapped up greater than 4% or gapped down less than 4% from the previous day's close. In order to filter this further, I will be looking at stocks in the NASDAQ which have entries for float and market capitalization on the Yahoo Finance website. At this point I believe there could be a relationship between the size of the company and the way the stock moves after a gap happens.

With regard to prediction, there are a number of ways price can move after open. 1. The gaps could close, meaning the price returns to the close of the previous day by the end of trading. 2. The gaps could run, meaning the price could move in the direction of the gap over the day. 3. The price could reverse multiple times during the day, finally settling on one of the above.

Most of the time though price movement follows number 3 above. Price movement would indicate that it is closing the gap, only to move in the opposite direction and close over the opening price during the day. The opposite also happens where the price continues in direction of the gap but changes direction over the end of the day.


Some of the questions I want to put forward to the model are:

  1. Do stocks with different market caps and floats behave differently when gapped after open?
  2. Does the prior day trend affect movement after open. As the trend could be different based on different times of the day, is it possible to look at the last 15 - 30 minutes of the prior day trend.
  3. Does the trend of the index the stock is on affect the direction of the gap?
  4. Can movement in the first minute predict movement in the subsequent 5 minutes?
  5. Can the above be applied for the next 10,15, 30, 60 minutes? If traded this time after open will it still be a profitable trade?


This will be done as a proper data analysis project implementing the best practices for organisation and making the approach as modular as possible. The directory structure will be based on the format where possible and will follow the sequence of: 1. Getting and formatting data for analysis 2. Creating features 3. Implementing algorithms 4. Presentation

Each of these will be further broken down depending on the scale and complexity of the task.

It is unlikely that the process will be a linear one, but rather an iterative one as discoveries in the future may lead to changing the original frame of the project or even refining some of the questions.

As the data is proprietary and not shareable, it cannot be hosted in the repository, but the code for the data processing will be available either as a Python script or an IPython notebook where available.

Getting the Data

The code in the posts try to stick as closely as possible to the most well known libraries in the Python data analysis stack. It has been set out so that anyone following the posts can easily recreate it in their own programming environment. Unfortunately though, good quality data is hard to come by. Yahoo and Google finance no longer open their data out for public consumption. The queries on lesser known stocks are either non existent or incomplete. Because of this, the analysis has been done within a Quantopian research notebook. Quantopian gives intra day data up to the minute with a massive number of securities at your disposal. All analysis has to be done within a Quantopian IPython Notebook as data cannot be exported to be locally processed.

The analysis starts by getting a list of NASDAQ securities. These will be used to search for fundamental data in Yahoo Finance. Getting the original list of securities will be done locally and then exported to the Quantopian environment.

The process will be: - Get the list of securities traded in the NASDAQ - Get fundamental data for these for future categorization - Search of any day they gapped based on a criteria on Quantopian - Organize the data in such a way that you could ask it questions

List of Securities Traded on the NASDAQ

The ADVFN website has an updated list of NASDAQ securities. Because of the sheer number, the securities are organized by letter. The script below downloads all the letters pages. The Python library BeautifulSoup reads the HTML document and extracts the relevant information to a dictionary for processing. Finally the Pickle library is used to save the dictionary for later processing.

tickers = []
letters = []
for i in string.ascii_uppercase:


for letter in letters:
    resp = requests.get('{}'.format(letter))
    soup = bs.BeautifulSoup(resp.text, 'html5lib')
    table = soup.find('table',{'class':'market tab1'})
    for row in table.find_all('tr')[2:]:

    with open('nasdaq.pickle', 'wb') as f:
        pickle.dump(tickers, f)

On the whole, there are about 3500 ticker symbols available from this process. Not all will be used in the final analysis as only the most tradeable will eventually be selected.

Include Market Cap and Float

Yahoo Finance contains comprehensive fundamental data for companies organized by stock market symbol. It is easily by searching by webpage in the same method as getting the list to begin with. As the analysis needs the market cap and float, if this data is not available on Yahoo Finance, then the security will be dropped from the analysis.

Get Float

nasdaq_stocks = {}
with open('nasdaq.pickle', 'rb') as f:
    nasdaq = pickle.load(f)

URL = '{}/key-statistics?p={}'
for stock in nasdaq:

    print ('Getting float for {}'.format(stock))
    resp = requests.get(URL.format(stock, stock))
    soup = bs.BeautifulSoup(resp.text, 'html5lib')
        stock_float = soup.find('td', {'class':'Fz(s) Fw(500) Ta(end)'}).text
        stock_float = 'UNK'
    nasdaq_stocks[stock] = {}    
    nasdaq_stocks[stock]['float'] = stock_float

with open('nasdaq_stocks.pickle', 'wb') as f:
    pickle.dump(nasdaq_stocks, f)

# Remove tickers where float is unavailable
with open('nasdaq_stocks.pickle', 'rb') as f:
    tickers = pickle.load(f)

tickers_final = {}

for key in tickers:
    if not (tickers[key]['float'] == 'N/A' or tickers[key]['float'] == 'UNK'):
        tickers_final[key] = {}
        tickers_final[key]['float'] = tickers[key]['float']

    with open('nasdaq_stocks_final.pickle', 'wb') as f:
        pickle.dump(tickers_final, f)

Get Market Cap

URL = '{}/key-statistics?p={}'

nasdaq_stocks_final = pickle.load(open('nasdaq_stocks_final.pickle', 'rb'))
for stock in nasdaq_stocks_final:
    print ('Getting Market Cap for {}'.format(stock))
    resp = requests.get(URL.format(stock, stock))
    soup = bs.BeautifulSoup(resp.text, 'html5lib')
        market_cap = soup.find('td', {'class':'Fz(s) Fw(500) Ta(end)'}).text
        market_cap = 'UNK'
    nasdaq_stocks_final[stock]['mktcap'] = market_cap

with open('nasdaq_stocks_final.pickle', 'wb') as f:
    pickle.dump(nasdaq_stocks_final, f)

The values for market cap and float in Yahoo are categorical variables. Because they are string characters, they cannot be compared with each other. They need to be converted into values and then binned so they are easier to deal with and analyse.

def get_value_from_string(val='UNK'):
    # Convert yahoo notation of float and market cap to numbers
    if val == 'UNK':
        return 0
    if val[-1] == 'B':
        return float(val[:-1]) * 1000
    if val[-1] == 'M':
        return float(val[:-1]) * 1
    return 9999999999

    with open('nasdaq_stocks_final.pickle', 'rb') as f:
        nasdaq_stocks_final = pickle.load(f)

dfStocks = pd.DataFrame(index=nasdaq_stocks_final.keys(), columns=['FLOAT','MKTCAP'])
for stock in nasdaq_stocks_final:
    dfStocks.loc[stock]['FLOAT'] = nasdaq_stocks_final[stock]['float']
    dfStocks.loc[stock]['MKTCAP'] = nasdaq_stocks_final[stock]['mktcap']
dfStocks['FLOAT_VAL'] = dfStocks['FLOAT'].apply(get_value_from_string)
dfStocks['MKTCAP_VAL'] = dfStocks['MKTCAP'].apply(get_value_from_string)
dfStocks['FLOAT_SCALED'] = pd.qcut(dfStocks['FLOAT_VAL'],10, labels=False)
dfStocks['MKTCAP_SCALED'] = pd.qcut(dfStocks['MKTCAP_VAL'], 10, labels=False)
with open('dfStocks.pickle', 'wb') as f:
    pickle.dump(dfStocks, f)
SHLM   1.14B   1.04B    1140.00     1040.00             6              6
ACMR  92.07M  98.82M      92.07       98.82             2              2
AAON    1.9B   1.73B    1900.00     1730.00             7              7
ABAX   1.65B   1.43B    1650.00     1430.00             6              6
ABMD  11.19B  10.16B   11190.00    10160.00             8              8

The final result is a Pandas Data Frame with some fundamental data that can be imported into a Quantopian IPython notebook.

Getting the Stock Price Movements from Quantopian

We now have a dataset of NASDAQ stocks including market caps and floats in order to carry out an event analysis. Quantopian is a great platform for doing this because it provides high quality minute data on a massive list of securities. Quantopian also provides a platform to do this analysis in the form of an IPython Notebook. Quantopian exposes useful functions including additional functionality to help with stock research.

The Pandas dataframe created in the Part 1 of this series will be used as a basis of the stocks we are interested in because these already have the features of the share float and market cap already in them. Although company fundamentals are already available in Quantopian, this process allows for adding more features where ever available instead of solely relying on Quantopian.

Some definitions.

A Pipeline is a feature in Quantopian that lets you look at a large number of stocks and associated data on them. A Pipeline is useful here to confirm if the stock from the list is available in the Quantopian database.

A Symbol is an object in that is associated with a stock. Each symbol is associated with a number which is unique. Relying on the ticker symbol won't do in this case as exchanges recycle ticker symbols and doing a query over an extended period of time on a ticker may relate to different companies.

The initial part of the analysis will be converting the symbols in the list to symbol objects in Quantopian.

from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
from import USEquityPricing
from quantopian.pipeline.factors import SimpleMovingAverage
from quantopian.pipeline.filters import  StaticAssets
from quantopian.pipeline.filters import Q1500US
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = local_csv('stocks.csv')
watchSymbols = symbols(df.columns[:-1])
watchAssets = StaticAssets(watchSymbols)

From here, we need the final list of stocks which Quantopian has data for. A simple way of getting it is to run a pipeline with the stocks in the list. If there is data available, the stocks will be returned in the pipeline.

# Get a list of assets with data available in Quantopian

def make_pipeline():
    return Pipeline(screen=watchAssets)

my_pipe = make_pipeline()
result = run_pipeline(my_pipe, start_date='2018-01-05', end_date='2018-01-05')

stocks = result.index.get_level_values(1).values

From the original list of 1500 stocks, there is data available for 1,300 securities. Once we have the stocks we can look for events. Events are defined as price differences greater or less than a predetermined percentage. A day's close is compared with the next day's open. Only events that have a certain level of volume can be traded, so we need to bring in volume as well. In order to get a comprehensive dataset, 3 years of data will be used. This way different market environments can be looked at including month wise and week wise comparisons to ensure that any models are not over fitted and can be generalized over any type of market.

# Get data for analysis
# This will be daily data to compare the close price for the day with the open price for the next day

start_date = '2015-01-01'
end_date = '2017-12-31'

dfOpen = get_pricing(stocks, start_date=start_date, end_date=end_date, fields='open_price', frequency='daily')
dfClose = get_pricing(stocks, start_date=start_date, end_date=end_date, fields='close_price', frequency='daily')
dfVolume = get_pricing(stocks, start_date=start_date, end_date=end_date, fields='volume', frequency='daily')

With the data create a list of when stocks gap based on a criteria. The criteria chosen is arbitrary. Look for stocks that have gapped up more than 4% and gapped down less than -4%. In order to ensure a reasonable level of activity, we look for stocks that have traded more than 300K volume.

# List of events that satisfy a criteria
# Greater than 4% and less than -4% have been arbitrarily chosen
# The output of this step is to get a list of events by date and security

intEvent = 0.04
lEvents = []

for equity in dfOpen.columns:
    for i in range(1, len(dfOpen.index)):
        price_today = dfOpen[equity].ix[dfOpen.index[i]]
        price_yest = dfClose[equity].ix[dfClose.index[i-1]]
        volume = dfVolume[equity].ix[dfOpen.index[i]]

        fPriceChange = ((price_today - price_yest) / price_yest)
        if ( fPriceChange > intEvent or fPriceChange < -intEvent) and (volume > 300000):
            #print i,equity
            #dfEvents[equity].ix[df.index[i]] = 1
            date_0 = dfOpen.index[i]
            shift_back = list(dfOpen.index).index(dfOpen.index[i]) -1
            shift_front = list(dfOpen.index).index(dfOpen.index[i])
                date_start = dfOpen.index[shift_back]
                date_end = dfOpen.index[shift_front]
                lEvents.append([date_0, date_start, date_end, equity, price_yest, price_today, fPriceChange])

Over the course of 3 years for 1,300 securities, there are 9,000 times the stock moved according to how the events were defined. 9,000 is an excellent sample of price movements that can be used for analysis. The next posts will deal with initial analysis including refining analysis to look at the profitability of particular trades and way of optimising variables to make those trades the most profitable.

Now that we have got a list of stocks and when the events happened, we can proceed to analyse the events and look for patterns. The output of this step is to have 6 data frames with OHLCV and Price which in Quantopian is the cleanest dataset with the least amount on gaps in the data.

Initial Analysis

The previous posts in this series focused on getting datasets for analysis. This one looks at doing preliminary analysis on the datasets and getting it's self into a format that easily lends it's self to analysis.

Number of Gaps

The initial dataset was created to bring in securities that gapped up more than 4% and gapped down less than 4%. In order to make these categrorical variables, gapping up events will be defined as 1 and down as -1. An overall count is given below.

def get_change(change):
    if change >0:
        return 1
        return -1


Shows a count of 5,000 instances of a gap up and 4,000 instances of a gap down. Broken down into size of float,

pd.pivot_table(df[['FLOAT_LABEL', 'GAP_VAL']],  

Distribution by Float

-1 1
(0.999, 25.125] 368 964
(25.125, 77.548] 6161226
(77.548, 171.691] 470 822
(171.691, 322.182] 604 796
(322.182, 550.745] 10361176
(550.745, 977.414] 858 926
(977.414, 1790.0] 10241150
(1790.0, 3602.0] 10761000
(3602.0, 13381.0] 11421066
(13381.0, 9999999999.0] 864 796

Distribution by Market Cap

-1 1
(-0.001, 26.326] 4341042
(26.326, 76.508] 5541194
(76.508, 168.097] 510 814
(168.097, 310.692] 574 770
(310.692, 533.895] 10181160
(533.895, 926.272] 926 972
(926.272, 1663.0] 9801112
(1663.0, 3520.0] 11441098
(3520.0, 12650.0] 10961018
(12650.0, 9999999999.0] 822 742

Distribution by Marketcap

From the initial analysis, this looks a good distribution of transaction events with a good distribution company sizes.

Create Features

The previous posts in this series involved creating the initial dataset for analysis. The analysis done so far was on the raw data. In this post we take it a step further to apply analysis to the raw data to create user defined fields that will aid in the analysis.

The outcome of this post will be to create a final dataset with all the features for analysis.

The initial datasets created on the Quantopian platform contain a list of NASDAQ stocks. These have gapped since the previous day by at least 4%. When this has happened, minute price movements from the day before and after the gap are put into a Pandas data frame. The features created and their reasons are given below.

  • GAP_DIRECTION - the direction the stock moved from the previous day (1 for a move up, -1 for a move down).
  • FLOAT - The size of the companies outstanding share float. To properly feed the algorithms this has been scaled and labelled for proper reporting.
  • MKTCAP - The market cap of the company from Yahoo data.
  • PD_VOLUME - Prior day volume. There are also features for trends in the last 15 and 30 minutes.
  • PD_PRICE - Prior day price movements. For the last 15 and 30 minutes for the previous day.
  • OPENING_CANDLES - Shows how price behaved on the open after the gap. There are candles for the first minute and 5 - 60 mins.
  • RETURN_PERCENT - Percent return if applying gap strategy for particular trade.


End of day movement showing if at the end of the day if the stock was a gap runner or a gap closer or just flat. This is the predictor variable. The final outcome of this project is to see if it is possible to predict where the stock will move to the day after the gap. The stronger the prediction the higher the amount invested in the particular trade.

Analysis After Features

The purpose of this step is to see if any of the features created in the previous steps have any predictive power on the movement of the price during the day.
In order to label the outcomes effectively, three simplistic states have been defined on how the price moves over the day. - Closer - where the end of day price moves in the opposite direction of the gap or tries to close the gap. - Flat - where the end of day price is +- 1% of open and therefore not a profitable or loss making trade. - Runner - where the closing price is in the direction of the gap.

The analysis will look at all the created features to see if there is a higher likelihood of closes over runners.

Gappers Up

Below is the relationships the created factors have with the movement of price over the end of day for stocks that have gapped up.

Gappers Up

Prior Day Trends

On a cursory analysis, prior day trends don't seem to have any impact on price direction. It is observable by the fact that the gap closers and runners are equally distributed.

Time Based Factors

There don't seem to be any predictive power on time based factors either. Month, Weekday and Year have relatively equal distributions for gap closers and runners.

Company Size

Market cap and Float seem to have a higher number of closers on the smaller valued companies. This then reverses to a smaller number of closers with larger companies.

Market Open

This looks like the most predictive of how the price will move over the day. If the price falls 15 minutes after open then there is a high likelihood that the price will close over the day.

So in summary, small cap companies that have gapped up but with a price fall after open are likely to move in the direction of closing the initial gap by the end of the day.

Gappers Down

The same analysis can be applied to gappers down.

Gappers Up

Gappers down seem to be the opposite of gap ups. The larger the company size, the more likely the gap will close over the day.

The trading strategy with opening gaps should favour company size when predicting movement during the day.