Useful Sentiment Analysis: Mining SEC Filings (Part 1)

This is how a computer reads. I think.

Ugh. Sentiment Analysis. To be honest, I roll my eyes every time I hear about the use of sentiment analysis for trading stocks. It’s not that it doesn’t work, it’s that most of the products I’ve come by in my professional capacity (data science/research for trade idea generation at a fund) have been gimmicky: tracking tweets, parsing news headlines, parsing earnings statements after release, parsing Fed notes…the list goes on and on and on. These products have had varying efficacy and large price tags. Without naming names, a recent product which analyzes tweets incorrectly tagged twitter memes and jokes about Red Dead Redemption 2’s many bugs as an “insight” that unit sales were weak. They were Red Dead Wrong, now focusing on Redemption — and I can keep making these puns all day.

So, what’s the purpose of this piece and why I am looking at sentiment analysis if my sentiment on it is so negative to begin with? Well, textual data has a lot of power. One of the interesting things with modeling and forecasting is that data mass (the amount of data you are working with) has more impact than algorithmic complexity.

For more info:

With that in mind, there is a lot more text data available compared to numerical data. Best of all, it’s mostly free and very messy. Cleaning that data and doing the tedious work that often gets avoided or overlooked is where profits lie.

Ways to approach sentiment analysis

Sentiment analysis utilizes various methods in order to understand the tone, opinion, and emotion embedded in a piece of text. It has gained popularity in finance because markets react to changes in sentiment and many of the advances made in the field of Natural Language Processing (NLP) are easily accessible to anyone with some coding experience.

Motivation and Data Collection: As an aside, another reason I like text data is because of a formative experience on my career path as a data scientist. In 2008–2009 I was a senior in high school and attended a talk by Vinton Cerf at BBN Technologies in Cambridge, MA. In his talk he went over his work as a tech evangelist at Google as well as some stories and lessons from his work on TCP/IP. BBN Technologies was a defense contractor that had worked on ARPANET, acoustical analysis of the JFK assassination tapes, and Shotspot — a tool that uses acoustic analysis to determine where a bullet was fired from. The talk was amazing and the history of science and technology in that office was inspiring. I got to talking with some of the engineers and, fast forward some months, in the summer of 2009 I was working at BBN Technologies in their Speech and Language Technologies department. Without going into too much detail, the main project of focus was to create software which allowed a user to speak in english and have the computer or smart phone translate and speak in a foreign language all while working offline. Since this was for the U.S. Military, the focus was on translating to Pashto and Iraqi. Working on NLP during a time before Siri existed was wild. It was the first time I experienced a practical application of mathematics and programming (most the coding I had done at that point were fairly dull textbook examples). This basically set me on the meandering path of a data scientist.

So we fast forward many years and I found myself fairly adept with coding in R but with minimal ability to program in python. In a desire to grow my skillset, I decided to attempt a project in python. I had read this paper Lazy Prices, which described a methodology for parsing Management Discussion & Analysis from 10-K and 10-Q SEC filings. From the abstract:

Full paper can be found here:

If this is your first attempt at some form of sentiment analysis, Lazy Prices is a great way to start. The logic is laid out cleanly and there are many variations on the final analysis that can be done. Additionally, SEC filings are fairly structured in terms of repeated boiler plate language. The rest of this article focuses on implementation of this paper so I would highly reccomend reading it before continuing.

I started by downloading a CSV that had cik, ticker, company name, trading exchange, and some other fields. CIK stands for Central Index Key and it’s how the SEC identifies corporations or individuals that have filed disclosures. I filtered to look at only the stocks that are on the NYSE.

Instead of scraping Edgar, the SEC’s online portal for retrieving filings, I used an R package called edgar. I don’t know if there is a newer version, but I would not reccomend using this package as the documentation wasn’t great and had code errors. Also, the set up was very clunky. It seems like there is a python package with the same name and looking at some code samples, it seems much cleaner. I plan on moving all the preprocessing code I wrote in R to python.

The R code downloaded 10-K filings for each CIK for 2014, 2015, and 2016 (this code was written in summer of 2017). For a production model, I would certainly run this process on 10-Q’s as quarterly filings are more frequent and have more immediate impact on a stock’s price. Both filings have and MD&A section, the framework built here should apply with minimal changes in code.

Pre-processing and cleansing: Earlier, I mentioned that working with text data requires a lot of cleansing. I cleansed each filing using basic functions such as grep and gsub:

  1. remove numerical values as we want to focus on changes in text as oppose to changes in reported values
  2. remove all punctuation and table style tags
  3. remove special chars using gsub(“[^[:alnum:][:blank:]+?&/\\-]”, “”, text)

The following R code shows how to read a specific filing and edit it. The output is a cleaned list:

url <- ""
exploreDoc <- try(readLines(URLencode(url)))
cleanedDoc <- gsub("<.*?>", "", exploreDoc)
cleanedDoc <- gsub("&nbsp;"," ", cleanedDoc)
cleanedDoc <- gsub(" {2,}", "", cleanedDoc)
cleanedDoc <- gsub("^\\s+|\\s+$", "", cleanedDoc)
cleanedDoc <- gsub("\\d+", "", cleanedDoc)
cleanedDoc <- gsub("&#;", "", cleanedDoc)
cleanedDoc <- gsub("[^[:alnum:][:blank:]+?&/\\-]", "", cleanedDoc)
cleanedDoc <- cleanedDoc[cleanedDoc != ""]
cleanedDoc <- cleanedDoc[cleanedDoc != " "]
cleanedDoc <- cleanedDoc[cleanedDoc != ",,"]
cleanedDoc <- cleanedDoc[cleanedDoc != ","]

Since our cleaned data is in a list, it makes it much easier to work with. We made an intuitive leap and figured that since most of the non numerical, non MD&A text is boiler plate, we are probably fine using as much text as possible as it will either filter out and be constant amongst filing years or it may have some additional value. For each filing we stored the cleaned text in a separate folder where each file name was labeled with the CIK, filing type, and year.

Now that we have our cleaned filings, we can use our python code to do our sentiment analysis (this is where I switched from writing in R to writing in python). Our code will read the cleaned text and use sklearn’s CountVectorizer to tokenize our text. CountVectorizer (and word2vec as well as other vectorizers) basically counts occurrences of all distinct words in a piece of text. To learn about word embeddings, I reccomend this resource:

As an initial analysis, we look for those documents where there were language changes from one year to the next — specifically we look for changes in cosine similarity:

Conclusion and next steps: As this was an initial introduction to sentiment analysis, we are going to stop here. The code above should help you process filings and find which companies have the largest change in language. You can then isolate those CIK’s for further sentiment analysis where you can use financial dictionaries to help the system attribute positive and negative words. Of course, should you decide you want to put this code into production in a trading engine, then a backtest is necessary. For that, it would make most sense to map the CIK’s back to tickers and tabulate returns.

Remember, sentiment analysis can be done in many different ways. This is a starting point so experiment with different similarity measures, different dictionaries, and, of course, look for ways to parse the text even more precisely.

I’ll plan to update with a Part 2, as I explore more extensions.

Useful Sentiment Analysis: Mining SEC Filings (Part 1) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Reply

Your email address will not be published. Required fields are marked *