CRSP Data Summary Statistics by Industry

1. Introduction

In this post, I compute industry level summary statistics the CRSP monthly file using $2$ different industry classification schemes:

All of the code for the results below as well as a JSON file containing the industry classification schemes can be found at my GitHub page. I use the Zoom.it API to make it convenient to scroll around and inspect the large summary statistic plots I create. Each of these plots can be expanded to full screen mode using the controls at the lower right hand corner of the figure.

2. Data

In this section, I describe my data sources for the plots below.

CRSP Monthly File

I gather my stock data from the CRSP monthly file via the WRDS database. Thus, the unit of observation is a firm $\times$ month pair. I restrict my attention to the time period from January 1988 to December 2010 to focus on the period of time over which the Fama and French (1988) industry classification scheme would have been widely known. I keep only actively traded firms listed on the NYSE, NASDAQ and AMEX exchanges. I require that the firm reports a non-missing price, return, share count and SIC code for a given month. I also remove any observations which lack valid data in the previous month. This leaves me with $1,916,707$ total firm $\times$ month observations covering $20,686$ firms. The figure below plots the total number of firms in the dataset each month.

Number of firms in the monthly CRSP database from January 1988 to December 2010.

Industry Classifications

I created a JSON file to house CRSP-COMPUSTAT industry classification data. The data can be found in various places throughout the web; e.g., see Ken French’s website used in Fama and French (1988). However, everywhere I looked, the data came as a txt file with quirky formatting. For example, below is the first industry coding from the file on Ken French’s site:

 1 Agric  Agriculture
          0100-0199 Agric production - crops
          0200-0299 Agric production - livestock
          0700-0799 Agricultural services
          0910-0919 Commercial fishing
          2048-2048 Prepared feeds for animals
 ...

This format is particularly difficult to read as it is irregularly spaced and little mark-up around the data. In response to this problem, I used Emacs Regexp to convert the file on Ken French’s website into a JSON format. I also coded up the $20$ firm industry classification used by Moskowitz and Grinblatt (1999). The JSON file contains $2$ major directories, one for the Fama and French (1988) industry classification scheme using $49$ different clusters and one for the Moskowitz and Grinblatt (1999) scheme with $20$ different clusters. The industry groupings are based on the SIC codes. Below I post a sample entry for the $\mathtt{Agriculture}$ industry from the Fama and French (1988) scheme:

{"Fama and French (1988)": {
    "Agriculture": {
	"Agric production - crops": {"start":100, "end":199},
	"Agric production - livestock": {"start":200, "end":299},
	"Agricultural services": {"start":700, "end":799},
	"Commercial fishing": {"start":910, "end":919},
	"Prepared feeds for animals": {"start":2048, "end":2048}
    },
    ...
}

Note that under the main heading there are several subindustry headings. The $\mathtt{start}$ and $\mathtt{end}$ tags denote the initial and ending SIC codes for each subindustry. The Moskowitz and Grinblatt (1999) scheme is less complex. There is a simple start and stop date for each of the $20$ broad industry groupings:

 "Moskowitz and Grinblatt (1999)": {
     "Mining": {"start":1000, "end":1499},
     "Food": {"start":2000, "end":2099},
     "Apparel": {"start":2200, "end":2399},
     "Paper": {"start":2600, "end":2699},
     "Chemical": {"start":2800, "end":2899},
     "Petroleum": {"start":2900, "end":2999},
     "Construction": {"start":3200, "end":3299},
     "Prim. Metals": {"start":3300, "end":3399},
     "Fab. Metals": {"start":3400, "end":3499},
     "Machinery": {"start":3500, "end":3599},
     "Electrical Eq.": {"start":3600, "end":3699},
     "Transportation Eq.": {"start":3700, "end":3799},
     "Manufacturing": {"start":3800, "end":3999},
     "Railroads": {"start":4000, "end":4099},
     "Other Transport.": {"start":4100, "end":4799},
     "Utilities": {"start":4900, "end":4999},
     "Retail": {"start":5000, "end":5299},
     "Dept. Stores": {"start":5300, "end":5399},
     "Retail": {"start":5400, "end":5999},
     "Financial": {"start":6000, "end":6999}
 }

3. Fama and French (1988) Classification

In this section, I plot $4$ different summary plots of the CRSP data split by the Fama and French (1988) industry classification. First, I plot the number of firms in each industry. In all of the plots, I omit the “Other” industry containing firms with no clear industry classification. All of the $48$ industries except for Candy and Soda, Coal, Non-Metalic and Industrial Mining, Pharmaceutical Products, Precious Metals and Trading display a single peaked pattern indicating that the number of firms in each industry dramatically expanded around 2000.

Number of firms in the monthly CRSP database from January 1988 to December 2010 by Fama and French (1988) industry classification system.

Next, I break down this firm count by industry plot even further into sub-industries in the figure below. This plot reveals that there is wide variation in the number of subindustries. What’s more, this single peaked pattern does not persist as strongly at the sub-industry level.

Number of firms in the monthly CRSP database from January 1988 to December 2010 by Fama and French (1988) industry classification system split by sub-industry.

I then turn to market capitalization by industry rather than firm counts. In the figure below, we see that while the number of firms in most industries has been shrinking since 2000, each industry’s market capitalization has been rising steadily. Thus, the combination of the first figure with the figure below reveals that industries have been consolidating.

Market capitalization in the monthly CRSP database from January 1988 to December 2010 by Fama and French (1988) industry classification system.

Finally, I look at the distribution of monthly excess returns defined as $r_{a,t} - r_{f,t}$ where $r_{f,t}$ is the $3$ month T-Bill by industry. Due to space constraints, it was not possible to plot $1$ box plot for each month of observations, so instead I first computed the mean monthly excess return for each firm in each year and then computed yearly box plots. Thus, a data point in the plot below is a mean monthly return for a particular firm over the whole year. This figure reveals that there are large outliers in the return distribution that need to be addressed before any further data work can be done. For instance, in 1992 a firm in the entertainment industry earned an average monthly excess return of over $800\%$ .

Mean return by year for firms in the monthly CRSP database from January 1988 to December 2010 by Fama and French (1988) industry classification system.

4. Moskowitz and Grinblatt (1999) Classification

I also create similar plots for the industry classification system used in Moskowitz and Grinblatt (1999) which contains only $20$ industries rather than the $48$ in the Fama and French (1988) system. These charts generally mirror the insights from above—just at a much more granular level. First, I plot the number of firms in each of the $20$ industries. These industry grouping were chosen in part to balance out the partition of firms across industries and, as a results, display a much more even cross-sectional distribution.

Number of firms in the monthly CRSP database from January 1988 to December 2010 by Moskowitz and Grinblatt (1999) industry classification system.

Again, the market capitalization plot reveals that firms have been consolidating within each industry since $2000$ .

Market capitalization in the monthly CRSP database from January 1988 to December 2010 by Moskowitz and Grinblatt (1999) industry classification system.

Finally, I plot the distribution of excess returns for each firm within industry as above.

Mean return by year for firms in the monthly CRSP database from January 1988 to December 2010 by Moskowitz and Grinblatt (1999) industry classification system.