SEC disclosure text mining update
As noted in previous posts, I have been working to develop a means of automatically identifying those 10-Q reports with a “risk factors” section, and extracting that section to what I have been calling risk files. My initial very rough estimate was that 42,601 of 58,798 files contained such a section.
I’ve developed a slightly less crude means of making this determination, which has revised my estimate upwards, to 42,878 of 58,798.
In order to determine how accurate my little Perl script is, I randomly selected 15 files it identified as having NO risk factors, and examined them manually. My program had incorrectly classified one of them: it did indeed have a risk factors section.
I also randomly selected 20 files identified as having a risk factors section for manual review. Here, the results show that additional refinement is needed:
-
9 files were properly identified: They had a risk factors section, with actual risk factors within it.
-
11 files were improperly identified: They had a risk factors section, but it had something else in it:
-
5 had references to risk factors previously listed in the firm’s annual report
-
4 had boilerplate saying, in effect “there are no reportable risk factors”
-
2 noted that as smaller companies, they were not obligated to report risk factors
-
I am confident that I can weed out the smaller companies which do not need to note risk factors.* How successful I can be at eliminating the “nothing to see here” noise without throwing away signal remains to be seen. As I pursue this, I am also making some headway learning to create and analyze corpora. My confidence is growing that results will come of this, although I do not know how interesting they will be!
Administrivia: I have created an “SEC Project” category, placing this and all other relevant posts within it.
*Based on some preliminary testing, I’ve found a way to get rid of about 3/4 of the smaller companies that don’t need to report. This will only improve, I suspect.