Form 144 is required to be submitted when an executive officer, director, or affiliate of the company sells more than 5000 shares of stock or the aggregate sales price exceeds $50,000.
Conveniently, we can access the 100 most recent Form 144 filings by using the RSS feed, which provides some basic information about each filing as well as a link to the filing’s database entry. An example of the RSS feed downloaded on July 28, 2025 is provided in the sample-feed.rss file for demonstration purposes.
RSS feeds are just another XML-like file, as you can see from the RSS2.0 Specification.
You can see the purpose of this assignment as well as the skills and knowledge you should be using and acquiring, in the Transparency in Learning and Teaching (TILT) document in this repository. The TILT document also contains a checklist for self-reflection that will provide some guidance as to how the assignment will be graded.
1 Warming Up
1.1 Parsing the File
Add a code chunk that will read in the RSS feed as an XML file.
library(xml2)library(rvest)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
1.2 Extracting Information
Add a code chunk that will convert each entry into a data frame row with columns title, link, summary, updated, category, and id. Note that you may need to access attributes to get the important data out of some nodes, while for others you may only need to access the content.
Show the first 5 and last 5 rows of your data, using a function like knitr::kable, DT::DT, or IPython’s display() and Markdown(). See this page for more information about rendering “pretty” tabular data using quarto.
1.3 Is it XML?
Is an RSS feed actually an XML file? Why or why not? What requirements of an XML file does an RSS feed meet, and what requirements are lacking?
2 Cleaning the Data
The data frame you created in Section 1.2 is not exactly in tidy form. Let’s see if we can fix that!
2.1 Tidy Data
What aspects of the data frame need work in order to meet the criteria for tidy data?
2.2 Tidy Planning
Write a step-by-step plan to convert your data into tidy form.
2.3 Tidying, for real!
Execute your step-by-step plan. Show a nicely formatted data table with up to 10 entries, and explain why your data is now tidy.
3 Filing Forms
The RSS feed provides bare-bones data about SEC filings, but also links to the actual forms. The link column of your data should contain a URI for the form which was actually filed. Let’s work on pulling out the form data, with the idea that we might want to analyze SEC filing data.
For instance, the following entry describes a tractor supply filing.
Open one link from your table and examine the HTML page.
Identify the CSS selector you believe would be most efficient to obtain the table of document format files.
Your answer here
Identify the CSS selector you believe would be most efficient for obtaining the first link in the table of document format files.
Your answer here
3.2 Reading HTML Tables
Write a function that takes a URI to the Filing Detail page and extracts the link for the document corresponding to the filing entry (I recommend going for the .htm link). Use your function to obtain the HTML addresses of the filed forms for all of the links in your RSS feed.
3.3 Reading HTML Pages
Use the links to the filed forms and extract the following information from each page, storing the information for each form in a data frame.
Name, SEC File Number, Address, and Phone Number of Issuer
Name of person selling the securities, and relationship to the Issuer
Securities Information: all values in the table
Securities to be Sold: all values in the table
Other sales in the last 3 months, if any
Notice date
You may want to write one or more functions to extract the necessary data, with error handling in order to ensure that if the data does not exist the function still returns the information that is available.
3.4 Assessing Market Value of Stock Sales
Make a histogram of the aggregate market value of stock to be sold for the entries in your RSS feed. Make sure your plot has descriptive titles, axis labels, and uses appropriate scales.
3.5 Efficiency
Imagine you’d saved the RSS feed from the SEC for the past 30 days. The SEC may not receive more than 100 form 144 filings each day, and they do not update the feed on federal holidays or weekends.
Describe how you would approach generating a database of the unique form 144 filings over all 30 days of RSS feed files. Make an ordered list of steps indicating the order in which you would read in the RSS entries, acquire the corresponding data from the SEC filing form, and deduplicating the potentially repeated filings, along with any other intermediate steps you might take.
Your solution should minimize the amount of load on the SEC’s server both because it minimizes the running time of the code and because it’s more polite.
Calculate the total number of calls you have to make to the SEC’s server to implement your solution, showing your work.