Since 2013, following the "Big Short," Fannie Mae has been issuing mortgage-backed bonds as part of its "Connecticut Avenue Securities" (CAS) program. This typically occurs several times a year in the form of so-called "deals." For each deal, Fannie Mae selects a pool of qualified mortgage loans and issues bonds that enable investors to receive a portion of the interest payments from the mortgages in that pool. In return, investors are exposed to the risks associated with those mortgages.
For each deal, Fannie Mae publishes data about the mortgage pool. The data comes in the form of a press release like this one and as a dataset that can be downloaded for free from this website. The full directory of all deals is available here.
Every month, the datasets are updated with payment information for each mortgage. Each row in the dataset is a record of the scheduled payment for a specific mortgage. Here is a sample fragment of a dataset for the 2019 R04 G2 deal, downloaded in January 2024, with payment information updated on December 20th, 2023, and the scheduled payment date is October 20th, 2023 (the latest reported payment as of December 20, 2023 - there is a time lag in reporting).
Fannie Mae's Data Dynamics application provides monthly updates on CAS deal performance. But what if an investor wants to take a closer look at underlying mortgage performance?
Capillaries provides an integration test called fannie_mae that processes Fannie Mae datasets for 2022 (24,071,388 payment records) and 2023 (5,102,932 payment records) deals, focusing on payment accuracy for mortgages sold by different sellers. Data processing diagrams:
The full transcript of what the result of each script node looks like in Cassandra is here.This test gives a realistic estimate of the time and money an organization may need to allocate to analyze this amount of Fannie Mae data.
The following tables and charts tell the story of the AWS-based test environment running the fannie_mae script against the 2022-2023 datasets mentioned above, presented as 287 Parquet files, totaling 459MB, with approximately 25 million payment records.
The charts show Prometheus statistics for the AWS environments. The top graph displays CPU usage, the middle one shows Cassandra writes, and the bottom one illustrates Cassandra reads. Open the graph in a separate browser tab to enlarge them.
Flavor | Cores | RAM | Hourly USD | Qty | Hourly rate x qty | Total cores | Total RAM | |
---|---|---|---|---|---|---|---|---|
Prometheus | c7g.medium | 1 | 2 | 0.0363 | 1 | $0.0363 | 1 | 2 |
RabbitMQ | c7g.medium | 1 | 2 | 0.0363 | 1 | $0.0363 | 1 | 2 |
Bastion | c7g.large | 2 | 4 | 0.0725 | 1 | $0.0725 | 2 | 4 |
Cassandra | c7g.16xlarge | 64 | 128 | 2.903 | 4 | $11.6120 | 256 | 512 |
Daemon (24 workers, 16 writers) |
c7g.4xlarge | 16 | 32 | 0.58 | 4 | $2.32 | 64 | 128 |
Total | 11 | $14.0771 | 324 | 648 |
Run times (sec) | 1380 |
Avg run time (sec) | 1380 |
Avg cost of run | $5.3962 |
Cost of setup (10 min) | $2.3462 |
There are a few distinct areas on the graph.
This shows an initial (uneven) CPU plateau matching the elevated areas in writes (peaking at around 450k w/s) and reads (peaking at around 220k r/s).
30% CPU usage, very few writes, ~75k reads/s.
Daemon instances are at 100% CPU, running Python code for around 1.65 million JSON strings (total number of loans in 2022-2023 deals). Database activity is almost zero.
Daemons show a flat 20% CPU usage, with minimal activity on the database. Joins on deals and sellers reveal that many batches end up doing nothing due to the data skew issue discussed in a previous post.
The process produces 3 Parquet files:
Just for the sake of the demo, let's show someting that Fannie Mae reports will not tell you - the percentage of missing payments per deal per seller. A simple Excel pivot table using data from deal_seller_summaries.parquet shows this:
Is it possible to tell that some sellers tend to sell mortgages to less responsible borrowers? From this dataset - hardly, but running this calculation for older mortgages may reveal something. Just a reminder: this test dataset was downloaded from the Fannie Mae website in January 2024. The monthly reporting period range for mortgages in these 2022-2023 deals is November 2021 to October 2023. This means the longest payment sequence on hand is only 24 months.
Can we tell a difference between G1 and G2 deals? We probably can: G2 mortgages tend to have a slightly higher rate of missing payments.