bulk_update_mappings adr

GSA · Jan 22, 2025 · 669f4e9 · 669f4e9
1 parent e6899ef
commit 669f4e9
Show file tree

Hide file tree

Showing 2 changed files with 37 additions and 2 deletions.
diff --git a/docs/adrs/0010-adr-celery-pool-support-best-practice.md b/docs/adrs/0010-adr-celery-pool-support-best-practice.md
@@ -1,7 +1,7 @@
 # Make best use of celery worker pools
 
-Status: N/A
-Date: N/A
+Status: Accepted
+Date: 7 January 2025
 
 ### Context
 Our API application started with initial celery pool support of 'prefork' (the default) and concurrency of 4.  We continuously encountered instability, which we initially attributed to a resource leak.  As a result of this we added the configuration `worker-max-tasks-per-child=500` which is a best practice.  When we ran a load test of 25000 simulated messages, however, we continued to see stability issues, amounting to a crash of the app after 4 hours requiring a restage.  Based on running `cf app notify-api-production` and observing that `cpu entitlement` was off the charts at 10000% to 12000% for the works, and after doing some further reading, we came to the conclusion that perhaps `prefork` pool support is not the best type of pool support for the API application.
@@ -10,8 +10,12 @@ The problem with `prefork` is that each process has a tendency to hang onto the
 
 ### Decision
 
+We decided to try to the 'threads' pool support with increased concurrency.
+
 ### Consequences
 
+We saw an immediate decrease in CPU usage of about 70% with no adverse consequences.
+
 ### Author
 @kenkehl
 

diff --git a/docs/adrs/0011-adr-delivery-receipts-updates.md b/docs/adrs/0011-adr-delivery-receipts-updates.md
@@ -0,0 +1,31 @@
+# Optimize processing of delivery receipts
+
+Status: Accepted
+Date: 22 January 2025
+
+### Context
+Our original effort to get delivery receipts for text messages was very object oriented and conformed to other patterns in the app.  After an individual message was sent, we would kick off a new task on a delay, and this task would go search the cloudwatch logs for the given phone number.
+On paper this looked good, but when one customer did a big send of 25k messages, we realized suddenly this was a bad idea.  We overloaded the AWS api call and got massive throttling as a result.  Although we ultimately did get most of the delivery receipts, it took hours and the logs were filled with errors.
+
+In refactoring this, there were two possible approaches we considered:
+
+1. Batch updates in the db (up to 1000 messages at a time).  This involved running update queries with case statements and there is some theoretical limit on how large these statements can get and still be efficient.
+
+2. bulk_update_mappings().   This would be a raw updating similar to COPY where we could do millions of rows at a time.
+
+### Decision
+
+We decided to try to use batch updates.  Even though they don't theoretically scale to the same level as bulk_update_mappings(), our app has a potential problem with using bulk_update_mappings().  In order for it to work, we would need to know the "id" for each notification, which is the primary key into the notifications table.  We do NOT know the "id" when we process the delivery receipts.  We do know the "message_id", but in order to get the "id" we would either have to a select query, or we would have to maintain some mapping in redis, etc.
+
+It is not clear, given the extra work necessary, that bulk_update_mappings() would be greatly superior to batch updates for our purposes.  And batch updates currently allow us to scale at least 100x above where we are now.
+
+### Consequences
+
+Batch updates greatly cleaned up the logs (no more errors for throttling) and reduced CPU consumption.  It was a very positive change.
+
+### Author
+@kenkehl
+
+### Stakeholders
+@ccostino
+@stvnrlly