From: https://openai.com/index/scaling-postgresql/
Properly Scaling /PostgreSQL is important as it can be easy to overload if we are not careful Currently it supports millions of queries per second.
Architecture
- A single primary for writes and a lot of secondary replicas for reading data (Azure PostgreSQL)
- For read-heavy load, scaling is good, but write heavy load can be an issue
- PostgreSQL’s Multi Version Concurrency Control (MVCC)
- Code optimized to minimize writes
- For read-heavy load, scaling is good, but write heavy load can be an issue
- Shardable data moved to another service - Azure Cosmos DB (noSQL database)
- Postgres remains unsharded.
Challenges Faced and Optimizations
Reduce Load on Primary
- Offload read traffic to replicas whenever possible
- Only read queries part of a write transaction are run on primary.
- Writes
- If Shardable
- To Azure CosmosDB
- if not
- Optimize the queries
- Remove redundant writes
- introduce lazy writes
- when backfilling (say: adding new columns) put rate limits
- If Shardable
Query optimization
The good old
- reduce number of joins
- no Online Transaction Processing (OLTP) anti-patterns
- Break down heavy db queries into smaller ones with joins in the application layer.
- ORMs can make these queries hard to find.
idle_in_transaction_session_timeoutto allow autovaccuum
Single Point of Failure Mitigation
- The write primary is a single point of failure
- Run it in High Availability mode - with a hot standby
Workload isolation
- Requests split into categories
- sent to different instances,
- Same with features and products as well
Connection Pooling
PgBounceras proxy to pool database connections- efficient reuse of connections
Caching Layer
- Cache layer serves most read traffic
- To prevent high db queries during high cache miss phases
Cache locking (and leasing)mechanism- When many request miss cache only one of them acquires the lock to fetch data from db
- All other requests wait for data in cache
- protects against load spikes
Properly scaling read replicas
The primary streams Write Ahead Logs (WAL) to all of the read replicas once.
- This does introduce latency as primary has to stream all write operations to all replicas)
- They acknowledge that their approach will probably not scale for 100+ replicas, but for 50, it is scaling fine
- OpenAI plans to introduce Cascading Replication for better scalability in the future.
Rate Limits
Rate limits on
- Connection Pooler (Probably like PgBouncer or the SQLAlchemy’s inherit pooler)
- Proxy (Probably on endpoint level, request / sec / client )
- Queries
- ORM layer modified to rate limit requests / block queries based on their digest.
- Protects against a serge of certain query execution
And keeping retry intervals sufficiently large
Schema Management
Queries that say change the datatype of the column can trigger a full table rewrite. As such they are very careful with schema changes.
- Full Table Rewrite: Copy all the data to a new table and delete existing
- To test if an
alterwill do afull table rewrite, an approach is to make a copy of the table and perform the alter in the copied table - the Try It and See (TIAS) approach - Before running the alter see the file location of the table
- Postgres stores each table in a file
- If the file location changes, then we know it triggered a
full table rewrite
A 5 second timeout rule on schema changes
Only permit
- adding / removing certain columns
- creating and dropping indices concurrently No new tables are permitted. Backfilling a table field done with a rate limit to prevent write spikes