I have an interesting technical situation facing me:
- A docker container which runs persistently and writes JSON files to a directory
- Each JSON file contains short form messages with JSON encoding along a variety of metadata fields
- Each file is idempotent i.e. once it exists in the directory will never be written to again
- Each file is named for a timestamp
Each file needs to:
- Have ridiculous, non-JSON comments removed from the beginning
- Be parsed into json
- There is roughly 6000 of these JSON things coming in per second
- Have only a selection of the JSON internally be selected
- Be batched into a collection up to the maximum size the SQS supports and then committed to SQS
- Be processed only once
Note: SQS is the AWS "Simple Queue Service". A queue is a specialized data structure which hands data off for processing by other tasks.
My initial thinking to handle this was to use Rust to write a high performance file processor. This appealed to me:
- New technical challenge; I don't know Rust
- Manly - writing fast code is a conveniently macho challenge
- Simple architecture - one box, one process watching a directory and ripping through files
There are, naturally, problems here:
- The Rust program represents a SPOF or "Single Point of Failure"; things break
- This architecture is simple but making it not be a SPOF means the architecture gets complex
- We don't actually even know if a single Rust executable is fast enough to handle all the data
- Rust is complex and even simple things take dramatically more code than a scripting language like Ruby or Python. There isn't anything wrong with Rust (I really like the language but learning it is non-trivial).
Thinking through all these issues as well as learning, by chance, that the runtime execution period for AWS lambda serverless functions had increased from 5 minutes to 15 minutes made me think in terms of a different architecture focused on using lambdas.
Note: A lambda is a self contained bit of code that you give over to AWS to manage on your behalf. Another term for Lambda is "functions as a service". You don't have to focus at all on servers, DevOps administration or the like.
Here is what I'm thinking:
- Add a network API to the filesystem of JSON files. This could literally be as simple as an NGINX server that listed the files.
- A lambda that requests a JSON file for processing per the description above and relies on a Redis dictionary to track files that have already been processed. Two dictionaries would be needed – json_files_processed and v (and, yes, there would need to be a way to expire things from json_files_processing in case a lambda crashes or is terminated; this would be a separate lambda).
- A CloudWatch Scheduler rule that triggers the lambda every 5 minutes
Fleshing this out further gives three lambda functions:
- sqs_loader - the main thing which processes files and sends their contents to SQS
- json_files_processed_archiver - pulls files that have been processed into a different directory so that they don't have to be ever checked again; alternatively this might delete them after some period of time
- json_files_processing_checker - checks to see if files have been in the json_files_processing queue for too long
There would likely need to be 3 CloudWatch Scheduler rules one for each lambda.
The sqs_loader would need the ability to self terminate / exit if all files are currently being processed.