Without using any costly database, this solution complements Amazon Kinesis with the following capabilities:
- Long-term archival of records.
- Making both current and archived records accessible to SQL-based exploration and analysis.
- Replay of archived records:
- which supports key-value compaction so only the last record for a key is replayed.
- which supports bounded replay (one needn’t replay the full archive).
- which supports filtered replay (only replay records matching some criteria).
- which supports annotating records as they are replayed in order to alter consumer behavior, such as to force overwrite.
- which, with consumer cooperation, provides some definition of eventual consistency with respect to records that arrive on a stream concurrently with a replay operation, without requiring this solution to mediate the flow of the stream.
- In active development for use at CommerceHub.
- Capable of using SQL to search a stream archive and a live stream.
- Stream archival capability to come next.
- Message replay capability to follow.
- This is mostly an integration project, light on actual software. The AWS CLI will be used, and is assumed to be installed and configured.
- This will probably be more of an ephemeral tool than a service, but the archival portion will have to run at least once every 24 hours (the Kinesis record expiration time) in order to not miss any records.
- The initial implementation might only support JSON records, but further contributions should be able to remove that as a requirement.
- The initial implementation might only support a single Kinesis stream, but further contributions should be able to remove that as a requirement.
- Data and cluster security is currently left to the user.
- Configure and launch a process (TBD, there are many options) to archive blocks of Amazon Kinesis records to Amazon S3 before they expire, possibly via Amazon EMRFS.
- Launch an Amazon EMR cluster including the Hive application.
- Deploy Apache Drill to the cluster.
- Configure Apache Drill to read archived records from Amazon S3, possibly via EMRFS.
- Configure Amazon EMR Hive to expose an Amazon Kinesis stream as an externally-stored table.
- Configure the Amazon EMR Hive Metastore for consumption by Apache Drill.
- Configure Apache Drill to read from Amazon Kinesis via Amazon EMR Hive.
- To the greatest extent possible without storing another copy of the data, provide a unified and de-duplicated view spanning current and archived Amazon Kinesis records.
- (TBD) Provide a basic UI or API to initiate search and replay operations, and monitor progress.
- Bash shell installed at /bin/bash
- AWS CLI installed and configured with your credentials and default region (you can run aws configure to do so interactively)
- Create a config file. Either:
- Make a copy of conf/defaults.conf and edit the copy, or
- Create a new file that will contain only overrides, and import the defaults by following the directions at the top of conf/defaults.conf
- Run ./upload-resources <config file>
- Run ./launch-cluster <config file> and note the cluster-id that is printed to stdout; future commands will require it.
- Run ./wait-until-ready <cluster-id>
- Run ./forward-local-ports <cluster-id> <private-key-file>
- As with any new SSH host, you will have to accept an authenticity warning the first time you connect to a cluster.
- Once it's forwarding, this process will not exit, nor print any output.
- Run ./terminate-clusters <cluster-id> when done to avoid recurring charges.
- For additional advanced operations, explore the emr subcommand of the AWS CLI.