Skip to content

Archival, search, and replay of Amazon Kinesis records

License

Notifications You must be signed in to change notification settings

aheiss1/flux-capacitor

Folders and files

NameName
Last commit message
Last commit date
Jul 14, 2015
Jul 14, 2015
Jul 8, 2015
Jul 6, 2015
Jul 13, 2015
Jul 15, 2015
Jul 13, 2015
Jul 10, 2015
Jul 13, 2015
Jul 10, 2015
Jul 11, 2015

Repository files navigation

flux-capacitor

Without using any costly database, this solution complements Amazon Kinesis with the following capabilities:

  • Long-term archival of records.
  • Making both current and archived records accessible to SQL-based exploration and analysis.
  • Replay of archived records:
    • which supports key-value compaction so only the last record for a key is replayed.
    • which supports bounded replay (one needn’t replay the full archive).
    • which supports filtered replay (only replay records matching some criteria).
    • which supports annotating records as they are replayed in order to alter consumer behavior, such as to force overwrite.
    • which, with consumer cooperation, provides some definition of eventual consistency with respect to records that arrive on a stream concurrently with a replay operation, without requiring this solution to mediate the flow of the stream.

Project Status

  • In active development for use at CommerceHub.
  • Capable of using SQL to search a stream archive and a live stream.
  • Stream archival capability to come next.
  • Message replay capability to follow.

Assumptions and Applicability Constraints

  • This is mostly an integration project, light on actual software. The AWS CLI will be used, and is assumed to be installed and configured.
  • This will probably be more of an ephemeral tool than a service, but the archival portion will have to run at least once every 24 hours (the Kinesis record expiration time) in order to not miss any records.
  • The initial implementation might only support JSON records, but further contributions should be able to remove that as a requirement.
  • The initial implementation might only support a single Kinesis stream, but further contributions should be able to remove that as a requirement.
  • Data and cluster security is currently left to the user.

Technical Goals

  • Configure and launch a process (TBD, there are many options) to archive blocks of Amazon Kinesis records to Amazon S3 before they expire, possibly via Amazon EMRFS.
  • Launch an Amazon EMR cluster including the Hive application.
  • Deploy Apache Drill to the cluster.
  • Configure Apache Drill to read archived records from Amazon S3, possibly via EMRFS.
  • Configure Amazon EMR Hive to expose an Amazon Kinesis stream as an externally-stored table.
  • Configure the Amazon EMR Hive Metastore for consumption by Apache Drill.
  • Configure Apache Drill to read from Amazon Kinesis via Amazon EMR Hive.
  • To the greatest extent possible without storing another copy of the data, provide a unified and de-duplicated view spanning current and archived Amazon Kinesis records.
  • (TBD) Provide a basic UI or API to initiate search and replay operations, and monitor progress.

Prerequisites

  • Bash shell installed at /bin/bash
  • AWS CLI installed and configured with your credentials and default region (you can run aws configure to do so interactively)

Getting Started

  • Create a config file. Either:
  • Make a copy of conf/defaults.conf and edit the copy, or
  • Create a new file that will contain only overrides, and import the defaults by following the directions at the top of conf/defaults.conf
  • Run ./upload-resources <config file>
  • Run ./launch-cluster <config file> and note the cluster-id that is printed to stdout; future commands will require it.
  • Run ./wait-until-ready <cluster-id>
  • Run ./forward-local-ports <cluster-id> <private-key-file>
  • As with any new SSH host, you will have to accept an authenticity warning the first time you connect to a cluster.
  • Once it's forwarding, this process will not exit, nor print any output.
  • Run ./terminate-clusters <cluster-id> when done to avoid recurring charges.
  • For additional advanced operations, explore the emr subcommand of the AWS CLI.

About

Archival, search, and replay of Amazon Kinesis records

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published