diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index b5fb80d3aa..0ec6c5e009 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1 +1 @@ -* @hdhalter @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @scrawfor99 +* @hdhalter @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @scrawfor99 @epugh diff --git a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt index 0a14e17e7d..091f2d2534 100644 --- a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt +++ b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt @@ -26,6 +26,7 @@ Boolean Dev [Dd]iscoverability Distro +[Dd]ownvote(s|d)? [Dd]uplicative [Ee]gress [Ee]num @@ -122,6 +123,7 @@ stdout [Ss]ubvector [Ss]ubwords? [Ss]uperset +[Ss]yslog tebibyte [Tt]emplated [Tt]okenization @@ -138,6 +140,7 @@ tebibyte [Uu]nregister(s|ed|ing)? [Uu]pdatable [Uu]psert +[Uu]pvote(s|d)? [Ww]alkthrough [Ww]ebpage xy \ No newline at end of file diff --git a/.github/workflows/vale.yml b/.github/workflows/vale.yml index 2eee5d82fb..515d974133 100644 --- a/.github/workflows/vale.yml +++ b/.github/workflows/vale.yml @@ -20,4 +20,5 @@ jobs: reporter: github-pr-check filter_mode: added vale_flags: "--no-exit" - version: 2.28.0 \ No newline at end of file + version: 2.28.0 + continue-on-error: true diff --git a/MAINTAINERS.md b/MAINTAINERS.md index 921e46ab09..1bf2a1d219 100644 --- a/MAINTAINERS.md +++ b/MAINTAINERS.md @@ -1,6 +1,6 @@ ## Overview -This document contains a list of maintainers in this repo. See [opensearch-project/.github/RESPONSIBILITIES.md](https://github.com/opensearch-project/.github/blob/main/RESPONSIBILITIES.md#maintainer-responsibilities) that explains what the role of maintainer means, what maintainers do in this and other repos, and how they should be doing it. If you're interested in contributing, and becoming a maintainer, see [CONTRIBUTING](CONTRIBUTING.md). +This document lists the maintainers in this repo. See [opensearch-project/.github/RESPONSIBILITIES.md](https://github.com/opensearch-project/.github/blob/main/RESPONSIBILITIES.md#maintainer-responsibilities) for information about the role of a maintainer, what maintainers do in this and other repos, and how they should be doing it. If you're interested in contributing or becoming a maintainer, see [CONTRIBUTING](CONTRIBUTING.md). ## Current Maintainers @@ -9,8 +9,9 @@ This document contains a list of maintainers in this repo. See [opensearch-proje | Heather Halter | [hdhalter](https://github.com/hdhalter) | Amazon | | Fanit Kolchina | [kolchfa-aws](https://github.com/kolchfa-aws) | Amazon | | Nate Archer | [Naarcha-AWS](https://github.com/Naarcha-AWS) | Amazon | -| Nate Bower | [natebower](https://github.com/natebower) | Amazon | +| Nathan Bower | [natebower](https://github.com/natebower) | Amazon | | Melissa Vagi | [vagimeli](https://github.com/vagimeli) | Amazon | | Miki Barahmand | [AMoo-Miki](https://github.com/AMoo-Miki) | Amazon | | David Venable | [dlvenable](https://github.com/dlvenable) | Amazon | | Stephen Crawford | [scraw99](https://github.com/scrawfor99) | Amazon | +| Eric Pugh | [epugh](https://github.com/epugh) | OpenSource Connections | diff --git a/TERMS.md b/TERMS.md index 8fc1ba0162..e12cc171ed 100644 --- a/TERMS.md +++ b/TERMS.md @@ -236,6 +236,8 @@ Do not use *disable* to refer to users. Always hyphenated. Don’t use _double click_. +**downvote** + **dropdown list** **due to** @@ -586,6 +588,10 @@ Use % in headlines, quotations, and tables or in technical copy. An agent and REST API that allows you to query numerous performance metrics for your cluster, including aggregations of those metrics, independent of the Java Virtual Machine (JVM). +**plaintext, plain text** + +Use *plaintext* only to refer to nonencrypted or decrypted text in content about encryption. Use *plain text* to refer to ASCII files. + **please** Avoid using except in quoted text. @@ -700,6 +706,8 @@ Never hyphenated. Use _startup_ as a noun (for example, “The following startup **Stochastic Gradient Descent (SGD)** +**syslog** + ## T **term frequency–inverse document frequency (TF–IDF)** @@ -746,6 +754,8 @@ A storage tier that you can use to store and analyze your data with Elasticsearc Hyphenate as adjectives. Use instead of *top left* and *top right*, unless the field name uses *top*. For example, "The upper-right corner." +**upvote** + **US** No periods, as specified in the Chicago Manual of Style. diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index ba09a7fa30..e6d9875736 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -13,52 +13,53 @@ Token filters receive the stream of tokens from the tokenizer and add, remove, o The following table lists all token filters that OpenSearch supports. Token filter | Underlying Lucene token filter| Description -`apostrophe` | [ApostropheFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token that contains an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following the apostrophe. -`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. -`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. -`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. -`classic` | [ClassicFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/standard/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. -`common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. -`conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script. -`decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9). -`delimited_payload` | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters before the delimiter, and a payload consists of all characters after the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload. +`apostrophe` | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token that contains an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following the apostrophe. +`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. +`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. +`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. +`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. +`common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. +`conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script. +`decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9). +`delimited_payload` | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters before the delimiter, and a payload consists of all characters after the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload. [`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency. -`dictionary_decompounder` | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages. -`edge_ngram` | [EdgeNGramTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. -`elision` | [ElisionFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). -`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. -`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. -`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. +`dictionary_decompounder` | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages. +`edge_ngram` | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. +`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). +`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. +`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. +`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. -`keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. -`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. -`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. -`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. -`kstem` | [KStemFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary. -`length` | [LengthFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens whose lengths are shorter or longer than the length range specified by `min` and `max`. -`limit` | [LimitTokenCountFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. A common use case is to limit the size of document field values based on token count. -`lowercase` | [LowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)). -`min_hash` | [MinHashFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially:
1. Hashes each token in the stream.
2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket.
3. Outputs the smallest hash from each bucket as a token stream. +`keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. +`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. +`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. +`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. +`kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary. +`kuromoji_completion` | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a `mode` parameter, which should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins). +`length` | [LengthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens whose lengths are shorter or longer than the length range specified by `min` and `max`. +`limit` | [LimitTokenCountFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. A common use case is to limit the size of document field values based on token count. +`lowercase` | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)). +`min_hash` | [MinHashFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially:
1. Hashes each token in the stream.
2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket.
3. Outputs the smallest hash from each bucket as a token stream. `multiplexer` | N/A | Emits multiple tokens at the same position. Runs each token through each of the specified filter lists separately and outputs the results as separate tokens. -`ngram` | [NGramTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html) | Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`. -Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizer.html)
`german_normalization`: [GermanNormalizationFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)
`hindi_normalization`: [HindiNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizer.html)
`indic_normalization`: [IndicNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizer.html)
`sorani_normalization`: [SoraniNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizer.html)
`persian_normalization`: [PersianNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizer.html)
`scandinavian_normalization` : [ScandinavianNormalizationFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)
`scandinavian_folding`: [ScandinavianFoldingFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)
`serbian_normalization`: [SerbianNormalizationFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/sr/SerbianNormalizationFilter.html) | Normalizes the characters of one of the listed languages. +`ngram` | [NGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html) | Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`. +Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ar/ArabicNormalizer.html)
`german_normalization`: [GermanNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)
`hindi_normalization`: [HindiNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hi/HindiNormalizer.html)
`indic_normalization`: [IndicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/in/IndicNormalizer.html)
`sorani_normalization`: [SoraniNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ckb/SoraniNormalizer.html)
`persian_normalization`: [PersianNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/fa/PersianNormalizer.html)
`scandinavian_normalization` : [ScandinavianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)
`scandinavian_folding`: [ScandinavianFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)
`serbian_normalization`: [SerbianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/sr/SerbianNormalizationFilter.html) | Normalizes the characters of one of the listed languages. `pattern_capture` | N/A | Generates a token for every capture group in the provided regular expression. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). `pattern_replace` | N/A | Matches a pattern in the provided regular expression and replaces matching substrings. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). `phonetic` | N/A | Uses a phonetic encoder to emit a metaphone token for each token in the token stream. Requires installing the `analysis-phonetic` plugin. -`porter_stem` | [PorterStemFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language. +`porter_stem` | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language. `predicate_token_filter` | N/A | Removes tokens that don’t match the specified predicate script. Supports inline Painless scripts only. -`remove_duplicates` | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position. -`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`. -`shingle` | [ShingleFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`]. +`remove_duplicates` | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position. +`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`. +`shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`]. `snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`. `stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`. `stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed. `stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream. `synonym` | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file. `synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process. -`trim` | [TrimFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space from each token in a stream. -`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit. +`trim` | [TrimFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space from each token in a stream. +`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit. `unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream. -`uppercase` | [UpperCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase. -`word_delimiter` | [WordDelimiterFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. -`word_delimiter_graph` | [WordDelimiterGraphFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. Assigns multi-position tokens a `positionLength` attribute. +`uppercase` | [UpperCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase. +`word_delimiter` | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. +`word_delimiter_graph` | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. Assigns multi-position tokens a `positionLength` attribute. diff --git a/_api-reference/document-apis/reindex.md b/_api-reference/document-apis/reindex.md index 766f5b2872..4a0346ede3 100644 --- a/_api-reference/document-apis/reindex.md +++ b/_api-reference/document-apis/reindex.md @@ -73,10 +73,11 @@ slice | Whether to manually or automatically slice the reindex operation so it e _source | Whether to reindex source fields. Specify a list of fields to reindex or true to reindex all fields. Default is true. id | The ID to associate with manual slicing. max | Maximum number of slices. -dest | Information about the destination index. Valid values are `index`, `version_type`, and `op_type`. +dest | Information about the destination index. Valid values are `index`, `version_type`, `op_type`, and `pipeline`. index | Name of the destination index. version_type | The indexing operation's version type. Valid values are `internal`, `external`, `external_gt` (retrieve the document if the specified version number is greater than the document’s current version), and `external_gte` (retrieve the document if the specified version number is greater or equal to than the document’s current version). op_type | Whether to copy over documents that are missing in the destination index. Valid values are `create` (ignore documents with the same ID from the source index) and `index` (copy everything from the source index). +pipeline | Which ingest pipeline to utilize during the reindex. script | A script that OpenSearch uses to apply transformations to the data during the reindex operation. source | The actual script that OpenSearch runs. lang | The scripting language. Valid options are `painless`, `expression`, `mustache`, and `java`. diff --git a/_api-reference/index-apis/force-merge.md b/_api-reference/index-apis/force-merge.md index 6ad2e7f23c..6c2a61bef3 100644 --- a/_api-reference/index-apis/force-merge.md +++ b/_api-reference/index-apis/force-merge.md @@ -72,6 +72,7 @@ The following table lists the available query parameters. All query parameters a | `ignore_unavailable` | Boolean | If `true`, OpenSearch ignores missing or closed indexes. If `false`, OpenSearch returns an error if the force merge operation encounters missing or closed indexes. Default is `false`. | | `max_num_segments` | Integer | The number of larger segments into which smaller segments are merged. Set this parameter to `1` to merge all segments into one segment. The default behavior is to perform the merge as necessary. | | `only_expunge_deletes` | Boolean | If `true`, the merge operation only expunges segments containing a certain percentage of deleted documents. The percentage is 10% by default and is configurable in the `index.merge.policy.expunge_deletes_allowed` setting. Prior to OpenSearch 2.12, `only_expunge_deletes` ignored the `index.merge.policy.max_merged_segment` setting. Starting with OpenSearch 2.12, using `only_expunge_deletes` does not produce segments larger than `index.merge.policy.max_merged_segment` (by default, 5 GB). For more information, see [Deleted documents](#deleted-documents). Default is `false`. | +| `primary_only` | Boolean | If set to `true`, then the merge operation is performed only on the primary shards of an index. This can be useful when you want to take a snapshot of the index after the merge is complete. Snapshots only copy segments from the primary shards. Merging the primary shards can reduce resource consumption. Default is `false`. | #### Example request: Force merge a specific index @@ -101,6 +102,13 @@ POST /.testindex-logs/_forcemerge?max_num_segments=1 ``` {% include copy-curl.html %} +#### Example request: Force merge primary shards + +```json +POST /.testindex-logs/_forcemerge?primary_only=true +``` +{% include copy-curl.html %} + #### Example response ```json diff --git a/_api-reference/nodes-apis/nodes-stats.md b/_api-reference/nodes-apis/nodes-stats.md index 4fdb5c3cb8..87365fa900 100644 --- a/_api-reference/nodes-apis/nodes-stats.md +++ b/_api-reference/nodes-apis/nodes-stats.md @@ -731,7 +731,10 @@ Select the arrow to view the example response. "nxLWtMdXQmWA-ZBVWU8nwA": { "timestamp": 1698401391000, "cpu_utilization_percent": "0.1", - "memory_utilization_percent": "3.9" + "memory_utilization_percent": "3.9", + "io_usage_stats": { + "max_io_utilization_percent": "99.6" + } } }, "admission_control": { @@ -742,6 +745,14 @@ Select the arrow to view the example response. "indexing": 1 } } + }, + "global_io_usage": { + "transport": { + "rejection_count": { + "search": 3, + "indexing": 1 + } + } } } } @@ -1252,16 +1263,20 @@ The `resource_usage_stats` object contains the resource usage statistics. Each e Field | Field type | Description :--- |:-----------| :--- timestamp | Integer | The last refresh time for the resource usage statistics, in milliseconds since the epoch. -cpu_utilization_percent | Float | Statistics for the average CPU usage of OpenSearch process within the time period configured in the `node.resource.tracker.global_cpu_usage.window_duration` setting. +cpu_utilization_percent | Float | Statistics for the average CPU usage of any OpenSearch processes within the time period configured in the `node.resource.tracker.global_cpu_usage.window_duration` setting. memory_utilization_percent | Float | The node JVM memory usage statistics within the time period configured in the `node.resource.tracker.global_jvmmp.window_duration` setting. +max_io_utilization_percent | Float | (Linux only) Statistics for the average IO usage of any OpenSearch processes within the time period configured in the `node.resource.tracker.global_io_usage.window_duration` setting. ### `admission_control` The `admission_control` object contains the rejection count of search and indexing requests based on resource consumption and has the following properties. + Field | Field type | Description :--- | :--- | :--- -admission_control.global_cpu_usage.transport.rejection_count.search | Integer | The total number of search rejections in the transport layer when the node CPU usage limit was breached. In this case, additional search requests are rejected until the system recovers. -admission_control.global_cpu_usage.transport.rejection_count.indexing | Integer | The total number of indexing rejections in the transport layer when the node CPU usage limit was breached. In this case, additional indexing requests are rejected until the system recovers. +admission_control.global_cpu_usage.transport.rejection_count.search | Integer | The total number of search rejections in the transport layer when the node CPU usage limit was met. In this case, additional search requests are rejected until the system recovers. The CPU usage limit is configured in the `admission_control.search.cpu_usage.limit` setting. +admission_control.global_cpu_usage.transport.rejection_count.indexing | Integer | The total number of indexing rejections in the transport layer when the node CPU usage limit was met. Any additional indexing requests are rejected until the system recovers. The CPU usage limit is configured in the `admission_control.indexing.cpu_usage.limit` setting. +admission_control.global_io_usage.transport.rejection_count.search | Integer | The total number of search rejections in the transport layer when the node IO usage limit was met. Any additional search requests are rejected until the system recovers. The CPU usage limit is configured in the `admission_control.search.io_usage.limit` setting (Linux only). +admission_control.global_io_usage.transport.rejection_count.indexing | Integer | The total number of indexing rejections in the transport layer when the node IO usage limit was met. Any additional indexing requests are rejected until the system recovers. The IO usage limit is configured in the `admission_control.indexing.io_usage.limit` setting (Linux only). ## Required permissions diff --git a/_api-reference/snapshots/get-snapshot-status.md b/_api-reference/snapshots/get-snapshot-status.md index 02aa419042..6f8320d0b0 100644 --- a/_api-reference/snapshots/get-snapshot-status.md +++ b/_api-reference/snapshots/get-snapshot-status.md @@ -29,9 +29,9 @@ Three request variants provide flexibility: * `GET _snapshot/_status` returns the status of all currently running snapshots in all repositories. -* `GET _snapshot//_status` returns the status of only currently running snapshots in the specified repository. This is the preferred variant. +* `GET _snapshot//_status` returns all currently running snapshots in the specified repository. This is the preferred variant. -* `GET _snapshot///_status` returns the status of all snapshots in the specified repository whether they are running or not. +* `GET _snapshot///_status` returns detailed status information for a specific snapshot in the specified repository, regardless of whether it's currently running or not. Using the API to return state for other than currently running snapshots can be very costly for (1) machine machine resources and (2) processing time if running in the cloud. For each snapshot, each request causes file reads from all a snapshot's shards. {: .warning} @@ -420,4 +420,4 @@ All property values are Integers. :--- | :--- | :--- | | shards_stats | Object | See [Shard stats](#shard-stats). | | stats | Object | See [Snapshot file stats](#snapshot-file-stats). | -| shards | list of Objects | List of objects containing information about the shards that include the snapshot. Properies of the shards are listed below in bold text.

**stage**: Current state of shards in the snapshot. Shard states are:

* DONE: Number of shards in the snapshot that were successfully stored in the repository.

* FAILURE: Number of shards in the snapshot that were not successfully stored in the repository.

* FINALIZE: Number of shards in the snapshot that are in the finalizing stage of being stored in the repository.

* INIT: Number of shards in the snapshot that are in the initializing stage of being stored in the repository.

* STARTED: Number of shards in the snapshot that are in the started stage of being stored in the repository.

**stats**: See [Snapshot file stats](#snapshot-file-stats).

**total**: Total number and size of files referenced by the snapshot.

**start_time_in_millis**: Time (in milliseconds) when snapshot creation began.

**time_in_millis**: Total time (in milliseconds) that the snapshot took to complete. | \ No newline at end of file +| shards | list of Objects | List of objects containing information about the shards that include the snapshot. OpenSearch returns the following properties about the shards.

**stage**: Current state of shards in the snapshot. Shard states are:

* DONE: Number of shards in the snapshot that were successfully stored in the repository.

* FAILURE: Number of shards in the snapshot that were not successfully stored in the repository.

* FINALIZE: Number of shards in the snapshot that are in the finalizing stage of being stored in the repository.

* INIT: Number of shards in the snapshot that are in the initializing stage of being stored in the repository.

* STARTED: Number of shards in the snapshot that are in the started stage of being stored in the repository.

**stats**: See [Snapshot file stats](#snapshot-file-stats).

**total**: Total number and size of files referenced by the snapshot.

**start_time_in_millis**: Time (in milliseconds) when snapshot creation began.

**time_in_millis**: Total time (in milliseconds) that the snapshot took to complete. | diff --git a/_automating-configurations/api/create-workflow.md b/_automating-configurations/api/create-workflow.md index 9353054113..e99a421fb9 100644 --- a/_automating-configurations/api/create-workflow.md +++ b/_automating-configurations/api/create-workflow.md @@ -7,9 +7,6 @@ nav_order: 10 # Create or update a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - Creating a workflow adds the content of a workflow template to the flow framework system index. You can provide workflows in JSON format (by specifying `Content-Type: application/json`) or YAML format (by specifying `Content-Type: application/yaml`). By default, the workflow is validated to help identify invalid configurations, including: * Workflow steps requiring an OpenSearch plugin that is not installed. @@ -19,6 +16,8 @@ Creating a workflow adds the content of a workflow template to the flow framewor To obtain the validation template for workflow steps, call the [Get Workflow Steps API]({{site.url}}{{site.baseurl}}/automating-configurations/api/get-workflow-steps/). +You can include placeholder expressions in the value of workflow step fields. For example, you can specify a credential field in a template as `openAI_key: '${{ openai_key }}'`. The expression will be substituted with the user-provided value during provisioning, using the format `${{ }}`. You can pass the actual key as a parameter using the [Provision Workflow API]({{site.url}}{{site.baseurl}}/automating-configurations/api/provision-workflow/) or using this API with the `provision` parameter set to `true`. + Once a workflow is created, provide its `workflow_id` to other APIs. The `POST` method creates a new workflow. The `PUT` method updates an existing workflow. @@ -59,12 +58,13 @@ POST /_plugins/_flow_framework/workflow?validation=none ``` {% include copy-curl.html %} -The following table lists the available query parameters. All query parameters are optional. +The following table lists the available query parameters. All query parameters are optional. User-provided parameters are only allowed if the `provision` parameter is set to `true`. | Parameter | Data type | Description | | :--- | :--- | :--- | | `provision` | Boolean | Whether to provision the workflow as part of the request. Default is `false`. | | `validation` | String | Whether to validate the workflow. Valid values are `all` (validate the template) and `none` (do not validate the template). Default is `all`. | +| User-provided substitution expressions | String | Parameters matching substitution expressions in the template. Only allowed if `provision` is set to `true`. Optional. If `provision` is set to `false`, you can pass these parameters in the [Provision Workflow API query parameters]({{site.url}}{{site.baseurl}}/automating-configurations/api/provision-workflow/#query-parameters). | ## Request fields diff --git a/_automating-configurations/api/delete-workflow.md b/_automating-configurations/api/delete-workflow.md index c1cee296f8..db3a340cee 100644 --- a/_automating-configurations/api/delete-workflow.md +++ b/_automating-configurations/api/delete-workflow.md @@ -7,9 +7,6 @@ nav_order: 80 # Delete a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - When you no longer need a workflow template, you can delete it by calling the Delete Workflow API. Note that deleting a workflow only deletes the stored template but does not deprovision its resources. diff --git a/_automating-configurations/api/deprovision-workflow.md b/_automating-configurations/api/deprovision-workflow.md index cdd85ef4e9..e9219536ce 100644 --- a/_automating-configurations/api/deprovision-workflow.md +++ b/_automating-configurations/api/deprovision-workflow.md @@ -7,9 +7,6 @@ nav_order: 70 # Deprovision a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - When you no longer need a workflow, you can deprovision its resources. Most workflow steps that create a resource have corresponding workflow steps to reverse that action. To retrieve all resources currently created for a workflow, call the [Get Workflow Status API]({{site.url}}{{site.baseurl}}/automating-configurations/api/get-workflow-status/). When you call the Deprovision Workflow API, resources included in the `resources_created` field of the Get Workflow Status API response will be removed using a workflow step corresponding to the one that provisioned them. The workflow executes the provisioning workflow steps in reverse order. If failures occur because of resource dependencies, such as preventing deletion of a registered model if it is still deployed, the workflow attempts retries. diff --git a/_automating-configurations/api/get-workflow-status.md b/_automating-configurations/api/get-workflow-status.md index 03870af174..280fb52195 100644 --- a/_automating-configurations/api/get-workflow-status.md +++ b/_automating-configurations/api/get-workflow-status.md @@ -7,9 +7,6 @@ nav_order: 40 # Get a workflow status -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - [Provisioning a workflow]({{site.url}}{{site.baseurl}}/automating-configurations/api/provision-workflow/) may take a significant amount of time, particularly when the action is associated with OpenSearch indexing operations. The Get Workflow State API permits monitoring of the provisioning deployment status until it is complete. ## Path and HTTP methods diff --git a/_automating-configurations/api/get-workflow-steps.md b/_automating-configurations/api/get-workflow-steps.md index b4859da776..38059ec80c 100644 --- a/_automating-configurations/api/get-workflow-steps.md +++ b/_automating-configurations/api/get-workflow-steps.md @@ -7,10 +7,7 @@ nav_order: 50 # Get workflow steps -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - -OpenSearch validates workflows by using the validation template that lists the required inputs, generated outputs, and required plugins for all steps. For example, for the `register_remote_model` step, the validation template appears as follows: +This API returns a list of workflow steps, including their required inputs, outputs, default timeout values, and required plugins. For example, for the `register_remote_model` step, the Get Workflow Steps API returns the following information: ```json { @@ -28,36 +25,52 @@ OpenSearch validates workflows by using the validation template that lists the r ] } } -``` - -The Get Workflow Steps API retrieves this file. +``` ## Path and HTTP methods ```json GET /_plugins/_flow_framework/workflow/_steps +GET /_plugins/_flow_framework/workflow/_step?workflow_step= ``` +## Query parameters + +The following table lists the available query parameters. All query parameters are optional. + +| Parameter | Data type | Description | +| :--- | :--- | :--- | +| `workflow_step` | String | The name of the step to retrieve. Specify multiple step names as a comma-separated list. For example, `create_connector,delete_model,deploy_model`. | + #### Example request +To fetch all workflow steps, use the following request: + ```json GET /_plugins/_flow_framework/workflow/_steps +``` +{% include copy-curl.html %} + +To fetch specific workflow steps, pass the step names to the request as a query parameter: + +```json +GET /_plugins/_flow_framework/workflow/_step?workflow_step=create_connector,delete_model,deploy_model ``` {% include copy-curl.html %} #### Example response -OpenSearch responds with the validation template containing the steps. The order of fields in the returned steps may not exactly match the original JSON but will function identically. +OpenSearch responds with the workflow steps. The order of fields in the returned steps may not exactly match the original JSON but will function identically. To retrieve the template in YAML format, specify `Content-Type: application/yaml` in the request header: ```bash -curl -XGET "http://localhost:9200/_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50" -H 'Content-Type: application/yaml' +curl -XGET "http://localhost:9200/_plugins/_flow_framework/workflow/_steps" -H 'Content-Type: application/yaml' ``` To retrieve the template in JSON format, specify `Content-Type: application/json` in the request header: ```bash -curl -XGET "http://localhost:9200/_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50" -H 'Content-Type: application/json' +curl -XGET "http://localhost:9200/_plugins/_flow_framework/workflow/_steps" -H 'Content-Type: application/json' ``` \ No newline at end of file diff --git a/_automating-configurations/api/get-workflow.md b/_automating-configurations/api/get-workflow.md index b49858ffd9..7b1d5987c4 100644 --- a/_automating-configurations/api/get-workflow.md +++ b/_automating-configurations/api/get-workflow.md @@ -7,9 +7,6 @@ nav_order: 20 # Get a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - The Get Workflow API retrieves the workflow template. ## Path and HTTP methods diff --git a/_automating-configurations/api/index.md b/_automating-configurations/api/index.md index 5fb050539b..716e19c41f 100644 --- a/_automating-configurations/api/index.md +++ b/_automating-configurations/api/index.md @@ -8,9 +8,6 @@ has_toc: false # Workflow APIs -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - OpenSearch supports the following workflow APIs: * [Create or update workflow]({{site.url}}{{site.baseurl}}/automating-configurations/api/create-workflow/) diff --git a/_automating-configurations/api/provision-workflow.md b/_automating-configurations/api/provision-workflow.md index 5d2b59364c..62c4954ee9 100644 --- a/_automating-configurations/api/provision-workflow.md +++ b/_automating-configurations/api/provision-workflow.md @@ -7,9 +7,6 @@ nav_order: 30 # Provision a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - Provisioning a workflow is a one-time setup process usually performed by a cluster administrator to create resources that will be used by end users. The `workflows` template field may contain multiple workflows. The workflow with the `provision` key can be executed with this API. This API is also executed when the [Create or Update Workflow API]({{site.url}}{{site.baseurl}}/automating-configurations/api/create-workflow/) is called with the `provision` parameter set to `true`. @@ -31,10 +28,39 @@ The following table lists the available path parameters. | :--- | :--- | :--- | | `workflow_id` | String | The ID of the workflow to be provisioned. Required. | -#### Example request +## Query parameters + +If you have included a substitution expression in the template, you may pass it as a query parameter or as a string value of a request body field. For example, if you specified a credential field in a template as `openAI_key: '${{ openai_key }}'`, then you can include the `openai_key` parameter as a query parameter or body field so it can be substituted during provisioning. For example, the following request provides a query parameter: + +```json +POST /_plugins/_flow_framework/workflow//_provision?= +``` + +| Parameter | Data type | Description | +| :--- | :--- | :--- | +| User-provided substitution expressions | String | Parameters matching substitution expressions in the template. Optional. | + +#### Example requests + +```json +POST /_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50/_provision +``` +{% include copy-curl.html %} + +The following request substitutes the expression `${{ openai_key }}` with the value "12345" using a query parameter: + +```json +POST /_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50/_provision?openai_key=12345 +``` +{% include copy-curl.html %} + +The following request substitutes the expression `${{ openai_key }}` with the value "12345" using the request body: ```json POST /_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50/_provision +{ + "openai_key" : "12345" +} ``` {% include copy-curl.html %} diff --git a/_automating-configurations/api/search-workflow-state.md b/_automating-configurations/api/search-workflow-state.md index 9e21f14392..1cacb3a32b 100644 --- a/_automating-configurations/api/search-workflow-state.md +++ b/_automating-configurations/api/search-workflow-state.md @@ -7,9 +7,6 @@ nav_order: 65 # Search for a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - You can search for resources created by workflows by matching a query to a field. The fields you can search correspond to those returned by the [Get Workflow Status API]({{site.url}}{{site.baseurl}}/automating-configurations/api/get-workflow-status/). ## Path and HTTP methods diff --git a/_automating-configurations/api/search-workflow.md b/_automating-configurations/api/search-workflow.md index 7eb8890f7e..b78de9e9d2 100644 --- a/_automating-configurations/api/search-workflow.md +++ b/_automating-configurations/api/search-workflow.md @@ -7,9 +7,6 @@ nav_order: 60 # Search for a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - You can retrieve created workflows with their `workflow_id` or search for workflows by using a query matching a field. You can use the `use_case` field to search for similar workflows. ## Path and HTTP methods diff --git a/_automating-configurations/index.md b/_automating-configurations/index.md index 2b9ffdcf34..ef9cb4f850 100644 --- a/_automating-configurations/index.md +++ b/_automating-configurations/index.md @@ -8,12 +8,9 @@ redirect_from: /automating-configurations/ --- # Automating configurations -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - You can automate complex OpenSearch setup and preprocessing tasks by providing templates for common use cases. For example, automating machine learning (ML) setup tasks streamlines the use of OpenSearch ML offerings. In OpenSearch 2.12, configuration automation is limited to ML tasks. diff --git a/_automating-configurations/workflow-settings.md b/_automating-configurations/workflow-settings.md index f3138d0ddc..78762fdfbb 100644 --- a/_automating-configurations/workflow-settings.md +++ b/_automating-configurations/workflow-settings.md @@ -6,9 +6,6 @@ nav_order: 30 # Workflow settings -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - The following keys represent configurable workflow settings. |Setting |Data type |Default value |Description | diff --git a/_automating-configurations/workflow-steps.md b/_automating-configurations/workflow-steps.md index 8565ccc29b..99c1f57993 100644 --- a/_automating-configurations/workflow-steps.md +++ b/_automating-configurations/workflow-steps.md @@ -6,9 +6,6 @@ nav_order: 10 # Workflow steps -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - _Workflow steps_ form basic "building blocks" for process automation. Most steps directly correspond to OpenSearch or plugin API operations, such as CRUD operations on machine learning (ML) connectors, models, and agents. Some steps simplify the configuration by reusing the body expected by these APIs across multiple steps. For example, once you configure a _tool_, you can use it with multiple _agents_. ## Workflow step fields @@ -42,6 +39,9 @@ The following table lists the workflow step types. The `user_inputs` fields for |`register_agent` |[Register Agent API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/) |Registers an agent as part of the ML Commons Agent Framework. | |`delete_agent` |[Delete Agent API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/) |Deletes an agent. | |`create_tool` |No API | A special-case non-API step encapsulating the specification of a tool for an agent in the ML Commons Agent Framework. These will be listed as `previous_node_inputs` for the appropriate register agent step, with the value set to `tools`. | +|`create_index`|[Create Index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/) | Creates a new OpenSearch index. The inputs include `index_name`, which should be the name of the index to be created, and `configurations`, which contains the payload body of a regular REST request for creating an index. +|`create_ingest_pipeline`|[Create Ingest Pipeline]({{site.url}}{{site.baseurl}}/ingest-pipelines/create-ingest/) | Creates or updates an ingest pipeline. The inputs include `pipeline_id`, which should be the ID of the pipeline, and `configurations`, which contains the payload body of a regular REST request for creating an ingest pipeline. +|`create_search_pipeline`|[Create Search Pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/creating-search-pipeline/) | Creates or updates a search pipeline. The inputs include `pipeline_id`, which should be the ID of the pipeline, and `configurations`, which contains the payload body of a regular REST request for creating a search pipeline. ## Additional fields diff --git a/_automating-configurations/workflow-tutorial.md b/_automating-configurations/workflow-tutorial.md index 99d84501e2..0074ad4691 100644 --- a/_automating-configurations/workflow-tutorial.md +++ b/_automating-configurations/workflow-tutorial.md @@ -6,9 +6,6 @@ nav_order: 20 # Workflow tutorial -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - You can automate the setup of common use cases, such as conversational chat, using a Chain-of-Thought (CoT) agent. An _agent_ orchestrates and runs ML models and tools. A _tool_ performs a set of specific tasks. This page presents a complete example of setting up a CoT agent. For more information about agents and tools, see [Agents and tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/) The setup requires the following sequence of API requests, with provisioned resources used in subsequent requests. The following list provides an overview of the steps required for this workflow. The step names correspond to the names in the template: diff --git a/_dashboards/csp/csp-dynamic-configuration.md b/_dashboards/csp/csp-dynamic-configuration.md new file mode 100644 index 0000000000..2101a83734 --- /dev/null +++ b/_dashboards/csp/csp-dynamic-configuration.md @@ -0,0 +1,50 @@ +--- +layout: default +title: Configuring Content Security Policy rules dynamically +nav_order: 110 +has_children: false +--- + +# Configuring Content Security Policy rules dynamically +Introduced 2.13 +{: .label .label-purple } + +Content Security Policy (CSP) is a security standard intended to prevent cross-site scripting (XSS), `clickjacking`, and other code injection attacks resulting from the execution of malicious content in the trusted webpage context. OpenSearch Dashboards supports configuring CSP rules in the `opensearch_dashboards.yml` file by using the `csp.rules` key. A change in the YAML file requires a server restart, which may interrupt service availability. You can, however, configure the CSP rules dynamically through the `applicationConfig` plugin without restarting the server. + +## Configuration + +The `applicationConfig` plugin provides read and write APIs that allow OpenSearch Dashboards users to manage dynamic configurations as key-value pairs in an index. The `cspHandler` plugin registers a pre-response handler to `HttpServiceSetup`, which gets CSP rules from the dependent `applicationConfig` plugin and then rewrites to the CSP header. Enable both plugins within your `opensearch_dashboards.yml` file to use this feature. The configuration is shown in the following example. Refer to the `cspHandler` plugin [README](https://github.com/opensearch-project/OpenSearch-Dashboards/blob/main/src/plugins/csp_handler/README.md) for configuration details. + +``` +application_config.enabled: true +csp_handler.enabled: true +``` + +## Enable site embedding for OpenSearch Dashboards + +To enable site embedding for OpenSearch Dashboards, update the CSP rules using CURL. When using CURL commands with single quotation marks inside the `data-raw` parameter, escape them with a backslash (`\`). For example, use `'\''` to represent `'`. The configuration is shown in the following example. Refer to the `applicationConfig` plugin [README](https://github.com/opensearch-project/OpenSearch-Dashboards/blob/main/src/plugins/application_config/README.md) for configuration details. + +``` +curl '{osd endpoint}/api/appconfig/csp.rules' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' -H 'osd-xsrf: osd-fetch' -H 'Sec-Fetch-Dest: empty' --data-raw '{"newValue":"script-src '\''unsafe-eval'\'' '\''self'\''; worker-src blob: '\''self'\''; style-src '\''unsafe-inline'\'' '\''self'\''; frame-ancestors '\''self'\'' {new site}"}' +``` + +## Delete CSP rules + +Use the following CURL command to delete CSP rules: + +``` +curl '{osd endpoint}/api/appconfig/csp.rules' -X DELETE -H 'osd-xsrf: osd-fetch' -H 'Sec-Fetch-Dest: empty' +``` + +## Get CSP rules + +Use the following CURL command to get CSP rules: + +``` +curl '{osd endpoint}/api/appconfig/csp.rules' + +``` + +## Precedence + +Dynamic configurations override YAML configurations, except for empty CSP rules. To prevent `clickjacking`, a `frame-ancestors: self` directive is automatically added to YAML-defined rules when necessary. diff --git a/_dashboards/dashboards-assistant/index.md b/_dashboards/dashboards-assistant/index.md index dd62347c31..d44e6b58e8 100644 --- a/_dashboards/dashboards-assistant/index.md +++ b/_dashboards/dashboards-assistant/index.md @@ -6,14 +6,11 @@ has_children: false has_toc: false --- -This is an experimental feature and is not recommended for use in a production environment. For updates on the feature's progress or to leave feedback, go to the [`dashboards-assistant` repository](https://github.com/opensearch-project/dashboards-assistant) on GitHub or the associated [OpenSearch forum thread](https://forum.opensearch.org/t/feedback-opensearch-assistant/16741). -{: .warning} - Note that machine learning models are probabilistic and that some may perform better than others, so the OpenSearch Assistant may occasionally produce inaccurate information. We recommend evaluating outputs for accuracy as appropriate to your use case, including reviewing the output or combining it with other verification factors. {: .important} # OpenSearch Assistant for OpenSearch Dashboards -Introduced 2.12 +**Introduced 2.13** {: .label .label-purple } The OpenSearch Assistant toolkit helps you create AI-powered assistants for OpenSearch Dashboards without requiring you to have specialized query tools or skills. @@ -49,9 +46,6 @@ A screenshot of the interface is shown in the following image. OpenSearch Assistant interface -For more information about ways to enable experimental features, see [Experimental feature flags]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). -{: .note} - ## Configuring OpenSearch Assistant You can use the OpenSearch Dashboards interface to configure OpenSearch Assistant. Go to the [Getting started guide](https://github.com/opensearch-project/dashboards-assistant/blob/main/GETTING_STARTED_GUIDE.md) for step-by-step instructions. For the chatbot template, go to the [Flow Framework plugin](https://github.com/opensearch-project/flow-framework) documentation. You can modify this template to use your own model and customize the chatbot tools. @@ -60,7 +54,7 @@ For information about configuring OpenSearch Assistant through the REST API, see ## Using OpenSearch Assistant in OpenSearch Dashboards -The following tutorials guide you through using OpenSearch Assistant in OpenSearch Dashboards. OpenSearch Assistant can be viewed full frame or in the right sidebar. The default is sidebar. To view full frame, select the frame icon {::nomarkdown}frame icon{:/} in the toolbar. +The following tutorials guide you through using OpenSearch Assistant in OpenSearch Dashboards. OpenSearch Assistant can be viewed in full frame or in the sidebar. The default view is in the right sidebar. To view the assistant in the left sidebar or in full frame, select the {::nomarkdown}frame icon{:/} icon in the toolbar and choose the preferred option. ### Start a conversation diff --git a/_dashboards/management/index-patterns.md b/_dashboards/management/index-patterns.md index 590a9675a2..37baa210e9 100644 --- a/_dashboards/management/index-patterns.md +++ b/_dashboards/management/index-patterns.md @@ -56,7 +56,7 @@ An example of step 1 is shown in the following image. Note that the index patter Once the index pattern has been created, you can view the mapping of the matching indexes. Within the table, you can see the list of fields, along with their data type and properties. An example is shown in the following image. -Index pattern table UI +Index pattern table UI ## Next steps diff --git a/_dashboards/management/multi-data-sources.md b/_dashboards/management/multi-data-sources.md index 0447348648..dd66101f80 100644 --- a/_dashboards/management/multi-data-sources.md +++ b/_dashboards/management/multi-data-sources.md @@ -3,7 +3,7 @@ layout: default title: Configuring and using multiple data sources parent: Data sources nav_order: 10 -redirect_from: +redirect_from: - /dashboards/discover/multi-data-sources/ --- @@ -11,23 +11,22 @@ redirect_from: You can ingest, process, and analyze data from multiple data sources in OpenSearch Dashboards. You configure the data sources in the **Dashboards Management** > **Data sources** app, as shown in the following image. - Dashboards Management Data sources main screen ## Getting started -The following tutorial guides you through configuring and using multiple data sources. +The following tutorial guides you through configuring and using multiple data sources. ### Step 1: Modify the YAML file settings To use multiple data sources, you must enable the `data_source.enabled` setting. It is disabled by default. To enable multiple data sources: 1. Open your local copy of the OpenSearch Dashboards configuration file, `opensearch_dashboards.yml`. If you don't have a copy, [`opensearch_dashboards.yml`](https://github.com/opensearch-project/OpenSearch-Dashboards/blob/main/config/opensearch_dashboards.yml) is available on GitHub. -2. Set `data_source.enabled:` to `true` and save the YAML file. +2. Set `data_source.enabled:` to `true` and save the YAML file. 3. Restart the OpenSearch Dashboards container. 4. Verify that the configuration settings were configured properly by connecting to OpenSearch Dashboards and viewing the **Dashboards Management** navigation menu. **Data sources** appears in the sidebar. You'll see a view similar to the following image. - Data sources in sidebar within Dashboards Management +Data sources in sidebar within Dashboards Management ### Step 2: Create a new data source connection @@ -36,16 +35,17 @@ A data source connection specifies the parameters needed to connect to a data so To create a new data source connection: 1. From the OpenSearch Dashboards main menu, select **Dashboards Management** > **Data sources** > **Create data source connection**. -2. Add the required information to each field to configure **Connection Details** and **Authentication Method**. - + +2. Add the required information to each field to configure the **Connection Details** and **Authentication Method**. + - Under **Connection Details**, enter a title and endpoint URL. For this tutorial, use the URL `http://localhost:5601/app/management/opensearch-dashboards/dataSources`. Entering a description is optional. - Under **Authentication Method**, select an authentication method from the dropdown list. Once an authentication method is selected, the applicable fields for that method appear. You can then enter the required details. The authentication method options are: - - **No authentication**: No authentication is used to connect to the data source. - - **Username & Password**: A basic username and password are used to connect to the data source. - - **AWS SigV4**: An AWS Signature Version 4 authenticating request is used to connect to the data source. AWS Signature Version 4 requires an access key and a secret key. - - For AWS Signature Version 4 authentication, first specify the **Region**. Next, select the OpenSearch service in the **Service Name** list. The options are **Amazon OpenSearch Service** and **Amazon OpenSearch Serverless**. Last, enter the **Access Key** and **Secret Key** for authorization. - + - **No authentication**: No authentication is used to connect to the data source. + - **Username & Password**: A basic username and password are used to connect to the data source. + - **AWS SigV4**: An AWS Signature Version 4 authenticating request is used to connect to the data source. AWS Signature Version 4 requires an access key and a secret key. + - For AWS Signature Version 4 authentication, first specify the **Region**. Next, select the OpenSearch service from the **Service Name** list. The options are **Amazon OpenSearch Service** and **Amazon OpenSearch Serverless**. Last, enter the **Access Key** and **Secret Key** for authorization. + For information about available AWS Regions for AWS accounts, see [Available Regions](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-available-regions). For more information about AWS Signature Version 4 authentication requests, see [Authenticating Requests (AWS Signature Version 4)](https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html). {: .note} @@ -58,12 +58,11 @@ To create a new data source connection: - To make changes to the data source connection, select a connection in the list on the **Data Sources** main page. The **Connection Details** window opens. - To make changes to **Connection Details**, edit one or both of the **Title** and **Description** fields and select **Save changes** in the lower-right corner of the screen. You can also cancel changes here. To change the **Authentication Method**, choose a different authentication method, enter your credentials (if applicable), and then select **Save changes** in the lower-right corner of the screen. The changes are saved. - + - When **Username & Password** is the selected authentication method, you can update the password by choosing **Update stored password** next to the **Password** field. In the pop-up window, enter a new password in the first field and then enter it again in the second field to confirm. Select **Update stored password** in the pop-up window. The new password is saved. Select **Test connection** to confirm that the connection is valid. - - When **AWS SigV4** is the selected authentication method, you can update the credentials by selecting **Update stored AWS credential**. In the pop-up window, enter a new access key in the first field and a new secret key in the second field. Select **Update stored AWS credential** in the pop-up window. The new credentials are saved. Select **Test connection** in the upper-right corner of the screen to confirm that the connection is valid. -5. Delete the data source connection by selecting the check box to the left of the title and then choosing **Delete 1 connection**. Selecting multiple check boxes for multiple connections is supported. Alternatively, select the trash can icon ({::nomarkdown}trash can icon{:/}). +5. Delete the data source connection by selecting the check box to the left of the title and then choosing **Delete 1 connection**. Selecting multiple check boxes for multiple connections is supported. Alternatively, select the {::nomarkdown}trash can icon{:/} icon. An example data source connection screen is shown in the following image. @@ -71,7 +70,7 @@ An example data source connection screen is shown in the following image. ### Selecting multiple data sources through the Dev Tools console -Alternatively, you can select multiple data sources through the [Dev Tools]({{site.url}}{{site.baseurl}}/dashboards/dev-tools/index-dev/) console. This option provides for working with a broader range of data and gaining deeper insight into your code and applications. +Alternatively, you can select multiple data sources through the [Dev Tools]({{site.url}}{{site.baseurl}}/dashboards/dev-tools/index-dev/) console. This option allows you to work with a broader range of data and gaining a deeper understanding of your code and applications. Watch the following 10-second video to see it in action. @@ -79,7 +78,7 @@ Watch the following 10-second video to see it in action. To select a data source through the Dev Tools console, follow these steps: -1. Locate your copy of `opensearch_dashboards.yml` and open it in the editor of your choice. +1. Locate your copy of `opensearch_dashboards.yml` and open it in the editor of your choice. 2. Set `data_source.enabled` to `true`. 3. Connect to OpenSearch Dashboards and select **Dev Tools** in the menu. 4. Enter the following query in the editor pane of the **Console** and then select the play button: @@ -93,19 +92,55 @@ To select a data source through the Dev Tools console, follow these steps: 6. Repeat the preceding steps for each data source you want to select. ### Upload saved objects to a dashboard from connected data sources -To upload saved objects from connected data sources to a dashboard with multiple data sources, export them as an NDJSON file from the data source's **Saved object management** page. Then upload the file to the dashboard's **Saved object management** page. This method can make it easier to transfer saved objects between dashboards. The following 20-second video shows this feature in action. +To upload saved objects from connected data sources to a dashboard with multiple data sources, export them as an NDJSON file from the data source's **Saved object management** page. Then upload the file to the dashboard's **Saved object management** page. This method can simplify the transfer of saved objects between dashboards. The following 20-second video shows this feature in action. Multiple data sources in Saved object management{: .img-fluid} +#### Import saved objects from a connected data source + Follow these steps to import saved objects from a connected data source: -1. Locate your `opensearch_dashboards.yml` file and open it in your preferred text editor. +1. Locate your `opensearch_dashboards.yml` file and open it in your preferred text editor. 2. Set `data_source.enabled` to `true`. 3. Connect to OpenSearch Dashboards and go to **Dashboards Management** > **Saved objects**. 4. Select **Import** > **Select file** and upload the file acquired from the connected data source. 5. Choose the appropriate **Data source** from the dropdown menu, set your **Conflict management** option, and then select the **Import** button. +### Show or hide authentication methods for multiple data sources +Introduced 2.13 +{: .label .label-purple } + +A feature flag in your `opensearch_dashboards.yml` file allows you to show or hide authentication methods within the `data_source` plugin. The following example setting, shown in a 10-second demo, hides the authentication method for `AWSSigV4`. + +```` +# Set enabled to false to hide the authentication method from multiple data source in OpenSearch Dashboards. +# If this setting is commented out, then all three options will be available in OpenSearch Dashboards. +# The default value will be considered as true. +data_source.authTypes: + NoAuthentication: + enabled: true + UsernamePassword: + enabled: true + AWSSigV4: + enabled: false +```` + +Multiple data sources hide and show authentication{: .img-fluid} + +### Hide the local cluster option for multiple data sources +Introduced 2.13 +{: .label .label-purple } + +A feature flag in your `opensearch_dashboards.yml` file allows you to hide the local cluster option within the `data_source` plugin. This option hides the local cluster from the data source dropdown menu and index creation page, which is ideal for environments with or without a local OpenSearch cluster. The following example setting, shown in a 20-second demo, hides the local cluster. + +```` +# hide local cluster in the data source dropdown and index pattern creation page. +data_source.hideLocalCluster: true +```` + +Multiple data sources hide local cluster{: .img-fluid} + ## Next steps Once you've configured your multiple data sources, you can start exploring that data. See the following resources to learn more: @@ -120,5 +155,5 @@ Once you've configured your multiple data sources, you can start exploring that This feature has some limitations: * The multiple data sources feature is supported for index-pattern-based visualizations only. -* The visualization types Time Series Visual Builder (TSVB), Vega and Vega-Lite, and timeline are not supported. -* External plugins, such as Gantt chart, and non-visualization plugins, such as the developer console, are not supported. +* The Time Series Visual Builder (TSVB) and timeline visualization types are not supported. +* External plugins, such as `gantt-chart`, and non-visualization plugins are not supported. diff --git a/_dashboards/visualize/vega.md b/_dashboards/visualize/vega.md new file mode 100644 index 0000000000..7764d583a6 --- /dev/null +++ b/_dashboards/visualize/vega.md @@ -0,0 +1,192 @@ +--- +layout: default +title: Using Vega +parent: Building data visualizations +nav_order: 45 +--- + +# Using Vega + +[Vega](https://vega.github.io/vega/) and [Vega-Lite](https://vega.github.io/vega-lite/) are open-source, declarative language visualization tools that you can use to create custom data visualizations with your OpenSearch data and [Vega Data](https://vega.github.io/vega/docs/data/). These tools are ideal for advanced users comfortable with writing OpenSearch queries directly. Enable the `vis_type_vega` plugin in your `opensearch_dashboards.yml` file to write your [Vega specifications](https://vega.github.io/vega/docs/specification/) in either JSON or [HJSON](https://hjson.github.io/) format or to specify one or more OpenSearch queries within your Vega specification. By default, the plugin is set to `true`. The configuration is shown in the following example. For configuration details, refer to the `vis_type_vega` [README](https://github.com/opensearch-project/OpenSearch-Dashboards/blob/main/src/plugins/vis_type_vega/README.md). + +``` +vis_type_vega.enabled: true +``` + +The following image shows a custom Vega map created in OpenSearch. + +Map created using Vega visualization in OpenSearch Dashboards + +## Querying from multiple data sources + +If you have configured [multiple data sources]({{site.url}}{{site.baseurl}}/dashboards/management/multi-data-sources/) in OpenSearch Dashboards, you can use Vega to query those data sources. Within your Vega specification, add the `data_source_name` field under the `url` property to target a specific data source by name. By default, queries use data from the local cluster. You can assign individual `data_source_name` values to each OpenSearch query within your Vega specification. This allows you to query multiple indexes across different data sources in a single visualization. + +The following is an example Vega specification with `Demo US Cluster` as the specified `data_source_name`: + +``` +{ + $schema: https://vega.github.io/schema/vega/v5.json + config: { + kibana: {type: "map", latitude: 25, longitude: -70, zoom: 3} + } + data: [ + { + name: table + url: { + index: opensearch_dashboards_sample_data_flights + // This OpenSearchQuery will query from the Demo US Cluster datasource + data_source_name: Demo US Cluster + %context%: true + // Uncomment to enable time filtering + // %timefield%: timestamp + body: { + size: 0 + aggs: { + origins: { + terms: {field: "OriginAirportID", size: 10000} + aggs: { + originLocation: { + top_hits: { + size: 1 + _source: { + includes: ["OriginLocation", "Origin"] + } + } + } + distinations: { + terms: {field: "DestAirportID", size: 10000} + aggs: { + destLocation: { + top_hits: { + size: 1 + _source: { + includes: ["DestLocation"] + } + } + } + } + } + } + } + } + } + } + format: {property: "aggregations.origins.buckets"} + transform: [ + { + type: geopoint + projection: projection + fields: [ + originLocation.hits.hits[0]._source.OriginLocation.lon + originLocation.hits.hits[0]._source.OriginLocation.lat + ] + } + ] + } + { + name: selectedDatum + on: [ + {trigger: "!selected", remove: true} + {trigger: "selected", insert: "selected"} + ] + } + ] + signals: [ + { + name: selected + value: null + on: [ + {events: "@airport:mouseover", update: "datum"} + {events: "@airport:mouseout", update: "null"} + ] + } + ] + scales: [ + { + name: airportSize + type: linear + domain: {data: "table", field: "doc_count"} + range: [ + {signal: "zoom*zoom*0.2+1"} + {signal: "zoom*zoom*10+1"} + ] + } + ] + marks: [ + { + type: group + from: { + facet: { + name: facetedDatum + data: selectedDatum + field: distinations.buckets + } + } + data: [ + { + name: facetDatumElems + source: facetedDatum + transform: [ + { + type: geopoint + projection: projection + fields: [ + destLocation.hits.hits[0]._source.DestLocation.lon + destLocation.hits.hits[0]._source.DestLocation.lat + ] + } + {type: "formula", expr: "{x:parent.x, y:parent.y}", as: "source"} + {type: "formula", expr: "{x:datum.x, y:datum.y}", as: "target"} + {type: "linkpath", shape: "diagonal"} + ] + } + ] + scales: [ + { + name: lineThickness + type: log + clamp: true + range: [1, 8] + } + { + name: lineOpacity + type: log + clamp: true + range: [0.2, 0.8] + } + ] + marks: [ + { + from: {data: "facetDatumElems"} + type: path + interactive: false + encode: { + update: { + path: {field: "path"} + stroke: {value: "black"} + strokeWidth: {scale: "lineThickness", field: "doc_count"} + strokeOpacity: {scale: "lineOpacity", field: "doc_count"} + } + } + } + ] + } + { + name: airport + type: symbol + from: {data: "table"} + encode: { + update: { + size: {scale: "airportSize", field: "doc_count"} + xc: {signal: "datum.x"} + yc: {signal: "datum.y"} + tooltip: { + signal: "{title: datum.originLocation.hits.hits[0]._source.Origin + ' (' + datum.key + ')', connnections: length(datum.distinations.buckets), flights: datum.doc_count}" + } + } + } + } + ] +} +``` +{% include copy-curl.html %} diff --git a/_data-prepper/common-use-cases/trace-analytics.md b/_data-prepper/common-use-cases/trace-analytics.md index 1f6c3b7cc4..033830351a 100644 --- a/_data-prepper/common-use-cases/trace-analytics.md +++ b/_data-prepper/common-use-cases/trace-analytics.md @@ -38,9 +38,9 @@ The [OpenTelemetry source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/c There are three processors for the trace analytics feature: -* *otel_traces_raw* - The *otel_traces_raw* processor receives a collection of [span](https://github.com/opensearch-project/data-prepper/blob/fa65e9efb3f8d6a404a1ab1875f21ce85e5c5a6d/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/trace/Span.java) records from [*otel-trace-source*]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/otel-trace/), and performs stateful processing, extraction, and completion of trace-group-related fields. -* *otel_traces_group* - The *otel_traces_group* processor fills in the missing trace-group-related fields in the collection of [span](https://github.com/opensearch-project/data-prepper/blob/298e7931aa3b26130048ac3bde260e066857df54/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/trace/Span.java) records by looking up the OpenSearch backend. -* *service_map_stateful* – The *service_map_stateful* processor performs the required preprocessing for trace data and builds metadata to display the `service-map` dashboards. +* otel_traces_raw -- The *otel_traces_raw* processor receives a collection of [span](https://github.com/opensearch-project/data-prepper/blob/fa65e9efb3f8d6a404a1ab1875f21ce85e5c5a6d/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/trace/Span.java) records from [*otel-trace-source*]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/otel-trace-source/), and performs stateful processing, extraction, and completion of trace-group-related fields. +* otel_traces_group -- The *otel_traces_group* processor fills in the missing trace-group-related fields in the collection of [span](https://github.com/opensearch-project/data-prepper/blob/298e7931aa3b26130048ac3bde260e066857df54/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/trace/Span.java) records by looking up the OpenSearch backend. +* service_map_stateful -- The *service_map_stateful* processor performs the required preprocessing for trace data and builds metadata to display the `service-map` dashboards. ### OpenSearch sink @@ -49,8 +49,8 @@ OpenSearch provides a generic sink that writes data to OpenSearch as the destina The sink provides specific configurations for the trace analytics feature. These configurations allow the sink to use indexes and index templates specific to trace analytics. The following OpenSearch indexes are specific to trace analytics: -* *otel-v1-apm-span* – The *otel-v1-apm-span* index stores the output from the [otel_traces_raw]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/otel-trace-raw/) processor. -* *otel-v1-apm-service-map* – The *otel-v1-apm-service-map* index stores the output from the [service_map_stateful]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/service-map-stateful/) processor. +* otel-v1-apm-span –- The *otel-v1-apm-span* index stores the output from the [otel_traces_raw]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/otel-trace-raw/) processor. +* otel-v1-apm-service-map –- The *otel-v1-apm-service-map* index stores the output from the [service_map_stateful]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/service-map-stateful/) processor. ## Trace tuning @@ -374,4 +374,4 @@ Starting with Data Prepper version 1.4, trace processing uses Data Prepper's eve * `otel_traces_group` replaces `otel_traces_group_prepper` for event-based spans. In Data Prepper version 2.0, `otel_traces_source` will only output events. Data Prepper version 2.0 also removes `otel_traces_raw_prepper` and `otel_traces_group_prepper` entirely. To migrate to Data Prepper version 2.0, you can configure your trace pipeline using the event model. - \ No newline at end of file + diff --git a/_data-prepper/managing-data-prepper/configuring-data-prepper.md b/_data-prepper/managing-data-prepper/configuring-data-prepper.md index bcff65ed4c..d6750daba4 100644 --- a/_data-prepper/managing-data-prepper/configuring-data-prepper.md +++ b/_data-prepper/managing-data-prepper/configuring-data-prepper.md @@ -128,6 +128,7 @@ extensions: region: sts_role_arn: refresh_interval: + disable_refresh: false : ... ``` @@ -148,7 +149,8 @@ Option | Required | Type | Description secret_id | Yes | String | The AWS secret name or ARN. | region | No | String | The AWS region of the secret. Defaults to `us-east-1`. sts_role_arn | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to the AWS Secrets Manager. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). -refresh_interval | No | Duration | The refreshment interval for AWS secrets extension plugin to poll new secret values. Defaults to `PT1H`. See [Automatically refreshing secrets](#automatically-refreshing-secrets) for details. +refresh_interval | No | Duration | The refreshment interval for the AWS Secrets extension plugin to poll new secret values. Defaults to `PT1H`. For more information, see [Automatically refreshing secrets](#automatically-refreshing-secrets). +disable_refresh | No | Boolean | Disables regular polling on the latest secret values inside the AWS secrets extension plugin. Defaults to `false`. When set to `true`, `refresh_interval` will not be used. #### Reference secrets ß diff --git a/_data-prepper/managing-data-prepper/extensions/extensions.md b/_data-prepper/managing-data-prepper/extensions/extensions.md new file mode 100644 index 0000000000..8cbfc602c7 --- /dev/null +++ b/_data-prepper/managing-data-prepper/extensions/extensions.md @@ -0,0 +1,15 @@ +--- +layout: default +title: Extensions +parent: Managing Data Prepper +has_children: true +nav_order: 18 +--- + +# Extensions + +Data Prepper extensions provide Data Prepper functionality outside of core Data Prepper pipeline components. +Many extensions provide configuration options that give Data Prepper administrators greater flexibility over Data Prepper's functionality. + +Extension configurations can be configured in the `data-prepper-config.yaml` file under the `extensions:` YAML block. + diff --git a/_data-prepper/managing-data-prepper/extensions/geoip_service.md b/_data-prepper/managing-data-prepper/extensions/geoip_service.md new file mode 100644 index 0000000000..53c21a08ff --- /dev/null +++ b/_data-prepper/managing-data-prepper/extensions/geoip_service.md @@ -0,0 +1,67 @@ +--- +layout: default +title: geoip_service +nav_order: 5 +parent: Extensions +grand_parent: Managing Data Prepper +--- + +# geoip_service + +The `geoip_service` extension configures all [`geoip`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/geoip) processors in Data Prepper. + +## Usage + +You can configure the GeoIP service that Data Prepper uses for the `geoip` processor. +By default, the GeoIP service comes with the [`maxmind`](#maxmind) option configured. + +The following example shows how to configure the `geoip_service` in the `data-prepper-config.yaml` file: + +``` +extensions: + geoip_service: + maxmind: + database_refresh_interval: PT1H + cache_count: 16_384 +``` + +## maxmind + +The GeoIP service supports the MaxMind [GeoIP and GeoLite](https://dev.maxmind.com/geoip) databases. +By default, Data Prepper will use all three of the following [MaxMind GeoLite2](https://dev.maxmind.com/geoip/geolite2-free-geolocation-data) databases: + +* City +* Country +* ASN + +The service also downloads databases automatically to keep Data Prepper up to date with changes from MaxMind. + +You can use the following options to configure the `maxmind` extension. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`databases` | No | [database](#database) | The database configuration. +`database_refresh_interval` | No | Duration | How frequently to check for updates from MaxMind. This can be any duration in the range of 15 minutes to 30 days. Default is `PT7D`. +`cache_count` | No | Integer | The maximum cache count by number of items in the cache, with a range of 100--100,000. Default is `4096`. +`database_destination` | No | String | The name of the directory in which to store downloaded databases. Default is `{data-prepper.dir}/data/geoip`. +`aws` | No | [aws](#aws) | Configures the AWS credentials for downloading the database from Amazon Simple Storage Service (Amazon S3). +`insecure` | No | Boolean | When `true`, this options allows you to download database files over HTTP. Default is `false`. + +## database + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`city` | No | String | The URL of the city in which the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL. +`country` | No | String | The URL of the country in which the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL. +`asn` | No | String | The URL of the Autonomous System Number (ASN) of where the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL. +`enterprise` | No | String | The URL of the enterprise in which the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL. + + +## aws + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`region` | No | String | The AWS Region to use for the credentials. Default is the [standard SDK behavior for determining the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). +`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon S3. Default is `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). +`aws_sts_header_overrides` | No | Map | A map of header overrides that the AWS Identity and Access Management (IAM) role assumes when downloading from Amazon S3. +`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the STS role. For more information, see the `ExternalID` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference. diff --git a/_data-prepper/pipelines/configuration/buffers/kafka.md b/_data-prepper/pipelines/configuration/buffers/kafka.md index 675a0c9775..f641874a91 100644 --- a/_data-prepper/pipelines/configuration/buffers/kafka.md +++ b/_data-prepper/pipelines/configuration/buffers/kafka.md @@ -41,11 +41,12 @@ Use the following configuration options with the `kafka` buffer. Option | Required | Type | Description --- | --- | --- | --- -`bootstrap_servers` | Yes | String list | The host and port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or the port number for each broker. When using [Amazon Managed Streaming for Apache Kafka (Amazon MSK)](https://aws.amazon.com/msk/) as your Kafka cluster, the bootstrap server information is obtained from Amazon MSK using the Amazon Resource Name (ARN) provided in the configuration. -`topics` | Yes | List | A list of [topics](#topic) to use. You must supply one topic per buffer. `authentication` | No | [Authentication](#authentication) | Sets the authentication options for both the pipeline and Kafka. For more information, see [Authentication](#authentication). -`encryption` | No | [Encryption](#encryption) | The encryption configuration for encryption in transit. For more information, see [Encryption](#encryption). `aws` | No | [AWS](#aws) | The AWS configuration. For more information, see [aws](#aws). +`bootstrap_servers` | Yes | String list | The host and port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or the port number for each broker. When using [Amazon Managed Streaming for Apache Kafka (Amazon MSK)](https://aws.amazon.com/msk/) as your Kafka cluster, the bootstrap server information is obtained from Amazon MSK using the Amazon Resource Name (ARN) provided in the configuration. +`encryption` | No | [Encryption](#encryption) | The encryption configuration for encryption in transit. For more information, see [Encryption](#encryption). +`producer_properties` | No | [Producer Properties](#producer_properties) | A list of configurable Kafka producer properties. +`topics` | Yes | List | A list of [topics](#topic) for the buffer to use. You must supply one topic per buffer. ### topic @@ -73,6 +74,7 @@ Option | Required | Type | Description `retry_backoff` | No | Integer | The amount of time to wait before attempting to retry a failed request to a given topic partition. Default is `10s`. `max_poll_interval` | No | Integer | The maximum delay between invocations of a `poll()` when using group management through Kafka's `max.poll.interval.ms` option. Default is `300s`. `consumer_max_poll_records` | No | Integer | The maximum number of records returned in a single `poll()` call through Kafka's `max.poll.records` setting. Default is `500`. +`max_message_bytes` | No | Integer | The maximum size of the message, in bytes. Default is 1 MB. ### kms @@ -123,6 +125,13 @@ Option | Required | Type | Description `type` | No | String | The encryption type. Use `none` to disable encryption. Default is `ssl`. `insecure` | No | Boolean | A Boolean flag used to turn off SSL certificate verification. If set to `true`, certificate authority (CA) certificate verification is turned off and insecure HTTP requests are sent. Default is `false`. +#### producer_properties + +Use the following configuration options to configure a Kafka producer. +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`max_request_size` | No | Integer | The maximum size of the request that the producer sends to Kafka. Default is 1 MB. + #### aws diff --git a/_data-prepper/pipelines/configuration/processors/date.md b/_data-prepper/pipelines/configuration/processors/date.md index 27b571df04..7ac1040c26 100644 --- a/_data-prepper/pipelines/configuration/processors/date.md +++ b/_data-prepper/pipelines/configuration/processors/date.md @@ -9,24 +9,32 @@ nav_order: 50 # date -The `date` processor adds a default timestamp to an event, parses timestamp fields, and converts timestamp information to the International Organization for Standardization (ISO) 8601 format. This timestamp information can be used as an event timestamp. +The `date` processor adds a default timestamp to an event, parses timestamp fields, and converts timestamp information to the International Organization for Standardization (ISO) 8601 format. This timestamp information can be used as an event timestamp. ## Configuration The following table describes the options you can use to configure the `date` processor. + Option | Required | Type | Description :--- | :--- | :--- | :--- -match | Conditionally | List | List of `key` and `patterns` where patterns is a list. The list of match can have exactly one `key` and `patterns`. There is no default value. This option cannot be defined at the same time as `from_time_received`. Include multiple date processors in your pipeline if both options should be used. -from_time_received | Conditionally | Boolean | A boolean that is used for adding default timestamp to event data from event metadata which is the time when source receives the event. Default value is `false`. This option cannot be defined at the same time as `match`. Include multiple date processors in your pipeline if both options should be used. -destination | No | String | Field to store the timestamp parsed by date processor. It can be used with both `match` and `from_time_received`. Default value is `@timestamp`. -source_timezone | No | String | Time zone used to parse dates. It is used in case the zone or offset cannot be extracted from the value. If the zone or offset are part of the value, then timezone is ignored. Find all the available timezones [the list of database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List) in the **TZ database name** column. -destination_timezone | No | String | Timezone used for storing timestamp in `destination` field. The available timezone values are the same as `source_timestamp`. -locale | No | String | Locale is used for parsing dates. It's commonly used for parsing month names(`MMM`). It can have language, country and variant fields using IETF BCP 47 or String representation of [Locale](https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html) object. For example `en-US` for IETF BCP 47 and `en_US` for string representation of Locale. Full list of locale fields which includes language, country and variant can be found [the language subtag registry](https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry). Default value is `Locale.ROOT`. +`match` | Conditionally | [Match](#Match) | The date match configuration. This option cannot be defined at the same time as `from_time_received`. There is no default value. +`from_time_received` | Conditionally | Boolean | When `true`, the timestamp from the event metadata, which is the time at which the source receives the event, is added to the event data. This option cannot be defined at the same time as `match`. Default is `false`. +`date_when` | No | String | Specifies under what condition the `date` processor should perform matching. Default is no condition. +`to_origination_metadata` | No | Boolean | When `true`, the matched time is also added to the event's metadata as an instance of `Instant`. Default is `false`. +`destination` | No | String | The field used to store the timestamp parsed by the date processor. Can be used with both `match` and `from_time_received`. Default is `@timestamp`. +`output_format` | No | String | Determines the format of the timestamp added to an event. Default is `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`. +`source_timezone` | No | String | The time zone used to parse dates, including when the zone or offset cannot be extracted from the value. If the zone or offset are part of the value, then the time zone is ignored. A list of all the available time zones is contained in the **TZ database name** column of [the list of database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List). +`destination_timezone` | No | String | The time zone used for storing the timestamp in the `destination` field. A list of all the available time zones is contained in the **TZ database name** column of [the list of database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List). +`locale` | No | String | The location used for parsing dates. Commonly used for parsing month names (`MMM`). The value can contain language, country, or variant fields in IETF BCP 47, such as `en-US`, or a string representation of the [locale](https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html) object, such as `en_US`. A full list of locale fields, including language, country, and variant, can be found in [the language subtag registry](https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry). Default is `Locale.ROOT`. + - +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`key` | Yes | String | Represents the event key against which to match patterns. Required if `match` is configured. +`patterns` | Yes | List | A list of possible patterns that the timestamp value of the key can have. The patterns are based on a sequence of letters and symbols. The `patterns` support all the patterns listed in the Java [DatetimeFormatter](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html) reference. The timestamp value also supports `epoch_second`, `epoch_milli`, and `epoch_nano` values, which represent the timestamp as the number of seconds, milliseconds, and nanoseconds since the epoch. Epoch values always use the UTC time zone. ## Metrics @@ -40,5 +48,29 @@ The following table describes common [Abstract processor](https://github.com/ope The `date` processor includes the following custom metrics. -* `dateProcessingMatchSuccessCounter`: Returns the number of records that match with at least one pattern specified by the `match configuration` option. -* `dateProcessingMatchFailureCounter`: Returns the number of records that did not match any of the patterns specified by the `patterns match` configuration option. \ No newline at end of file +* `dateProcessingMatchSuccessCounter`: Returns the number of records that match at least one pattern specified by the `match configuration` option. +* `dateProcessingMatchFailureCounter`: Returns the number of records that did not match any of the patterns specified by the `patterns match` configuration option. + +## Example: Add the default timestamp to an event +The following `date` processor configuration can be used to add a default timestamp in the `@timestamp` filed applied to all events: + +```yaml +- date: + from_time_received: true + destination: "@timestamp" +``` + +## Example: Parse a timestamp to convert its format and time zone +The following `date` processor configuration can be used to parse the value of the timestamp applied to `dd/MMM/yyyy:HH:mm:ss` and write it in `yyyy-MM-dd'T'HH:mm:ss.SSSXXX` format: + +```yaml +- date: + match: + - key: timestamp + patterns: ["dd/MMM/yyyy:HH:mm:ss"] + destination: "@timestamp" + output_format: "yyyy-MM-dd'T'HH:mm:ss.SSSXXX" + source_timezone: "America/Los_Angeles" + destination_timezone: "America/Chicago" + locale: "en_US" +``` diff --git a/_data-prepper/pipelines/configuration/processors/decompress.md b/_data-prepper/pipelines/configuration/processors/decompress.md new file mode 100644 index 0000000000..d03c236ac5 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/decompress.md @@ -0,0 +1,49 @@ +--- +layout: default +title: decompress +parent: Processors +grand_parent: Pipelines +nav_order: 40 +--- + +# decompress + +The `decompress` processor decompresses any Base64-encoded compressed fields inside of an event. + +## Configuration + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`keys` | Yes | List | The fields in the event that will be decompressed. +`type` | Yes | Enum | The type of decompression to use for the `keys` in the event. Only `gzip` is supported. +`decompress_when` | No | String| A [Data Prepper conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/) that determines when the `decompress` processor will run on certain events. +`tags_on_failure` | No | List | A list of strings with which to tag events when the processor fails to decompress the `keys` inside an event. Defaults to `_decompression_failure`. + +## Usage + +The following example shows the `decompress` processor used in `pipelines.yaml`: + +```yaml +processor: + - decompress: + decompress_when: '/some_key == null' + keys: [ "base_64_gzip_key" ] + type: gzip +``` + +## Metrics + +The following table describes common [abstract processor](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/processor/AbstractProcessor.java) metrics. + +| Metric name | Type | Description | +| ------------- | ---- | -----------| +| `recordsIn` | Counter | The ingress of records to a pipeline component. | +| `recordsOut` | Counter | The egress of records from a pipeline component. | +| `timeElapsed` | Timer | The time elapsed during execution of a pipeline component. | + +### Counter + +The `decompress` processor accounts for the following metrics: + +* `processingErrors`: The number of processing errors that have occurred in the `decompress` processor. + diff --git a/_data-prepper/pipelines/configuration/processors/delete-entries.md b/_data-prepper/pipelines/configuration/processors/delete-entries.md index 0546ed67c4..33c54a0b29 100644 --- a/_data-prepper/pipelines/configuration/processors/delete-entries.md +++ b/_data-prepper/pipelines/configuration/processors/delete-entries.md @@ -3,7 +3,7 @@ layout: default title: delete_entries parent: Processors grand_parent: Pipelines -nav_order: 51 +nav_order: 41 --- # delete_entries diff --git a/_data-prepper/pipelines/configuration/processors/dissect.md b/_data-prepper/pipelines/configuration/processors/dissect.md index 2d32ba47ae..a8258bee4e 100644 --- a/_data-prepper/pipelines/configuration/processors/dissect.md +++ b/_data-prepper/pipelines/configuration/processors/dissect.md @@ -3,7 +3,7 @@ layout: default title: dissect parent: Processors grand_parent: Pipelines -nav_order: 52 +nav_order: 45 --- # dissect diff --git a/_data-prepper/pipelines/configuration/processors/drop-events.md b/_data-prepper/pipelines/configuration/processors/drop-events.md index d030f14a27..1f601c9743 100644 --- a/_data-prepper/pipelines/configuration/processors/drop-events.md +++ b/_data-prepper/pipelines/configuration/processors/drop-events.md @@ -3,7 +3,7 @@ layout: default title: drop_events parent: Processors grand_parent: Pipelines -nav_order: 53 +nav_order: 46 --- # drop_events diff --git a/_data-prepper/pipelines/configuration/processors/flatten.md b/_data-prepper/pipelines/configuration/processors/flatten.md new file mode 100644 index 0000000000..43793c2b83 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/flatten.md @@ -0,0 +1,239 @@ +--- +layout: default +title: flatten +parent: Processors +grand_parent: Pipelines +nav_order: 48 +--- + +# flatten + +The `flatten` processor transforms nested objects inside of events into flattened structures. + +## Configuration + +The following table describes configuration options for the `flatten` processor. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`source` | Yes | String | The source key on which to perform the operation. If set to an empty string (`""`), then the processor uses the root of the event as the source. +`target` | Yes | String | The target key to put into the flattened fields. If set to an empty string (`""`), then the processor uses the root of the event as the target. +`exclude_keys` | No | List | The keys from the source field that should be excluded from processing. Default is an empty list (`[]`). +`remove_processed_fields` | No | Boolean | When `true`, the processor removes all processed fields from the source. Default is `false`. +`remove_list_indices` | No | Boolean | When `true`, the processor converts the fields from the source map into lists and puts the lists into the target field. Default is `false`. +`flatten_when` | No | String | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that determines whether the `flatten` processor will be run on the event. Default is `null`, which means that all events will be processed unless otherwise stated. +`tags_on_failure` | No | List | A list of tags to add to the event metadata when the event fails to process. + +## Usage + +The following examples show how the `flatten` processor can be used in Data Prepper pipelines. + +### Minimum configuration + +The following example shows only the parameters that are required for using the `flatten` processor, `source` and `target`: + +```yaml +... + processor: + - flatten: + source: "key2" + target: "flattened-key2" +... +``` +{% include copy.html %} + +For example, when the input event contains the following nested objects: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + } +} +``` + +The `flatten` processor creates a flattened structure under the `flattened-key2` object, as shown in the following output: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "flattened-key2": { + "key3.key4": "val2" + } +} +``` + +### Remove processed fields + +Use the `remove_processed_fields` option when flattening all of an event's nested objects. This removes all the event's processed fields, as shown in the following example: + +```yaml +... + processor: + - flatten: + source: "" # empty string represents root of event + target: "" # empty string represents root of event + remove_processed_fields: true +... +``` + +For example, when the input event contains the following nested objects: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "list1": [ + { + "list2": [ + { + "name": "name1", + "value": "value1" + }, + { + "name": "name2", + "value": "value2" + } + ] + } + ] +} +``` + + +The `flatten` processor creates a flattened structure in which all processed fields are absent, as shown in the following output: + +```json +{ + "key1": "val1", + "key2.key3.key4": "val2", + "list1[0].list2[0].name": "name1", + "list1[0].list2[0].value": "value1", + "list1[0].list2[1].name": "name2", + "list1[0].list2[1].value": "value2", +} +``` + +### Exclude specific keys from flattening + +Use the `exclude_keys` option to prevent specific keys from being flattened in the output, as shown in the following example, where the `key2` value is excluded: + +```yaml +... + processor: + - flatten: + source: "" # empty string represents root of event + target: "" # empty string represents root of event + remove_processed_fields: true + exclude_keys: ["key2"] +... +``` + +For example, when the input event contains the following nested objects: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "list1": [ + { + "list2": [ + { + "name": "name1", + "value": "value1" + }, + { + "name": "name2", + "value": "value2" + } + ] + } + ] +} +``` + +All other nested objects in the input event, excluding the `key2` key, will be flattened, as shown in the following example: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "list1[0].list2[0].name": "name1", + "list1[0].list2[0].value": "value1", + "list1[0].list2[1].name": "name2", + "list1[0].list2[1].value": "value2", +} +``` + +### Remove list indexes + +Use the `remove_list_indices` option to convert the fields from the source map into lists and put the lists into the target field, as shown in the following example: + +```yaml +... + processor: + - flatten: + source: "" # empty string represents root of event + target: "" # empty string represents root of event + remove_processed_fields: true + remove_list_indices: true +... +``` + +For example, when the input event contains the following nested objects: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "list1": [ + { + "list2": [ + { + "name": "name1", + "value": "value1" + }, + { + "name": "name2", + "value": "value2" + } + ] + } + ] +} +``` + +The processor removes all indexes from the output and places them into the source map as a flattened, structured list, as shown in the following example: + +```json +{ + "key1": "val1", + "key2.key3.key4": "val2", + "list1[].list2[].name": ["name1","name2"], + "list1[].list2[].value": ["value1","value2"] +} +``` diff --git a/_data-prepper/pipelines/configuration/processors/geoip.md b/_data-prepper/pipelines/configuration/processors/geoip.md new file mode 100644 index 0000000000..b7418c66c6 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/geoip.md @@ -0,0 +1,67 @@ +--- +layout: default +title: geoip +parent: Processors +grand_parent: Pipelines +nav_order: 49 +--- + +# geoip + +The `geoip` processor enriches events with geographic information extracted from IP addresses contained in the events. +By default, Data Prepper uses the [MaxMind GeoLite2](https://dev.maxmind.com/geoip/geolite2-free-geolocation-data) geolocation database. +Data Prepper administrators can configure the databases using the [`geoip_service`]({{site.url}}{{site.baseurl}}/data-prepper/managing-data-prepper/extensions/geoip_service) extension configuration. + +## Usage + +You can configure the `geoip` processor to work on entries. + +The minimal configuration requires at least one entry, and each entry at least one source field. + +The following configuration extracts all available geolocation data from the IP address provided in the field named `clientip`. +It will write the geolocation data to a new field named `geo`, the default source when none is configured: + +``` +my-pipeline: + processor: + - geoip: + entries: + - source: clientip +``` + +The following example excludes Autonomous System Number (ASN) fields and puts the geolocation data into a field named `clientlocation`: + +``` +my-pipeline: + processor: + - geoip: + entries: + - source: clientip + target: clientlocation + include_fields: [asn, asn_organization, network] +``` + + +## Configuration + +You can use the following options to configure the `geoip` processor. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`entries` | Yes | [entry](#entry) list | The list of entries marked for enrichment. +`geoip_when` | No | String | Specifies under what condition the `geoip` processor should perform matching. Default is no condition. +`tags_on_no_valid_ip` | No | String | The tags to add to the event metadata if the source field is not a valid IP address. This includes the localhost IP address. +`tags_on_ip_not_found` | No | String | The tags to add to the event metadata if the `geoip` processor is unable to find a location for the IP address. +`tags_on_engine_failure` | No | String | The tags to add to the event metadata if the `geoip` processor is unable to enrich an event due to an engine failure. + +## entry + +The following parameters allow you to configure a single geolocation entry. Each entry corresponds to a single IP address. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`source` | Yes | String | The key of the source field containing the IP address to geolocate. +`target` | No | String | The key of the target field in which to save the geolocation data. Default is `geo`. +`include_fields` | No | String list | The list of geolocation fields to include in the `target` object. By default, this is all the fields provided by the configured databases. +`exclude_fields` | No | String list | The list of geolocation fields to exclude from the `target` object. + diff --git a/_data-prepper/pipelines/configuration/processors/grok.md b/_data-prepper/pipelines/configuration/processors/grok.md index d1eea278d2..16f72c4968 100644 --- a/_data-prepper/pipelines/configuration/processors/grok.md +++ b/_data-prepper/pipelines/configuration/processors/grok.md @@ -3,7 +3,7 @@ layout: default title: Grok parent: Processors grand_parent: Pipelines -nav_order: 54 +nav_order: 50 --- # Grok @@ -15,26 +15,25 @@ The Grok processor uses pattern matching to structure and extract important keys The following table describes options you can use with the Grok processor to structure your data and make your data easier to query. Option | Required | Type | Description -:--- | :--- | :--- | :--- -break_on_match | No | Boolean | Specifies whether to match all patterns or stop once the first successful match is found. Default value is `true`. -grok_when | No | String | Specifies under what condition the `Grok` processor should perform matching. Default is no condition. -keep_empty_captures | No | Boolean | Enables the preservation of `null` captures. Default value is `false`. -keys_to_overwrite | No | List | Specifies which existing keys will be overwritten if there is a capture with the same key value. Default value is `[]`. -match | No | Map | Specifies which keys to match specific patterns against. Default value is an empty body. -named_captures_only | No | Boolean | Specifies whether to keep only named captures. Default value is `true`. -pattern_definitions | No | Map | Allows for custom pattern use inline. Default value is an empty body. -patterns_directories | No | List | Specifies the path of directories that contain customer pattern files. Default value is an empty list. -pattern_files_glob | No | String | Specifies which pattern files to use from the directories specified for `pattern_directories`. Default value is `*`. -target_key | No | String | Specifies a parent-level key used to store all captures. Default value is `null`. -timeout_millis | No | Integer | The maximum amount of time during which matching occurs. Setting to `0` disables the timeout. Default value is `30,000`. - - +:--- | :--- |:--- | :--- +`break_on_match` | No | Boolean | Specifies whether to match all patterns (`true`) or stop once the first successful match is found (`false`). Default is `true`. +`grok_when` | No | String | Specifies under what condition the `grok` processor should perform matching. Default is no condition. +`keep_empty_captures` | No | Boolean | Enables the preservation of `null` captures from the processed output. Default is `false`. +`keys_to_overwrite` | No | List | Specifies which existing keys will be overwritten if there is a capture with the same key value. Default is `[]`. +`match` | No | Map | Specifies which keys should match specific patterns. Default is an empty response body. +`named_captures_only` | No | Boolean | Specifies whether to keep only named captures. Default is `true`. +`pattern_definitions` | No | Map | Allows for a custom pattern that can be used inline inside the response body. Default is an empty response body. +`patterns_directories` | No | List | Specifies which directory paths contain the custom pattern files. Default is an empty list. +`pattern_files_glob` | No | String | Specifies which pattern files to use from the directories specified for `pattern_directories`. Default is `*`. +`target_key` | No | String | Specifies a parent-level key used to store all captures. Default value is `null`. +`timeout_millis` | No | Integer | The maximum amount of time during which matching occurs. Setting to `0` prevents any matching from occurring. Default is `30,000`. +`performance_metadata` | No | Boolean | Whether or not to add the performance metadata to events. Default is `false`. For more information, see [Grok performance metadata](#grok-performance-metadata). + ## Conditional grok -The Grok processor can be configured to run conditionally by using the `grok_when` option. The following is an example Grok processor configuration that uses `grok_when`: +The `grok` processor can be configured to run conditionally by using the `grok_when` option. The following is an example Grok processor configuration that uses `grok_when`: + ``` processor: - grok: @@ -46,8 +45,36 @@ processor: match: message: ['%{IPV6:clientip} %{WORD:request} %{POSINT:bytes}'] ``` +{% include copy.html %} + The `grok_when` option can take a conditional expression. This expression is detailed in the [Expression syntax](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/) documentation. +## Grok performance metadata + +When the `performance_metadata` option is set to `true`, the `grok` processor adds the following metadata keys to each event: + +* `_total_grok_processing_time`: The total amount of time, in milliseconds, that the `grok` processor takes to match the event. This is the sum of the processing time based on all of the `grok` processors that ran on the event and have the `performance_metadata` option enabled. +* `_total_grok_patterns_attempted`: The total number of `grok` pattern match attempts across all `grok` processors that ran on the event. + +To include Grok performance metadata when the event is sent to the sink inside the pipeline, use the `add_entries` processor to describe the metadata you want to include, as shown in the following example: + + +```yaml +processor: + - grok: + performance_metadata: true + match: + log: "%{COMMONAPACHELOG"} + - add_entries: + entries: + - add_when: 'getMetadata("_total_grok_patterns_attempted") != null' + key: "grok_patterns_attempted" + value_expression: 'getMetadata("_total_grok_patterns_attempted")' + - add_when: 'getMetadata("_total_grok_processing_time") != null' + key: "grok_time_spent" + value_expression: 'getMetadata("_total_grok_processing_time")' +``` + ## Metrics The following table describes common [Abstract processor](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/processor/AbstractProcessor.java) metrics. diff --git a/_data-prepper/pipelines/configuration/processors/list-to-map.md b/_data-prepper/pipelines/configuration/processors/list-to-map.md index 4b137f5ce8..15a90ffc24 100644 --- a/_data-prepper/pipelines/configuration/processors/list-to-map.md +++ b/_data-prepper/pipelines/configuration/processors/list-to-map.md @@ -16,10 +16,12 @@ The following table describes the configuration options used to generate target Option | Required | Type | Description :--- | :--- | :--- | :--- -`key` | Yes | String | The key of the fields to be extracted as keys in the generated mappings. `source` | Yes | String | The list of objects with `key` fields to be converted into keys for the generated map. `target` | No | String | The target for the generated map. When not specified, the generated map will be placed in the root node. +`key` | Conditionally | String | The key of the fields to be extracted as keys in the generated mappings. Must be specified if `use_source_key` is `false`. +`use_source_key` | No | Boolean | When `true`, keys in the generated map will use original keys from the source. Default is `false`. `value_key` | No | String | When specified, values given a `value_key` in objects contained in the source list will be extracted and converted into the value specified by this option based on the generated map. When not specified, objects contained in the source list retain their original value when mapped. +`extract_value` | No | Boolean | When `true`, object values from the source list will be extracted and added to the generated map. When `false`, object values from the source list are added to the generated map as they appear in the source list. Default is `false` `flatten` | No | Boolean | When `true`, values in the generated map output flatten into single items based on the `flattened_element`. Otherwise, objects mapped to values from the generated map appear as lists. `flattened_element` | Conditionally | String | The element to keep, either `first` or `last`, when `flatten` is set to `true`. @@ -302,4 +304,52 @@ Some objects in the response may have more than one element in their values, as "val-c" ] } +``` + +### Example: `use_source_key` and `extract_value` set to `true` + +The following example `pipeline.yaml` file sets `flatten` to `false`, causing the processor to output values from the generated map as a list: + +```yaml +pipeline: + source: + file: + path: "/full/path/to/logs_json.log" + record_type: "event" + format: "json" + processor: + - list_to_map: + source: "mylist" + use_source_key: true + extract_value: true + sink: + - stdout: +``` +{% include copy.html %} + +Object values from `mylist` are extracted and added to fields with the source keys `name` and `value`, as shown in the following response: + +```json +{ + "mylist": [ + { + "name": "a", + "value": "val-a" + }, + { + "name": "b", + "value": "val-b1" + }, + { + "name": "b", + "value": "val-b2" + }, + { + "name": "c", + "value": "val-c" + } + ], + "name": ["a", "b", "b", "c"], + "value": ["val-a", "val-b1", "val-b2", "val-c"] +} ``` \ No newline at end of file diff --git a/_data-prepper/pipelines/configuration/processors/map-to-list.md b/_data-prepper/pipelines/configuration/processors/map-to-list.md new file mode 100644 index 0000000000..f3393e6c46 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/map-to-list.md @@ -0,0 +1,277 @@ +--- +layout: default +title: map_to_list +parent: Processors +grand_parent: Pipelines +nav_order: 63 +--- + +# map_to_list + +The `map_to_list` processor converts a map of key-value pairs to a list of objects. Each object contains the key and value in separate fields. + +## Configuration + +The following table describes the configuration options for the `map_to_list` processor. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`source` | Yes | String | The source map used to perform the mapping operation. When set to an empty string (`""`), it will use the root of the event as the `source`. +`target` | Yes | String | The target for the generated list. +`key_name` | No | String | The name of the field in which to store the original key. Default is `key`. +`value_name` | No | String | The name of the field in which to store the original value. Default is `value`. +`exclude_keys` | No | List | The keys in the source map that will be excluded from processing. Default is an empty list (`[]`). +`remove_processed_fields` | No | Boolean | When `true`, the processor will remove the processed fields from the source map. Default is `false`. +`convert_field_to_list` | No | Boolean | If `true`, the processor will convert the fields from the source map into lists and place them in fields in the target list. Default is `false`. +`map_to_list_when` | No | String | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. Default is `null`. All events will be processed unless otherwise stated. +`tags_on_failure` | No | List | A list of tags to add to the event metadata when the event fails to process. + +## Usage + +The following examples show how the `map_to_list` processor can be used in your pipeline. + +### Example: Minimum configuration + +The following example shows the `map_to_list` processor with only the required parameters, `source` and `target`, configured: + +```yaml +... + processor: + - map_to_list: + source: "my-map" + target: "my-list" +... +``` +{% include copy.html %} + +When the input event contains the following data: + +```json +{ + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + + +The processed event will contain the following output: + +```json +{ + "my-list": [ + { + "key": "key1", + "value": "value1" + }, + { + "key": "key2", + "value": "value2" + }, + { + "key": "key3", + "value": "value3" + } + ], + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +### Example: Custom key name and value name + +The following example shows how to configure a custom key name and value name: + +```yaml +... + processor: + - map_to_list: + source: "my-map" + target: "my-list" + key_name: "name" + value_name: "data" +... +``` +{% include copy.html %} + +When the input event contains the following data: + +```json +{ + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +The processed event will contain the following output: + +```json +{ + "my-list": [ + { + "name": "key1", + "data": "value1" + }, + { + "name": "key2", + "data": "value2" + }, + { + "name": "key3", + "data": "value3" + } + ], + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +### Example: Exclude specific keys from processing and remove any processed fields + +The following example shows how to exclude specific keys and remove any processed fields from the output: + +```yaml +... + processor: + - map_to_list: + source: "my-map" + target: "my-list" + exclude_keys: ["key1"] + remove_processed_fields: true +... +``` +{% include copy.html %} + +When the input event contains the following data: +```json +{ + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +The processed event will remove the "key2" and "key3" fields, but the "my-map" object, "key1", will remain, as shown in the following output: + +```json +{ + "my-list": [ + { + "key": "key2", + "value": "value2" + }, + { + "key": "key3", + "value": "value3" + } + ], + "my-map": { + "key1": "value1" + } +} +``` + +### Example: Use convert_field_to_list + +The following example shows how to use the `convert_field_to_list` option in the processor: + +```yaml +... + processor: + - map_to_list: + source: "my-map" + target: "my-list" + convert_field_to_list: true +... +``` +{% include copy.html %} + +When the input event contains the following data: + +```json +{ + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +The processed event will convert all fields into lists, as shown in the following output: + +```json +{ + "my-list": [ + ["key1", "value1"], + ["key2", "value2"], + ["key3", "value3"] + ], + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +### Example: Use the event root as the source + +The following example shows how you can use an event's root as the source by setting the `source` setting to an empty string (`""`): + +```yaml +... + processor: + - map_to_list: + source: "" + target: "my-list" +... +``` +{% include copy.html %} + +When the input event contains the following data: + +```json +{ + "key1": "value1", + "key2": "value2", + "key3": "value3" +} +``` + +The processed event will contain the following output: + +```json +{ + "my-list": [ + { + "key": "key1", + "value": "value1" + }, + { + "key": "key2", + "value": "value2" + }, + { + "key": "key3", + "value": "value3" + } + ], + "key1": "value1", + "key2": "value2", + "key3": "value3" +} +``` diff --git a/_data-prepper/pipelines/configuration/processors/mutate-event.md b/_data-prepper/pipelines/configuration/processors/mutate-event.md index 032bc89fcd..9b3b2afb33 100644 --- a/_data-prepper/pipelines/configuration/processors/mutate-event.md +++ b/_data-prepper/pipelines/configuration/processors/mutate-event.md @@ -11,11 +11,14 @@ nav_order: 65 Mutate event processors allow you to modify events in Data Prepper. The following processors are available: * [add_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/add-entries/) allows you to add entries to an event. +* [convert_entry_type]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/convert_entry_type/) allows you to convert value types in an event. * [copy_values]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/copy-values/) allows you to copy values within an event. * [delete_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/delete-entries/) allows you to delete entries from an event. -* [rename_keys]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/rename-keys/) allows you to rename keys in an event. -* [convert_entry_type]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/convert_entry_type/) allows you to convert value types in an event. * [list_to_map]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/list-to-map) allows you to convert list of objects from an event where each object contains a `key` field into a map of target keys. +* `map_to_list` allows you to convert a map of objects from an event, where each object contains a `key` field, into a list of target keys. +* [rename_keys]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/rename-keys/) allows you to rename keys in an event. +* [select_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/select-entries/) allows you to select entries from an event. + diff --git a/_data-prepper/pipelines/configuration/processors/obfuscate.md b/_data-prepper/pipelines/configuration/processors/obfuscate.md index 4c33d8baab..13d906acb3 100644 --- a/_data-prepper/pipelines/configuration/processors/obfuscate.md +++ b/_data-prepper/pipelines/configuration/processors/obfuscate.md @@ -67,6 +67,8 @@ Use the following configuration options with the `obfuscate` processor. | `source` | Yes | The source field to obfuscate. | | `target` | No | The new field in which to store the obfuscated value. This leaves the original source field unchanged. When no `target` is provided, the source field updates with the obfuscated value. | | `patterns` | No | A list of regex patterns that allow you to obfuscate specific parts of a field. Only parts that match the regex pattern will obfuscate. When not provided, the processor obfuscates the whole field. | +| `obfuscate_when` | No | Specifies under what condition the Obfuscate processor should perform matching. Default is no condition. | +| `tags_on_match_failure` | No | The tag to add to an event if the obfuscate processor fails to match the pattern. | | `action` | No | The obfuscation action. As of Data Prepper 2.3, only the `mask` action is supported. | You can customize the `mask` action with the following optional configuration options. diff --git a/_data-prepper/pipelines/configuration/processors/parse-ion.md b/_data-prepper/pipelines/configuration/processors/parse-ion.md new file mode 100644 index 0000000000..0edd446c42 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/parse-ion.md @@ -0,0 +1,56 @@ +--- +layout: default +title: parse_ion +parent: Processors +grand_parent: Pipelines +nav_order: 79 +--- + +# parse_ion + +The `parse_ion` processor parses [Amazon Ion](https://amazon-ion.github.io/ion-docs/) data. + +## Configuration + +You can configure the `parse_ion` processor with the following options. + +| Option | Required | Type | Description | +| :--- | :--- | :--- | :--- | +| `source` | No | String | The field in the `event` that is parsed. Default value is `message`. | +| `destination` | No | String | The destination field of the parsed JSON. Defaults to the root of the `event`. Cannot be `""`, `/`, or any white-space-only `string` because these are not valid `event` fields. | +| `pointer` | No | String | A JSON pointer to the field to be parsed. There is no `pointer` by default, meaning that the entire `source` is parsed. The `pointer` can access JSON array indexes as well. If the JSON pointer is invalid, then the entire `source` data is parsed into the outgoing `event`. If the key that is pointed to already exists in the `event` and the `destination` is the root, then the pointer uses the entire path of the key. | +| `tags_on_failure` | No | String | A list of strings that specify the tags to be set in the event that the processors fails or an unknown exception occurs while parsing. + +## Usage + +The following examples show how to use the `parse_ion` processor in your pipeline. + +### Example: Minimum configuration + +The following example shows the minimum configuration for the `parse_ion` processor: + +```yaml +parse-json-pipeline: + source: + stdin: + processor: + - parse_json: + source: "my_ion" + sink: + - stdout: +``` +{% include copy.html %} + +When the input event contains the following data: + +``` +{"my_ion": "{ion_value1: \"hello\", ion_value2: \"world\"}"} +``` + +The processor parses the event into the following output: + +``` +{"ion_value1": "hello", "ion_value2" : "world"} +``` + + diff --git a/_data-prepper/pipelines/configuration/processors/select-entries.md b/_data-prepper/pipelines/configuration/processors/select-entries.md new file mode 100644 index 0000000000..39b79a5bcc --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/select-entries.md @@ -0,0 +1,51 @@ +--- +layout: default +title: select_entries +parent: Processors +grand_parent: Pipelines +nav_order: 59 +--- + +# select_entries + +The `select_entries` processor selects entries from a Data Prepper event. Only the selected entries will remain in the event, and all other entries will be removed from the event. + +## Configuration + +You can configure the `select_entries` processor using the following options. + +| Option | Required | Description | +| :--- | :--- | :--- | +| `include_keys` | Yes | A list of keys to be selected from an event. | +| `select_when` | No | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. | + +### Usage + +The following example shows how to configure the `select_entries` processor in the `pipeline.yaml` file: + +```yaml +pipeline: + source: + ... + .... + processor: + - select_entries: + entries: + - include_keys: [ "key1", "key2" ] + add_when: '/some_key == "test"' + sink: +``` +{% include copy.html %} + + +For example, when your source contains the following event record: + +```json +{"message": "hello", "key1" : "value1", "key2" : "value2", "some_key" : "test"} +``` + +The `select_entries` processor includes only `key1` and `key2` in the processed output: + +```json +{"key1": "value1", "key2": "value2"} +``` diff --git a/_data-prepper/pipelines/configuration/processors/split-event.md b/_data-prepper/pipelines/configuration/processors/split-event.md new file mode 100644 index 0000000000..f059fe5b95 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/split-event.md @@ -0,0 +1,52 @@ +--- +layout: default +title: split-event +parent: Processors +grand_parent: Pipelines +nav_order: 96 +--- + +# split-event + +The `split-event` processor is used to split events based on a delimiter and generates multiple events from a user-specified field. + +## Configuration + +The following table describes the configuration options for the `split-event` processor. + +| Option | Type | Description | +|------------------|---------|-----------------------------------------------------------------------------------------------| +| `field` | String | The event field to be split. | +| `delimiter_regex`| String | The regular expression used as the delimiter for splitting the field. | +| `delimiter` | String | The delimiter used for splitting the field. If not specified, the default delimiter is used. | + +# Usage + +To use the `split-event` processor, add the following to your `pipelines.yaml` file: + +``` +split-event-pipeline: + source: + http: + processor: + - split_event: + field: query + delimiter: ' ' + sink: + - stdout: +``` +{% include copy.html %} + +When an event contains the following example input: + +``` +{"query" : "open source", "some_other_field" : "abc" } +``` + +The input will be split into multiple events based on the `query` field, with the delimiter set as white space, as shown in the following example: + +``` +{"query" : "open", "some_other_field" : "abc" } +{"query" : "source", "some_other_field" : "abc" } +``` + diff --git a/_data-prepper/pipelines/configuration/processors/truncate.md b/_data-prepper/pipelines/configuration/processors/truncate.md new file mode 100644 index 0000000000..3714d80847 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/truncate.md @@ -0,0 +1,107 @@ +--- +layout: default +title: truncate +parent: Processors +grand_parent: Pipelines +nav_order: 121 +--- + +# truncate + +The `truncate` processor truncates a key's value at the beginning, the end, or on both sides of the value string, based on the processor's configuration. If the key's value is a list, then each member in the string list is truncated. Non-string members of the list are not truncated. When the `truncate_when` option is provided, input is truncated only when the condition specified is `true` for the event being processed. + +## Configuration + +You can configure the `truncate` processor using the following options. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`entries` | Yes | String list | A list of entries to add to an event. +`source_keys` | No | String list | The list of source keys that will be modified by the processor. The default value is an empty list, which indicates that all values will be truncated. +`truncate_when` | No | Conditional expression | A condition that, when met, determines when the truncate operation is performed. +`start_at` | No | Integer | Where in the string value to start truncation. Default is `0`, which specifies to start truncation at the beginning of each key's value. +`length` | No | Integer| The length of the string after truncation. When not specified, the processor will measure the length based on where the string ends. + +Either the `start_at` or `length` options must be present in the configuration in order for the `truncate` processor to run. You can define both values in the configuration in order to further customize where truncation occurs in the string. + +## Usage + +The following examples show how to configure the `truncate` processor in the `pipeline.yaml` file: + +## Example: Minimum configuration + +The following example shows the minimum configuration for the `truncate` processor: + +```yaml +pipeline: + source: + file: + path: "/full/path/to/logs_json.log" + record_type: "event" + format: "json" + processor: + - truncate: + entries: + - source_keys: ["message1", "message2"] + length: 5 + - source_keys: ["info"] + length: 6 + start_at: 4 + - source_keys: ["log"] + start_at: 5 + sink: + - stdout: +``` + +For example, the following event contains several keys with string values: + +```json +{"message1": "hello,world", "message2": "test message", "info", "new information", "log": "test log message"} +``` + +The `truncate` processor produces the following output, where: + +- The `start_at` setting is `0` for the `message1` and `message 2` keys, indicating that truncation will begin at the start of the string, with the string itself truncated to a length of `5`. +- The `start_at` setting is `4` for the `info` key, indicating that truncation will begin at letter `i` of the string, with the string truncated to a length of `6`. +- The `start_at` setting is `5` for the `log` key, with no length specified, indicating that truncation will begin at letter `l` of the string. + +```json +{"message1":"hello", "message2":"test ", "info":"inform", "log": "log message"} +``` + + +## Example: Using `truncate_when` + +The following example configuration shows the `truncate` processor with the `truncate_when` option configured: + +```yaml +pipeline: + source: + file: + path: "/full/path/to/logs_json.log" + record_type: "event" + format: "json" + processor: + - truncate: + entries: + - source_keys: ["message"] + length: 5 + start_at: 8 + truncate_when: '/id == 1' + sink: + - stdout: +``` + +The following example contains two events: + +```json +{"message": "hello, world", "id": 1} +{"message": "hello, world,not-truncated", "id": 2} +``` + +When the `truncate` processor runs on the events, only the first event is truncated because the `id` key contains a value of `1`: + +```json +{"message": "world", "id": 1} +{"message": "hello, world,not-truncated", "id": 2} +``` diff --git a/_data-prepper/pipelines/configuration/sinks/file.md b/_data-prepper/pipelines/configuration/sinks/file.md index 74af5a1803..bd4fec1865 100644 --- a/_data-prepper/pipelines/configuration/sinks/file.md +++ b/_data-prepper/pipelines/configuration/sinks/file.md @@ -17,6 +17,7 @@ The following table describes options you can configure for the `file` sink. Option | Required | Type | Description :--- | :--- | :--- | :--- path | Yes | String | Path for the output file (e.g. `logs/my-transformed-log.log`). +append | No | Boolean | When `true`, the sink file is opened in append mode. ## Usage diff --git a/_data-prepper/pipelines/configuration/sinks/opensearch.md b/_data-prepper/pipelines/configuration/sinks/opensearch.md index b4861f68fd..d485fbb2b9 100644 --- a/_data-prepper/pipelines/configuration/sinks/opensearch.md +++ b/_data-prepper/pipelines/configuration/sinks/opensearch.md @@ -50,45 +50,82 @@ pipeline: The following table describes options you can configure for the `opensearch` sink. + +Option | Required | Type | Description +:--- | :--- |:---| :--- +`hosts` | Yes | List | A list of OpenSearch hosts to write to, such as `["https://localhost:9200", "https://remote-cluster:9200"]`. +`cert` | No | String | The path to the security certificate. For example, `"config/root-ca.pem"` if the cluster uses the OpenSearch Security plugin. +`username` | No | String | The username for HTTP basic authentication. +`password` | No | String | The password for HTTP basic authentication. +`aws` | No | AWS | The [AWS](#aws) configuration. +[max_retries](#configure-max_retries) | No | Integer | The maximum number of times that the `opensearch` sink should try to push data to the OpenSearch server before considering it to be a failure. Defaults to `Integer.MAX_VALUE`. When not provided, the sink will try to push data to the OpenSearch server indefinitely and exponential backoff will increase the waiting time before a retry. +`aws_sigv4` | No | Boolean | **Deprecated in Data Prepper 2.7.** Default is `false`. Whether to use AWS Identity and Access Management (IAM) signing to connect to an Amazon OpenSearch Service domain. For your access key, secret key, and optional session token, Data Prepper uses the default credential chain (environment variables, Java system properties, `~/.aws/credential`). +`aws_region` | No | String | **Deprecated in Data Prepper 2.7.** The AWS Region (for example, `"us-east-1"`) for the domain when you are connecting to Amazon OpenSearch Service. +`aws_sts_role_arn` | No | String | **Deprecated in Data Prepper 2.7.** The IAM role that the plugin uses to sign requests sent to Amazon OpenSearch Service. If this information is not provided, then the plugin uses the default credentials. +`socket_timeout` | No | Integer | The timeout value, in milliseconds, when waiting for data to be returned (the maximum period of inactivity between two consecutive data packets). A timeout value of `0` is interpreted as an infinite timeout. If this timeout value is negative or not set, then the underlying Apache HttpClient will rely on operating system settings to manage socket timeouts. +`connect_timeout` | No | Integer| The timeout value, in milliseconds, when requesting a connection from the connection manager. A timeout value of `0` is interpreted as an infinite timeout. If this timeout value is negative or not set, the underlying Apache HttpClient will rely on operating system settings to manage connection timeouts. +`insecure` | No | Boolean | Whether or not to verify SSL certificates. If set to `true`, then certificate authority (CA) certificate verification is disabled and insecure HTTP requests are sent instead. Default is `false`. +`proxy` | No | String | The address of the [forward HTTP proxy server](https://en.wikipedia.org/wiki/Proxy_server). The format is `"<hostname or IP>:<port>"` (for example, `"example.com:8100"`, `"http://example.com:8100"`, `"112.112.112.112:8100"`). The port number cannot be omitted. +`index` | Conditionally | String | The name of the export index. Only required when the `index_type` is `custom`. The index can be a plain string, such as `my-index-name`, contain [Java date-time patterns](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html), such as `my-index-${yyyy.MM.dd}` or `my-${yyyy-MM-dd-HH}-index`, be formatted using field values, such as `my-index-${/my_field}`, or use [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `my-index-${getMetadata(\"my_metadata_field\"}`. All formatting options can be combined to provide flexibility when creating static, dynamic, and rolling indexes. +`index_type` | No | String | Tells the sink plugin what type of data it is handling. Valid values are `custom`, `trace-analytics-raw`, `trace-analytics-service-map`, or `management-disabled`. Default is `custom`. +`template_type` | No | String | Defines what type of OpenSearch template to use. Available options are `v1` and `index-template`. The default value is `v1`, which uses the original OpenSearch templates available at the `_template` API endpoints. The `index-template` option uses composable [index templates]({{site.url}}{{site.baseurl}}/opensearch/index-templates/), which are available through the OpenSearch `_index_template` API. Composable index types offer more flexibility than the default and are necessary when an OpenSearch cluster contains existing index templates. Composable templates are available for all versions of OpenSearch and some later versions of Elasticsearch. When `distribution_version` is set to `es6`, Data Prepper enforces the `template_type` as `v1`. +`template_file` | No | String | The path to a JSON [index template]({{site.url}}{{site.baseurl}}/opensearch/index-templates/) file, such as `/your/local/template-file.json`, when `index_type` is set to `custom`. For an example template file, see [otel-v1-apm-span-index-template.json](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/opensearch/src/main/resources/otel-v1-apm-span-index-template.json). If you supply a template file, then it must match the template format specified by the `template_type` parameter. +`template_content` | No | JSON | Contains all the inline JSON found inside of the index [index template]({{site.url}}{{site.baseurl}}/opensearch/index-templates/). For an example of template content, see [the example template content](#example_template_content). +`document_id_field` | No | String | **Deprecated in Data Prepper 2.7 in favor of `document_id`.** The field from the source data to use for the OpenSearch document ID (for example, `"my-field"`) if `index_type` is `custom`. +`document_id` | No | String | A format string to use as the `_id` in OpenSearch documents. To specify a single field in an event, use `${/my_field}`. You can also use Data Prepper expressions to construct the `document_id`, for example, `${getMetadata(\"some_metadata_key\")}`. These options can be combined into more complex formats, such as `${/my_field}-test-${getMetadata(\"some_metadata_key\")}`. +`document_version` | No | String | A format string to use as the `_version` in OpenSearch documents. To specify a single field in an event, use `${/my_field}`. You can also use Data Prepper expressions to construct the `document_version`, for example, `${getMetadata(\"some_metadata_key\")}`. These options can be combined into more complex versions, such as `${/my_field}${getMetadata(\"some_metadata_key\")}`. The `document_version` format must evaluate to a long type and can only be used when `document_version_type` is set to either `external` or `external_gte`. +`document_version_type` | No | String | The document version type for index operations. Must be one of `external`, `external_gte`, or `internal`. If set to `external` or `external_gte`, then `document_version` is required. +`dlq_file` | No | String | The path to your preferred dead letter queue file (such as `/your/local/dlq-file`). Data Prepper writes to this file when it fails to index a document on the OpenSearch cluster. +`dlq` | No | N/A | [DLQ configurations]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/dlq/). +`bulk_size` | No | Integer (long) | The maximum size (in MiB) of bulk requests sent to the OpenSearch cluster. Values below `0` indicate an unlimited size. If a single document exceeds the maximum bulk request size, then Data Prepper sends each request individually. Default value is `5`. +`ism_policy_file` | No | String | The absolute file path for an Index State Management (ISM) policy JSON file. This policy file is effective only when there is no built-in policy file for the index type. For example, the `custom` index type is currently the only type without a built-in policy file, so it will use this policy file if it is provided through this parameter. For more information about the policy JSON file, see [ISM policies]({{site.url}}{{site.baseurl}}/im-plugin/ism/policies/). +`number_of_shards` | No | Integer | The number of primary shards that an index should have on the destination OpenSearch server. This parameter is effective only when `template_file` is either explicitly provided in the sink configuration or built in. If this parameter is set, then it will override the value in the index template file. For more information, see [Create index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/). +`number_of_replicas` | No | Integer | The number of replica shards that each primary shard should have on the destination OpenSearch server. For example, if you have 4 primary shards and set `number_of_replicas` to `3`, then the index has 12 replica shards. This parameter is effective only when `template_file` is either explicitly provided in the sink configuration or built in. If this parameter is set, then it will override the value in the index template file. For more information, see [Create index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/). +`distribution_version` | No | String | Indicates whether the backend version of the sink is Elasticsearch 6 or later. `es6` represents Elasticsearch 6. `default` represents the latest compatible backend version, such as Elasticsearch 7.x, OpenSearch 1.x, or OpenSearch 2.x. Default is `default`. +`enable_request_compression` | No | Boolean | Whether to enable compression when sending requests to OpenSearch. When `distribution_version` is set to `es6`, default is `false`. For all other distribution versions, default is `true`. +`action` | No | String | The OpenSearch bulk action to use for documents. Must be one of `create`, `index`, `update`, `upsert`, or `delete`. Default is `index`. +`actions` | No | List | A [list of actions](#actions) that can be used as an alternative to `action`, which reads as a switch case statement that conditionally determines the bulk action to take for an event. +`flush_timeout` | No | Long | A long class that contains the amount of time, in milliseconds, to try packing a bulk request up to the `bulk_size` before flushing the request. If this timeout expires before a bulk request has reached the `bulk_size`, the request will be flushed. Set to `-1` to disable the flush timeout and instead flush whatever is present at the end of each batch. Default is `60,000`, or 1 minute. +`normalize_index` | No | Boolean | If true, then the OpenSearch sink will try to create dynamic index names. Index names with format options specified in `${})` are valid according to the [index naming restrictions]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/#index-naming-restrictions). Any invalid characters will be removed. Default value is `false`. +`routing` | No | String | A string used as a hash for generating the `shard_id` for a document when it is stored in OpenSearch. Each incoming record is searched. When present, the string is used as the routing field for the document. When not present, the default routing mechanism (`document_id`) is used by OpenSearch when storing the document. Supports formatting with fields in events and [Data Prepper expressions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), such as `${/my_field}-test-${getMetadata(\"some_metadata_key\")}`. +`document_root_key` | No | String | The key in the event that will be used as the root in the document. The default is the root of the event. If the key does not exist, then the entire event is written as the document. If `document_root_key` is of a basic value type, such as a string or integer, then the document will have a structure of `{"data": }`. +`serverless` | No | Boolean | Determines whether the OpenSearch backend is Amazon OpenSearch Serverless. Set this value to `true` when the destination for the `opensearch` sink is an Amazon OpenSearch Serverless collection. Default is `false`. +`serverless_options` | No | Object | The network configuration options available when the backend of the `opensearch` sink is set to Amazon OpenSearch Serverless. For more information, see [Serverless options](#serverless-options). + + + +## aws + Option | Required | Type | Description :--- | :--- | :--- | :--- -hosts | Yes | List | List of OpenSearch hosts to write to (for example, `["https://localhost:9200", "https://remote-cluster:9200"]`). -cert | No | String | Path to the security certificate (for example, `"config/root-ca.pem"`) if the cluster uses the OpenSearch Security plugin. -username | No | String | Username for HTTP basic authentication. -password | No | String | Password for HTTP basic authentication. -aws_sigv4 | No | Boolean | Default value is false. Whether to use AWS Identity and Access Management (IAM) signing to connect to an Amazon OpenSearch Service domain. For your access key, secret key, and optional session token, Data Prepper uses the default credential chain (environment variables, Java system properties, `~/.aws/credential`, etc.). -aws_region | No | String | The AWS region (for example, `"us-east-1"`) for the domain if you are connecting to Amazon OpenSearch Service. -aws_sts_role_arn | No | String | IAM role that the plugin uses to sign requests sent to Amazon OpenSearch Service. If this information is not provided, the plugin uses the default credentials. -[max_retries](#configure-max_retries) | No | Integer | The maximum number of times the OpenSearch sink should try to push data to the OpenSearch server before considering it to be a failure. Defaults to `Integer.MAX_VALUE`. If not provided, the sink will try to push data to the OpenSearch server indefinitely because the default value is high and exponential backoff would increase the waiting time before retry. -socket_timeout | No | Integer | The timeout, in milliseconds, waiting for data to return (or the maximum period of inactivity between two consecutive data packets). A timeout value of zero is interpreted as an infinite timeout. If this timeout value is negative or not set, the underlying Apache HttpClient would rely on operating system settings for managing socket timeouts. -connect_timeout | No | Integer | The timeout in milliseconds used when requesting a connection from the connection manager. A timeout value of zero is interpreted as an infinite timeout. If this timeout value is negative or not set, the underlying Apache HttpClient would rely on operating system settings for managing connection timeouts. -insecure | No | Boolean | Whether or not to verify SSL certificates. If set to true, certificate authority (CA) certificate verification is disabled and insecure HTTP requests are sent instead. Default value is `false`. -proxy | No | String | The address of a [forward HTTP proxy server](https://en.wikipedia.org/wiki/Proxy_server). The format is "<host name or IP>:<port>". Examples: "example.com:8100", "http://example.com:8100", "112.112.112.112:8100". Port number cannot be omitted. -index | Conditionally | String | Name of the export index. Applicable and required only when the `index_type` is `custom`. -index_type | No | String | This index type tells the Sink plugin what type of data it is handling. Valid values: `custom`, `trace-analytics-raw`, `trace-analytics-service-map`, `management-disabled`. Default value is `custom`. -template_type | No | String | Defines what type of OpenSearch template to use. The available options are `v1` and `index-template`. The default value is `v1`, which uses the original OpenSearch templates available at the `_template` API endpoints. The `index-template` option uses composable [index templates]({{site.url}}{{site.baseurl}}/opensearch/index-templates/) which are available through OpenSearch's `_index_template` API. Composable index types offer more flexibility than the default and are necessary when an OpenSearch cluster has already existing index templates. Composable templates are available for all versions of OpenSearch and some later versions of Elasticsearch. When `distribution_version` is set to `es6`, Data Prepper enforces the `template_type` as `v1`. -template_file | No | String | The path to a JSON [index template]({{site.url}}{{site.baseurl}}/opensearch/index-templates/) file such as `/your/local/template-file.json` when `index_type` is set to `custom`. For an example template file, see [otel-v1-apm-span-index-template.json](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/opensearch/src/main/resources/otel-v1-apm-span-index-template.json). If you supply a template file it must match the template format specified by the `template_type` parameter. -document_id_field | No | String | The field from the source data to use for the OpenSearch document ID (for example, `"my-field"`) if `index_type` is `custom`. -dlq_file | No | String | The path to your preferred dead letter queue file (for example, `/your/local/dlq-file`). Data Prepper writes to this file when it fails to index a document on the OpenSearch cluster. -dlq | No | N/A | DLQ configurations. See [Dead Letter Queues]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/dlq/) for details. If the `dlq_file` option is also available, the sink will fail. -bulk_size | No | Integer (long) | The maximum size (in MiB) of bulk requests sent to the OpenSearch cluster. Values below 0 indicate an unlimited size. If a single document exceeds the maximum bulk request size, Data Prepper sends it individually. Default value is 5. -ism_policy_file | No | String | The absolute file path for an ISM (Index State Management) policy JSON file. This policy file is effective only when there is no built-in policy file for the index type. For example, `custom` index type is currently the only one without a built-in policy file, thus it would use the policy file here if it's provided through this parameter. For more information, see [ISM policies]({{site.url}}{{site.baseurl}}/im-plugin/ism/policies/). -number_of_shards | No | Integer | The number of primary shards that an index should have on the destination OpenSearch server. This parameter is effective only when `template_file` is either explicitly provided in Sink configuration or built-in. If this parameter is set, it would override the value in index template file. For more information, see [Create index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/). -number_of_replicas | No | Integer | The number of replica shards each primary shard should have on the destination OpenSearch server. For example, if you have 4 primary shards and set number_of_replicas to 3, the index has 12 replica shards. This parameter is effective only when `template_file` is either explicitly provided in Sink configuration or built-in. If this parameter is set, it would override the value in index template file. For more information, see [Create index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/). -distribution_version | No | String | Indicates whether the sink backend version is Elasticsearch 6 or later. `es6` represents Elasticsearch 6. `default` represents the latest compatible backend version, such as Elasticsearch 7.x, OpenSearch 1.x, or OpenSearch 2.x. Default is `default`. -enable_request_compression | No | Boolean | Whether to enable compression when sending requests to OpenSearch. When `distribution_version` is set to `es6`, default is `false`. For all other distribution versions, default is `true`. -serverless | No | Boolean | Determines whether the OpenSearch backend is Amazon OpenSearch Serverless. Set this value to `true` when the destination for the `opensearch` sink is an Amazon OpenSearch Serverless collection. Default is `false`. -serverless_options | No | Object | The network configuration options available when the backend of the `opensearch` sink is set to Amazon OpenSearch Serverless. For more information, see [Serverless options](#serverless-options). - -### Serverless options +`region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). +`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). +`sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin. +`sts_external_id` | No | String | The external ID to attach to AssumeRole requests from AWS STS. +`serverless` | No | Boolean | **Deprecated in Data Prepper 2.7. Use this option with the `aws` configuration instead.** Determines whether the OpenSearch backend is Amazon OpenSearch Serverless. Set this value to `true` when the destination for the `opensearch` sink is an Amazon OpenSearch Serverless collection. Default is `false`. +`serverless_options` | No | Object | **Deprecated in Data Prepper 2.7. Use this option with the `aws` configuration instead.** The network configuration options available when the backend of the `opensearch` sink is set to Amazon OpenSearch Serverless. For more information, see [Serverless options](#serverless-options). + + +## actions + + +The following options can be used inside the `actions` option. + +Option | Required | Type | Description +:--- |:---| :--- | :--- +`type` | Yes | String | The type of bulk action to use if the `when` condition evaluates to true. Must be either `create`, `index`, `update`, `upsert`, or `delete`. +`when` | No | String | A [Data Prepper expression]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/) that conditionally evaluates whether an event will be sent to OpenSearch using the bulk action configured in `type`. When empty, the bulk action will be chosen automatically when the event is sent to OpenSearch. + + +## Serverless options The following options can be used in the `serverless_options` object. Option | Required | Type | Description :--- | :--- | :---| :--- -network_policy_name | Yes | String | The name of the network policy to create. -collection_name | Yes | String | The name of the Amazon OpenSearch Serverless collection to configure. -vpce_id | Yes | String | The virtual private cloud (VPC) endpoint to which the source connects. +`network_policy_name` | Yes | String | The name of the network policy to create. +`collection_name` | Yes | String | The name of the Amazon OpenSearch Serverless collection to configure. +`vpce_id` | Yes | String | The virtual private cloud (VPC) endpoint to which the source connects. ### Configure max_retries @@ -191,7 +228,6 @@ If your domain uses a master user in the internal user database, specify the mas sink: opensearch: hosts: ["https://your-fgac-amazon-opensearch-service-endpoint"] - aws_sigv4: false username: "master-username" password: "master-password" ``` @@ -302,3 +338,53 @@ log-pipeline: sts_role_arn: "arn:aws:iam:::role/PipelineRole" region: "us-east-1" ``` + +### Example with template_content and actions + +The following example pipeline contains both `template_content` and a list of conditional `actions`: + +```yaml +log-pipeline: + source: + http: + processor: + - date: + from_time_received: true + destination: "@timestamp" + sink: + - opensearch: + hosts: [ "https://" ] + index: "my-serverless-index" + template_type: index-template + template_content: > + { + "template" : { + "mappings" : { + "properties" : { + "Data" : { + "type" : "binary" + }, + "EncodedColors" : { + "type" : "binary" + }, + "Type" : { + "type" : "keyword" + }, + "LargeDouble" : { + "type" : "double" + } + } + } + } + } + # index is the default case + actions: + - type: "delete" + when: '/operation == "delete"' + - type: "update" + when: '/operation == "update"' + - type: "index" + aws: + sts_role_arn: "arn:aws:iam:::role/PipelineRole" + region: "us-east-1" +``` diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index cb881e814a..c752bf6b3d 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -8,7 +8,22 @@ nav_order: 55 # s3 -The `s3` sink saves batches of events to [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) objects. +The `s3` sink saves and writes batches of Data Prepper events to Amazon Simple Storage Service (Amazon S3) objects. The configured `codec` determines how the `s3` sink serializes the data into Amazon S3. + +The `s3` sink uses the following format when batching events: + +``` +${pathPrefix}events-%{yyyy-MM-dd'T'HH-mm-ss'Z'}-${currentTimeInNanos}-${uniquenessId}.${codecSuppliedExtension} +``` + +When a batch of objects is written to S3, the objects are formatted similarly to the following: + +``` +my-logs/2023/06/09/06/events-2023-06-09T06-00-01-1686290401871214927-ae15b8fa-512a-59c2-b917-295a0eff97c8.json +``` + + +For more information about how to configure an object, see the [Object key](#object-key-configuration) section. ## Usage @@ -22,14 +37,12 @@ pipeline: aws: region: us-east-1 sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper - sts_header_overrides: max_retries: 5 - bucket: - name: bucket_name - object_key: - path_prefix: my-elb/%{yyyy}/%{MM}/%{dd}/ + bucket: mys3bucket + object_key: + path_prefix: my-logs/%{yyyy}/%{MM}/%{dd}/ threshold: - event_count: 2000 + event_count: 10000 maximum_size: 50mb event_collect_timeout: 15s codec: @@ -37,17 +50,37 @@ pipeline: buffer_type: in_memory ``` +## IAM permissions + +In order to use the `s3` sink, configure AWS Identity and Access Management (IAM) to grant Data Prepper permissions to write to Amazon S3. You can use a configuration similar to the following JSON configuration: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "s3-access", + "Effect": "Allow", + "Action": [ + "s3:PutObject" + ], + "Resource": "arn:aws:s3:::/*" + } + ] +} +``` + ## Configuration Use the following options when customizing the `s3` sink. Option | Required | Type | Description :--- | :--- | :--- | :--- -`bucket` | Yes | String | The object from which the data is retrieved and then stored. The `name` must match the name of your object store. -`codec` | Yes | [Buffer type](#buffer-type) | Determines the buffer type. +`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. +`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object. `aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. `threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. -`object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found inside the root directory of the bucket. +`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` of the object in S3. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found in the root directory of the bucket. `compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. `buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. `max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`. @@ -59,33 +92,34 @@ Option | Required | Type | Description `region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). `sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). `sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin. -`sts_external_id` | No | String | The external ID to attach to AssumeRole requests from AWS STS. +`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the role. For more information, see the `ExternalId` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference. + ## Threshold configuration -Use the following options to set ingestion thresholds for the `s3` sink. +Use the following options to set ingestion thresholds for the `s3` sink. When any of these conditions are met, Data Prepper will write events to an S3 object. Option | Required | Type | Description :--- | :--- | :--- | :--- -`event_count` | Yes | Integer | The maximum number of events the S3 bucket can ingest. -`maximum_size` | Yes | String | The maximum number of bytes that the S3 bucket can ingest after compression. Defaults to `50mb`. -`event_collect_timeout` | Yes | String | Sets the time period during which events are collected before ingestion. All values are strings that represent duration, either an ISO_8601 notation string, such as `PT20.345S`, or a simple notation, such as `60s` or `1500ms`. +`event_count` | Yes | Integer | The number of Data Prepper events to accumulate before writing an object to S3. +`maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`. +`event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`. ## Buffer type -`buffer_type` is an optional configuration that records stored events temporarily before flushing them into an S3 bucket. The default value is `in_memory`. Use one of the following options: +`buffer_type` is an optional configuration that determines how Data Prepper temporarily stores data before writing an object to S3. The default value is `in_memory`. Use one of the following options: - `in_memory`: Stores the record in memory. -- `local_file`: Flushes the record into a file on your machine. +- `local_file`: Flushes the record into a file on your local machine. This uses your machine's temporary directory. - `multipart`: Writes using the [S3 multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html). Every 10 MB is written as a part. ## Object key configuration Option | Required | Type | Description :--- | :--- | :--- | :--- -`path_prefix` | Yes | String | The S3 key prefix path to use. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. By default, events write to the root of the bucket. +`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. ## codec @@ -156,3 +190,49 @@ Option | Required | Type | Description `schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true. `auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event. +### Setting a schema with Parquet + +The following example shows you how to configure the `s3` sink to write Parquet data into a Parquet file using a schema for [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records): + +``` +pipeline: + ... + sink: + - s3: + aws: + region: us-east-1 + sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper + bucket: mys3bucket + object_key: + path_prefix: vpc-flow-logs/%{yyyy}/%{MM}/%{dd}/%{HH}/ + codec: + parquet: + schema: > + { + "type" : "record", + "namespace" : "org.opensearch.dataprepper.examples", + "name" : "VpcFlowLog", + "fields" : [ + { "name" : "version", "type" : ["null", "string"]}, + { "name" : "srcport", "type": ["null", "int"]}, + { "name" : "dstport", "type": ["null", "int"]}, + { "name" : "accountId", "type" : ["null", "string"]}, + { "name" : "interfaceId", "type" : ["null", "string"]}, + { "name" : "srcaddr", "type" : ["null", "string"]}, + { "name" : "dstaddr", "type" : ["null", "string"]}, + { "name" : "start", "type": ["null", "int"]}, + { "name" : "end", "type": ["null", "int"]}, + { "name" : "protocol", "type": ["null", "int"]}, + { "name" : "packets", "type": ["null", "int"]}, + { "name" : "bytes", "type": ["null", "int"]}, + { "name" : "action", "type": ["null", "string"]}, + { "name" : "logStatus", "type" : ["null", "string"]} + ] + } + threshold: + event_count: 500000000 + maximum_size: 20mb + event_collect_timeout: PT15M + buffer_type: in_memory +``` + diff --git a/_data-prepper/pipelines/configuration/sources/dynamo-db.md b/_data-prepper/pipelines/configuration/sources/dynamo-db.md index 597e835151..f75489f103 100644 --- a/_data-prepper/pipelines/configuration/sources/dynamo-db.md +++ b/_data-prepper/pipelines/configuration/sources/dynamo-db.md @@ -31,6 +31,7 @@ cdc-pipeline: s3_prefix: "myprefix" stream: start_position: "LATEST" # Read latest data from streams (Default) + view_on_remove: NEW_IMAGE aws: region: "us-west-2" sts_role_arn: "arn:aws:iam::123456789012:role/my-iam-role" @@ -84,12 +85,112 @@ Option | Required | Type | Description The following option lets you customize how the pipeline reads events from the DynamoDB table. -Option | Required | Type | Description +Option | Required | Type | Description :--- | :--- | :--- | :--- `start_position` | No | String | The position from where the source starts reading stream events when the DynamoDB stream option is enabled. `LATEST` starts reading events from the most recent stream record. +`view_on_remove` | No | Enum | The [stream record view](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html) to use for REMOVE events from DynamoDB streams. Must be either `NEW_IMAGE` or `OLD_IMAGE` . Defaults to `NEW_IMAGE`. If the `OLD_IMAGE` option is used and the old image can not be found, the source will find the `NEW_IMAGE`. + +## Exposed metadata attributes + +The following metadata will be added to each event that is processed by the `dynamodb` source. These metadata attributes can be accessed using the [expression syntax `getMetadata` function](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/#getmetadata). + +* `primary_key`: The primary key of the DynamoDB item. For tables that only contain a partition key, this value provides the partition key. For tables that contain both a partition and sort key, the `primary_key` attribute will be equal to the partition and sort key, separated by a `|`, for example, `partition_key|sort_key`. +* `partition_key`: The partition key of the DynamoDB item. +* `sort_key`: The sort key of the DynamoDB item. This will be null if the table does not contain a sort key. +* `dynamodb_timestamp`: The timestamp of the DynamoDB item. This will be the export time for export items and the DynamoDB stream event time for stream items. This timestamp is used by sinks to emit an `EndtoEndLatency` metric for DynamoDB stream events that tracks the latency between a change occurring in the DynamoDB table and that change being applied to the sink. +* `document_version`: Uses the `dynamodb_timestamp` to modify break ties between stream items that are received in the same second. Recommend for use with the `opensearch` sink's `document_version` setting. +* `opensearch_action`: A default value for mapping DynamoDB event actions to OpenSearch actions. This action will be `index` for export items, and `INSERT` or `MODIFY` for stream events, and `REMOVE` stream events when the OpenSearch action is `delete`. +* `dynamodb_event_name`: The exact event type for the item. Will be `null` for export items and either `INSERT`, `MODIFY`, or `REMOVE` for stream events. +* `table_name`: The name of the DynamoDB table that an event came from. + + +## Permissions + +The following are the minimum required permissions for running DynamoDB as a source: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "allowDescribeTable", + "Effect": "Allow", + "Action": [ + "dynamodb:DescribeTable" + ], + "Resource": [ + "arn:aws:dynamodb:us-east-1:{account-id}:table/my-table" + ] + }, + { + "Sid": "allowRunExportJob", + "Effect": "Allow", + "Action": [ + "dynamodb:DescribeContinuousBackups", + "dynamodb:ExportTableToPointInTime" + ], + "Resource": [ + "arn:aws:dynamodb:us-east-1:{account-id}:table/my-table" + ] + }, + { + "Sid": "allowCheckExportjob", + "Effect": "Allow", + "Action": [ + "dynamodb:DescribeExport" + ], + "Resource": [ + "arn:aws:dynamodb:us-east-1:{account-id}:table/my-table/export/*" + ] + }, + { + "Sid": "allowReadFromStream", + "Effect": "Allow", + "Action": [ + "dynamodb:DescribeStream", + "dynamodb:GetRecords", + "dynamodb:GetShardIterator" + ], + "Resource": [ + "arn:aws:dynamodb:us-east-1:{account-id}:table/my-table/stream/*" + ] + }, + { + "Sid": "allowReadAndWriteToS3ForExport", + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:AbortMultipartUpload", + "s3:PutObject", + "s3:PutObjectAcl" + ], + "Resource": [ + "arn:aws:s3:::my-bucket/*" + ] + } + ] +} +``` + +When performing an export, the `"Sid": "allowReadFromStream"` section is not required. If only reading from DynamoDB streams, the +`"Sid": "allowReadAndWriteToS3ForExport"`, `"Sid": "allowCheckExportjob"`, and ` "Sid": "allowRunExportJob"` sections are not required. + +## Metrics +The `dynamodb` source includes the following metrics. +### Counters +* `exportJobSuccess`: The number of export jobs that have been submitted successfully. +* `exportJobFailure`: The number of export job submission attempts that have failed. +* `exportS3ObjectsTotal`: The total number of export data files found in S3. +* `exportS3ObjectsProcessed`: The total number of export data files that have been processed successfully from S3. +* `exportRecordsTotal`: The total number of records found in the export. +* `exportRecordsProcessed`: The total number of export records that have been processed successfully. +* `exportRecordsProcessingErrors`: The number of export record processing errors. +* `changeEventsProcessed`: The number of change events processed from DynamoDB streams. +* `changeEventsProcessingErrors`: The number of processing errors for change events from DynamoDB streams. +* `shardProgress`: The incremented shard progress when DynamoDB streams are being read correctly. This being`0` for any significant amount of time means there is a problem with the pipeline that has streams enabled. diff --git a/_data-prepper/pipelines/configuration/sources/otel-trace.md b/_data-prepper/pipelines/configuration/sources/otel-trace-source.md similarity index 89% rename from _data-prepper/pipelines/configuration/sources/otel-trace.md rename to _data-prepper/pipelines/configuration/sources/otel-trace-source.md index 4b17647768..137592bbe8 100644 --- a/_data-prepper/pipelines/configuration/sources/otel-trace.md +++ b/_data-prepper/pipelines/configuration/sources/otel-trace-source.md @@ -1,22 +1,22 @@ --- layout: default -title: otel_trace_source source +title: otel_trace_source parent: Sources grand_parent: Pipelines nav_order: 15 +redirect_from: + - /data-prepper/pipelines/configuration/sources/otel-trace/ --- -# otel_trace source +# otel_trace_source -## Overview - -The `otel_trace` source is a source for the OpenTelemetry Collector. The following table describes options you can use to configure the `otel_trace` source. +`otel_trace_source` is a source for the OpenTelemetry Collector. The following table describes options you can use to configure the `otel_trace_source` source. Option | Required | Type | Description :--- | :--- | :--- | :--- -port | No | Integer | The port that the `otel_trace` source runs on. Default value is `21890`. +port | No | Integer | The port that the `otel_trace_source` source runs on. Default value is `21890`. request_timeout | No | Integer | The request timeout, in milliseconds. Default value is `10000`. health_check_service | No | Boolean | Enables a gRPC health check service under `grpc.health.v1/Health/Check`. Default value is `false`. unauthenticated_health_check | No | Boolean | Determines whether or not authentication is required on the health check endpoint. Data Prepper ignores this option if no authentication is defined. Default value is `false`. @@ -35,6 +35,8 @@ authentication | No | Object | An authentication configuration. By default, an u ## Metrics +The 'otel_trace_source' source includes the following metrics. + ### Counters - `requestTimeouts`: Measures the total number of requests that time out. @@ -50,4 +52,4 @@ authentication | No | Object | An authentication configuration. By default, an u ### Distribution summaries -- `payloadSize`: Measures the incoming request payload size distribution in bytes. \ No newline at end of file +- `payloadSize`: Measures the incoming request payload size distribution in bytes. diff --git a/_data-prepper/pipelines/configuration/sources/s3.md b/_data-prepper/pipelines/configuration/sources/s3.md index 7dc31caade..7a3746bab6 100644 --- a/_data-prepper/pipelines/configuration/sources/s3.md +++ b/_data-prepper/pipelines/configuration/sources/s3.md @@ -8,7 +8,10 @@ nav_order: 20 # s3 source -`s3` is a source plugin that reads events from [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) objects. It requires an [Amazon Simple Queue Service (Amazon SQS)](https://aws.amazon.com/sqs/) queue that receives [S3 Event Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html). After Amazon SQS is configured, the `s3` source receives messages from Amazon SQS. When the SQS message indicates that an S3 object was created, the `s3` source loads the S3 objects and then parses them using the configured [codec](#codec). You can also configure the `s3` source to use [Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) instead of Data Prepper to parse S3 objects. +`s3` is a source plugin that reads events from [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) objects. You can configure the source to either use an [Amazon Simple Queue Service (Amazon SQS)](https://aws.amazon.com/sqs/) queue or scan an S3 bucket: + +- To use Amazon SQS notifications, configure S3 event notifications on your S3 bucket. After Amazon SQS is configured, the `s3` source receives messages from Amazon SQS. When the SQS message indicates that an S3 object has been created, the `s3` source loads the S3 objects and then parses them using the configured [codec](#codec). +- To use an S3 bucket, configure the `s3` source to use Amazon S3 Select instead of Data Prepper to parse S3 objects. ## IAM permissions @@ -86,19 +89,23 @@ Option | Required | Type | Description :--- | :--- | :--- | :--- `notification_type` | Yes | String | Must be `sqs`. `notification_source` | No | String | Determines how notifications are received by SQS. Must be `s3` or `eventbridge`. `s3` represents notifications that are directly sent from Amazon S3 to Amazon SQS or fanout notifications from Amazon S3 to Amazon Simple Notification Service (Amazon SNS) to Amazon SQS. `eventbridge` represents notifications from [Amazon EventBridge](https://aws.amazon.com/eventbridge/) and [Amazon Security Lake](https://aws.amazon.com/security-lake/). Default is `s3`. -`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `automatic`. Default is `none`. +`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, `snappy`, or `automatic`. Default is `none`. `codec` | Yes | Codec | The [codec](#codec) to apply. `sqs` | Yes | SQS | The SQS configuration. See [sqs](#sqs) for more information. `aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. `on_error` | No | String | Determines how to handle errors in Amazon SQS. Can be either `retain_messages` or `delete_messages`. `retain_messages` leaves the message in the Amazon SQS queue and tries to send the message again. This is recommended for dead-letter queues. `delete_messages` deletes failed messages. Default is `retain_messages`. -buffer_timeout | No | Duration | The amount of time allowed for writing events to the Data Prepper buffer before timeout occurs. Any events that the Amazon S3 source cannot write to the buffer during the set amount of time are discarded. Default is `10s`. +`buffer_timeout` | No | Duration | The amount of time allowed for writing events to the Data Prepper buffer before timeout occurs. Any events that the Amazon S3 source cannot write to the buffer during the specified amount of time are discarded. Default is `10s`. `records_to_accumulate` | No | Integer | The number of messages that accumulate before being written to the buffer. Default is `100`. `metadata_root_key` | No | String | The base key for adding S3 metadata to each event. The metadata includes the key and bucket for each S3 object. Default is `s3/`. +`default_bucket_owner` | No | String | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership). +`bucket_owners` | No | Map | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership). `disable_bucket_ownership_validation` | No | Boolean | When `true`, the S3 source does not attempt to validate that the bucket is owned by the expected account. The expected account is the same account that owns the Amazon SQS queue. Default is `false`. `acknowledgments` | No | Boolean | When `true`, enables `s3` sources to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#end-to-end-acknowledgments) when events are received by OpenSearch sinks. `s3_select` | No | [s3_select](#s3_select) | The Amazon S3 Select configuration. `scan` | No | [scan](#scan) | The S3 scan configuration. `delete_s3_objects_on_read` | No | Boolean | When `true`, the S3 scan attempts to delete S3 objects after all events from the S3 object are successfully acknowledged by all sinks. `acknowledgments` should be enabled when deleting S3 objects. Default is `false`. +`workers` | No | Integer | Configures the number of worker threads that the source uses to read data from S3. Leaving this value at the default unless your S3 objects are less than 1MB. Performance may decrease for larger S3 objects. This setting only affects SQS-based sources. Default is `1`. + ## sqs @@ -112,7 +119,7 @@ Option | Required | Type | Description `visibility_timeout` | No | Duration | The visibility timeout to apply to messages read from the Amazon SQS queue. This should be set to the amount of time that Data Prepper may take to read all the S3 objects in a batch. Default is `30s`. `wait_time` | No | Duration | The amount of time to wait for long polling on the Amazon SQS API. Default is `20s`. `poll_delay` | No | Duration | A delay placed between the reading and processing of a batch of Amazon SQS messages and making a subsequent request. Default is `0s`. -`visibility_duplication_protection` | No | Boolean | If set to `true`, Data Prepper attempts to avoid duplicate processing by extending the visibility timeout of SQS messages. Until the data reaches the sink, Data Prepper will regularly call `ChangeMessageVisibility` to avoid reading the S3 object again. To use this feature, you need to grant permissions to `ChangeMessageVisibility` on the IAM role. Default is `false`. +`visibility_duplication_protection` | No | Boolean | If set to `true`, Data Prepper attempts to avoid duplicate processing by extending the visibility timeout of SQS messages. Until the data reaches the sink, Data Prepper will regularly call `ChangeMessageVisibility` to avoid rereading of the S3 object. To use this feature, you need to grant permissions to `sqs:ChangeMessageVisibility` on the IAM role. Default is `false`. `visibility_duplicate_protection_timeout` | No | Duration | Sets the maximum total length of time that a message will not be processed when using `visibility_duplication_protection`. Defaults to two hours. @@ -123,6 +130,7 @@ Option | Required | Type | Description `region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). `sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). `aws_sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin. +`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the STS role. For more information, see the `ExternalID` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference. ## codec @@ -154,9 +162,6 @@ Option | Required | Type | Description `header` | No | String list | The header containing the column names used to parse CSV data. `detect_header` | No | Boolean | Whether the first line of the Amazon S3 object should be interpreted as a header. Default is `true`. - - - ## Using `s3_select` with the `s3` source When configuring `s3_select` to parse Amazon S3 objects, use the following options: @@ -198,16 +203,18 @@ Option | Required | Type | Description `start_time` | No | String | The time from which to start scanning objects modified after the given `start_time`. This should follow [ISO LocalDateTime](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_DATE_TIME) format, for example, `023-01-23T10:00:00`. If `end_time` is configured along with `start_time`, all objects after `start_time` and before `end_time` will be processed. `start_time` and `range` cannot be used together. `end_time` | No | String | The time after which no objects will be scanned after the given `end_time`. This should follow [ISO LocalDateTime](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_DATE_TIME) format, for example, `023-01-23T10:00:00`. If `start_time` is configured along with `end_time`, all objects after `start_time` and before `end_time` will be processed. `end_time` and `range` cannot be used together. `range` | No | String | The time range from which objects are scanned from all buckets. Supports ISO_8601 notation strings, such as `PT20.345S` or `PT15M`, and notation strings for seconds (`60s`) and milliseconds (`1600ms`). `start_time` and `end_time` cannot be used with `range`. Range `P12H` scans all the objects modified in the last 12 hours from the time pipeline started. -`buckets` | Yes | List | A list of [buckets](#bucket) to scan. +`buckets` | Yes | List | A list of [scan buckets](#scan-bucket) to scan. `scheduling` | No | List | The configuration for scheduling periodic scans on all buckets. `start_time`, `end_time` and `range` can not be used if scheduling is configured. -### bucket + +### scan bucket + Option | Required | Type | Description :--- | :--- |:-----| :--- `bucket` | Yes | Map | Provides options for each bucket. -You can configure the following options inside the [bucket](#bucket) setting. +You can configure the following options in the `bucket` setting map. Option | Required | Type | Description :--- | :--- | :--- | :--- @@ -244,13 +251,17 @@ The `s3` source includes the following metrics: * `s3ObjectsNotFound`: The number of S3 objects that the `s3` source failed to read due to an S3 "Not Found" error. These are also counted toward `s3ObjectsFailed`. * `s3ObjectsAccessDenied`: The number of S3 objects that the `s3` source failed to read due to an "Access Denied" or "Forbidden" error. These are also counted toward `s3ObjectsFailed`. * `s3ObjectsSucceeded`: The number of S3 objects that the `s3` source successfully read. +* `s3ObjectNoRecordsFound`: The number of S3 objects that resulted in 0 records being added to the buffer by the `s3` source. +* `s3ObjectsDeleted`: The number of S3 objects deleted by the `s3` source. +* `s3ObjectsDeleteFailed`: The number of S3 objects that the `s3` source failed to delete. +* `s3ObjectsEmpty`: The number of S3 objects that are considered empty because they have a size of `0`. These objects will be skipped by the `s3` source. * `sqsMessagesReceived`: The number of Amazon SQS messages received from the queue by the `s3` source. * `sqsMessagesDeleted`: The number of Amazon SQS messages deleted from the queue by the `s3` source. * `sqsMessagesFailed`: The number of Amazon SQS messages that the `s3` source failed to parse. -* `s3ObjectNoRecordsFound` -- The number of S3 objects that resulted in 0 records added to the buffer by the `s3` source. * `sqsMessagesDeleteFailed` -- The number of SQS messages that the `s3` source failed to delete from the SQS queue. -* `s3ObjectsDeleted` -- The number of S3 objects deleted by the `s3` source. -* `s3ObjectsDeleteFailed` -- The number of S3 objects that the `s3` source failed to delete. +* `sqsVisibilityTimeoutChangedCount`: The number of times that the `s3` source changed the visibility timeout for an SQS message. This includes multiple visibility timeout changes on the same message. +* `sqsVisibilityTimeoutChangeFailedCount`: The number of times that the `s3` source failed to change the visibility timeout for an SQS message. This includes multiple visibility timeout change failures on the same message. +* `acknowledgementSetCallbackCounter`: The number of times that the `s3` source received an acknowledgment from Data Prepper. ### Timers diff --git a/_data-prepper/pipelines/expression-syntax.md b/_data-prepper/pipelines/expression-syntax.md index 8257ab8978..be0be6f792 100644 --- a/_data-prepper/pipelines/expression-syntax.md +++ b/_data-prepper/pipelines/expression-syntax.md @@ -230,7 +230,7 @@ The `length()` function takes one argument of the JSON pointer type and returns ### `hasTags()` -The `hastags()` function takes one or more string type arguments and returns `true` if all the arguments passed are present in an event's tags. When an argument does not exist in the event's tags, the function returns `false`. For example, if you use the expression `hasTags("tag1")` and the event contains `tag1`, Data Prepper returns `true`. If you use the expression `hasTags("tag2")` but the event only contains a `tag1` tag, Data Prepper returns `false`. +The `hasTags()` function takes one or more string type arguments and returns `true` if all of the arguments passed are present in an event's tags. When an argument does not exist in the event's tags, the function returns `false`. For example, if you use the expression `hasTags("tag1")` and the event contains `tag1`, Data Prepper returns `true`. If you use the expression `hasTags("tag2")` but the event only contains `tag1`, Data Prepper returns `false`. ### `getMetadata()` @@ -245,3 +245,21 @@ The `contains()` function takes two string arguments and determines whether eith The `cidrContains()` function takes two or more arguments. The first argument is a JSON pointer, which represents the key to the IP address that is checked. It supports both IPv4 and IPv6 addresses. Every argument that comes after the key is a string type that represents CIDR blocks that are checked against. If the IP address in the first argument is in the range of any of the given CIDR blocks, the function returns `true`. If the IP address is not in the range of the CIDR blocks, the function returns `false`. For example, `cidrContains(/sourceIp,"192.0.2.0/24","10.0.1.0/16")` will return `true` if the `sourceIp` field indicated in the JSON pointer has a value of `192.0.2.5`. + +### `join()` + +The `join()` function joins elements of a list to form a string. The function takes a JSON pointer, which represents the key to a list or a map where values are of the list type, and joins the lists as strings using commas (`,`), the default delimiter between strings. + +If `{"source": [1, 2, 3]}` is the input data, as shown in the following example: + + +```json +{"source": {"key1": [1, 2, 3], "key2": ["a", "b", "c"]}} +``` + +Then `join(/source)` will return `"1,2,3"` in the following format: + +```json +{"key1": "1,2,3", "key2": "a,b,c"} +``` +You can also specify a delimiter other than the default inside the expression. For example, `join("-", /source)` joins each `source` field using a hyphen (`-`) as the delimiter. diff --git a/_im-plugin/reindex-data.md b/_im-plugin/reindex-data.md index 2e3288087a..a766589b84 100644 --- a/_im-plugin/reindex-data.md +++ b/_im-plugin/reindex-data.md @@ -91,6 +91,12 @@ Options | Valid values | Description | Required `socket_timeout` | Time Unit | The wait time for socket reads (default 30s). | No `connect_timeout` | Time Unit | The wait time for remote connection timeouts (default 30s). | No +The following table lists the retry policy cluster settings. + +Setting | Description | Default value +:--- | :--- +`reindex.remote.retry.initial_backoff` | The initial backoff time for retries. Subsequent retries will follow exponential backoff based on the initial backoff time. | 500 ms +`reindex.remote.retry.max_count` | The maximum number of retry attempts. | 15 ## Reindex a subset of documents diff --git a/_ingest-pipelines/processors/index-processors.md b/_ingest-pipelines/processors/index-processors.md index fb71e90d01..60fcac82e2 100644 --- a/_ingest-pipelines/processors/index-processors.md +++ b/_ingest-pipelines/processors/index-processors.md @@ -59,6 +59,7 @@ Processor type | Description `sort` | Sorts the elements of an array in ascending or descending order. `sparse_encoding` | Generates a sparse vector/token and weights from text fields for neural sparse search using sparse retrieval. `split` | Splits a field into an array using a separator character. +`text_chunking` | Splits long documents into smaller chunks. `text_embedding` | Generates vector embeddings from text fields for semantic search. `text_image_embedding` | Generates combined vector embeddings from text and image fields for multimodal neural search. `trim` | Removes leading and trailing white space from a string field. diff --git a/_ingest-pipelines/processors/text-chunking.md b/_ingest-pipelines/processors/text-chunking.md new file mode 100644 index 0000000000..e9ff55b210 --- /dev/null +++ b/_ingest-pipelines/processors/text-chunking.md @@ -0,0 +1,315 @@ +--- +layout: default +title: Text chunking +parent: Ingest processors +nav_order: 250 +--- + +# Text chunking processor + +The `text_chunking` processor splits a long document into shorter passages. The processor supports the following algorithms for text splitting: + +- [`fixed_token_length`](#fixed-token-length-algorithm): Splits text into passages of the specified size. +- [`delimiter`](#delimiter-algorithm): Splits text into passages on a delimiter. + +The following is the syntax for the `text_chunking` processor: + +```json +{ + "text_chunking": { + "field_map": { + "": "" + }, + "algorithm": { + "": "" + } + } +} +``` + +## Configuration parameters + +The following table lists the required and optional parameters for the `text_chunking` processor. + +| Parameter | Data type | Required/Optional | Description | +|:---|:---|:---|:---| +| `field_map` | Object | Required | Contains key-value pairs that specify the mapping of a text field to the output field. | +| `field_map.` | String | Required | The name of the field from which to obtain text for generating chunked passages. | +| `field_map.` | String | Required | The name of the field in which to store the chunked results. | +| `algorithm` | Object | Required | Contains at most one key-value pair that specifies the chunking algorithm and parameters. | +| `algorithm.` | String | Optional | The name of the chunking algorithm. Valid values are [`fixed_token_length`](#fixed-token-length-algorithm) or [`delimiter`](#delimiter-algorithm). Default is `fixed_token_length`. | +| `algorithm.` | Object | Optional | The parameters for the chunking algorithm. By default, contains the default parameters of the `fixed_token_length` algorithm. | +| `description` | String | Optional | A brief description of the processor. | +| `tag` | String | Optional | An identifier tag for the processor. Useful when debugging in order to distinguish between processors of the same type. | + +### Fixed token length algorithm + +The following table lists the optional parameters for the `fixed_token_length` algorithm. + +| Parameter | Data type | Required/Optional | Description | +|:---|:---|:---|:---| +| `token_limit` | Integer | Optional | The token limit for chunking algorithms. Valid values are integers of at least `1`. Default is `384`. | +| `tokenizer` | String | Optional | The [word tokenizer]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/index/#word-tokenizers) name. Default is `standard`. | +| `overlap_rate` | String | Optional | The degree of overlap in the token algorithm. Valid values are floats between `0` and `0.5`, inclusive. Default is `0`. | +| `max_chunk_limit` | Integer | Optional | The chunk limit for chunking algorithms. Default is 100. To disable this parameter, set it to `-1`. | + +The default value of `token_limit` is `384` so that output passages don't exceed the token limit constraint of the downstream text embedding models. For [OpenSearch-supported pretrained models]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/#supported-pretrained-models), like `msmarco-distilbert-base-tas-b` and `opensearch-neural-sparse-encoding-v1`, the input token limit is `512`. The `standard` tokenizer tokenizes text into words. According to [OpenAI](https://platform.openai.com/docs/introduction), 1 token equals approximately 0.75 words of English text. The default token limit is calculated as 512 * 0.75 = 384. +{: .note} + +You can set the `overlap_rate` to a decimal percentage value in the 0--0.5 range, inclusive. Per [Amazon Bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend setting this parameter to a value of 0–0.2 to improve accuracy. +{: .note} + +The `max_chunk_limit` parameter limits the number of chunked passages. If the number of passages generated by the processor exceeds the limit, the algorithm will return an exception, prompting you to either increase or disable the limit. +{: .note} + +### Delimiter algorithm + +The following table lists the optional parameters for the `delimiter` algorithm. + +| Parameter | Data type | Required/Optional | Description | +|:---|:---|:---|:---| +| `delimiter` | String | Optional | A string delimiter used to split text. You can set the `delimiter` to any string, for example, `\n` (split text into paragraphs on a new line) or `.` (split text into sentences). Default is `\n\n` (split text into paragraphs on two new line characters). | +| `max_chunk_limit` | Integer | Optional | The chunk limit for chunking algorithms. Default is `100`. To disable this parameter, set it to `-1`. | + +The `max_chunk_limit` parameter limits the number of chunked passages. If the number of passages generated by the processor exceeds the limit, the algorithm will return an exception, prompting you to either increase or disable the limit. +{: .note} + +## Using the processor + +Follow these steps to use the processor in a pipeline. You can specify the chunking algorithm when creating the processor. If you don't provide an algorithm name, the chunking processor will use the default `fixed_token_length` algorithm along with all its default parameters. + +**Step 1: Create a pipeline** + +The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field: + +```json +PUT _ingest/pipeline/text-chunking-ingest-pipeline +{ + "description": "A text chunking ingest pipeline", + "processors": [ + { + "text_chunking": { + "algorithm": { + "fixed_token_length": { + "token_limit": 10, + "overlap_rate": 0.2, + "tokenizer": "standard" + } + }, + "field_map": { + "passage_text": "passage_chunk" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +**Step 2 (Optional): Test the pipeline** + +It is recommended that you test your pipeline before ingesting documents. +{: .tip} + +To test the pipeline, run the following query: + +```json +POST _ingest/pipeline/text-chunking-ingest-pipeline/_simulate +{ + "docs": [ + { + "_index": "testindex", + "_id": "1", + "_source":{ + "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch." + } + } + ] +} +``` +{% include copy-curl.html %} + +#### Response + +The response confirms that, in addition to the `passage_text` field, the processor has generated chunking results in the `passage_chunk` field. The processor split the paragraph into 10-word chunks. Because of the `overlap` setting of 0.2, the last 2 words of a chunk are duplicated in the following chunk: + +```json +{ + "docs": [ + { + "doc": { + "_index": "testindex", + "_id": "1", + "_source": { + "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.", + "passage_chunk": [ + "This is an example document to be chunked. The document ", + "The document contains a single paragraph, two sentences and 24 ", + "and 24 tokens by standard tokenizer in OpenSearch." + ] + }, + "_ingest": { + "timestamp": "2024-03-20T02:55:25.642366Z" + } + } + } + ] +} +``` + +Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). + +## Chaining text chunking and embedding processors + +You can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage. + +**Prerequisites** + +Follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model. + +**Step 1: Create a pipeline** + +The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field: + +```json +PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline +{ + "description": "A text chunking and embedding ingest pipeline", + "processors": [ + { + "text_chunking": { + "algorithm": { + "fixed_token_length": { + "token_limit": 10, + "overlap_rate": 0.2, + "tokenizer": "standard" + } + }, + "field_map": { + "passage_text": "passage_chunk" + } + } + }, + { + "text_embedding": { + "model_id": "LMLPWY4BROvhdbtgETaI", + "field_map": { + "passage_chunk": "passage_chunk_embedding" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +**Step 2 (Optional): Test the pipeline** + +It is recommended that you test your pipeline before ingesting documents. +{: .tip} + +To test the pipeline, run the following query: + +```json +POST _ingest/pipeline/text-chunking-embedding-ingest-pipeline/_simulate +{ + "docs": [ + { + "_index": "testindex", + "_id": "1", + "_source":{ + "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch." + } + } + ] +} +``` +{% include copy-curl.html %} + +#### Response + +The response confirms that, in addition to the `passage_text` and `passage_chunk` fields, the processor has generated text embeddings for each of the three passages in the `passage_chunk_embedding` field. The embedding vectors are stored in the `knn` field for each chunk: + +```json +{ + "docs": [ + { + "doc": { + "_index": "testindex", + "_id": "1", + "_source": { + "passage_chunk_embedding": [ + { + "knn": [...] + }, + { + "knn": [...] + }, + { + "knn": [...] + } + ], + "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.", + "passage_chunk": [ + "This is an example document to be chunked. The document ", + "The document contains a single paragraph, two sentences and 24 ", + "and 24 tokens by standard tokenizer in OpenSearch." + ] + }, + "_ingest": { + "timestamp": "2024-03-20T03:04:49.144054Z" + } + } + } + ] +} +``` + +Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). + +## Cascaded text chunking processors + +You can chain multiple chunking processors together. For example, to split documents into paragraphs, apply the `delimiter` algorithm and specify the parameter as `\n\n`. To prevent a paragraph from exceeding the token limit, append another chunking processor that uses the `fixed_token_length` algorithm. You can configure the ingest pipeline for this example as follows: + +```json +PUT _ingest/pipeline/text-chunking-cascade-ingest-pipeline +{ + "description": "A text chunking pipeline with cascaded algorithms", + "processors": [ + { + "text_chunking": { + "algorithm": { + "delimiter": { + "delimiter": "\n\n" + } + }, + "field_map": { + "passage_text": "passage_chunk1" + } + } + }, + { + "text_chunking": { + "algorithm": { + "fixed_token_length": { + "token_limit": 500, + "overlap_rate": 0.2, + "tokenizer": "standard" + } + }, + "field_map": { + "passage_chunk1": "passage_chunk2" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +## Next steps + +- To learn more about semantic search, see [Semantic search]({{site.url}}{{site.baseurl}}/search-plugins/semantic-search/). +- To learn more about sparse search, see [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). +- To learn more about using models in OpenSearch, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model). +- For a comprehensive example, see [Neural search tutorial]({{site.url}}{{site.baseurl}}/search-plugins/neural-search-tutorial/). diff --git a/_install-and-configure/configuring-opensearch/index-settings.md b/_install-and-configure/configuring-opensearch/index-settings.md index 0f7e336cdd..25cd4b8810 100644 --- a/_install-and-configure/configuring-opensearch/index-settings.md +++ b/_install-and-configure/configuring-opensearch/index-settings.md @@ -100,6 +100,7 @@ OpenSearch supports the following static index-level index settings: - `index.merge_on_flush.policy` (default | merge-on-flush): This setting controls which merge policy should be used when `index.merge_on_flush.enabled` is enabled. Default is `default`. +- `index.check_pending_flush.enabled` (Boolean): This setting controls the Apache Lucene `checkPendingFlushOnUpdate` index writer setting, which specifies whether an indexing thread should check for pending flushes on an update in order to flush indexing buffers to disk. Default is `true`. ### Updating a static index setting @@ -184,9 +185,9 @@ OpenSearch supports the following dynamic index-level index settings: - `index.final_pipeline` (String): The final ingest node pipeline for the index. If the final pipeline is set and the pipeline does not exist, then index requests fail. The pipeline name `_none` specifies that the index does not have an ingest pipeline. -- `index.optimize_doc_id_lookup.fuzzy_set.enabled` (Boolean): This setting controls whether `fuzzy_set` should be enabled in order to optimize document ID lookups in index or search calls by using an additional data structure, in this case, the Bloom filter data structure. Enabling this setting improves performance for upsert and search operations that rely on document ID by creating a new data structure (Bloom filter). The Bloom filter allows for the handling of negative cases (that is, IDs being absent in the existing index) through faster off-heap lookups. Default is `false`. This setting can only be used if the feature flag `opensearch.experimental.optimize_doc_id_lookup.fuzzy_set.enabled` is set to `true`. +- `index.optimize_doc_id_lookup.fuzzy_set.enabled` (Boolean): This setting controls whether `fuzzy_set` should be enabled in order to optimize document ID lookups in index or search calls by using an additional data structure, in this case, the Bloom filter data structure. Enabling this setting improves performance for upsert and search operations that rely on document IDs by creating a new data structure (Bloom filter). The Bloom filter allows for the handling of negative cases (that is, IDs being absent in the existing index) through faster off-heap lookups. Note that creating a Bloom filter requires additional heap usage during indexing time. Default is `false`. -- `index.optimize_doc_id_lookup.fuzzy_set.false_positive_probability` (Double): Sets the false-positive probability for the underlying `fuzzy_set` (that is, the Bloom filter). A lower false-positive probability ensures higher throughput for `UPSERT` and `GET` operations. Allowed values range between `0.01` and `0.50`. Default is `0.20`. This setting can only be used if the feature flag `opensearch.experimental.optimize_doc_id_lookup.fuzzy_set.enabled` is set to `true`. +- `index.optimize_doc_id_lookup.fuzzy_set.false_positive_probability` (Double): Sets the false-positive probability for the underlying `fuzzy_set` (that is, the Bloom filter). A lower false-positive probability ensures higher throughput for upsert and get operations but results in increased storage and memory use. Allowed values range between `0.01` and `0.50`. Default is `0.20`. ### Updating a dynamic index setting diff --git a/_install-and-configure/install-dashboards/debian.md b/_install-and-configure/install-dashboards/debian.md index 4372049230..73aba46cd4 100644 --- a/_install-and-configure/install-dashboards/debian.md +++ b/_install-and-configure/install-dashboards/debian.md @@ -131,3 +131,44 @@ By default, OpenSearch Dashboards, like OpenSearch, binds to `localhost` when yo 1. From a web browser, navigate to OpenSearch Dashboards. The default port is 5601. 1. Log in with the default username `admin` and the default password `admin`. (For OpenSearch 2.12 and later, the password should be the custom admin password) 1. Visit [Getting started with OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/dashboards/index/) to learn more. + + +## Upgrade to a newer version + +OpenSearch Dashboards instances installed using `dpkg` or `apt-get` can be easily upgraded to a newer version. + +### Manual upgrade with DPKG + +Download the Debian package for the desired upgrade version directly from the [OpenSearch Project downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. + +Navigate to the directory containing the distribution and run the following command: + +```bash +sudo dpkg -i opensearch-dashboards-{{site.opensearch_version}}-linux-x64.deb +``` +{% include copy.html %} + +### APT-GET + +To upgrade to the latest version of OpenSearch Dashboards using `apt-get`, run the following command: + +```bash +sudo apt-get upgrade opensearch-dashboards +``` +{% include copy.html %} + +You can also upgrade to a specific OpenSearch Dashboards version by providing the version number: + +```bash +sudo apt-get upgrade opensearch-dashboards= +``` +{% include copy.html %} + +### Automatically restart the service after a package upgrade (2.13.0+) + +To automatically restart OpenSearch Dashboards after a package upgrade, enable the `opensearch-dashboards.service` through `systemd`: + +```bash +sudo systemctl enable opensearch-dashboards.service +``` +{% include copy.html %} diff --git a/_install-and-configure/install-dashboards/rpm.md b/_install-and-configure/install-dashboards/rpm.md index d250c4c1f3..cc5974c91e 100644 --- a/_install-and-configure/install-dashboards/rpm.md +++ b/_install-and-configure/install-dashboards/rpm.md @@ -89,4 +89,41 @@ YUM, the primary package management tool for Red Hat-based operating systems, al 1. Once complete, you can run OpenSearch Dashboards. ```bash sudo systemctl start opensearch-dashboards - ``` \ No newline at end of file + ``` + +## Upgrade to a newer version + +OpenSearch Dashboards instances installed using RPM or YUM can be easily upgraded to a newer version. We recommend using YUM, but you can also choose RPM. + + +### Manual upgrade with RPM + +Download the RPM package for the desired upgrade version directly from the [OpenSearch Project downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. + +Navigate to the directory containing the distribution and run the following command: + +```bash +rpm -Uvh opensearch-dashboards-{{site.opensearch_version}}-linux-x64.rpm +``` +{% include copy.html %} + +### YUM + +To upgrade to the latest version of OpenSearch Dashboards using YUM, run the following command: + +```bash +sudo yum update opensearch-dashboards +``` +{% include copy.html %} + +You can also upgrade to a specific OpenSearch Dashboards version by providing the version number: + + ```bash + sudo yum update opensearch-dashboards- + ``` + {% include copy.html %} + +### Automatically restart the service after a package upgrade + +The OpenSearch Dashboards RPM package does not currently support automatically restarting the service after a package upgrade. + diff --git a/_install-and-configure/install-opensearch/debian.md b/_install-and-configure/install-opensearch/debian.md index 6f9167a12c..72ae05d87c 100644 --- a/_install-and-configure/install-opensearch/debian.md +++ b/_install-and-configure/install-opensearch/debian.md @@ -528,7 +528,7 @@ OpenSearch instances installed using `dpkg` or `apt-get` can be easily upgraded ### Manual upgrade with DPKG -Download the Debian package for the desired upgrade version directly from the [OpenSearch downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. +Download the Debian package for the desired upgrade version directly from the [OpenSearch Project downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. Navigate to the directory containing the distribution and run the following command: ```bash @@ -550,6 +550,15 @@ sudo apt-get upgrade opensearch= ``` {% include copy.html %} +### Automatically restart the service after a package upgrade (2.13.0+) + +To automatically restart OpenSearch after a package upgrade, enable the `opensearch.service` through `systemd`: + +```bash +sudo systemctl enable opensearch.service +``` +{% include copy.html %} + ## Related links - [OpenSearch configuration]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/) diff --git a/_install-and-configure/install-opensearch/rpm.md b/_install-and-configure/install-opensearch/rpm.md index ac3ff4e0e9..a22ea96d61 100644 --- a/_install-and-configure/install-opensearch/rpm.md +++ b/_install-and-configure/install-opensearch/rpm.md @@ -500,7 +500,7 @@ OpenSearch instances installed using RPM or YUM can be easily upgraded to a newe ### Manual upgrade with RPM -Download the RPM package for the desired upgrade version directly from the [OpenSearch downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. +Download the RPM package for the desired upgrade version directly from the [OpenSearch Project downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. Navigate to the directory containing the distribution and run the following command: ```bash @@ -512,7 +512,7 @@ rpm -Uvh opensearch-{{site.opensearch_version}}-linux-x64.rpm To upgrade to the latest version of OpenSearch using YUM: ```bash -sudo yum update +sudo yum update opensearch ``` {% include copy.html %} @@ -522,6 +522,10 @@ sudo yum update ``` {% include copy.html %} +### Automatically restart the service after a package upgrade + +The OpenSearch RPM package does not currently support automatically restarting the service after a package upgrade. + ## Related links - [OpenSearch configuration]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/) diff --git a/_install-and-configure/plugins.md b/_install-and-configure/plugins.md index b18257cf3e..6b0b28769e 100644 --- a/_install-and-configure/plugins.md +++ b/_install-and-configure/plugins.md @@ -247,8 +247,23 @@ bin/opensearch-plugin install --batch ## Available plugins -Major, minor, and patch plugin versions must match OpenSearch major, minor, and patch versions in order to be compatible. For example, plugins versions 2.3.0.x work only with OpenSearch 2.3.0. -{: .warning} +OpenSearch provides several bundled and additional plugins. + +### Plugin compatibility + +A plugin can explicitly specify compatibility with a specific OpenSearch version by listing that version in its `plugin-descriptor.properties` file. For example, a plugin with the following property is compatible only with OpenSearch 2.3.0: + +```properties +opensearch.version=2.3.0 +``` +Alternatively, a plugin can specify a range of compatible OpenSearch versions by setting the `dependencies` property in its `plugin-descriptor.properties` file using one of the following notations: +- `dependencies={ opensearch: "2.3.0" }`: The plugin is compatible only with OpenSearch version 2.3.0. +- `dependencies={ opensearch: "=2.3.0" }`: The plugin is compatible only with OpenSearch version 2.3.0. +- `dependencies={ opensearch: "~2.3.0" }`: The plugin is compatible with all versions starting from 2.3.0 up to the next minor version, in this example, 2.4.0 (exclusive). +- `dependencies={ opensearch: "^2.3.0" }`: The plugin is compatible with all versions starting from 2.3.0 up to the next major version, in this example, 3.0.0 (exclusive). + +You can specify only one of the `opensearch.version` or `dependencies` properties. +{: .note} ### Bundled plugins diff --git a/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md b/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md index bc2b7443de..68d979d6d6 100644 --- a/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md +++ b/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md @@ -7,12 +7,9 @@ nav_order: 10 --- # Agents and tools tutorial -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The following tutorial illustrates creating a flow agent for retrieval-augmented generation (RAG). A flow agent runs its configured tools sequentially, in the order specified. In this example, you'll create an agent with two tools: 1. `VectorDBTool`: The agent will use this tool to retrieve OpenSearch documents relevant to the user question. You'll ingest supplementary information into an OpenSearch index. To facilitate vector search, you'll deploy a text embedding model that translates text into vector embeddings. OpenSearch will translate the ingested documents into embeddings and store them in the index. When you provide a user question to the agent, the agent will construct a query from the question, run vector search on the OpenSearch index, and pass the relevant retrieved documents to the `MLModelTool`. @@ -264,7 +261,7 @@ To test the LLM, send the following predict request: POST /_plugins/_ml/models/NWR9YIsBUysqmzBdifVJ/_predict { "parameters": { - "prompt": "\n\nHuman:hello\n\nnAssistant:" + "prompt": "\n\nHuman:hello\n\nAssistant:" } } ``` @@ -354,4 +351,50 @@ Therefore, the population increase of Seattle from 2021 to 2023 is 58,000.""" } ] } -``` \ No newline at end of file +``` + +## Hidden agents +**Introduced 2.13** +{: .label .label-purple } + +To hide agent details from end users, including the cluster admin, you can register a _hidden_ agent. If an agent is hidden, non-superadmin users don't have permission to call any [Agent APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/index/) except for the [Execute API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/execute-agent/), on the agent. + +Only superadmin users can register a hidden agent. To register a hidden agent, you first need to authenticate with an [admin certificate]({{site.url}}{{site.baseurl}}/security/configuration/tls/#configuring-admin-certificates): + +```bash +curl -k --cert ./kirk.pem --key ./kirk-key.pem -XGET 'https://localhost:9200/.opendistro_security/_search' +``` + +All agents created by a superadmin user are automatically registered as hidden. To register a hidden agent, send a request to the `_register` endpoint: + +```bash +curl -k --cert ./kirk.pem --key ./kirk-key.pem -X POST 'https://localhost:9200/_plugins/_ml/models/_register' -H 'Content-Type: application/json' -d ' +{ + "name": "Test_Agent_For_RAG", + "type": "flow", + "description": "this is a test agent", + "tools": [ + { + "name": "vector_tool", + "type": "VectorDBTool", + "parameters": { + "model_id": "zBRyYIsBls05QaITo5ex", + "index": "my_test_data", + "embedding_field": "embedding", + "source_field": [ + "text" + ], + "input": "${parameters.question}" + } + }, + { + "type": "MLModelTool", + "description": "A general tool to answer any question", + "parameters": { + "model_id": "NWR9YIsBUysqmzBdifVJ", + "prompt": "\n\nHuman:You are a professional data analyst. You will always answer question based on the given context first. If the answer is not directly shown in the context, you will analyze the data and find the answer. If you don't know the answer, just say don't know. \n\n Context:\n${parameters.vector_tool.output}\n\nHuman:${parameters.question}\n\nAssistant:" + } + } + ] +}' +``` diff --git a/_ml-commons-plugin/agents-tools/index.md b/_ml-commons-plugin/agents-tools/index.md index 016a077c62..ba88edef2f 100644 --- a/_ml-commons-plugin/agents-tools/index.md +++ b/_ml-commons-plugin/agents-tools/index.md @@ -7,12 +7,9 @@ nav_order: 27 --- # Agents and tools -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - You can automate machine learning (ML) tasks using agents and tools. An _agent_ orchestrates and runs ML models and tools. A _tool_ performs a set of specific tasks. Some examples of tools are the `VectorDBTool`, which supports vector search, and the `CATIndexTool`, which executes the `cat indices` operation. For a list of supported tools, see [Tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/tools/index/). ## Agents @@ -155,24 +152,6 @@ POST /_plugins/_ml/agents/_register It is important to provide thorough descriptions of the tools so that the LLM can decide in which situations to use those tools. {: .tip} -## Enabling the feature - -To enable agents and tools, configure the following setting: - -```yaml -plugins.ml_commons.agent_framework_enabled: true -``` -{% include copy.html %} - -For conversational agents, you also need to enable RAG for use in conversational search. To enable RAG, configure the following setting: - -```yaml -plugins.ml_commons.rag_pipeline_feature_enabled: true -``` -{% include copy.html %} - -For more information about ways to enable experimental features, see [Experimental feature flags]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). - ## Next steps - For a list of supported tools, see [Tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/tools/index/). diff --git a/_ml-commons-plugin/agents-tools/tools/agent-tool.md b/_ml-commons-plugin/agents-tools/tools/agent-tool.md index 272456d693..272af51e4d 100644 --- a/_ml-commons-plugin/agents-tools/tools/agent-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/agent-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Agent tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `AgentTool` runs any agent. ## Step 1: Set up an agent for AgentTool to run diff --git a/_ml-commons-plugin/agents-tools/tools/cat-index-tool.md b/_ml-commons-plugin/agents-tools/tools/cat-index-tool.md index 77b28ed527..50ccf28b9b 100644 --- a/_ml-commons-plugin/agents-tools/tools/cat-index-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/cat-index-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # CAT Index tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `CatIndexTool` retrieves index information for the OpenSearch cluster, similarly to the [CAT Indices API]({{site.url}}{{site.baseurl}}/api-reference/cat/cat-indices/). ## Step 1: Register a flow agent that will run the CatIndexTool diff --git a/_ml-commons-plugin/agents-tools/tools/index-mapping-tool.md b/_ml-commons-plugin/agents-tools/tools/index-mapping-tool.md index f27b0592a8..8649d2d74d 100644 --- a/_ml-commons-plugin/agents-tools/tools/index-mapping-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/index-mapping-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Index Mapping tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `IndexMappingTool` retrieves mapping and setting information for indexes in your cluster. ## Step 1: Register a flow agent that will run the IndexMappingTool diff --git a/_ml-commons-plugin/agents-tools/tools/index.md b/_ml-commons-plugin/agents-tools/tools/index.md index fe6d574d63..8db522006e 100644 --- a/_ml-commons-plugin/agents-tools/tools/index.md +++ b/_ml-commons-plugin/agents-tools/tools/index.md @@ -10,7 +10,7 @@ redirect_from: --- # Tools -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } A _tool_ performs a set of specific tasks. The following table lists all tools that OpenSearch supports. diff --git a/_ml-commons-plugin/agents-tools/tools/ml-model-tool.md b/_ml-commons-plugin/agents-tools/tools/ml-model-tool.md index c0f8aeab86..ceeda40528 100644 --- a/_ml-commons-plugin/agents-tools/tools/ml-model-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/ml-model-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # ML Model tool -**Introduced 2.12** +plugins.ml_commons.rag_pipeline_feature_enabled: true {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `MLModelTool` runs a machine learning (ML) model and returns inference results. ## Step 1: Create a connector for a model diff --git a/_ml-commons-plugin/agents-tools/tools/neural-sparse-tool.md b/_ml-commons-plugin/agents-tools/tools/neural-sparse-tool.md index bc1fd4845e..9fee4dcbd2 100644 --- a/_ml-commons-plugin/agents-tools/tools/neural-sparse-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/neural-sparse-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Neural Sparse Search tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `NeuralSparseSearchTool` performs sparse vector retrieval. For more information about neural sparse search, see [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). ## Step 1: Register and deploy a sparse encoding model diff --git a/_ml-commons-plugin/agents-tools/tools/ppl-tool.md b/_ml-commons-plugin/agents-tools/tools/ppl-tool.md index f153ca88f3..72d8ba30b5 100644 --- a/_ml-commons-plugin/agents-tools/tools/ppl-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/ppl-tool.md @@ -9,12 +9,9 @@ grand_parent: Agents and tools --- # PPL tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `PPLTool` translates natural language into a PPL query. The tool provides an `execute` flag to specify whether to run the query. If you set the flag to `true`, the `PPLTool` runs the query and returns the query and the results. ## Prerequisite diff --git a/_ml-commons-plugin/agents-tools/tools/rag-tool.md b/_ml-commons-plugin/agents-tools/tools/rag-tool.md index ae3ad1281a..1f6fafe49a 100644 --- a/_ml-commons-plugin/agents-tools/tools/rag-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/rag-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # RAG tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `RAGTool` performs retrieval-augmented generation (RAG). For more information about RAG, see [Conversational search]({{site.url}}{{site.baseurl}}/search-plugins/conversational-search/). RAG calls a large language model (LLM) and supplements its knowledge by providing relevant OpenSearch documents along with the user question. To retrieve relevant documents from an OpenSearch index, you'll need a text embedding model that facilitates vector search. diff --git a/_ml-commons-plugin/agents-tools/tools/search-alerts-tool.md b/_ml-commons-plugin/agents-tools/tools/search-alerts-tool.md index 387ef1cbab..76f9e4b4dc 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-alerts-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/search-alerts-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Alerts tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchAlertsTool` retrieves information about generated alerts. For more information about alerts, see [Alerting]({{site.url}}{{site.baseurl}}/observing-your-data/alerting/index/). ## Step 1: Register a flow agent that will run the SearchAlertsTool diff --git a/_ml-commons-plugin/agents-tools/tools/search-anomaly-detectors.md b/_ml-commons-plugin/agents-tools/tools/search-anomaly-detectors.md index de93a404a3..9f31dea057 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-anomaly-detectors.md +++ b/_ml-commons-plugin/agents-tools/tools/search-anomaly-detectors.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Anomaly Detectors tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchAnomalyDetectorsTool` retrieves information about anomaly detectors set up on your cluster. For more information about anomaly detectors, see [Anomaly detection]({{site.url}}{{site.baseurl}}/observing-your-data/ad/index/). ## Step 1: Register a flow agent that will run the SearchAnomalyDetectorsTool diff --git a/_ml-commons-plugin/agents-tools/tools/search-anomaly-results.md b/_ml-commons-plugin/agents-tools/tools/search-anomaly-results.md index bce27bba55..2f2728e32d 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-anomaly-results.md +++ b/_ml-commons-plugin/agents-tools/tools/search-anomaly-results.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Anomaly Results tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchAnomalyResultsTool` retrieves information about anomaly detector results. For more information about anomaly detectors, see [Anomaly detection]({{site.url}}{{site.baseurl}}/observing-your-data/ad/index/). ## Step 1: Register a flow agent that will run the SearchAnomalyResultsTool diff --git a/_ml-commons-plugin/agents-tools/tools/search-index-tool.md b/_ml-commons-plugin/agents-tools/tools/search-index-tool.md index 86ecbfc609..b023522893 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-index-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/search-index-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Index tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchIndexTool` searches an index using a query written in query domain-specific language (DSL) and returns the query results. ## Step 1: Register a flow agent that will run the SearchIndexTool diff --git a/_ml-commons-plugin/agents-tools/tools/search-monitors-tool.md b/_ml-commons-plugin/agents-tools/tools/search-monitors-tool.md index 2b746d3453..77b51d4964 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-monitors-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/search-monitors-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Monitors tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchMonitorsTool` retrieves information about alerting monitors set up on your cluster. For more information about alerting monitors, see [Monitors]({{site.url}}{{site.baseurl}}/observing-your-data/alerting/monitors/). ## Step 1: Register a flow agent that will run the SearchMonitorsTool diff --git a/_ml-commons-plugin/agents-tools/tools/vector-db-tool.md b/_ml-commons-plugin/agents-tools/tools/vector-db-tool.md index d8b8083df3..9093541cbb 100644 --- a/_ml-commons-plugin/agents-tools/tools/vector-db-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/vector-db-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Vector DB tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `VectorDBTool` performs dense vector retrieval. For more information about OpenSearch vector database capabilities, see [neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/). ## Step 1: Register and deploy a sparse encoding model diff --git a/_ml-commons-plugin/agents-tools/tools/visualization-tool.md b/_ml-commons-plugin/agents-tools/tools/visualization-tool.md index 1407232555..98457932c2 100644 --- a/_ml-commons-plugin/agents-tools/tools/visualization-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/visualization-tool.md @@ -9,12 +9,9 @@ grand_parent: Agents and tools --- # Visualization tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - Use the `VisualizationTool` to find visualizations relevant to a question. ## Step 1: Register a flow agent that will run the VisualizationTool diff --git a/_ml-commons-plugin/api/agent-apis/delete-agent.md b/_ml-commons-plugin/api/agent-apis/delete-agent.md index 0327c3bf04..ddde8fb19b 100644 --- a/_ml-commons-plugin/api/agent-apis/delete-agent.md +++ b/_ml-commons-plugin/api/agent-apis/delete-agent.md @@ -7,12 +7,9 @@ nav_order: 50 --- # Delete an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - You can use this API to delete an agent based on the `agent_id`. ## Path and HTTP methods diff --git a/_ml-commons-plugin/api/agent-apis/execute-agent.md b/_ml-commons-plugin/api/agent-apis/execute-agent.md index 8302ac265f..27d50bced0 100644 --- a/_ml-commons-plugin/api/agent-apis/execute-agent.md +++ b/_ml-commons-plugin/api/agent-apis/execute-agent.md @@ -7,12 +7,9 @@ nav_order: 20 --- # Execute an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - When an agent is executed, it runs the tools with which it is configured. ### Path and HTTP methods diff --git a/_ml-commons-plugin/api/agent-apis/get-agent.md b/_ml-commons-plugin/api/agent-apis/get-agent.md index be49a87502..6190406649 100644 --- a/_ml-commons-plugin/api/agent-apis/get-agent.md +++ b/_ml-commons-plugin/api/agent-apis/get-agent.md @@ -7,12 +7,9 @@ nav_order: 20 --- # Get an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - You can retrieve agent information using the `agent_id`. ## Path and HTTP methods diff --git a/_ml-commons-plugin/api/agent-apis/index.md b/_ml-commons-plugin/api/agent-apis/index.md index 4b6954a79f..72bf6082ce 100644 --- a/_ml-commons-plugin/api/agent-apis/index.md +++ b/_ml-commons-plugin/api/agent-apis/index.md @@ -9,12 +9,9 @@ redirect_from: /ml-commons-plugin/api/agent-apis/ --- # Agent APIs -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - You can automate machine learning (ML) tasks using agents and tools. An _agent_ orchestrates and runs ML models and tools. For more information, see [Agents and tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/index/). ML Commons supports the following agent-level APIs: diff --git a/_ml-commons-plugin/api/agent-apis/register-agent.md b/_ml-commons-plugin/api/agent-apis/register-agent.md index 75a63d40cf..820bb923f7 100644 --- a/_ml-commons-plugin/api/agent-apis/register-agent.md +++ b/_ml-commons-plugin/api/agent-apis/register-agent.md @@ -7,12 +7,9 @@ nav_order: 10 --- # Register an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - Use this API to register an agent. Agents may be of the following types: diff --git a/_ml-commons-plugin/api/agent-apis/search-agent.md b/_ml-commons-plugin/api/agent-apis/search-agent.md index c5df482ac2..3d950cde8f 100644 --- a/_ml-commons-plugin/api/agent-apis/search-agent.md +++ b/_ml-commons-plugin/api/agent-apis/search-agent.md @@ -7,12 +7,9 @@ nav_order: 30 --- # Search for an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - Use this command to search for agents you've already created. You can provide any OpenSearch search query in the request body. ## Path and HTTP methods diff --git a/_ml-commons-plugin/api/index.md b/_ml-commons-plugin/api/index.md index a41679f666..ec4cf12492 100644 --- a/_ml-commons-plugin/api/index.md +++ b/_ml-commons-plugin/api/index.md @@ -16,8 +16,11 @@ ML Commons supports the following APIs: - [Model APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/index/) - [Model group APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-group-apis/index/) - [Connector APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/connector-apis/index/) +- [Agent APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/index/) +- [Memory APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/memory-apis/index/) +- [Controller APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/controller-apis/index/) +- [Execute Algorithm API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/execute-algorithm/) - [Tasks APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/index/) - [Train and Predict APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/train-predict/index/) -- [Execute Algorithm API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/execute-algorithm/) - [Profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/profile/) - [Stats API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/stats/) diff --git a/_ml-commons-plugin/api/model-apis/deploy-model.md b/_ml-commons-plugin/api/model-apis/deploy-model.md index 52cf3f232e..2c6991ba22 100644 --- a/_ml-commons-plugin/api/model-apis/deploy-model.md +++ b/_ml-commons-plugin/api/model-apis/deploy-model.md @@ -8,7 +8,19 @@ nav_order: 20 # Deploy a model -The deploy model operation reads the model's chunks from the model index and then creates an instance of the model to cache into memory. This operation requires the `model_id`. +The deploy model operation reads the model's chunks from the model index and then creates an instance of the model to cache in memory. This operation requires the `model_id`. + +Starting with OpenSearch version 2.13, [externally hosted models]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index) are deployed automatically by default when you send a Predict API request for the first time. To disable automatic deployment for an externally hosted model, set `plugins.ml_commons.model_auto_deploy.enable` to `false`: + +```json +PUT _cluster/settings +{ + "persistent": { + "plugins.ml_commons.model_auto_deploy.enable": "false" + } +} +``` +{% include copy-curl.html %} For information about user access for this API, see [Model access control considerations]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/index/#model-access-control-considerations). diff --git a/_ml-commons-plugin/api/model-apis/register-model.md b/_ml-commons-plugin/api/model-apis/register-model.md index 880cbd68e5..dd157ed264 100644 --- a/_ml-commons-plugin/api/model-apis/register-model.md +++ b/_ml-commons-plugin/api/model-apis/register-model.md @@ -183,8 +183,9 @@ Field | Data type | Required/Optional | Description `description` | String | Optional| The model description. | `model_group_id` | String | Optional | The model group ID of the model group to register this model to. `is_enabled`| Boolean | Specifies whether the model is enabled. Disabling the model makes it unavailable for Predict API requests, regardless of the model's deployment status. Default is `true`. +`guardrails`| Object | Optional | The guardrails for the model input. For more information, see [Guardrails](#the-guardrails-parameter).| -#### Example request: Remote model with a standalone connector +#### Example request: Externally hosted with a standalone connector ```json POST /_plugins/_ml/models/_register @@ -198,7 +199,7 @@ POST /_plugins/_ml/models/_register ``` {% include copy-curl.html %} -#### Example request: Remote model with a connector specified as part of the model +#### Example request: Externally hosted with a connector specified as part of the model ```json POST /_plugins/_ml/models/_register @@ -248,6 +249,70 @@ OpenSearch responds with the `task_id` and task `status`. } ``` +### The `guardrails` parameter + +Guardrails are safety measures for large language models (LLMs). They provide a set of rules and boundaries that control how an LLM behaves and what kind of output it generates. + +To register an externally hosted model with guardrails, provide the `guardrails` parameter, which supports the following fields. All fields are optional. + +Field | Data type | Description +:--- | :--- | :--- +`type` | String | The guardrail type. Currently, only `local_regex` is supported. +`input_guardrail`| Object | The guardrail for the model input. | +`output_guardrail`| Object | The guardrail for the model output. | +`stop_words`| Object | The list of indexes containing stopwords used for the model input/output validation. If the model prompt/response contains a stopword contained in any of the indexes, the predict request on this model is rejected. | +`index_name`| Object | The name of the index storing the stopwords. | +`source_fields`| Object | The name of the field storing the stopwords. | +`regex`| Object | A regular expression used for input/output validation. If the model prompt/response matches the regular expression, the predict request on this model is rejected. | + +#### Example request: Externally hosted model with guardrails + +```json +POST /_plugins/_ml/models/_register +{ + "name": "openAI-gpt-3.5-turbo", + "function_name": "remote", + "model_group_id": "1jriBYsBq7EKuKzZX131", + "description": "test model", + "connector_id": "a1eMb4kBJ1eYAeTMAljY", + "guardrails": { + "type": "local_regex", + "input_guardrail": { + "stop_words": [ + { + "index_name": "stop_words_input", + "source_fields": ["title"] + } + ], + "regex": ["regex1", "regex2"] + }, + "output_guardrail": { + "stop_words": [ + { + "index_name": "stop_words_output", + "source_fields": ["title"] + } + ], + "regex": ["regex1", "regex2"] + } + } +} +``` +{% include copy-curl.html %} + +For a complete example, see [Guardrails]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/guardrails/). + +#### Example response + +OpenSearch responds with the `task_id` and task `status`: + +```json +{ + "task_id" : "ew8I44MBhyWuIwnfvDIH", + "status" : "CREATED" +} +``` + ## Check the status of model registration To see the status of your model registration and retrieve the model ID created for the new model version, pass the `task_id` as a path parameter to the Tasks API: diff --git a/_ml-commons-plugin/api/model-apis/update-model.md b/_ml-commons-plugin/api/model-apis/update-model.md index 380f422272..877d0b5c51 100644 --- a/_ml-commons-plugin/api/model-apis/update-model.md +++ b/_ml-commons-plugin/api/model-apis/update-model.md @@ -36,6 +36,7 @@ Field | Data type | Description `rate_limiter` | Object | Limits the number of times any user can call the Predict API on the model. For more information, see [Rate limiting inference calls]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#rate-limiting-inference-calls). `rate_limiter.limit` | Integer | The maximum number of times any user can call the Predict API on the model per `unit` of time. By default, there is no limit on the number of Predict API calls. Once you set a limit, you cannot reset it to no limit. As an alternative, you can specify a high limit value and a small time unit, for example, 1 request per nanosecond. `rate_limiter.unit` | String | The unit of time for the rate limiter. Valid values are `DAYS`, `HOURS`, `MICROSECONDS`, `MILLISECONDS`, `MINUTES`, `NANOSECONDS`, and `SECONDS`. +`guardrails`| Object | The guardrails for the model. #### Example request: Disabling a model @@ -62,6 +63,35 @@ PUT /_plugins/_ml/models/T_S-cY0BKCJ3ot9qr0aP ``` {% include copy-curl.html %} +#### Example request: Updating the guardrails + +```json +PUT /_plugins/_ml/models/MzcIJX8BA7mbufL6DOwl +{ + "guardrails": { + "input_guardrail": { + "stop_words": [ + { + "index_name": "updated_stop_words_input", + "source_fields": ["updated_title"] + } + ], + "regex": ["updated_regex1", "updated_regex2"] + }, + "output_guardrail": { + "stop_words": [ + { + "index_name": "updated_stop_words_output", + "source_fields": ["updated_title"] + } + ], + "regex": ["updated_regex1", "updated_regex2"] + } + } +} +``` +{% include copy-curl.html %} + #### Example response ```json @@ -78,4 +108,5 @@ PUT /_plugins/_ml/models/T_S-cY0BKCJ3ot9qr0aP "_seq_no": 48, "_primary_term": 4 } -``` \ No newline at end of file +``` + diff --git a/_ml-commons-plugin/cluster-settings.md b/_ml-commons-plugin/cluster-settings.md index 5bf1c13599..c473af81a1 100644 --- a/_ml-commons-plugin/cluster-settings.md +++ b/_ml-commons-plugin/cluster-settings.md @@ -239,6 +239,33 @@ plugins.ml_commons.native_memory_threshold: 90 - Default value: 90 - Value range: [0, 100] +## Set JVM heap memory threshold + +Sets a circuit breaker that checks JVM heap memory usage before running an ML task. If the heap usage exceeds the threshold, OpenSearch triggers a circuit breaker and throws an exception to maintain optimal performance. + +Values are based on the percentage of JVM heap memory available. When set to `0`, no ML tasks will run. When set to `100`, the circuit breaker closes and no threshold exists. + +### Setting + +``` +plugins.ml_commons.jvm_heap_memory_threshold: 85 +``` + +### Values + +- Default value: 85 +- Value range: [0, 100] + +## Exclude node names + +Use this setting to specify the names of nodes on which you don't want to run ML tasks. The value should be a valid node name or a comma-separated node name list. + +### Setting + +``` +plugins.ml_commons.exclude_nodes._name: node1, node2 +``` + ## Allow custom deployment plans When enabled, this setting grants users the ability to deploy models to specific ML nodes according to that user's permissions. @@ -254,6 +281,21 @@ plugins.ml_commons.allow_custom_deployment_plan: false - Default value: false - Valid values: `false`, `true` +## Enable auto deploy + +This setting is applicable when you send a prediction request for an externally hosted model that has not been deployed. When set to `true`, this setting automatically deploys the model to the cluster if the model has not been deployed already. + +### Setting + +``` +plugins.ml_commons.model_auto_deploy.enable: false +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` + ## Enable auto redeploy This setting automatically redeploys deployed or partially deployed models upon cluster failure. If all ML nodes inside a cluster crash, the model switches to the `DEPLOYED_FAILED` state, and the model must be deployed manually. @@ -326,10 +368,110 @@ plugins.ml_commons.connector_access_control_enabled: true ### Values -- Default value: false +- Default value: `false` - Valid values: `false`, `true` +## Enable a local model + +This setting allows a cluster admin to enable running local models on the cluster. When this setting is `false`, users will not be able to run register, deploy, or predict operations on any local model. + +### Setting + +``` +plugins.ml_commons.local_model.enabled: true +``` +### Values + +- Default value: `true` +- Valid values: `false`, `true` +## Node roles that can run externally hosted models +This setting allows a cluster admin to control the types of nodes on which externally hosted models can run. + +### Setting + +``` +plugins.ml_commons.task_dispatcher.eligible_node_role.remote_model: ["ml"] +``` + +### Values + +- Default value: `["data", "ml"]`, which allows externally hosted models to run on data nodes and ML nodes. + + +## Node roles that can run local models + +This setting allows a cluster admin to control the types of nodes on which local models can run. The `plugins.ml_commons.only_run_on_ml_node` setting only allows the model to run on ML nodes. For a local model, if `plugins.ml_commons.only_run_on_ml_node` is set to `true`, then the model will always run on ML nodes. If `plugins.ml_commons.only_run_on_ml_node` is set to `false`, then the model will run on nodes defined in the `plugins.ml_commons.task_dispatcher.eligible_node_role.local_model` setting. + +### Setting +``` +plugins.ml_commons.task_dispatcher.eligible_node_role.remote_model: ["ml"] +``` + +### Values + +- Default value: `["data", "ml"]` + +## Enable remote inference + +This setting allows a cluster admin to enable remote inference on the cluster. If this setting is `false`, users will not be able to run register, deploy, or predict operations on any externally hosted model or create a connector for remote inference. + +### Setting + +``` +plugins.ml_commons.remote_inference.enabled: true +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` + +## Enable agent framework + +When set to `true`, this setting enables the agent framework (including agents and tools) on the cluster and allows users to run register, execute, delete, get, and search operations on an agent. + +### Setting + +``` +plugins.ml_commons.agent_framework_enabled: true +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` + +## Enable memory + +When set to `true`, this setting enables conversational memory, which stores all messages from a conversation for conversational search. + +### Setting + +``` +plugins.ml_commons.memory_feature_enabled: true +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` + + +## Enable RAG pipeline + +When set to `true`, this setting enables the search processors for retrieval-augmented generation (RAG). RAG enhances query results by generating responses using relevant information from memory and previous conversations. + +### Setting + +``` +plugins.ml_commons.agent_framework_enabled: true +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` diff --git a/_ml-commons-plugin/custom-local-models.md b/_ml-commons-plugin/custom-local-models.md index f96f784196..a265d8804a 100644 --- a/_ml-commons-plugin/custom-local-models.md +++ b/_ml-commons-plugin/custom-local-models.md @@ -7,7 +7,7 @@ nav_order: 120 --- # Custom local models -**Generally available 2.9** +**Introduced 2.9** {: .label .label-purple } To use a custom model locally, you can upload it to the OpenSearch cluster. @@ -20,12 +20,14 @@ As of OpenSearch 2.11, OpenSearch supports local sparse encoding models. As of OpenSearch 2.12, OpenSearch supports local cross-encoder models. +As of OpenSearch 2.13, OpenSearch supports local question answering models. + Running local models on the CentOS 7 operating system is not supported. Moreover, not all local models can run on all hardware and operating systems. {: .important} ## Preparing a model -For both text embedding and sparse encoding models, you must provide a tokenizer JSON file within the model zip file. +For all the models, you must provide a tokenizer JSON file within the model zip file. For sparse encoding models, make sure your output format is `{"output":}` so that ML Commons can post-process the sparse vector. @@ -157,7 +159,7 @@ POST /_plugins/_ml/models/_register ``` {% include copy.html %} -For descriptions of Register API parameters, see [Register a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/). The `model_task_type` corresponds to the model type. For text embedding models, set this parameter to `TEXT_EMBEDDING`. For sparse encoding models, set this parameter to `SPARSE_ENCODING` or `SPARSE_TOKENIZE`. For cross-encoder models, set this parameter to `TEXT_SIMILARITY`. +For descriptions of Register API parameters, see [Register a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/). The `model_task_type` corresponds to the model type. For text embedding models, set this parameter to `TEXT_EMBEDDING`. For sparse encoding models, set this parameter to `SPARSE_ENCODING` or `SPARSE_TOKENIZE`. For cross-encoder models, set this parameter to `TEXT_SIMILARITY`. For question answering models, set this parameter to `QUESTION_ANSWERING`. OpenSearch returns the task ID of the register operation: @@ -321,3 +323,60 @@ The response contains the tokens and weights: ## Step 5: Use the model for search To learn how to use the model for vector search, see [Using an ML model for neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/#using-an-ml-model-for-neural-search). + +## Question answering models + +A question answering model extracts the answer to a question from a given context. ML Commons supports context in `text` format. + +To register a question answering model, send a request in the following format. Specify the `function_name` as `QUESTION_ANSWERING`: + +```json +POST /_plugins/_ml/models/_register +{ + "name": "question_answering", + "version": "1.0.0", + "function_name": "QUESTION_ANSWERING", + "description": "test model", + "model_format": "TORCH_SCRIPT", + "model_group_id": "lN4AP40BKolAMNtR4KJ5", + "model_content_hash_value": "e837c8fc05fd58a6e2e8383b319257f9c3859dfb3edc89b26badfaf8a4405ff6", + "model_config": { + "model_type": "bert", + "framework_type": "huggingface_transformers" + }, + "url": "https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/question_answering/question_answering_pt.zip?raw=true" +} +``` +{% include copy-curl.html %} + +Then send a request to deploy the model: + +```json +POST _plugins/_ml/models//_deploy +``` +{% include copy-curl.html %} + +To test a question answering model, send the following request. It requires a `question` and the relevant `context` from which the answer will be generated: + +```json +POST /_plugins/_ml/_predict/question_answering/ +{ + "question": "Where do I live?" + "context": "My name is John. I live in New York" +} +``` +{% include copy-curl.html %} + +The response provides the answer based on the context: + +```json +{ + "inference_results": [ + { + "output": [ + { + "result": "New York" + } + } +} +``` \ No newline at end of file diff --git a/_ml-commons-plugin/ml-dashboard.md b/_ml-commons-plugin/ml-dashboard.md index 3195aff8de..20c4e636bb 100644 --- a/_ml-commons-plugin/ml-dashboard.md +++ b/_ml-commons-plugin/ml-dashboard.md @@ -7,7 +7,7 @@ redirect_from: --- # Managing ML models in OpenSearch Dashboards -**Generally available 2.9** +**Introduced 2.9** {: .label .label-purple } Administrators of machine learning (ML) clusters can use OpenSearch Dashboards to manage and check the status of ML models running inside a cluster. This can help ML developers provision nodes to ensure their models run efficiently. diff --git a/_ml-commons-plugin/opensearch-assistant.md b/_ml-commons-plugin/opensearch-assistant.md index 3a8e0c8703..0a058d73a0 100644 --- a/_ml-commons-plugin/opensearch-assistant.md +++ b/_ml-commons-plugin/opensearch-assistant.md @@ -7,12 +7,9 @@ nav_order: 28 --- # OpenSearch Assistant Toolkit -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [OpenSearch forum thread](https://forum.opensearch.org/t/feedback-opensearch-assistant/16741). -{: .warning} - The OpenSearch Assistant Toolkit helps you create AI-powered assistants for OpenSearch Dashboards. The toolkit includes the following elements: - [**Agents and tools**]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/index/): _Agents_ interface with a large language model (LLM) and execute high-level tasks, such as summarization or generating Piped Processing Language (PPL) queries from natural language. The agent's high-level tasks consist of low-level tasks called _tools_, which can be reused by multiple agents. @@ -36,8 +33,6 @@ To enable OpenSearch Assistant, perform the following steps: ``` {% include copy.html %} -For more information about ways to enable experimental features, see [Experimental feature flags]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). - ## Next steps - For more information about the OpenSearch Assistant UI, see [OpenSearch Assistant for OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/dashboards/dashboards-assistant/index/) \ No newline at end of file diff --git a/_ml-commons-plugin/pretrained-models.md b/_ml-commons-plugin/pretrained-models.md index c68f9c8bab..8847d36291 100644 --- a/_ml-commons-plugin/pretrained-models.md +++ b/_ml-commons-plugin/pretrained-models.md @@ -7,7 +7,7 @@ nav_order: 120 --- # OpenSearch-provided pretrained models -**Generally available 2.9** +**Introduced 2.9** {: .label .label-purple } OpenSearch provides a variety of open-source pretrained models that can assist with a range of machine learning (ML) search and analytics use cases. You can upload any supported model to the OpenSearch cluster and use it locally. diff --git a/_ml-commons-plugin/remote-models/blueprints.md b/_ml-commons-plugin/remote-models/blueprints.md index 57e0e4177b..5cac2f3d3b 100644 --- a/_ml-commons-plugin/remote-models/blueprints.md +++ b/_ml-commons-plugin/remote-models/blueprints.md @@ -55,32 +55,41 @@ As an ML developer, you can build connector blueprints for other platforms. Usin ## Configuration parameters -The following configuration parameters are **required** in order to build a connector blueprint. - -| Field | Data type | Description | -| :--- | :--- | :--- | -| `name` | String | The name of the connector. | -| `description` | String | A description of the connector. | -| `version` | Integer | The version of the connector. | -| `protocol` | String | The protocol for the connection. For AWS services such as Amazon SageMaker and Amazon Bedrock, use `aws_sigv4`. For all other services, use `http`. | -| `parameters` | JSON object | The default connector parameters, including `endpoint` and `model`. Any parameters indicated in this field can be overridden by parameters specified in a predict request. | -| `credential` | JSON object | Defines any credential variables required in order to connect to your chosen endpoint. ML Commons uses **AES/GCM/NoPadding** symmetric encryption to encrypt your credentials. When the connection to the cluster first starts, OpenSearch creates a random 32-byte encryption key that persists in OpenSearch's system index. Therefore, you do not need to manually set the encryption key. | -| `actions` | JSON array | Defines what actions can run within the connector. If you're an administrator creating a connection, add the [blueprint]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/) for your desired connection. | -| `backend_roles` | JSON array | A list of OpenSearch backend roles. For more information about setting up backend roles, see [Assigning backend roles to users]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#assigning-backend-roles-to-users). | -| `access_mode` | String | Sets the access mode for the model, either `public`, `restricted`, or `private`. Default is `private`. For more information about `access_mode`, see [Model groups]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#model-groups). | -| `add_all_backend_roles` | Boolean | When set to `true`, adds all `backend_roles` to the access list, which only a user with admin permissions can adjust. When set to `false`, non-admins can add `backend_roles`. | - -The `action` parameter supports the following options. - -| Field | Data type | Description | -| :--- | :--- | :--- | -| `action_type` | String | Required. Sets the ML Commons API operation to use upon connection. As of OpenSearch 2.9, only `predict` is supported. | -| `method` | String | Required. Defines the HTTP method for the API call. Supports `POST` and `GET`. | -| `url` | String | Required. Sets the connection endpoint at which the action occurs. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index#adding-trusted-endpoints). | -| `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. | -| `request_body` | String | Required. Sets the parameters contained inside the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. | -| `pre_process_function` | String | Optional. A built-in or custom Painless script used to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models
- `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models
- `connector.pre_process.default.embedding`, which you can use to preprocess documents in neural search requests so that they are in the format that ML Commons can process with the default preprocessor (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions). | -| `post_process_function` | String | Optional. A built-in or custom Painless script used to post-process the model output data. OpenSearch provides the following built-in post-process functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere text embedding models](https://docs.cohere.com/reference/embed)
- `connector.pre_process.openai.embedding` for [OpenAI text embedding models](https://platform.openai.com/docs/api-reference/embeddings)
- `connector.post_process.default.embedding`, which you can use to post-process documents in the model response so that they are in the format that neural search expects (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions). | +| Field | Data type | Is required | Description | +|:------------------------|:------------|:------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `name` | String | Yes | The name of the connector. | +| `description` | String | Yes | A description of the connector. | +| `version` | Integer | Yes | The version of the connector. | +| `protocol` | String | Yes | The protocol for the connection. For AWS services such as Amazon SageMaker and Amazon Bedrock, use `aws_sigv4`. For all other services, use `http`. | +| `parameters` | JSON object | Yes | The default connector parameters, including `endpoint` and `model`. Any parameters indicated in this field can be overridden by parameters specified in a predict request. | +| `credential` | JSON object | Yes | Defines any credential variables required to connect to your chosen endpoint. ML Commons uses **AES/GCM/NoPadding** symmetric encryption to encrypt your credentials. When the connection to the cluster first starts, OpenSearch creates a random 32-byte encryption key that persists in OpenSearch's system index. Therefore, you do not need to manually set the encryption key. | +| `actions` | JSON array | Yes | Defines what actions can run within the connector. If you're an administrator creating a connection, add the [blueprint]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/) for your desired connection. | +| `backend_roles` | JSON array | Yes | A list of OpenSearch backend roles. For more information about setting up backend roles, see [Assigning backend roles to users]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#assigning-backend-roles-to-users). | +| `access_mode` | String | Yes | Sets the access mode for the model, either `public`, `restricted`, or `private`. Default is `private`. For more information about `access_mode`, see [Model groups]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#model-groups). | +| `add_all_backend_roles` | Boolean | Yes | When set to `true`, adds all `backend_roles` to the access list, which only a user with admin permissions can adjust. When set to `false`, non-admins can add `backend_roles`. | +| `client_config` | JSON object | No | The client configuration object, which provides settings that control the behavior of the client connections used by the connector. These settings allow you to manage connection limits and timeouts, ensuring efficient and reliable communication. | + + +The `actions` parameter supports the following options. + +| Field | Data type | Description | +|:------------------------|:------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `action_type` | String | Required. Sets the ML Commons API operation to use upon connection. As of OpenSearch 2.9, only `predict` is supported. | +| `method` | String | Required. Defines the HTTP method for the API call. Supports `POST` and `GET`. | +| `url` | String | Required. Sets the connection endpoint at which the action occurs. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index#adding-trusted-endpoints). | +| `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. | +| `request_body` | String | Required. Sets the parameters contained in the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. | +| `pre_process_function` | String | Optional. A built-in or custom Painless script used to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models
- `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models
- `connector.pre_process.default.embedding`, which you can use to preprocess documents in neural search requests so that they are in the format that ML Commons can process with the default preprocessor (OpenSearch 2.11 or later). For more information, see [Built-in functions](#built-in-pre--and-post-processing-functions). | +| `post_process_function` | String | Optional. A built-in or custom Painless script used to post-process the model output data. OpenSearch provides the following built-in post-process functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere text embedding models](https://docs.cohere.com/reference/embed)
- `connector.pre_process.openai.embedding` for [OpenAI text embedding models](https://platform.openai.com/docs/api-reference/embeddings)
- `connector.post_process.default.embedding`, which you can use to post-process documents in the model response so that they are in the format that neural search expects (OpenSearch 2.11 or later). For more information, see [Built-in functions](#built-in-pre--and-post-processing-functions). | + + +The `client_config` parameter supports the following options. + +| Field | Data type | Description | +|:---------------------|:----------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `max_connection` | Integer | The maximum number of concurrent connections that the client can establish with the server. | +| `connection_timeout` | Integer | The maximum amount of time (in seconds) that the client will wait while trying to establish a connection to the server. A timeout prevents the client from waiting indefinitely and allows it to recover from unreachable network endpoints. | +| `read_timeout` | Integer | The maximum amount of time (in seconds) that the client will wait for a response from the server after sending a request. Useful when the server is slow to respond or encounters issues while processing a request. | ## Built-in pre- and post-processing functions diff --git a/_ml-commons-plugin/remote-models/guardrails.md b/_ml-commons-plugin/remote-models/guardrails.md new file mode 100644 index 0000000000..ca34eb335c --- /dev/null +++ b/_ml-commons-plugin/remote-models/guardrails.md @@ -0,0 +1,298 @@ +--- +layout: default +title: Guardrails +has_children: false +has_toc: false +nav_order: 70 +parent: Connecting to externally hosted models +grand_parent: Integrating ML models +--- + +# Configuring model guardrails +**Introduced 2.13** +{: .label .label-purple } + +Guardrails can guide a large language model (LLM) toward desired behavior. They act as a filter, preventing the LLM from generating output that is harmful or violates ethical principles and facilitating safer use of AI. Guardrails also cause the LLM to produce more focused and relevant output. + +To configure guardrails for your LLM, you can provide a list of words to be prohibited in the input or output of the model. Alternatively, you can provide a regular expression against which the model input or output will be matched. + +## Prerequisites + +Before you start, make sure you have fulfilled the [prerequisites]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index/#prerequisites) for connecting to an externally hosted model. + +## Step 1: Create a guardrail index + +To start, create an index that will store the excluded words (_stopwords_). In the index settings, specify a `title` field, which will contain excluded words, and a `query` field of the [percolator]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/percolator/) type. The percolator query will be used to match the LLM input or output: + +```json +PUT /words0 +{ + "mappings": { + "properties": { + "title": { + "type": "text" + }, + "query": { + "type": "percolator" + } + } + } +} +``` +{% include copy-curl.html %} + +## Step 2: Index excluded words or phrases + +Next, index a query string query that will be used to match excluded words in the model input or output: + +```json +PUT /words0/_doc/1?refresh +{ + "query": { + "query_string": { + "query": "title: blacklist" + } + } +} +``` +{% include copy-curl.html %} + +```json +PUT /words0/_doc/2?refresh +{ + "query": { + "query_string": { + "query": "title: \"Master slave architecture\"" + } + } +} +``` +{% include copy-curl.html %} + +For more query string options, see [Query string query]({{site.url}}{{site.baseurl}}/query-dsl/full-text/query-string/). + +## Step 3: Register a model group + +To register a model group, send the following request: + +```json +POST /_plugins/_ml/model_groups/_register +{ + "name": "bedrock", + "description": "This is a public model group." +} +``` +{% include copy-curl.html %} + +The response contains the model group ID that you'll use to register a model to this model group: + +```json +{ + "model_group_id": "wlcnb4kBJ1eYAeTMHlV6", + "status": "CREATED" +} +``` + +To learn more about model groups, see [Model access control]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control/). + +## Step 4: Create a connector + +Now you can create a connector for the model. In this example, you'll create a connector to the Anthropic Claude model hosted on Amazon Bedrock: + +```json +POST /_plugins/_ml/connectors/_create +{ + "name": "BedRock test claude Connector", + "description": "The connector to BedRock service for claude model", + "version": 1, + "protocol": "aws_sigv4", + "parameters": { + "region": "us-east-1", + "service_name": "bedrock", + "anthropic_version": "bedrock-2023-05-31", + "endpoint": "bedrock.us-east-1.amazonaws.com", + "auth": "Sig_V4", + "content_type": "application/json", + "max_tokens_to_sample": 8000, + "temperature": 0.0001, + "response_filter": "$.completion" + }, + "credential": { + "access_key": "", + "secret_key": "" + }, + "actions": [ + { + "action_type": "predict", + "method": "POST", + "url": "https://bedrock-runtime.us-east-1.amazonaws.com/model/anthropic.claude-v2/invoke", + "headers": { + "content-type": "application/json", + "x-amz-content-sha256": "required" + }, + "request_body": "{\"prompt\":\"${parameters.prompt}\", \"max_tokens_to_sample\":${parameters.max_tokens_to_sample}, \"temperature\":${parameters.temperature}, \"anthropic_version\":\"${parameters.anthropic_version}\" }" + } + ] +} +``` +{% include copy-curl.html %} + +The response contains the connector ID for the newly created connector: + +```json +{ + "connector_id": "a1eMb4kBJ1eYAeTMAljY" +} +``` + +## Step 5: Register and deploy the model with guardrails + +To register an externally hosted model, provide the model group ID from step 3 and the connector ID from step 4 in the following request. To configure guardrails, include the `guardrails` object: + +```json +POST /_plugins/_ml/models/_register?deploy=true +{ + "name": "Bedrock Claude V2 model", + "function_name": "remote", + "model_group_id": "wlcnb4kBJ1eYAeTMHlV6", + "description": "test model", + "connector_id": "a1eMb4kBJ1eYAeTMAljY", + "guardrails": { + "type": "local_regex", + "input_guardrail": { + "stop_words": [ + { + "index_name": "words0", + "source_fields": [ + "title" + ] + } + ], + "regex": [ + ".*abort.*", + ".*kill.*" + ] + }, + "output_guardrail": { + "stop_words": [ + { + "index_name": "words0", + "source_fields": [ + "title" + ] + } + ], + "regex": [ + ".*abort.*", + ".*kill.*" + ] + } + } +} +``` +{% include copy-curl.html %} + +For more information, see [The `guardrails` parameter]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/#the-guardrails-parameter). + +OpenSearch returns the task ID of the register operation: + +```json +{ + "task_id": "cVeMb4kBJ1eYAeTMFFgj", + "status": "CREATED" +} +``` + +To check the status of the operation, provide the task ID to the [Tasks API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/): + +```bash +GET /_plugins/_ml/tasks/cVeMb4kBJ1eYAeTMFFgj +``` +{% include copy-curl.html %} + +When the operation is complete, the state changes to `COMPLETED`: + +```json +{ + "model_id": "cleMb4kBJ1eYAeTMFFg4", + "task_type": "DEPLOY_MODEL", + "function_name": "REMOTE", + "state": "COMPLETED", + "worker_node": [ + "n-72khvBTBi3bnIIR8FTTw" + ], + "create_time": 1689793851077, + "last_update_time": 1689793851101, + "is_async": true +} +``` + +## Step 6 (Optional): Test the model + +To demonstrate how guardrails are applied, first run the predict operation that does not contain any excluded words: + +```json +POST /_plugins/_ml/models/p94dYo4BrXGpZpgPp98E/_predict +{ + "parameters": { + "prompt": "\n\nHuman:this is a test\n\nnAssistant:" + } +} +``` +{% include copy-curl.html %} + +The response contains inference results: + +```json +{ + "inference_results": [ + { + "output": [ + { + "name": "response", + "dataAsMap": { + "response": " Thank you for the test, I appreciate you taking the time to interact with me. I'm an AI assistant created by Anthropic to be helpful, harmless, and honest." + } + } + ], + "status_code": 200 + } + ] +} +``` + +Then run the predict operation that contains excluded words: + +```json +POST /_plugins/_ml/models/p94dYo4BrXGpZpgPp98E/_predict +{ + "parameters": { + "prompt": "\n\nHuman:this is a test of Master slave architecture\n\nnAssistant:" + } +} +``` +{% include copy-curl.html %} + +The response contains an error message because guardrails were triggered: + +```json +{ + "error": { + "root_cause": [ + { + "type": "illegal_argument_exception", + "reason": "guardrails triggered for user input" + } + ], + "type": "illegal_argument_exception", + "reason": "guardrails triggered for user input" + }, + "status": 400 +} +``` + +Guardrails are also triggered when a prompt matches the supplied regular expression. + +## Next steps + +- For more information about configuring guardrails, see [The `guardrails` parameter]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/#the-guardrails-parameter). \ No newline at end of file diff --git a/_ml-commons-plugin/remote-models/index.md b/_ml-commons-plugin/remote-models/index.md index 0b9c6d03ed..0b92adaab6 100644 --- a/_ml-commons-plugin/remote-models/index.md +++ b/_ml-commons-plugin/remote-models/index.md @@ -205,7 +205,18 @@ Take note of the returned `model_id` because you’ll need it to deploy the mode ## Step 4: Deploy the model -To deploy the registered model, provide its model ID from step 3 in the following request: +Starting with OpenSearch version 2.13, externally hosted models are deployed automatically by default when you send a Predict API request for the first time. To disable automatic deployment for an externally hosted model, set `plugins.ml_commons.model_auto_deploy.enable` to `false`: +```json +PUT _cluster/settings +{ + "persistent": { + "plugins.ml_commons.model_auto_deploy.enable" : "false" + } +} +``` +{% include copy-curl.html %} + +To undeploy the model, use the [Undeploy API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/undeploy-model/). ```bash POST /_plugins/_ml/models/cleMb4kBJ1eYAeTMFFg4/_deploy @@ -317,3 +328,4 @@ To learn how to use the model for vector search, see [Using an ML model for neur - For more information about connector parameters, see [Connector blueprints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/). - For more information about managing ML models in OpenSearch, see [Using ML models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-serving-framework/). - For more information about interacting with ML models in OpenSearch, see [Managing ML models in OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-dashboard/) +For instructions on how to configure model guardrails, see [Guardrails]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/guardrails/). diff --git a/_ml-commons-plugin/using-ml-models.md b/_ml-commons-plugin/using-ml-models.md index 5c23e19ab6..db50626721 100644 --- a/_ml-commons-plugin/using-ml-models.md +++ b/_ml-commons-plugin/using-ml-models.md @@ -10,7 +10,7 @@ redirect_from: --- # Using ML models within OpenSearch -**Generally available 2.9** +**Introduced 2.9** {: .label .label-purple } To integrate machine learning (ML) models into your OpenSearch cluster, you can upload and serve them locally. Choose one of the following options: diff --git a/_monitoring-your-cluster/metrics/getting-started.md b/_monitoring-your-cluster/metrics/getting-started.md index 21edceda7b..659614a07c 100644 --- a/_monitoring-your-cluster/metrics/getting-started.md +++ b/_monitoring-your-cluster/metrics/getting-started.md @@ -1,8 +1,9 @@ --- layout: default -title: Metrics framework -parent: Trace Analytics -nav_order: 65 +title: Metrics framework +nav_order: 1 +has_children: false +has_toc: false redirect_from: - /monitoring-your-cluster/metrics/ --- @@ -95,3 +96,12 @@ The metrics framework feature supports various telemetry solutions through plugi 2. **Exporters:** Exporters are responsible for persisting the data. OpenTelemetry provides several out-of-the-box exporters. OpenSearch supports the following exporters: - `LoggingMetricExporter`: Exports metrics to a log file, generating a separate file in the logs directory `_otel_metrics.log`. Default is `telemetry.otel.metrics.exporter.class=io.opentelemetry.exporter.logging.LoggingMetricExporter`. - `OtlpGrpcMetricExporter`: Exports spans through gRPC. To use this exporter, you need to install the `otel-collector` on the node. By default, it writes to the http://localhost:4317/ endpoint. To use this exporter, set the following static setting: `telemetry.otel.metrics.exporter.class=io.opentelemetry.exporter.otlp.metrics.OtlpGrpcMetricExporter`. + +### Supported metric types + +The metrics framework feature supports the following metric types: + +1. **Counters:** Counters are continuous and synchronous meters used to track the frequency of events over time. Counters can only be incremented with positive values, making them ideal for measuring the number of monitoring occurrences such as errors, processed or received bytes, and total requests. +2. **UpDown counters:** UpDown counters can be incremented with positive values or decremented with negative values. UpDown counters are well suited for tracking metrics like open connections, active requests, and other fluctuating quantities. +3. **Histograms:** Histograms are valuable tools for visualizing the distribution of continuous data. Histograms offer insight into the central tendency, spread, skewness, and potential outliers that might exist in your metrics. Patterns such as normal distribution, skewed distribution, or bimodal distribution can be readily identified, making histograms ideal for analyzing latency metrics and assessing percentiles. +4. **Asynchronous Gauges:** Asynchronous gauges capture the current value at the moment a metric is read. These metrics are non-additive and are commonly used to measure CPU utilization on a per-minute basis, memory utilization, and other real-time values. diff --git a/_observing-your-data/event-analytics.md b/_observing-your-data/event-analytics.md index dd936b7d27..b8fe72964c 100644 --- a/_observing-your-data/event-analytics.md +++ b/_observing-your-data/event-analytics.md @@ -30,9 +30,6 @@ For more information about building PPL queries, see [Piped Processing Language] ### OpenSearch Dashboards Query Assistant -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [OpenSearch forum thread](https://forum.opensearch.org/t/feedback-opensearch-assistant/16741). -{: .warning} - Note that machine learning models are probabilistic and that some may perform better than others, so the OpenSearch Assistant may occasionally produce inaccurate information. We recommend evaluating outputs for accuracy as appropriate to your use case, including reviewing the output or combining it with other verification factors. {: .important} @@ -42,28 +39,23 @@ To simplify query building, the **OpenSearch Assistant** toolkit offers an assis #### Enabling Query Assistant -To enable **Query Assistant** in OpenSearch Dashboards, locate your copy of the `opensearch_dashboards.yml` file and set the following option: - -``` -observability.query_assist.enabled: true -observability.query_assist.ppl_agent_name: "PPL agent" -``` +By default, **Query Assistant** is enabled in OpenSearch Dashboards. To enable summarization of responses, locate your copy of the `opensearch_dashboards.yml` file and set the following option: -To enable summarization of responses, locate your copy of the `opensearch_dashboards.yml` file and set the following option: - -``` +```yaml observability.summarize.enabled: true observability.summarize.response_summary_agent_name: "Response summary agent" observability.summarize.error_summary_agent_name: "Error summary agent" ``` +To disable Query Assistant, add `observability.query_assist.enabled: false` to your `opensearch_dashboards.yml`. + #### Setting up Query Assistant To set up **Query Assistant**, follow the steps in the [Getting started guide](https://github.com/opensearch-project/dashboards-assistant/blob/main/GETTING_STARTED_GUIDE.md) on GitHub. This guide provides step-by-step setup instructions for **OpenSearch Assistant** and **Query Assistant**. To set up **Query Assistant** only, use the `query-assist-agent` template included in the guide. ## Saving a visualization -After Dashboards generates a visualization, save it if you want to revisit it or include it in an [operational panel]({{site.url}}{{site.baseurl}}/observing-your-data/operational-panels). To save a visualization, expand the **Save** dropdown menu in the upper-right corner, enter a name for the visualization, and then select the **Save** button. You can reopen saved visualizations on the event analytics page. +After Dashboards generates a visualization, save it if you want to revisit it or include it in an [operational panel]({{site.url}}{{site.baseurl}}/observing-your-data/operational-panels/). To save a visualization, expand the **Save** dropdown menu in the upper-right corner, enter a name for the visualization, and then select the **Save** button. You can reopen saved visualizations on the event analytics page. ## Creating event analytics visualizations and adding them to dashboards diff --git a/_query-dsl/minimum-should-match.md b/_query-dsl/minimum-should-match.md index 9ec65431b1..e2032b8911 100644 --- a/_query-dsl/minimum-should-match.md +++ b/_query-dsl/minimum-should-match.md @@ -26,7 +26,7 @@ GET /shakespeare/_search } ``` -In this example, the query has three optional clauses that are combined with an `OR`, so the document must match either `prince`, `king`, or `star`. +In this example, the query has three optional clauses that are combined with an `OR`, so the document must match either `prince` and `king`, or `prince` and `star`, or `king` and `star`. ## Valid values @@ -448,4 +448,4 @@ The results contain only four documents that match at least one of the optional ] } } -``` \ No newline at end of file +``` diff --git a/_search-plugins/caching/index.md b/_search-plugins/caching/index.md new file mode 100644 index 0000000000..4d0173fdc7 --- /dev/null +++ b/_search-plugins/caching/index.md @@ -0,0 +1,32 @@ +--- +layout: default +title: Caching +parent: Improving search performance +has_children: true +nav_order: 100 +--- + +# Caching + +OpenSearch relies heavily on different on-heap cache types to accelerate data retrieval, providing significant improvement in search latencies. However, cache size is limited by the amount of memory available on a node. If you are processing a larger dataset that can potentially be cached, the cache size limit causes a lot of cache evictions and misses. The increasing number of evictions impacts performance because OpenSearch needs to process the query again, causing high resource consumption. + +Prior to version 2.13, OpenSearch supported the following on-heap cache types: + +- **Request cache**: Caches the local results on each shard. This allows frequently used (and potentially resource-heavy) search requests to return results almost instantly. +- **Query cache**: The shard-level query cache caches common data from similar queries. The query cache is more granular than the request cache and can cache data that is reused in different queries. +- **Field data cache**: The field data cache contains field data and global ordinals, which are both used to support aggregations on certain field types. + +## Additional cache stores +**Introduced 2.13** +{: .label .label-purple } + +This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/10024). +{: .warning} + +In addition to existing OpenSearch custom on-heap cache stores, cache plugins provide the following cache stores: + +- **Disk cache**: This cache stores the precomputed result of a query on disk. You can use a disk cache to cache much larger datasets, provided that the disk latencies are acceptable. +- **Tiered cache**: This is a multi-level cache, in which each tier has its own characteristics and performance levels. For example, a tiered cache can contain on-heap and disk tiers. By combining different tiers, you can achieve a balance between cache performance and size. To learn more, see [Tiered cache]({{site.url}}{{site.baseurl}}/search-plugins/caching/tiered-cache/). + +In OpenSearch 2.13, the request cache is integrated with cache plugins. You can use a tiered or disk cache as a request-level cache. +{: .note} \ No newline at end of file diff --git a/_search-plugins/caching/tiered-cache.md b/_search-plugins/caching/tiered-cache.md new file mode 100644 index 0000000000..3842ebe5a9 --- /dev/null +++ b/_search-plugins/caching/tiered-cache.md @@ -0,0 +1,82 @@ +--- +layout: default +title: Tiered cache +parent: Caching +grand_parent: Improving search performance +nav_order: 10 +--- + +# Tiered cache + +This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/10024). +{: .warning} + +A tiered cache is a multi-level cache, in which each tier has its own characteristics and performance levels. By combining different tiers, you can achieve a balance between cache performance and size. + +## Types of tiered caches + +OpenSearch 2.13 provides an implementation of _tiered spillover cache_. This implementation spills the evicted items from upper to lower tiers. The upper tier is smaller in size but offers better latency, like the on-heap tier. The lower tier is larger in size but is slower in terms of latency compared to the upper tier. A disk cache is an example of a lower tier. OpenSearch 2.13 offers on-heap and disk tiers. + +## Enabling a tiered cache + +To enable a tiered cache, configure the following setting: + +```yaml +opensearch.experimental.feature.pluggable.caching.enabled: true +``` +{% include copy.html %} + +For more information about ways to enable experimental features, see [Experimental feature flags]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). + +## Installing required plugins + +A tiered cache provides a way to plug in any disk or on-heap tier implementation. You can install the plugins you intend to use in the tiered cache. As of OpenSearch 2.13, the available cache plugin is the `cache-ehcache` plugin. This plugin provides a disk cache implementation to use within a tiered cache as a disk tier. + +A tiered cache will fail to initialize if the `cache-ehcache` plugin is not installed or disk cache properties are not set. +{: .warning} + +## Tiered cache settings + +In OpenSearch 2.13, a request cache can use a tiered cache. To begin, configure the following settings in the `opensearch.yml` file. + +### Cache store name + +Set the cache store name to `tiered_spillover` to use the OpenSearch-provided tiered spillover cache implementation: + +```yaml +indices.request.cache.store.name: tiered_spillover: true +``` +{% include copy.html %} + +### Setting on-heap and disk store tiers + +The `opensearch_onheap` setting is the built-in on-heap cache available in OpenSearch. The `ehcache_disk` setting is the disk cache implementation from [Ehcache](https://www.ehcache.org/). This requires installing the `cache-ehcache` plugin: + +```yaml +indices.request.cache.tiered_spillover.onheap.store.name: opensearch_onheap +indices.request.cache.tiered_spillover.disk.store.name: ehcache_disk +``` +{% include copy.html %} + +For more information about installing non-bundled plugins, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins). + +### Configuring on-heap and disk stores + +The following table lists the cache store settings for the `opensearch_onheap` store. + +Setting | Default | Description +:--- | :--- | :--- +`indices.request.cache.opensearch_onheap.size` | 1% of the heap | The size of the on-heap cache. Optional. +`indices.request.cache.opensearch_onheap.expire` | `MAX_VALUE` (disabled) | Specify a time-to-live (TTL) for the cached results. Optional. + +The following table lists the disk cache store settings for the `ehcache_disk` store. + +Setting | Default | Description +:--- | :--- | :--- +`indices.request.cache.ehcache_disk.max_size_in_bytes` | `1073741824` (1 GB) | Defines the size of the disk cache. Optional. +`indices.request.cache.ehcache_disk.storage.path` | `""` | Defines the storage path for the disk cache. Required. +`indices.request.cache.ehcache_disk.expire_after_access` | `MAX_VALUE` (disabled) | Specify a time-to-live (TTL) for the cached results. Optional. +`indices.request.cache.ehcache_disk.alias` | `ehcacheDiskCache#INDICES_REQUEST_CACHE` (this is an example of request cache) | Specify an alias for the disk cache. Optional. +`indices.request.cache.ehcache_disk.segments` | `16` | Defines the number of segments the disk cache is separated into. Used for concurrency. Optional. +`indices.request.cache.ehcache_disk.concurrency` | `1` | Defines the number of distinct write queues created for the disk store, where a group of segments share a write queue. Optional. + diff --git a/_search-plugins/concurrent-segment-search.md b/_search-plugins/concurrent-segment-search.md index 58b8d9a8ce..0bb7657937 100644 --- a/_search-plugins/concurrent-segment-search.md +++ b/_search-plugins/concurrent-segment-search.md @@ -27,7 +27,7 @@ By default, concurrent segment search is disabled on the cluster. You can enable - Cluster level - Index level -The index-level setting takes priority over the cluster-level setting. Thus, if the cluster setting is enabled but the index setting is disabled, then concurrent segment search will be disabled for that index. +The index-level setting takes priority over the cluster-level setting. Thus, if the cluster setting is enabled but the index setting is disabled, then concurrent segment search will be disabled for that index. Because of this, the index-level setting is not evaluated unless it is explicitly set, regardless of the default value configured for the setting. You can retrieve the current value of the index-level setting by calling the [Index Settings API]({{site.url}}{{site.baseurl}}/api-reference/index-apis/get-settings/) and omitting the `?include_defaults` query parameter. {: .note} To enable concurrent segment search for all indexes in the cluster, set the following dynamic cluster setting: diff --git a/_search-plugins/hybrid-search.md b/_search-plugins/hybrid-search.md index ebd014b0de..b0fb4d5bef 100644 --- a/_search-plugins/hybrid-search.md +++ b/_search-plugins/hybrid-search.md @@ -146,7 +146,9 @@ PUT /_search/pipeline/nlp-search-pipeline To perform hybrid search on your index, use the [`hybrid` query]({{site.url}}{{site.baseurl}}/query-dsl/compound/hybrid/), which combines the results of keyword and semantic search. -The following example request combines two query clauses---a neural query and a `match` query. It specifies the search pipeline created in the previous step as a query parameter: +#### Example: Combining a neural query and a match query + +The following example request combines two query clauses---a `neural` query and a `match` query. It specifies the search pipeline created in the previous step as a query parameter: ```json GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline @@ -161,7 +163,7 @@ GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline "queries": [ { "match": { - "text": { + "passage_text": { "query": "Hi world" } } @@ -216,3 +218,355 @@ The response contains the matching document: } } ``` +{% include copy-curl.html %} + +#### Example: Combining a match query and a term query + +The following example request combines two query clauses---a `match` query and a `term` query. It specifies the search pipeline created in the previous step as a query parameter: + +```json +GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline +{ + "_source": { + "exclude": [ + "passage_embedding" + ] + }, + "query": { + "hybrid": { + "queries": [ + { + "match":{ + "passage_text": "hello" + } + }, + { + "term":{ + "passage_text":{ + "value":"planet" + } + } + } + ] + } + } +} +``` +{% include copy-curl.html %} + +The response contains the matching documents: + +```json +{ + "took": 11, + "timed_out": false, + "_shards": { + "total": 2, + "successful": 2, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 0.7, + "hits": [ + { + "_index": "my-nlp-index", + "_id": "2", + "_score": 0.7, + "_source": { + "id": "s2", + "passage_text": "Hi planet" + } + }, + { + "_index": "my-nlp-index", + "_id": "1", + "_score": 0.3, + "_source": { + "id": "s1", + "passage_text": "Hello world" + } + } + ] + } +} +``` +{% include copy-curl.html %} + +## Hybrid search with post-filtering +**Introduced 2.13** +{: .label .label-purple } + +You can perform post-filtering on hybrid search results by providing the `post_filter` parameter in your query. + +The `post_filter` clause is applied after the search results have been retrieved. Post-filtering is useful for applying additional filters to the search results without impacting the scoring or the order of the results. + +Post-filtering does not impact document relevance scores or aggregation results. +{: .note} + +#### Example: Post-filtering + +The following example request combines two query clauses---a `term` query and a `match` query. This is the same query as in the [preceding example](#example-combining-a-match-query-and-a-term-query), but it contains a `post_filter`: + +```json +GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline +{ + "query": { + "hybrid":{ + "queries":[ + { + "match":{ + "passage_text": "hello" + } + }, + { + "term":{ + "passage_text":{ + "value":"planet" + } + } + } + ] + } + + }, + "post_filter":{ + "match": { "passage_text": "world" } + } +} + +``` +{% include copy-curl.html %} + +Compare the results to the results without post-filtering in the [preceding example](#example-combining-a-match-query-and-a-term-query). Unlike the preceding example response, which contains two documents, the response in this example contains one document because the second document is filtered using post-filtering: + +```json +{ + "took": 18, + "timed_out": false, + "_shards": { + "total": 2, + "successful": 2, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 0.3, + "hits": [ + { + "_index": "my-nlp-index", + "_id": "1", + "_score": 0.3, + "_source": { + "id": "s1", + "passage_text": "Hello world" + } + } + ] + } +} +``` + + +## Combining hybrid search and aggregations +**Introduced 2.13** +{: .label .label-purple } + +You can enhance search results by combining a hybrid query clause with any aggregation that OpenSearch supports. Aggregations allow you to use OpenSearch as an analytics engine. For more information about aggregations, see [Aggregations]({{site.url}}{{site.baseurl}}/aggregations/). + +Most aggregations are performed on the subset of documents that is returned by a hybrid query. The only aggregation that operates on all documents is the [`global`]({{site.url}}{{site.baseurl}}/aggregations/bucket/global/) aggregation. + +To use aggregations with a hybrid query, first create an index. Aggregations are typically used on fields of special types, like `keyword` or `integer`. The following example creates an index with several such fields: + +```json +PUT /my-nlp-index +{ + "settings": { + "number_of_shards": 2 + }, + "mappings": { + "properties": { + "doc_index": { + "type": "integer" + }, + "doc_keyword": { + "type": "keyword" + }, + "category": { + "type": "keyword" + } + } + } +} +``` +{% include copy-curl.html %} + +The following request ingests six documents into your new index: + +```json +POST /_bulk +{ "index": { "_index": "my-nlp-index" } } +{ "category": "permission", "doc_keyword": "workable", "doc_index": 4976, "doc_price": 100} +{ "index": { "_index": "my-nlp-index" } } +{ "category": "sister", "doc_keyword": "angry", "doc_index": 2231, "doc_price": 200 } +{ "index": { "_index": "my-nlp-index" } } +{ "category": "hair", "doc_keyword": "likeable", "doc_price": 25 } +{ "index": { "_index": "my-nlp-index" } } +{ "category": "editor", "doc_index": 9871, "doc_price": 30 } +{ "index": { "_index": "my-nlp-index" } } +{ "category": "statement", "doc_keyword": "entire", "doc_index": 8242, "doc_price": 350 } +{ "index": { "_index": "my-nlp-index" } } +{ "category": "statement", "doc_keyword": "idea", "doc_index": 5212, "doc_price": 200 } +{ "index": { "_index": "index-test" } } +{ "category": "editor", "doc_keyword": "bubble", "doc_index": 1298, "doc_price": 130 } +{ "index": { "_index": "index-test" } } +{ "category": "editor", "doc_keyword": "bubble", "doc_index": 521, "doc_price": 75 } +``` +{% include copy-curl.html %} + +Now you can combine a hybrid query clause with a `min` aggregation: + +```json +GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline +{ + "query": { + "hybrid": { + "queries": [ + { + "term": { + "category": "permission" + } + }, + { + "bool": { + "should": [ + { + "term": { + "category": "editor" + } + }, + { + "term": { + "category": "statement" + } + } + ] + } + } + ] + } + }, + "aggs": { + "total_price": { + "sum": { + "field": "doc_price" + } + }, + "keywords": { + "terms": { + "field": "doc_keyword", + "size": 10 + } + } + } +} +``` +{% include copy-curl.html %} + +The response contains the matching documents and the aggregation results: + +```json +{ + "took": 9, + "timed_out": false, + "_shards": { + "total": 2, + "successful": 2, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 4, + "relation": "eq" + }, + "max_score": 0.5, + "hits": [ + { + "_index": "my-nlp-index", + "_id": "mHRPNY4BlN82W_Ar9UMY", + "_score": 0.5, + "_source": { + "doc_price": 100, + "doc_index": 4976, + "doc_keyword": "workable", + "category": "permission" + } + }, + { + "_index": "my-nlp-index", + "_id": "m3RPNY4BlN82W_Ar9UMY", + "_score": 0.5, + "_source": { + "doc_price": 30, + "doc_index": 9871, + "category": "editor" + } + }, + { + "_index": "my-nlp-index", + "_id": "nXRPNY4BlN82W_Ar9UMY", + "_score": 0.5, + "_source": { + "doc_price": 200, + "doc_index": 5212, + "doc_keyword": "idea", + "category": "statement" + } + }, + { + "_index": "my-nlp-index", + "_id": "nHRPNY4BlN82W_Ar9UMY", + "_score": 0.5, + "_source": { + "doc_price": 350, + "doc_index": 8242, + "doc_keyword": "entire", + "category": "statement" + } + } + ] + }, + "aggregations": { + "total_price": { + "value": 680 + }, + "doc_keywords": { + "doc_count_error_upper_bound": 0, + "sum_other_doc_count": 0, + "buckets": [ + { + "key": "entire", + "doc_count": 1 + }, + { + "key": "idea", + "doc_count": 1 + }, + { + "key": "workable", + "doc_count": 1 + } + ] + } + } +} +``` \ No newline at end of file diff --git a/_search-plugins/knn/approximate-knn.md b/_search-plugins/knn/approximate-knn.md index 99cb9e6767..16d1a7e686 100644 --- a/_search-plugins/knn/approximate-knn.md +++ b/_search-plugins/knn/approximate-knn.md @@ -303,3 +303,8 @@ The cosine similarity formula does not include the `1 -` prefix. However, becaus smaller scores with closer results, they return `1 - cosineSimilarity` for cosine similarity space---that's why `1 -` is included in the distance function. {: .note } + +With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of +such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests +containing the zero vector will be rejected and a corresponding exception will be thrown. +{: .note } \ No newline at end of file diff --git a/_search-plugins/knn/knn-score-script.md b/_search-plugins/knn/knn-score-script.md index 14027d6cc8..cc79e90850 100644 --- a/_search-plugins/knn/knn-score-script.md +++ b/_search-plugins/knn/knn-score-script.md @@ -328,3 +328,8 @@ A space corresponds to the function used to measure the distance between two poi Cosine similarity returns a number between -1 and 1, and because OpenSearch relevance scores can't be below 0, the k-NN plugin adds 1 to get the final score. + +With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of +such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests +containing the zero vector will be rejected and a corresponding exception will be thrown. +{: .note } \ No newline at end of file diff --git a/_search-plugins/knn/painless-functions.md b/_search-plugins/knn/painless-functions.md index 2b28f753ef..1f27cc29a6 100644 --- a/_search-plugins/knn/painless-functions.md +++ b/_search-plugins/knn/painless-functions.md @@ -67,3 +67,8 @@ cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector fie ``` Because scores can only be positive, this script ranks documents with vector fields higher than those without. + +With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of +such a vector is 0, which raises a `divide by 0` exception when computing the value. Requests +containing the zero vector will be rejected and a corresponding exception will be thrown. +{: .note } \ No newline at end of file diff --git a/_search-plugins/neural-sparse-search.md b/_search-plugins/neural-sparse-search.md index 31ae43991e..88d30e4391 100644 --- a/_search-plugins/neural-sparse-search.md +++ b/_search-plugins/neural-sparse-search.md @@ -55,6 +55,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse ``` {% include copy-curl.html %} +To split long text into passages, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors). + ## Step 2: Create an index for ingestion In order to use the text embedding processor defined in your pipeline, create a rank features index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as [`rank_features`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/rank/#rank-features). Similarly, the `passage_text` field should be mapped as `text`. @@ -237,3 +239,129 @@ The response contains the matching documents: } } ``` + +## Setting a default model on an index or field + +A [`neural_sparse`]({{site.url}}{{site.baseurl}}/query-dsl/specialized/neural-sparse/) query requires a model ID for generating sparse embeddings. To eliminate passing the model ID with each neural_sparse query request, you can set a default model on index-level or field-level. + +First, create a [search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/index/) with a [`neural_query_enricher`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/neural-query-enricher/) request processor. To set a default model for an index, provide the model ID in the `default_model_id` parameter. To set a default model for a specific field, provide the field name and the corresponding model ID in the `neural_field_default_id` map. If you provide both `default_model_id` and `neural_field_default_id`, `neural_field_default_id` takes precedence: + +```json +PUT /_search/pipeline/default_model_pipeline +{ + "request_processors": [ + { + "neural_query_enricher" : { + "default_model_id": "bQ1J8ooBpBj3wT4HVUsb", + "neural_field_default_id": { + "my_field_1": "uZj0qYoBMtvQlfhaYeud", + "my_field_2": "upj0qYoBMtvQlfhaZOuM" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +Then set the default model for your index: + +```json +PUT /my-nlp-index/_settings +{ + "index.search.default_pipeline" : "default_model_pipeline" +} +``` +{% include copy-curl.html %} + +You can now omit the model ID when searching: + +```json +GET /my-nlp-index/_search +{ + "query": { + "neural_sparse": { + "passage_embedding": { + "query_text": "Hi world" + } + } + } +} +``` +{% include copy-curl.html %} + +The response contains both documents: + +```json +{ + "took" : 688, + "timed_out" : false, + "_shards" : { + "total" : 1, + "successful" : 1, + "skipped" : 0, + "failed" : 0 + }, + "hits" : { + "total" : { + "value" : 2, + "relation" : "eq" + }, + "max_score" : 30.0029, + "hits" : [ + { + "_index" : "my-nlp-index", + "_id" : "1", + "_score" : 30.0029, + "_source" : { + "passage_text" : "Hello world", + "passage_embedding" : { + "!" : 0.8708904, + "door" : 0.8587369, + "hi" : 2.3929274, + "worlds" : 2.7839446, + "yes" : 0.75845814, + "##world" : 2.5432441, + "born" : 0.2682308, + "nothing" : 0.8625516, + "goodbye" : 0.17146169, + "greeting" : 0.96817183, + "birth" : 1.2788506, + "come" : 0.1623208, + "global" : 0.4371151, + "it" : 0.42951578, + "life" : 1.5750692, + "thanks" : 0.26481047, + "world" : 4.7300377, + "tiny" : 0.5462298, + "earth" : 2.6555297, + "universe" : 2.0308156, + "worldwide" : 1.3903781, + "hello" : 6.696973, + "so" : 0.20279501, + "?" : 0.67785245 + }, + "id" : "s1" + } + }, + { + "_index" : "my-nlp-index", + "_id" : "2", + "_score" : 16.480486, + "_source" : { + "passage_text" : "Hi planet", + "passage_embedding" : { + "hi" : 4.338913, + "planets" : 2.7755864, + "planet" : 5.0969057, + "mars" : 1.7405145, + "earth" : 2.6087382, + "hello" : 3.3210192 + }, + "id" : "s2" + } + } + ] + } +} +``` \ No newline at end of file diff --git a/_search-plugins/search-pipelines/search-processors.md b/_search-plugins/search-pipelines/search-processors.md index 36b848e6eb..5e53cf5615 100644 --- a/_search-plugins/search-pipelines/search-processors.md +++ b/_search-plugins/search-pipelines/search-processors.md @@ -24,7 +24,7 @@ The following table lists all supported search request processors. Processor | Description | Earliest available version :--- | :--- | :--- [`filter_query`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/filter-query-processor/) | Adds a filtering query that is used to filter requests. | 2.8 -[`neural_query_enricher`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/neural-query-enricher/) | Sets a default model for neural search at the index or field level. | 2.11 +[`neural_query_enricher`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/neural-query-enricher/) | Sets a default model for neural search and neural sparse search at the index or field level. | 2.11(neural), 2.13(neural sparse) [`script`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/script-processor/) | Adds a script that is run on newly indexed documents. | 2.8 [`oversample`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/oversample-processor/) | Increases the search request `size` parameter, storing the original value in the pipeline state. | 2.12 diff --git a/_search-plugins/semantic-search.md b/_search-plugins/semantic-search.md index f4753bee1c..32bd18cd6c 100644 --- a/_search-plugins/semantic-search.md +++ b/_search-plugins/semantic-search.md @@ -48,6 +48,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline ``` {% include copy-curl.html %} +To split long text into passages, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors). + ## Step 2: Create an index for ingestion In order to use the text embedding processor defined in your pipeline, create a k-NN index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as a k-NN vector with a dimension that matches the model dimension. Similarly, the `passage_text` field should be mapped as `text`. diff --git a/_search-plugins/sql/ppl/index.md b/_search-plugins/sql/ppl/index.md index c39e3429e1..56ffebf555 100644 --- a/_search-plugins/sql/ppl/index.md +++ b/_search-plugins/sql/ppl/index.md @@ -12,6 +12,8 @@ redirect_from: - /search-plugins/ppl/index/ - /search-plugins/ppl/endpoint/ - /search-plugins/ppl/protocol/ + - /search-plugins/sql/ppl/index/ + - /observability-plugin/ppl/index/ --- # PPL diff --git a/_security-analytics/api-tools/alert-finding-api.md b/_security-analytics/api-tools/alert-finding-api.md index a22b601b08..f2631f2a50 100644 --- a/_security-analytics/api-tools/alert-finding-api.md +++ b/_security-analytics/api-tools/alert-finding-api.md @@ -149,13 +149,230 @@ You can specify the following parameters when getting findings. Parameter | Description :--- | :--- -`detector_id` | The ID of the detector used to fetch alerts. Optional when the `detectorType` is specified. Otherwise required. -`detectorType` | The type of detector used to fetch alerts. Optional when the `detector_id` is specified. Otherwise required. +`detector_id` | The ID of the detector used to fetch alerts. Optional. +`detectorType` | The type of detector used to fetch alerts. Optional. `sortOrder` | The order used to sort the list of findings. Possible values are `asc` or `desc`. Optional. `size` | An optional limit for the maximum number of results returned in the response. Optional. +`startIndex` | The pagination indicator. Optional. +`detectionType` | The detection rule type that dictates the retrieval type for the findings. When the detection type is `threat`, it fetches threat intelligence feeds. When the detection type is `rule`, findings are fetched based on the detector's rule. Optional. +`severity` | The severity of the detector rule used to fetch alerts. Severity can be `critical`, `high`, `medium`, or `low`. Optional. ### Example request +```json +GET /_plugins/_security_analytics/findings/_search +{ + "total_findings": 2, + "findings": [ + { + "detectorId": "b9ZN040Bjlggkcgx1d1W", + "id": "35efb736-c5d9-499d-b9b5-31f0a7d61251", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "QdZN040Bjlggkcgxdd3X", + "name": "QdZN040Bjlggkcgxdd3X", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "high", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + }, + { + "detectorId": "O9ZM040Bjlggkcgx6N1S", + "id": "a5022930-4503-4ca8-bf0a-320a2b1fb433", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "KtZM040Bjlggkcgxkd04", + "name": "KtZM040Bjlggkcgxkd04", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "critical", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + } + ] +} + +``` + +```json +GET /_plugins/_security_analytics/findings/_search?severity=high +{ + "total_findings": 1, + "findings": [ + { + "detectorId": "b9ZN040Bjlggkcgx1d1W", + "id": "35efb736-c5d9-499d-b9b5-31f0a7d61251", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "QdZN040Bjlggkcgxdd3X", + "name": "QdZN040Bjlggkcgxdd3X", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "high", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + } + ] +} + +``` + +```json +GET /_plugins/_security_analytics/findings/_search?detectionType=rule +{ + "total_findings": 2, + "findings": [ + { + "detectorId": "b9ZN040Bjlggkcgx1d1W", + "id": "35efb736-c5d9-499d-b9b5-31f0a7d61251", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "QdZN040Bjlggkcgxdd3X", + "name": "QdZN040Bjlggkcgxdd3X", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "high", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + }, + { + "detectorId": "O9ZM040Bjlggkcgx6N1S", + "id": "a5022930-4503-4ca8-bf0a-320a2b1fb433", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "KtZM040Bjlggkcgxkd04", + "name": "KtZM040Bjlggkcgxkd04", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "critical", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + } + ] +} + + +``` +```json +GET /_plugins/_security_analytics/findings/_search?detectionType=rule&severity=high +{ + "total_findings": 1, + "findings": [ + { + "detectorId": "b9ZN040Bjlggkcgx1d1W", + "id": "35efb736-c5d9-499d-b9b5-31f0a7d61251", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "QdZN040Bjlggkcgxdd3X", + "name": "QdZN040Bjlggkcgxdd3X", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "high", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + } + ] +} + +``` + ```json GET /_plugins/_security_analytics/findings/_search?*detectorType*= { diff --git a/_security/access-control/anonymous-authentication.md b/_security/access-control/anonymous-authentication.md index 429daafb9b..cb2f951546 100644 --- a/_security/access-control/anonymous-authentication.md +++ b/_security/access-control/anonymous-authentication.md @@ -30,6 +30,19 @@ The following table describes the `anonymous_auth_enabled` setting. For more inf If you disable anonymous authentication, you must provide at least one `authc` in order for the Security plugin to initialize successfully. {: .important } +## OpenSearch Dashboards configuration + +To enable anonymous authentication for OpenSearch Dashboards, you need to modify the `opensearch_dashboards.yml` file in the configuration directory of your OpenSearch Dashboards installation. + +Add the following setting to `opensearch_dashboards.yml`: + +```yml +opensearch_security.auth.anonymous_auth_enabled: true +``` + +Anonymous login for OpenSearch Dashboards requires anonymous authentication to be enabled on the OpenSearch cluster. +{: .important} + ## Defining anonymous authentication privileges When anonymous authentication is enabled, your defined HTTP authenticators still try to find user credentials inside your HTTP request. If credentials are found, the user is authenticated. If none are found, the user is authenticated as an `anonymous` user. diff --git a/_security/access-control/api.md b/_security/access-control/api.md index acbdb5e0be..8a464bdeb1 100644 --- a/_security/access-control/api.md +++ b/_security/access-control/api.md @@ -1297,6 +1297,91 @@ PATCH _plugins/_security/api/securityconfig } ``` +### Configuration upgrade check + +Introduced 2.13 +{: .label .label-purple } + +Checks the current configuration bundled with the host's Security plugin and compares it to the version of the OpenSearch Security plugin the user downloaded. Then, the API responds indicating whether or not an upgrade can be performed and what resources can be updated. + +With each new OpenSearch version, there are changes to the default security configuration. This endpoint helps cluster operators determine whether the cluster is missing defaults or has stale definitions of defaults. +{: .note} + +#### Request + +```json +GET _plugins/_security/api/_upgrade_check +``` +{% include copy-curl.html %} + +#### Example response + +```json +{ + "status" : "OK", + "upgradeAvailable" : true, + "upgradeActions" : { + "roles" : { + "add" : [ "flow_framework_full_access" ] + } + } +} +``` + +#### Response fields + +| Field | Data type | Description | +|:---------|:-----------|:------------------------------| +| `upgradeAvailable` | Boolean | Responds with `true` when an upgrade to the security configuration is available. | +| `upgradeActions` | Object list | A list of security objects that would be modified when upgrading the host's Security plugin. | + +### Configuration upgrade + +Introduced 2.13 +{: .label .label-purple } + +Adds and updates resources on a host's existing security configuration from the configuration bundled with the latest version of the Security plugin. + +These bundled configuration files can be found in the `/security/config` directory. Default configuration files are updated when OpenSearch is upgraded, whereas the cluster configuration is only updated by the cluster operators. This endpoint helps cluster operator upgrade missing defaults and stale default definitions. + + +#### Request + +```json +POST _plugins/_security/api/_upgrade_perform +{ + "configs" : [ "roles" ] +} +``` +{% include copy-curl.html %} + +#### Request fields + +| Field | Data type | Description | Required | +|:----------------|:-----------|:------------------------------------------------------------------------------------------------------------------|:---------| +| `configs` | Array | Specifies the configurations to be upgraded. This field can include any combination of the following configurations: `actiongroups`,`allowlist`, `audit`, `internalusers`, `nodesdn`, `roles`, `rolesmappings`, `tenants`.
Default is all supported configurations. | No | + + +#### Example response + +```json +{ + "status" : "OK", + "upgrades" : { + "roles" : { + "add" : [ "flow_framework_full_access" ] + } + } +} +``` + +#### Response fields + +| Field | Data type | Description | +|:---------|:-----------|:------------------------------| +| `upgrades` | Object | A container for the upgrade results, organized by configuration type, such as `roles`. Each changed configuration type will be represented as a key in this object. | +| `roles` | Object | Contains a list of role-based action keys of objects modified by the upgrade. | + --- ## Distinguished names diff --git a/_security/access-control/document-level-security.md b/_security/access-control/document-level-security.md index 3f2049a1e2..be5fe7e0da 100644 --- a/_security/access-control/document-level-security.md +++ b/_security/access-control/document-level-security.md @@ -10,30 +10,31 @@ redirect_from: # Document-level security (DLS) -Document-level security lets you restrict a role to a subset of documents in an index. The easiest way to get started with document- and field-level security is to open OpenSearch Dashboards and choose **Security**. Then choose **Roles**, create a new role, and review the **Index permissions** section. - -![Document- and field-level security screen in OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/images/security-dls.png) - - -## Simple roles - -Document-level security uses the OpenSearch query DSL to define which documents a role grants access to. In OpenSearch Dashboards, choose an index pattern and provide a query in the **Document level security** section: - -```json -{ - "bool": { - "must": { - "match": { - "genres": "Comedy" - } - } - } -} -``` - -This query specifies that for the role to have access to a document, its `genres` field must include `Comedy`. - -A typical request to the `_search` API includes `{ "query": { ... } }` around the query, but in this case, you only need to specify the query itself. +Document-level security lets you restrict a role to a subset of documents in an index. +For more information about OpenSearch users and roles, see the [documentation](https://opensearch.org/docs/latest/security/access-control/users-roles/#create-roles). + +Use the following steps to get started with document-level and field-level security: +1. Open OpenSearch Dashboards. +2. Choose **Security** > **Roles**. +3. Select **Create Role** and provide a name for the role. +4. Review the **Index permissions** section and any necessary [index permissions](https://opensearch.org/docs/latest/security/access-control/permissions/) for the role. +5. Add document-level security, with the addition of a domain-specific language (DSL) query in the `Document level security - optional` section. A typical request sent to the `_search` API includes `{ "query": { ... } }` around the query, but with document-level security in OpenSearch Dashboards, you only need to specify the query itself. For example, the following DSL query specifies that for the new role to have access to a document, the query's `genres` field must include `Comedy`: + + ```json + { + "bool": { + "must": { + "match": { + "genres": "Comedy" + } + } + } + } + ``` + + - ![Document- and field-level security screen in OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/images/security-dls.png) + +## Updating roles by accessing the REST API In the REST API, you provide the query as a string, so you must escape your quotes. This role allows a user to read any document in any index with the field `public` set to `true`: diff --git a/_security/access-control/permissions.md b/_security/access-control/permissions.md index 4f8df5e042..0b2d609c35 100644 --- a/_security/access-control/permissions.md +++ b/_security/access-control/permissions.md @@ -124,7 +124,7 @@ green open .kibana_3 XmTePICFRoSNf5O5uLgwRw 1 1 220 0 468.3kb 232.1kb ### Enabling system index permissions -Users that have the permission [`restapi:admin/roles`]({{site.url}}{{site.baseurl}}/security/access-control/api/#access-control-for-the-api) are able to map system index permissions to all users in the same way they would for a cluster or index permission in the `roles.yml` file. However, to preserve some control over this permission, the `plugins.security.system_indices.permissions.enabled` setting allows you to enable or disable the system index permissions feature. This setting is disabled by default. To enable the system index permissions feature, set `plugins.security.system_indices.permissions.enabled` to `true`. For more information about this setting, see [Enabling user access to system indexes]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#enabling-user-access-to-system-indexes). +Users that have the permission [`restapi:admin/roles`]({{site.url}}{{site.baseurl}}/security/access-control/api/#access-control-for-the-api) are able to map system index permissions to all users in the same way they would for a cluster or index permission in the `roles.yml` file. However, to preserve some control over this permission, the `plugins.security.system_indices.permission.enabled` setting allows you to enable or disable the system index permissions feature. This setting is disabled by default. To enable the system index permissions feature, set `plugins.security.system_indices.permissions.enabled` to `true`. For more information about this setting, see [Enabling user access to system indexes]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#enabling-user-access-to-system-indexes). Keep in mind that enabling this feature and mapping system index permissions to normal users gives those users access to indexes that may contain sensitive information and configurations essential to a cluster's health. We also recommend caution when mapping users to `restapi:admin/roles` because this permission gives a user not only the ability to assign the system index permission to another user but also the ability to self-assign access to any system index. {: .warning } diff --git a/_security/access-control/users-roles.md b/_security/access-control/users-roles.md index 3b728029f8..ae7670bc29 100644 --- a/_security/access-control/users-roles.md +++ b/_security/access-control/users-roles.md @@ -14,6 +14,23 @@ The Security plugin includes an internal user database. Use this database in pla Roles are the core way of controlling access to your cluster. Roles contain any combination of cluster-wide permissions, index-specific permissions, document- and field-level security, and tenants. Then you map users to these roles so that users gain those permissions. +## Creating and editing OpenSearch roles + +You can update OpenSearch by using one of the following methods. + +### Using the API + +You can send HTTP requests to OpenSearch-provided endpoints to update security roles, permissions, and associated settings. This method offers granular control and automation capabilities for managing roles. + +### Using the UI (OpenSearch Dashboards) + +OpenSearch Dashboards provides a user-friendly interface for managing roles. Roles, permissions, and document-level security settings are configured in the Security section within OpenSearch Dashboards. When updating roles through the UI, OpenSearch Dashboards calls the API in the background to implement the changes. + +### Editing the `roles.yml` file + +If you want more granular control of your security configuration, you can edit roles and their associated permissions in the `roles.yml` file. This method provides direct access to the underlying configuration and can be version controlled for use in collaborative development environments. +For more information about creating roles, see the [Create roles][https://opensearch.org/docs/latest/security/access-control/users-roles/#create-roles) documentation. + Unless you need to create new [reserved or hidden users]({{site.url}}{{site.baseurl}}/security/access-control/api/#reserved-and-hidden-resources), we **highly** recommend using OpenSearch Dashboards or the REST API to create new users, roles, and role mappings. The `.yml` files are for initial setup, not ongoing use. {: .warning } @@ -75,6 +92,24 @@ See [YAML files]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#roles See [Create role]({{site.url}}{{site.baseurl}}/security/access-control/api/#create-role). +## Edit roles + +You can edit roles using one of the following methods. + +### OpenSearch Dashboards + +1. Choose **Security** > **Roles**. In the **Create role** section, select **Explore existing roles**. +1. Select the role you want to edit. +1. Choose **edit role**. Make any necessary updates to the role. +1. To save your changes, select **Update**. + +### roles.yml + +See [YAML files]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#rolesyml). + +### REST API + +See [Create role]({{site.url}}{{site.baseurl}}/security/access-control/api/#create-role). ## Map users to roles diff --git a/_security/configuration/yaml.md b/_security/configuration/yaml.md index 258866a7f8..af60238b42 100644 --- a/_security/configuration/yaml.md +++ b/_security/configuration/yaml.md @@ -139,12 +139,12 @@ plugins.security.cache.ttl_minutes: 60 ### Enabling user access to system indexes -Mapping a system index permission to a user allows that user to modify the system index specified in the permission's name (the one exception is the Security plugin's [system index]({{site.url}}{{site.baseurl}}/security/configuration/system-indices/)). The `plugins.security.system_indices.permissions.enabled` setting provides a way for administrators to make this permission available for or hidden from role mapping. +Mapping a system index permission to a user allows that user to modify the system index specified in the permission's name (the one exception is the Security plugin's [system index]({{site.url}}{{site.baseurl}}/security/configuration/system-indices/)). The `plugins.security.system_indices.permission.enabled` setting provides a way for administrators to make this permission available for or hidden from role mapping. When set to `true`, the feature is enabled and users with permission to modify roles can create roles that include permissions that grant access to system indexes: ```yml -plugins.security.system_indices.permissions.enabled: true +plugins.security.system_indices.permission.enabled: true ``` When set to `false`, the permission is disabled and only admins with an admin certificate can make changes to system indexes. By default, the permission is set to `false` in a new cluster. diff --git a/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md b/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md index 3eb40fe2ed..7cc533fe76 100644 --- a/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md +++ b/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md @@ -24,8 +24,12 @@ _Cluster state_ is an internal data structure that contains the metadata of the The cluster state metadata is managed by the elected cluster manager node and is essential for the cluster to properly function. When the cluster loses the majority of the cluster manager nodes permanently, then the cluster may experience data loss because the latest cluster state metadata might not be present in the surviving cluster manager nodes. Persisting the state of all the cluster manager nodes in the cluster to remote-backed storage provides better durability. When the remote cluster state feature is enabled, the cluster metadata will be published to a remote repository configured in the cluster. -Any time new cluster manager nodes are launched after disaster recovery, the nodes will automatically bootstrap using the latest metadata stored in the remote repository. -After the metadata is restored automatically from the latest metadata stored, and if the data nodes are unchanged in the index data, the metadata lost will be automatically recovered. However, if the data nodes have been replaced, then you can restore the index data by invoking the `_remotestore/_restore` API as described in the [remote store documentation]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/index/). +Any time new cluster manager nodes are launched after disaster recovery, the nodes will automatically bootstrap using the latest metadata stored in the remote repository. This provides metadata durability. + +You can enable remote cluster state independently of remote-backed data storage. +{: .note} + +If you require data durability, you must enable remote-backed data storage as described in the [remote store documentation]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/index/). ## Configuring the remote cluster state @@ -59,4 +63,3 @@ Setting | Default | Description The remote cluster state functionality has the following limitations: - Unsafe bootstrap scripts cannot be run when the remote cluster state is enabled. When a majority of cluster-manager nodes are lost and the cluster goes down, the user needs to replace any remaining cluster manager nodes and reseed the nodes in order to bootstrap a new cluster. -- The remote cluster state cannot be enabled without first configuring remote-backed storage. diff --git a/images/dashboards/multidata-hide-localcluster.gif b/images/dashboards/multidata-hide-localcluster.gif new file mode 100644 index 0000000000..b778063943 Binary files /dev/null and b/images/dashboards/multidata-hide-localcluster.gif differ diff --git a/images/dashboards/multidata-hide-show-auth.gif b/images/dashboards/multidata-hide-show-auth.gif new file mode 100644 index 0000000000..9f1f945c44 Binary files /dev/null and b/images/dashboards/multidata-hide-show-auth.gif differ diff --git a/images/dashboards/vega-2.png b/images/dashboards/vega-2.png new file mode 100644 index 0000000000..1faa3a6e67 Binary files /dev/null and b/images/dashboards/vega-2.png differ