diff --git a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt index 0a14e17e7d..091f2d2534 100644 --- a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt +++ b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt @@ -26,6 +26,7 @@ Boolean Dev [Dd]iscoverability Distro +[Dd]ownvote(s|d)? [Dd]uplicative [Ee]gress [Ee]num @@ -122,6 +123,7 @@ stdout [Ss]ubvector [Ss]ubwords? [Ss]uperset +[Ss]yslog tebibyte [Tt]emplated [Tt]okenization @@ -138,6 +140,7 @@ tebibyte [Uu]nregister(s|ed|ing)? [Uu]pdatable [Uu]psert +[Uu]pvote(s|d)? [Ww]alkthrough [Ww]ebpage xy \ No newline at end of file diff --git a/.github/workflows/vale.yml b/.github/workflows/vale.yml index 2eee5d82fb..515d974133 100644 --- a/.github/workflows/vale.yml +++ b/.github/workflows/vale.yml @@ -20,4 +20,5 @@ jobs: reporter: github-pr-check filter_mode: added vale_flags: "--no-exit" - version: 2.28.0 \ No newline at end of file + version: 2.28.0 + continue-on-error: true diff --git a/TERMS.md b/TERMS.md index 8fc1ba0162..e12cc171ed 100644 --- a/TERMS.md +++ b/TERMS.md @@ -236,6 +236,8 @@ Do not use *disable* to refer to users. Always hyphenated. Don’t use _double click_. +**downvote** + **dropdown list** **due to** @@ -586,6 +588,10 @@ Use % in headlines, quotations, and tables or in technical copy. An agent and REST API that allows you to query numerous performance metrics for your cluster, including aggregations of those metrics, independent of the Java Virtual Machine (JVM). +**plaintext, plain text** + +Use *plaintext* only to refer to nonencrypted or decrypted text in content about encryption. Use *plain text* to refer to ASCII files. + **please** Avoid using except in quoted text. @@ -700,6 +706,8 @@ Never hyphenated. Use _startup_ as a noun (for example, “The following startup **Stochastic Gradient Descent (SGD)** +**syslog** + ## T **term frequency–inverse document frequency (TF–IDF)** @@ -746,6 +754,8 @@ A storage tier that you can use to store and analyze your data with Elasticsearc Hyphenate as adjectives. Use instead of *top left* and *top right*, unless the field name uses *top*. For example, "The upper-right corner." +**upvote** + **US** No periods, as specified in the Chicago Manual of Style. diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index ba09a7fa30..e6d9875736 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -13,52 +13,53 @@ Token filters receive the stream of tokens from the tokenizer and add, remove, o The following table lists all token filters that OpenSearch supports. Token filter | Underlying Lucene token filter| Description -`apostrophe` | [ApostropheFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token that contains an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following the apostrophe. -`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. -`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. -`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. -`classic` | [ClassicFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/standard/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. -`common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. -`conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script. -`decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9). -`delimited_payload` | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters before the delimiter, and a payload consists of all characters after the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload. +`apostrophe` | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token that contains an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following the apostrophe. +`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. +`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. +`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. +`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. +`common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. +`conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script. +`decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9). +`delimited_payload` | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters before the delimiter, and a payload consists of all characters after the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload. [`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency. -`dictionary_decompounder` | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages. -`edge_ngram` | [EdgeNGramTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. -`elision` | [ElisionFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). -`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. -`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. -`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. +`dictionary_decompounder` | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages. +`edge_ngram` | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. +`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). +`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. +`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. +`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. -`keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. -`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. -`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. -`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. -`kstem` | [KStemFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary. -`length` | [LengthFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens whose lengths are shorter or longer than the length range specified by `min` and `max`. -`limit` | [LimitTokenCountFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. A common use case is to limit the size of document field values based on token count. -`lowercase` | [LowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)). -`min_hash` | [MinHashFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially:
1. Hashes each token in the stream.
2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket.
3. Outputs the smallest hash from each bucket as a token stream. +`keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. +`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. +`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. +`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. +`kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary. +`kuromoji_completion` | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a `mode` parameter, which should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins). +`length` | [LengthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens whose lengths are shorter or longer than the length range specified by `min` and `max`. +`limit` | [LimitTokenCountFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. A common use case is to limit the size of document field values based on token count. +`lowercase` | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)). +`min_hash` | [MinHashFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially:
1. Hashes each token in the stream.
2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket.
3. Outputs the smallest hash from each bucket as a token stream. `multiplexer` | N/A | Emits multiple tokens at the same position. Runs each token through each of the specified filter lists separately and outputs the results as separate tokens. -`ngram` | [NGramTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html) | Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`. -Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizer.html)
`german_normalization`: [GermanNormalizationFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)
`hindi_normalization`: [HindiNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizer.html)
`indic_normalization`: [IndicNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizer.html)
`sorani_normalization`: [SoraniNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizer.html)
`persian_normalization`: [PersianNormalizer](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizer.html)
`scandinavian_normalization` : [ScandinavianNormalizationFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)
`scandinavian_folding`: [ScandinavianFoldingFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)
`serbian_normalization`: [SerbianNormalizationFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/sr/SerbianNormalizationFilter.html) | Normalizes the characters of one of the listed languages. +`ngram` | [NGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html) | Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`. +Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ar/ArabicNormalizer.html)
`german_normalization`: [GermanNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)
`hindi_normalization`: [HindiNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hi/HindiNormalizer.html)
`indic_normalization`: [IndicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/in/IndicNormalizer.html)
`sorani_normalization`: [SoraniNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ckb/SoraniNormalizer.html)
`persian_normalization`: [PersianNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/fa/PersianNormalizer.html)
`scandinavian_normalization` : [ScandinavianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)
`scandinavian_folding`: [ScandinavianFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)
`serbian_normalization`: [SerbianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/sr/SerbianNormalizationFilter.html) | Normalizes the characters of one of the listed languages. `pattern_capture` | N/A | Generates a token for every capture group in the provided regular expression. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). `pattern_replace` | N/A | Matches a pattern in the provided regular expression and replaces matching substrings. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). `phonetic` | N/A | Uses a phonetic encoder to emit a metaphone token for each token in the token stream. Requires installing the `analysis-phonetic` plugin. -`porter_stem` | [PorterStemFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language. +`porter_stem` | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language. `predicate_token_filter` | N/A | Removes tokens that don’t match the specified predicate script. Supports inline Painless scripts only. -`remove_duplicates` | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position. -`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`. -`shingle` | [ShingleFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`]. +`remove_duplicates` | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position. +`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`. +`shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`]. `snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`. `stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`. `stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed. `stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream. `synonym` | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file. `synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process. -`trim` | [TrimFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space from each token in a stream. -`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit. +`trim` | [TrimFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space from each token in a stream. +`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit. `unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream. -`uppercase` | [UpperCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase. -`word_delimiter` | [WordDelimiterFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. -`word_delimiter_graph` | [WordDelimiterGraphFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. Assigns multi-position tokens a `positionLength` attribute. +`uppercase` | [UpperCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase. +`word_delimiter` | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. +`word_delimiter_graph` | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. Assigns multi-position tokens a `positionLength` attribute. diff --git a/_api-reference/index-apis/force-merge.md b/_api-reference/index-apis/force-merge.md index 6ad2e7f23c..6c2a61bef3 100644 --- a/_api-reference/index-apis/force-merge.md +++ b/_api-reference/index-apis/force-merge.md @@ -72,6 +72,7 @@ The following table lists the available query parameters. All query parameters a | `ignore_unavailable` | Boolean | If `true`, OpenSearch ignores missing or closed indexes. If `false`, OpenSearch returns an error if the force merge operation encounters missing or closed indexes. Default is `false`. | | `max_num_segments` | Integer | The number of larger segments into which smaller segments are merged. Set this parameter to `1` to merge all segments into one segment. The default behavior is to perform the merge as necessary. | | `only_expunge_deletes` | Boolean | If `true`, the merge operation only expunges segments containing a certain percentage of deleted documents. The percentage is 10% by default and is configurable in the `index.merge.policy.expunge_deletes_allowed` setting. Prior to OpenSearch 2.12, `only_expunge_deletes` ignored the `index.merge.policy.max_merged_segment` setting. Starting with OpenSearch 2.12, using `only_expunge_deletes` does not produce segments larger than `index.merge.policy.max_merged_segment` (by default, 5 GB). For more information, see [Deleted documents](#deleted-documents). Default is `false`. | +| `primary_only` | Boolean | If set to `true`, then the merge operation is performed only on the primary shards of an index. This can be useful when you want to take a snapshot of the index after the merge is complete. Snapshots only copy segments from the primary shards. Merging the primary shards can reduce resource consumption. Default is `false`. | #### Example request: Force merge a specific index @@ -101,6 +102,13 @@ POST /.testindex-logs/_forcemerge?max_num_segments=1 ``` {% include copy-curl.html %} +#### Example request: Force merge primary shards + +```json +POST /.testindex-logs/_forcemerge?primary_only=true +``` +{% include copy-curl.html %} + #### Example response ```json diff --git a/_automating-configurations/api/create-workflow.md b/_automating-configurations/api/create-workflow.md index 9353054113..e99a421fb9 100644 --- a/_automating-configurations/api/create-workflow.md +++ b/_automating-configurations/api/create-workflow.md @@ -7,9 +7,6 @@ nav_order: 10 # Create or update a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - Creating a workflow adds the content of a workflow template to the flow framework system index. You can provide workflows in JSON format (by specifying `Content-Type: application/json`) or YAML format (by specifying `Content-Type: application/yaml`). By default, the workflow is validated to help identify invalid configurations, including: * Workflow steps requiring an OpenSearch plugin that is not installed. @@ -19,6 +16,8 @@ Creating a workflow adds the content of a workflow template to the flow framewor To obtain the validation template for workflow steps, call the [Get Workflow Steps API]({{site.url}}{{site.baseurl}}/automating-configurations/api/get-workflow-steps/). +You can include placeholder expressions in the value of workflow step fields. For example, you can specify a credential field in a template as `openAI_key: '${{ openai_key }}'`. The expression will be substituted with the user-provided value during provisioning, using the format `${{ }}`. You can pass the actual key as a parameter using the [Provision Workflow API]({{site.url}}{{site.baseurl}}/automating-configurations/api/provision-workflow/) or using this API with the `provision` parameter set to `true`. + Once a workflow is created, provide its `workflow_id` to other APIs. The `POST` method creates a new workflow. The `PUT` method updates an existing workflow. @@ -59,12 +58,13 @@ POST /_plugins/_flow_framework/workflow?validation=none ``` {% include copy-curl.html %} -The following table lists the available query parameters. All query parameters are optional. +The following table lists the available query parameters. All query parameters are optional. User-provided parameters are only allowed if the `provision` parameter is set to `true`. | Parameter | Data type | Description | | :--- | :--- | :--- | | `provision` | Boolean | Whether to provision the workflow as part of the request. Default is `false`. | | `validation` | String | Whether to validate the workflow. Valid values are `all` (validate the template) and `none` (do not validate the template). Default is `all`. | +| User-provided substitution expressions | String | Parameters matching substitution expressions in the template. Only allowed if `provision` is set to `true`. Optional. If `provision` is set to `false`, you can pass these parameters in the [Provision Workflow API query parameters]({{site.url}}{{site.baseurl}}/automating-configurations/api/provision-workflow/#query-parameters). | ## Request fields diff --git a/_automating-configurations/api/delete-workflow.md b/_automating-configurations/api/delete-workflow.md index c1cee296f8..db3a340cee 100644 --- a/_automating-configurations/api/delete-workflow.md +++ b/_automating-configurations/api/delete-workflow.md @@ -7,9 +7,6 @@ nav_order: 80 # Delete a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - When you no longer need a workflow template, you can delete it by calling the Delete Workflow API. Note that deleting a workflow only deletes the stored template but does not deprovision its resources. diff --git a/_automating-configurations/api/deprovision-workflow.md b/_automating-configurations/api/deprovision-workflow.md index cdd85ef4e9..e9219536ce 100644 --- a/_automating-configurations/api/deprovision-workflow.md +++ b/_automating-configurations/api/deprovision-workflow.md @@ -7,9 +7,6 @@ nav_order: 70 # Deprovision a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - When you no longer need a workflow, you can deprovision its resources. Most workflow steps that create a resource have corresponding workflow steps to reverse that action. To retrieve all resources currently created for a workflow, call the [Get Workflow Status API]({{site.url}}{{site.baseurl}}/automating-configurations/api/get-workflow-status/). When you call the Deprovision Workflow API, resources included in the `resources_created` field of the Get Workflow Status API response will be removed using a workflow step corresponding to the one that provisioned them. The workflow executes the provisioning workflow steps in reverse order. If failures occur because of resource dependencies, such as preventing deletion of a registered model if it is still deployed, the workflow attempts retries. diff --git a/_automating-configurations/api/get-workflow-status.md b/_automating-configurations/api/get-workflow-status.md index 03870af174..280fb52195 100644 --- a/_automating-configurations/api/get-workflow-status.md +++ b/_automating-configurations/api/get-workflow-status.md @@ -7,9 +7,6 @@ nav_order: 40 # Get a workflow status -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - [Provisioning a workflow]({{site.url}}{{site.baseurl}}/automating-configurations/api/provision-workflow/) may take a significant amount of time, particularly when the action is associated with OpenSearch indexing operations. The Get Workflow State API permits monitoring of the provisioning deployment status until it is complete. ## Path and HTTP methods diff --git a/_automating-configurations/api/get-workflow-steps.md b/_automating-configurations/api/get-workflow-steps.md index b4859da776..38059ec80c 100644 --- a/_automating-configurations/api/get-workflow-steps.md +++ b/_automating-configurations/api/get-workflow-steps.md @@ -7,10 +7,7 @@ nav_order: 50 # Get workflow steps -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - -OpenSearch validates workflows by using the validation template that lists the required inputs, generated outputs, and required plugins for all steps. For example, for the `register_remote_model` step, the validation template appears as follows: +This API returns a list of workflow steps, including their required inputs, outputs, default timeout values, and required plugins. For example, for the `register_remote_model` step, the Get Workflow Steps API returns the following information: ```json { @@ -28,36 +25,52 @@ OpenSearch validates workflows by using the validation template that lists the r ] } } -``` - -The Get Workflow Steps API retrieves this file. +``` ## Path and HTTP methods ```json GET /_plugins/_flow_framework/workflow/_steps +GET /_plugins/_flow_framework/workflow/_step?workflow_step= ``` +## Query parameters + +The following table lists the available query parameters. All query parameters are optional. + +| Parameter | Data type | Description | +| :--- | :--- | :--- | +| `workflow_step` | String | The name of the step to retrieve. Specify multiple step names as a comma-separated list. For example, `create_connector,delete_model,deploy_model`. | + #### Example request +To fetch all workflow steps, use the following request: + ```json GET /_plugins/_flow_framework/workflow/_steps +``` +{% include copy-curl.html %} + +To fetch specific workflow steps, pass the step names to the request as a query parameter: + +```json +GET /_plugins/_flow_framework/workflow/_step?workflow_step=create_connector,delete_model,deploy_model ``` {% include copy-curl.html %} #### Example response -OpenSearch responds with the validation template containing the steps. The order of fields in the returned steps may not exactly match the original JSON but will function identically. +OpenSearch responds with the workflow steps. The order of fields in the returned steps may not exactly match the original JSON but will function identically. To retrieve the template in YAML format, specify `Content-Type: application/yaml` in the request header: ```bash -curl -XGET "http://localhost:9200/_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50" -H 'Content-Type: application/yaml' +curl -XGET "http://localhost:9200/_plugins/_flow_framework/workflow/_steps" -H 'Content-Type: application/yaml' ``` To retrieve the template in JSON format, specify `Content-Type: application/json` in the request header: ```bash -curl -XGET "http://localhost:9200/_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50" -H 'Content-Type: application/json' +curl -XGET "http://localhost:9200/_plugins/_flow_framework/workflow/_steps" -H 'Content-Type: application/json' ``` \ No newline at end of file diff --git a/_automating-configurations/api/get-workflow.md b/_automating-configurations/api/get-workflow.md index b49858ffd9..7b1d5987c4 100644 --- a/_automating-configurations/api/get-workflow.md +++ b/_automating-configurations/api/get-workflow.md @@ -7,9 +7,6 @@ nav_order: 20 # Get a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - The Get Workflow API retrieves the workflow template. ## Path and HTTP methods diff --git a/_automating-configurations/api/index.md b/_automating-configurations/api/index.md index 5fb050539b..716e19c41f 100644 --- a/_automating-configurations/api/index.md +++ b/_automating-configurations/api/index.md @@ -8,9 +8,6 @@ has_toc: false # Workflow APIs -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - OpenSearch supports the following workflow APIs: * [Create or update workflow]({{site.url}}{{site.baseurl}}/automating-configurations/api/create-workflow/) diff --git a/_automating-configurations/api/provision-workflow.md b/_automating-configurations/api/provision-workflow.md index 5d2b59364c..62c4954ee9 100644 --- a/_automating-configurations/api/provision-workflow.md +++ b/_automating-configurations/api/provision-workflow.md @@ -7,9 +7,6 @@ nav_order: 30 # Provision a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - Provisioning a workflow is a one-time setup process usually performed by a cluster administrator to create resources that will be used by end users. The `workflows` template field may contain multiple workflows. The workflow with the `provision` key can be executed with this API. This API is also executed when the [Create or Update Workflow API]({{site.url}}{{site.baseurl}}/automating-configurations/api/create-workflow/) is called with the `provision` parameter set to `true`. @@ -31,10 +28,39 @@ The following table lists the available path parameters. | :--- | :--- | :--- | | `workflow_id` | String | The ID of the workflow to be provisioned. Required. | -#### Example request +## Query parameters + +If you have included a substitution expression in the template, you may pass it as a query parameter or as a string value of a request body field. For example, if you specified a credential field in a template as `openAI_key: '${{ openai_key }}'`, then you can include the `openai_key` parameter as a query parameter or body field so it can be substituted during provisioning. For example, the following request provides a query parameter: + +```json +POST /_plugins/_flow_framework/workflow//_provision?= +``` + +| Parameter | Data type | Description | +| :--- | :--- | :--- | +| User-provided substitution expressions | String | Parameters matching substitution expressions in the template. Optional. | + +#### Example requests + +```json +POST /_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50/_provision +``` +{% include copy-curl.html %} + +The following request substitutes the expression `${{ openai_key }}` with the value "12345" using a query parameter: + +```json +POST /_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50/_provision?openai_key=12345 +``` +{% include copy-curl.html %} + +The following request substitutes the expression `${{ openai_key }}` with the value "12345" using the request body: ```json POST /_plugins/_flow_framework/workflow/8xL8bowB8y25Tqfenm50/_provision +{ + "openai_key" : "12345" +} ``` {% include copy-curl.html %} diff --git a/_automating-configurations/api/search-workflow-state.md b/_automating-configurations/api/search-workflow-state.md index 9e21f14392..1cacb3a32b 100644 --- a/_automating-configurations/api/search-workflow-state.md +++ b/_automating-configurations/api/search-workflow-state.md @@ -7,9 +7,6 @@ nav_order: 65 # Search for a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - You can search for resources created by workflows by matching a query to a field. The fields you can search correspond to those returned by the [Get Workflow Status API]({{site.url}}{{site.baseurl}}/automating-configurations/api/get-workflow-status/). ## Path and HTTP methods diff --git a/_automating-configurations/api/search-workflow.md b/_automating-configurations/api/search-workflow.md index 7eb8890f7e..b78de9e9d2 100644 --- a/_automating-configurations/api/search-workflow.md +++ b/_automating-configurations/api/search-workflow.md @@ -7,9 +7,6 @@ nav_order: 60 # Search for a workflow -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - You can retrieve created workflows with their `workflow_id` or search for workflows by using a query matching a field. You can use the `use_case` field to search for similar workflows. ## Path and HTTP methods diff --git a/_automating-configurations/index.md b/_automating-configurations/index.md index 2b9ffdcf34..a7462ad16a 100644 --- a/_automating-configurations/index.md +++ b/_automating-configurations/index.md @@ -11,9 +11,6 @@ redirect_from: /automating-configurations/ **Introduced 2.12** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - You can automate complex OpenSearch setup and preprocessing tasks by providing templates for common use cases. For example, automating machine learning (ML) setup tasks streamlines the use of OpenSearch ML offerings. In OpenSearch 2.12, configuration automation is limited to ML tasks. diff --git a/_automating-configurations/workflow-settings.md b/_automating-configurations/workflow-settings.md index f3138d0ddc..78762fdfbb 100644 --- a/_automating-configurations/workflow-settings.md +++ b/_automating-configurations/workflow-settings.md @@ -6,9 +6,6 @@ nav_order: 30 # Workflow settings -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - The following keys represent configurable workflow settings. |Setting |Data type |Default value |Description | diff --git a/_automating-configurations/workflow-steps.md b/_automating-configurations/workflow-steps.md index 8565ccc29b..99c1f57993 100644 --- a/_automating-configurations/workflow-steps.md +++ b/_automating-configurations/workflow-steps.md @@ -6,9 +6,6 @@ nav_order: 10 # Workflow steps -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - _Workflow steps_ form basic "building blocks" for process automation. Most steps directly correspond to OpenSearch or plugin API operations, such as CRUD operations on machine learning (ML) connectors, models, and agents. Some steps simplify the configuration by reusing the body expected by these APIs across multiple steps. For example, once you configure a _tool_, you can use it with multiple _agents_. ## Workflow step fields @@ -42,6 +39,9 @@ The following table lists the workflow step types. The `user_inputs` fields for |`register_agent` |[Register Agent API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/) |Registers an agent as part of the ML Commons Agent Framework. | |`delete_agent` |[Delete Agent API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/) |Deletes an agent. | |`create_tool` |No API | A special-case non-API step encapsulating the specification of a tool for an agent in the ML Commons Agent Framework. These will be listed as `previous_node_inputs` for the appropriate register agent step, with the value set to `tools`. | +|`create_index`|[Create Index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/) | Creates a new OpenSearch index. The inputs include `index_name`, which should be the name of the index to be created, and `configurations`, which contains the payload body of a regular REST request for creating an index. +|`create_ingest_pipeline`|[Create Ingest Pipeline]({{site.url}}{{site.baseurl}}/ingest-pipelines/create-ingest/) | Creates or updates an ingest pipeline. The inputs include `pipeline_id`, which should be the ID of the pipeline, and `configurations`, which contains the payload body of a regular REST request for creating an ingest pipeline. +|`create_search_pipeline`|[Create Search Pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/creating-search-pipeline/) | Creates or updates a search pipeline. The inputs include `pipeline_id`, which should be the ID of the pipeline, and `configurations`, which contains the payload body of a regular REST request for creating a search pipeline. ## Additional fields diff --git a/_automating-configurations/workflow-tutorial.md b/_automating-configurations/workflow-tutorial.md index 99d84501e2..0074ad4691 100644 --- a/_automating-configurations/workflow-tutorial.md +++ b/_automating-configurations/workflow-tutorial.md @@ -6,9 +6,6 @@ nav_order: 20 # Workflow tutorial -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/flow-framework/issues/475). -{: .warning} - You can automate the setup of common use cases, such as conversational chat, using a Chain-of-Thought (CoT) agent. An _agent_ orchestrates and runs ML models and tools. A _tool_ performs a set of specific tasks. This page presents a complete example of setting up a CoT agent. For more information about agents and tools, see [Agents and tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/) The setup requires the following sequence of API requests, with provisioned resources used in subsequent requests. The following list provides an overview of the steps required for this workflow. The step names correspond to the names in the template: diff --git a/_dashboards/csp/csp-dynamic-configuration.md b/_dashboards/csp/csp-dynamic-configuration.md new file mode 100644 index 0000000000..2101a83734 --- /dev/null +++ b/_dashboards/csp/csp-dynamic-configuration.md @@ -0,0 +1,50 @@ +--- +layout: default +title: Configuring Content Security Policy rules dynamically +nav_order: 110 +has_children: false +--- + +# Configuring Content Security Policy rules dynamically +Introduced 2.13 +{: .label .label-purple } + +Content Security Policy (CSP) is a security standard intended to prevent cross-site scripting (XSS), `clickjacking`, and other code injection attacks resulting from the execution of malicious content in the trusted webpage context. OpenSearch Dashboards supports configuring CSP rules in the `opensearch_dashboards.yml` file by using the `csp.rules` key. A change in the YAML file requires a server restart, which may interrupt service availability. You can, however, configure the CSP rules dynamically through the `applicationConfig` plugin without restarting the server. + +## Configuration + +The `applicationConfig` plugin provides read and write APIs that allow OpenSearch Dashboards users to manage dynamic configurations as key-value pairs in an index. The `cspHandler` plugin registers a pre-response handler to `HttpServiceSetup`, which gets CSP rules from the dependent `applicationConfig` plugin and then rewrites to the CSP header. Enable both plugins within your `opensearch_dashboards.yml` file to use this feature. The configuration is shown in the following example. Refer to the `cspHandler` plugin [README](https://github.com/opensearch-project/OpenSearch-Dashboards/blob/main/src/plugins/csp_handler/README.md) for configuration details. + +``` +application_config.enabled: true +csp_handler.enabled: true +``` + +## Enable site embedding for OpenSearch Dashboards + +To enable site embedding for OpenSearch Dashboards, update the CSP rules using CURL. When using CURL commands with single quotation marks inside the `data-raw` parameter, escape them with a backslash (`\`). For example, use `'\''` to represent `'`. The configuration is shown in the following example. Refer to the `applicationConfig` plugin [README](https://github.com/opensearch-project/OpenSearch-Dashboards/blob/main/src/plugins/application_config/README.md) for configuration details. + +``` +curl '{osd endpoint}/api/appconfig/csp.rules' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' -H 'osd-xsrf: osd-fetch' -H 'Sec-Fetch-Dest: empty' --data-raw '{"newValue":"script-src '\''unsafe-eval'\'' '\''self'\''; worker-src blob: '\''self'\''; style-src '\''unsafe-inline'\'' '\''self'\''; frame-ancestors '\''self'\'' {new site}"}' +``` + +## Delete CSP rules + +Use the following CURL command to delete CSP rules: + +``` +curl '{osd endpoint}/api/appconfig/csp.rules' -X DELETE -H 'osd-xsrf: osd-fetch' -H 'Sec-Fetch-Dest: empty' +``` + +## Get CSP rules + +Use the following CURL command to get CSP rules: + +``` +curl '{osd endpoint}/api/appconfig/csp.rules' + +``` + +## Precedence + +Dynamic configurations override YAML configurations, except for empty CSP rules. To prevent `clickjacking`, a `frame-ancestors: self` directive is automatically added to YAML-defined rules when necessary. diff --git a/_dashboards/dashboards-assistant/index.md b/_dashboards/dashboards-assistant/index.md index dd62347c31..9313dd2e97 100644 --- a/_dashboards/dashboards-assistant/index.md +++ b/_dashboards/dashboards-assistant/index.md @@ -60,7 +60,7 @@ For information about configuring OpenSearch Assistant through the REST API, see ## Using OpenSearch Assistant in OpenSearch Dashboards -The following tutorials guide you through using OpenSearch Assistant in OpenSearch Dashboards. OpenSearch Assistant can be viewed full frame or in the right sidebar. The default is sidebar. To view full frame, select the frame icon {::nomarkdown}frame icon{:/} in the toolbar. +The following tutorials guide you through using OpenSearch Assistant in OpenSearch Dashboards. OpenSearch Assistant can be viewed in full frame or in the sidebar. The default view is in the right sidebar. To view the assistant in the left sidebar or in full frame, select the {::nomarkdown}frame icon{:/} icon in the toolbar and choose the preferred option. ### Start a conversation diff --git a/_dashboards/management/index-patterns.md b/_dashboards/management/index-patterns.md index 590a9675a2..37baa210e9 100644 --- a/_dashboards/management/index-patterns.md +++ b/_dashboards/management/index-patterns.md @@ -56,7 +56,7 @@ An example of step 1 is shown in the following image. Note that the index patter Once the index pattern has been created, you can view the mapping of the matching indexes. Within the table, you can see the list of fields, along with their data type and properties. An example is shown in the following image. -Index pattern table UI +Index pattern table UI ## Next steps diff --git a/_dashboards/management/multi-data-sources.md b/_dashboards/management/multi-data-sources.md index 0447348648..dd66101f80 100644 --- a/_dashboards/management/multi-data-sources.md +++ b/_dashboards/management/multi-data-sources.md @@ -3,7 +3,7 @@ layout: default title: Configuring and using multiple data sources parent: Data sources nav_order: 10 -redirect_from: +redirect_from: - /dashboards/discover/multi-data-sources/ --- @@ -11,23 +11,22 @@ redirect_from: You can ingest, process, and analyze data from multiple data sources in OpenSearch Dashboards. You configure the data sources in the **Dashboards Management** > **Data sources** app, as shown in the following image. - Dashboards Management Data sources main screen ## Getting started -The following tutorial guides you through configuring and using multiple data sources. +The following tutorial guides you through configuring and using multiple data sources. ### Step 1: Modify the YAML file settings To use multiple data sources, you must enable the `data_source.enabled` setting. It is disabled by default. To enable multiple data sources: 1. Open your local copy of the OpenSearch Dashboards configuration file, `opensearch_dashboards.yml`. If you don't have a copy, [`opensearch_dashboards.yml`](https://github.com/opensearch-project/OpenSearch-Dashboards/blob/main/config/opensearch_dashboards.yml) is available on GitHub. -2. Set `data_source.enabled:` to `true` and save the YAML file. +2. Set `data_source.enabled:` to `true` and save the YAML file. 3. Restart the OpenSearch Dashboards container. 4. Verify that the configuration settings were configured properly by connecting to OpenSearch Dashboards and viewing the **Dashboards Management** navigation menu. **Data sources** appears in the sidebar. You'll see a view similar to the following image. - Data sources in sidebar within Dashboards Management +Data sources in sidebar within Dashboards Management ### Step 2: Create a new data source connection @@ -36,16 +35,17 @@ A data source connection specifies the parameters needed to connect to a data so To create a new data source connection: 1. From the OpenSearch Dashboards main menu, select **Dashboards Management** > **Data sources** > **Create data source connection**. -2. Add the required information to each field to configure **Connection Details** and **Authentication Method**. - + +2. Add the required information to each field to configure the **Connection Details** and **Authentication Method**. + - Under **Connection Details**, enter a title and endpoint URL. For this tutorial, use the URL `http://localhost:5601/app/management/opensearch-dashboards/dataSources`. Entering a description is optional. - Under **Authentication Method**, select an authentication method from the dropdown list. Once an authentication method is selected, the applicable fields for that method appear. You can then enter the required details. The authentication method options are: - - **No authentication**: No authentication is used to connect to the data source. - - **Username & Password**: A basic username and password are used to connect to the data source. - - **AWS SigV4**: An AWS Signature Version 4 authenticating request is used to connect to the data source. AWS Signature Version 4 requires an access key and a secret key. - - For AWS Signature Version 4 authentication, first specify the **Region**. Next, select the OpenSearch service in the **Service Name** list. The options are **Amazon OpenSearch Service** and **Amazon OpenSearch Serverless**. Last, enter the **Access Key** and **Secret Key** for authorization. - + - **No authentication**: No authentication is used to connect to the data source. + - **Username & Password**: A basic username and password are used to connect to the data source. + - **AWS SigV4**: An AWS Signature Version 4 authenticating request is used to connect to the data source. AWS Signature Version 4 requires an access key and a secret key. + - For AWS Signature Version 4 authentication, first specify the **Region**. Next, select the OpenSearch service from the **Service Name** list. The options are **Amazon OpenSearch Service** and **Amazon OpenSearch Serverless**. Last, enter the **Access Key** and **Secret Key** for authorization. + For information about available AWS Regions for AWS accounts, see [Available Regions](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-available-regions). For more information about AWS Signature Version 4 authentication requests, see [Authenticating Requests (AWS Signature Version 4)](https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html). {: .note} @@ -58,12 +58,11 @@ To create a new data source connection: - To make changes to the data source connection, select a connection in the list on the **Data Sources** main page. The **Connection Details** window opens. - To make changes to **Connection Details**, edit one or both of the **Title** and **Description** fields and select **Save changes** in the lower-right corner of the screen. You can also cancel changes here. To change the **Authentication Method**, choose a different authentication method, enter your credentials (if applicable), and then select **Save changes** in the lower-right corner of the screen. The changes are saved. - + - When **Username & Password** is the selected authentication method, you can update the password by choosing **Update stored password** next to the **Password** field. In the pop-up window, enter a new password in the first field and then enter it again in the second field to confirm. Select **Update stored password** in the pop-up window. The new password is saved. Select **Test connection** to confirm that the connection is valid. - - When **AWS SigV4** is the selected authentication method, you can update the credentials by selecting **Update stored AWS credential**. In the pop-up window, enter a new access key in the first field and a new secret key in the second field. Select **Update stored AWS credential** in the pop-up window. The new credentials are saved. Select **Test connection** in the upper-right corner of the screen to confirm that the connection is valid. -5. Delete the data source connection by selecting the check box to the left of the title and then choosing **Delete 1 connection**. Selecting multiple check boxes for multiple connections is supported. Alternatively, select the trash can icon ({::nomarkdown}trash can icon{:/}). +5. Delete the data source connection by selecting the check box to the left of the title and then choosing **Delete 1 connection**. Selecting multiple check boxes for multiple connections is supported. Alternatively, select the {::nomarkdown}trash can icon{:/} icon. An example data source connection screen is shown in the following image. @@ -71,7 +70,7 @@ An example data source connection screen is shown in the following image. ### Selecting multiple data sources through the Dev Tools console -Alternatively, you can select multiple data sources through the [Dev Tools]({{site.url}}{{site.baseurl}}/dashboards/dev-tools/index-dev/) console. This option provides for working with a broader range of data and gaining deeper insight into your code and applications. +Alternatively, you can select multiple data sources through the [Dev Tools]({{site.url}}{{site.baseurl}}/dashboards/dev-tools/index-dev/) console. This option allows you to work with a broader range of data and gaining a deeper understanding of your code and applications. Watch the following 10-second video to see it in action. @@ -79,7 +78,7 @@ Watch the following 10-second video to see it in action. To select a data source through the Dev Tools console, follow these steps: -1. Locate your copy of `opensearch_dashboards.yml` and open it in the editor of your choice. +1. Locate your copy of `opensearch_dashboards.yml` and open it in the editor of your choice. 2. Set `data_source.enabled` to `true`. 3. Connect to OpenSearch Dashboards and select **Dev Tools** in the menu. 4. Enter the following query in the editor pane of the **Console** and then select the play button: @@ -93,19 +92,55 @@ To select a data source through the Dev Tools console, follow these steps: 6. Repeat the preceding steps for each data source you want to select. ### Upload saved objects to a dashboard from connected data sources -To upload saved objects from connected data sources to a dashboard with multiple data sources, export them as an NDJSON file from the data source's **Saved object management** page. Then upload the file to the dashboard's **Saved object management** page. This method can make it easier to transfer saved objects between dashboards. The following 20-second video shows this feature in action. +To upload saved objects from connected data sources to a dashboard with multiple data sources, export them as an NDJSON file from the data source's **Saved object management** page. Then upload the file to the dashboard's **Saved object management** page. This method can simplify the transfer of saved objects between dashboards. The following 20-second video shows this feature in action. Multiple data sources in Saved object management{: .img-fluid} +#### Import saved objects from a connected data source + Follow these steps to import saved objects from a connected data source: -1. Locate your `opensearch_dashboards.yml` file and open it in your preferred text editor. +1. Locate your `opensearch_dashboards.yml` file and open it in your preferred text editor. 2. Set `data_source.enabled` to `true`. 3. Connect to OpenSearch Dashboards and go to **Dashboards Management** > **Saved objects**. 4. Select **Import** > **Select file** and upload the file acquired from the connected data source. 5. Choose the appropriate **Data source** from the dropdown menu, set your **Conflict management** option, and then select the **Import** button. +### Show or hide authentication methods for multiple data sources +Introduced 2.13 +{: .label .label-purple } + +A feature flag in your `opensearch_dashboards.yml` file allows you to show or hide authentication methods within the `data_source` plugin. The following example setting, shown in a 10-second demo, hides the authentication method for `AWSSigV4`. + +```` +# Set enabled to false to hide the authentication method from multiple data source in OpenSearch Dashboards. +# If this setting is commented out, then all three options will be available in OpenSearch Dashboards. +# The default value will be considered as true. +data_source.authTypes: + NoAuthentication: + enabled: true + UsernamePassword: + enabled: true + AWSSigV4: + enabled: false +```` + +Multiple data sources hide and show authentication{: .img-fluid} + +### Hide the local cluster option for multiple data sources +Introduced 2.13 +{: .label .label-purple } + +A feature flag in your `opensearch_dashboards.yml` file allows you to hide the local cluster option within the `data_source` plugin. This option hides the local cluster from the data source dropdown menu and index creation page, which is ideal for environments with or without a local OpenSearch cluster. The following example setting, shown in a 20-second demo, hides the local cluster. + +```` +# hide local cluster in the data source dropdown and index pattern creation page. +data_source.hideLocalCluster: true +```` + +Multiple data sources hide local cluster{: .img-fluid} + ## Next steps Once you've configured your multiple data sources, you can start exploring that data. See the following resources to learn more: @@ -120,5 +155,5 @@ Once you've configured your multiple data sources, you can start exploring that This feature has some limitations: * The multiple data sources feature is supported for index-pattern-based visualizations only. -* The visualization types Time Series Visual Builder (TSVB), Vega and Vega-Lite, and timeline are not supported. -* External plugins, such as Gantt chart, and non-visualization plugins, such as the developer console, are not supported. +* The Time Series Visual Builder (TSVB) and timeline visualization types are not supported. +* External plugins, such as `gantt-chart`, and non-visualization plugins are not supported. diff --git a/_dashboards/visualize/vega.md b/_dashboards/visualize/vega.md new file mode 100644 index 0000000000..7764d583a6 --- /dev/null +++ b/_dashboards/visualize/vega.md @@ -0,0 +1,192 @@ +--- +layout: default +title: Using Vega +parent: Building data visualizations +nav_order: 45 +--- + +# Using Vega + +[Vega](https://vega.github.io/vega/) and [Vega-Lite](https://vega.github.io/vega-lite/) are open-source, declarative language visualization tools that you can use to create custom data visualizations with your OpenSearch data and [Vega Data](https://vega.github.io/vega/docs/data/). These tools are ideal for advanced users comfortable with writing OpenSearch queries directly. Enable the `vis_type_vega` plugin in your `opensearch_dashboards.yml` file to write your [Vega specifications](https://vega.github.io/vega/docs/specification/) in either JSON or [HJSON](https://hjson.github.io/) format or to specify one or more OpenSearch queries within your Vega specification. By default, the plugin is set to `true`. The configuration is shown in the following example. For configuration details, refer to the `vis_type_vega` [README](https://github.com/opensearch-project/OpenSearch-Dashboards/blob/main/src/plugins/vis_type_vega/README.md). + +``` +vis_type_vega.enabled: true +``` + +The following image shows a custom Vega map created in OpenSearch. + +Map created using Vega visualization in OpenSearch Dashboards + +## Querying from multiple data sources + +If you have configured [multiple data sources]({{site.url}}{{site.baseurl}}/dashboards/management/multi-data-sources/) in OpenSearch Dashboards, you can use Vega to query those data sources. Within your Vega specification, add the `data_source_name` field under the `url` property to target a specific data source by name. By default, queries use data from the local cluster. You can assign individual `data_source_name` values to each OpenSearch query within your Vega specification. This allows you to query multiple indexes across different data sources in a single visualization. + +The following is an example Vega specification with `Demo US Cluster` as the specified `data_source_name`: + +``` +{ + $schema: https://vega.github.io/schema/vega/v5.json + config: { + kibana: {type: "map", latitude: 25, longitude: -70, zoom: 3} + } + data: [ + { + name: table + url: { + index: opensearch_dashboards_sample_data_flights + // This OpenSearchQuery will query from the Demo US Cluster datasource + data_source_name: Demo US Cluster + %context%: true + // Uncomment to enable time filtering + // %timefield%: timestamp + body: { + size: 0 + aggs: { + origins: { + terms: {field: "OriginAirportID", size: 10000} + aggs: { + originLocation: { + top_hits: { + size: 1 + _source: { + includes: ["OriginLocation", "Origin"] + } + } + } + distinations: { + terms: {field: "DestAirportID", size: 10000} + aggs: { + destLocation: { + top_hits: { + size: 1 + _source: { + includes: ["DestLocation"] + } + } + } + } + } + } + } + } + } + } + format: {property: "aggregations.origins.buckets"} + transform: [ + { + type: geopoint + projection: projection + fields: [ + originLocation.hits.hits[0]._source.OriginLocation.lon + originLocation.hits.hits[0]._source.OriginLocation.lat + ] + } + ] + } + { + name: selectedDatum + on: [ + {trigger: "!selected", remove: true} + {trigger: "selected", insert: "selected"} + ] + } + ] + signals: [ + { + name: selected + value: null + on: [ + {events: "@airport:mouseover", update: "datum"} + {events: "@airport:mouseout", update: "null"} + ] + } + ] + scales: [ + { + name: airportSize + type: linear + domain: {data: "table", field: "doc_count"} + range: [ + {signal: "zoom*zoom*0.2+1"} + {signal: "zoom*zoom*10+1"} + ] + } + ] + marks: [ + { + type: group + from: { + facet: { + name: facetedDatum + data: selectedDatum + field: distinations.buckets + } + } + data: [ + { + name: facetDatumElems + source: facetedDatum + transform: [ + { + type: geopoint + projection: projection + fields: [ + destLocation.hits.hits[0]._source.DestLocation.lon + destLocation.hits.hits[0]._source.DestLocation.lat + ] + } + {type: "formula", expr: "{x:parent.x, y:parent.y}", as: "source"} + {type: "formula", expr: "{x:datum.x, y:datum.y}", as: "target"} + {type: "linkpath", shape: "diagonal"} + ] + } + ] + scales: [ + { + name: lineThickness + type: log + clamp: true + range: [1, 8] + } + { + name: lineOpacity + type: log + clamp: true + range: [0.2, 0.8] + } + ] + marks: [ + { + from: {data: "facetDatumElems"} + type: path + interactive: false + encode: { + update: { + path: {field: "path"} + stroke: {value: "black"} + strokeWidth: {scale: "lineThickness", field: "doc_count"} + strokeOpacity: {scale: "lineOpacity", field: "doc_count"} + } + } + } + ] + } + { + name: airport + type: symbol + from: {data: "table"} + encode: { + update: { + size: {scale: "airportSize", field: "doc_count"} + xc: {signal: "datum.x"} + yc: {signal: "datum.y"} + tooltip: { + signal: "{title: datum.originLocation.hits.hits[0]._source.Origin + ' (' + datum.key + ')', connnections: length(datum.distinations.buckets), flights: datum.doc_count}" + } + } + } + } + ] +} +``` +{% include copy-curl.html %} diff --git a/_data-prepper/common-use-cases/trace-analytics.md b/_data-prepper/common-use-cases/trace-analytics.md index 1f6c3b7cc4..033830351a 100644 --- a/_data-prepper/common-use-cases/trace-analytics.md +++ b/_data-prepper/common-use-cases/trace-analytics.md @@ -38,9 +38,9 @@ The [OpenTelemetry source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/c There are three processors for the trace analytics feature: -* *otel_traces_raw* - The *otel_traces_raw* processor receives a collection of [span](https://github.com/opensearch-project/data-prepper/blob/fa65e9efb3f8d6a404a1ab1875f21ce85e5c5a6d/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/trace/Span.java) records from [*otel-trace-source*]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/otel-trace/), and performs stateful processing, extraction, and completion of trace-group-related fields. -* *otel_traces_group* - The *otel_traces_group* processor fills in the missing trace-group-related fields in the collection of [span](https://github.com/opensearch-project/data-prepper/blob/298e7931aa3b26130048ac3bde260e066857df54/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/trace/Span.java) records by looking up the OpenSearch backend. -* *service_map_stateful* – The *service_map_stateful* processor performs the required preprocessing for trace data and builds metadata to display the `service-map` dashboards. +* otel_traces_raw -- The *otel_traces_raw* processor receives a collection of [span](https://github.com/opensearch-project/data-prepper/blob/fa65e9efb3f8d6a404a1ab1875f21ce85e5c5a6d/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/trace/Span.java) records from [*otel-trace-source*]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/otel-trace-source/), and performs stateful processing, extraction, and completion of trace-group-related fields. +* otel_traces_group -- The *otel_traces_group* processor fills in the missing trace-group-related fields in the collection of [span](https://github.com/opensearch-project/data-prepper/blob/298e7931aa3b26130048ac3bde260e066857df54/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/trace/Span.java) records by looking up the OpenSearch backend. +* service_map_stateful -- The *service_map_stateful* processor performs the required preprocessing for trace data and builds metadata to display the `service-map` dashboards. ### OpenSearch sink @@ -49,8 +49,8 @@ OpenSearch provides a generic sink that writes data to OpenSearch as the destina The sink provides specific configurations for the trace analytics feature. These configurations allow the sink to use indexes and index templates specific to trace analytics. The following OpenSearch indexes are specific to trace analytics: -* *otel-v1-apm-span* – The *otel-v1-apm-span* index stores the output from the [otel_traces_raw]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/otel-trace-raw/) processor. -* *otel-v1-apm-service-map* – The *otel-v1-apm-service-map* index stores the output from the [service_map_stateful]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/service-map-stateful/) processor. +* otel-v1-apm-span –- The *otel-v1-apm-span* index stores the output from the [otel_traces_raw]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/otel-trace-raw/) processor. +* otel-v1-apm-service-map –- The *otel-v1-apm-service-map* index stores the output from the [service_map_stateful]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/service-map-stateful/) processor. ## Trace tuning @@ -374,4 +374,4 @@ Starting with Data Prepper version 1.4, trace processing uses Data Prepper's eve * `otel_traces_group` replaces `otel_traces_group_prepper` for event-based spans. In Data Prepper version 2.0, `otel_traces_source` will only output events. Data Prepper version 2.0 also removes `otel_traces_raw_prepper` and `otel_traces_group_prepper` entirely. To migrate to Data Prepper version 2.0, you can configure your trace pipeline using the event model. - \ No newline at end of file + diff --git a/_data-prepper/pipelines/configuration/sources/otel-trace.md b/_data-prepper/pipelines/configuration/sources/otel-trace-source.md similarity index 89% rename from _data-prepper/pipelines/configuration/sources/otel-trace.md rename to _data-prepper/pipelines/configuration/sources/otel-trace-source.md index 4b17647768..137592bbe8 100644 --- a/_data-prepper/pipelines/configuration/sources/otel-trace.md +++ b/_data-prepper/pipelines/configuration/sources/otel-trace-source.md @@ -1,22 +1,22 @@ --- layout: default -title: otel_trace_source source +title: otel_trace_source parent: Sources grand_parent: Pipelines nav_order: 15 +redirect_from: + - /data-prepper/pipelines/configuration/sources/otel-trace/ --- -# otel_trace source +# otel_trace_source -## Overview - -The `otel_trace` source is a source for the OpenTelemetry Collector. The following table describes options you can use to configure the `otel_trace` source. +`otel_trace_source` is a source for the OpenTelemetry Collector. The following table describes options you can use to configure the `otel_trace_source` source. Option | Required | Type | Description :--- | :--- | :--- | :--- -port | No | Integer | The port that the `otel_trace` source runs on. Default value is `21890`. +port | No | Integer | The port that the `otel_trace_source` source runs on. Default value is `21890`. request_timeout | No | Integer | The request timeout, in milliseconds. Default value is `10000`. health_check_service | No | Boolean | Enables a gRPC health check service under `grpc.health.v1/Health/Check`. Default value is `false`. unauthenticated_health_check | No | Boolean | Determines whether or not authentication is required on the health check endpoint. Data Prepper ignores this option if no authentication is defined. Default value is `false`. @@ -35,6 +35,8 @@ authentication | No | Object | An authentication configuration. By default, an u ## Metrics +The 'otel_trace_source' source includes the following metrics. + ### Counters - `requestTimeouts`: Measures the total number of requests that time out. @@ -50,4 +52,4 @@ authentication | No | Object | An authentication configuration. By default, an u ### Distribution summaries -- `payloadSize`: Measures the incoming request payload size distribution in bytes. \ No newline at end of file +- `payloadSize`: Measures the incoming request payload size distribution in bytes. diff --git a/_im-plugin/reindex-data.md b/_im-plugin/reindex-data.md index 2e3288087a..a766589b84 100644 --- a/_im-plugin/reindex-data.md +++ b/_im-plugin/reindex-data.md @@ -91,6 +91,12 @@ Options | Valid values | Description | Required `socket_timeout` | Time Unit | The wait time for socket reads (default 30s). | No `connect_timeout` | Time Unit | The wait time for remote connection timeouts (default 30s). | No +The following table lists the retry policy cluster settings. + +Setting | Description | Default value +:--- | :--- +`reindex.remote.retry.initial_backoff` | The initial backoff time for retries. Subsequent retries will follow exponential backoff based on the initial backoff time. | 500 ms +`reindex.remote.retry.max_count` | The maximum number of retry attempts. | 15 ## Reindex a subset of documents diff --git a/_ingest-pipelines/processors/index-processors.md b/_ingest-pipelines/processors/index-processors.md index fb71e90d01..60fcac82e2 100644 --- a/_ingest-pipelines/processors/index-processors.md +++ b/_ingest-pipelines/processors/index-processors.md @@ -59,6 +59,7 @@ Processor type | Description `sort` | Sorts the elements of an array in ascending or descending order. `sparse_encoding` | Generates a sparse vector/token and weights from text fields for neural sparse search using sparse retrieval. `split` | Splits a field into an array using a separator character. +`text_chunking` | Splits long documents into smaller chunks. `text_embedding` | Generates vector embeddings from text fields for semantic search. `text_image_embedding` | Generates combined vector embeddings from text and image fields for multimodal neural search. `trim` | Removes leading and trailing white space from a string field. diff --git a/_ingest-pipelines/processors/text-chunking.md b/_ingest-pipelines/processors/text-chunking.md new file mode 100644 index 0000000000..e9ff55b210 --- /dev/null +++ b/_ingest-pipelines/processors/text-chunking.md @@ -0,0 +1,315 @@ +--- +layout: default +title: Text chunking +parent: Ingest processors +nav_order: 250 +--- + +# Text chunking processor + +The `text_chunking` processor splits a long document into shorter passages. The processor supports the following algorithms for text splitting: + +- [`fixed_token_length`](#fixed-token-length-algorithm): Splits text into passages of the specified size. +- [`delimiter`](#delimiter-algorithm): Splits text into passages on a delimiter. + +The following is the syntax for the `text_chunking` processor: + +```json +{ + "text_chunking": { + "field_map": { + "": "" + }, + "algorithm": { + "": "" + } + } +} +``` + +## Configuration parameters + +The following table lists the required and optional parameters for the `text_chunking` processor. + +| Parameter | Data type | Required/Optional | Description | +|:---|:---|:---|:---| +| `field_map` | Object | Required | Contains key-value pairs that specify the mapping of a text field to the output field. | +| `field_map.` | String | Required | The name of the field from which to obtain text for generating chunked passages. | +| `field_map.` | String | Required | The name of the field in which to store the chunked results. | +| `algorithm` | Object | Required | Contains at most one key-value pair that specifies the chunking algorithm and parameters. | +| `algorithm.` | String | Optional | The name of the chunking algorithm. Valid values are [`fixed_token_length`](#fixed-token-length-algorithm) or [`delimiter`](#delimiter-algorithm). Default is `fixed_token_length`. | +| `algorithm.` | Object | Optional | The parameters for the chunking algorithm. By default, contains the default parameters of the `fixed_token_length` algorithm. | +| `description` | String | Optional | A brief description of the processor. | +| `tag` | String | Optional | An identifier tag for the processor. Useful when debugging in order to distinguish between processors of the same type. | + +### Fixed token length algorithm + +The following table lists the optional parameters for the `fixed_token_length` algorithm. + +| Parameter | Data type | Required/Optional | Description | +|:---|:---|:---|:---| +| `token_limit` | Integer | Optional | The token limit for chunking algorithms. Valid values are integers of at least `1`. Default is `384`. | +| `tokenizer` | String | Optional | The [word tokenizer]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/index/#word-tokenizers) name. Default is `standard`. | +| `overlap_rate` | String | Optional | The degree of overlap in the token algorithm. Valid values are floats between `0` and `0.5`, inclusive. Default is `0`. | +| `max_chunk_limit` | Integer | Optional | The chunk limit for chunking algorithms. Default is 100. To disable this parameter, set it to `-1`. | + +The default value of `token_limit` is `384` so that output passages don't exceed the token limit constraint of the downstream text embedding models. For [OpenSearch-supported pretrained models]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/#supported-pretrained-models), like `msmarco-distilbert-base-tas-b` and `opensearch-neural-sparse-encoding-v1`, the input token limit is `512`. The `standard` tokenizer tokenizes text into words. According to [OpenAI](https://platform.openai.com/docs/introduction), 1 token equals approximately 0.75 words of English text. The default token limit is calculated as 512 * 0.75 = 384. +{: .note} + +You can set the `overlap_rate` to a decimal percentage value in the 0--0.5 range, inclusive. Per [Amazon Bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend setting this parameter to a value of 0–0.2 to improve accuracy. +{: .note} + +The `max_chunk_limit` parameter limits the number of chunked passages. If the number of passages generated by the processor exceeds the limit, the algorithm will return an exception, prompting you to either increase or disable the limit. +{: .note} + +### Delimiter algorithm + +The following table lists the optional parameters for the `delimiter` algorithm. + +| Parameter | Data type | Required/Optional | Description | +|:---|:---|:---|:---| +| `delimiter` | String | Optional | A string delimiter used to split text. You can set the `delimiter` to any string, for example, `\n` (split text into paragraphs on a new line) or `.` (split text into sentences). Default is `\n\n` (split text into paragraphs on two new line characters). | +| `max_chunk_limit` | Integer | Optional | The chunk limit for chunking algorithms. Default is `100`. To disable this parameter, set it to `-1`. | + +The `max_chunk_limit` parameter limits the number of chunked passages. If the number of passages generated by the processor exceeds the limit, the algorithm will return an exception, prompting you to either increase or disable the limit. +{: .note} + +## Using the processor + +Follow these steps to use the processor in a pipeline. You can specify the chunking algorithm when creating the processor. If you don't provide an algorithm name, the chunking processor will use the default `fixed_token_length` algorithm along with all its default parameters. + +**Step 1: Create a pipeline** + +The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field: + +```json +PUT _ingest/pipeline/text-chunking-ingest-pipeline +{ + "description": "A text chunking ingest pipeline", + "processors": [ + { + "text_chunking": { + "algorithm": { + "fixed_token_length": { + "token_limit": 10, + "overlap_rate": 0.2, + "tokenizer": "standard" + } + }, + "field_map": { + "passage_text": "passage_chunk" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +**Step 2 (Optional): Test the pipeline** + +It is recommended that you test your pipeline before ingesting documents. +{: .tip} + +To test the pipeline, run the following query: + +```json +POST _ingest/pipeline/text-chunking-ingest-pipeline/_simulate +{ + "docs": [ + { + "_index": "testindex", + "_id": "1", + "_source":{ + "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch." + } + } + ] +} +``` +{% include copy-curl.html %} + +#### Response + +The response confirms that, in addition to the `passage_text` field, the processor has generated chunking results in the `passage_chunk` field. The processor split the paragraph into 10-word chunks. Because of the `overlap` setting of 0.2, the last 2 words of a chunk are duplicated in the following chunk: + +```json +{ + "docs": [ + { + "doc": { + "_index": "testindex", + "_id": "1", + "_source": { + "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.", + "passage_chunk": [ + "This is an example document to be chunked. The document ", + "The document contains a single paragraph, two sentences and 24 ", + "and 24 tokens by standard tokenizer in OpenSearch." + ] + }, + "_ingest": { + "timestamp": "2024-03-20T02:55:25.642366Z" + } + } + } + ] +} +``` + +Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). + +## Chaining text chunking and embedding processors + +You can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage. + +**Prerequisites** + +Follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model. + +**Step 1: Create a pipeline** + +The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field: + +```json +PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline +{ + "description": "A text chunking and embedding ingest pipeline", + "processors": [ + { + "text_chunking": { + "algorithm": { + "fixed_token_length": { + "token_limit": 10, + "overlap_rate": 0.2, + "tokenizer": "standard" + } + }, + "field_map": { + "passage_text": "passage_chunk" + } + } + }, + { + "text_embedding": { + "model_id": "LMLPWY4BROvhdbtgETaI", + "field_map": { + "passage_chunk": "passage_chunk_embedding" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +**Step 2 (Optional): Test the pipeline** + +It is recommended that you test your pipeline before ingesting documents. +{: .tip} + +To test the pipeline, run the following query: + +```json +POST _ingest/pipeline/text-chunking-embedding-ingest-pipeline/_simulate +{ + "docs": [ + { + "_index": "testindex", + "_id": "1", + "_source":{ + "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch." + } + } + ] +} +``` +{% include copy-curl.html %} + +#### Response + +The response confirms that, in addition to the `passage_text` and `passage_chunk` fields, the processor has generated text embeddings for each of the three passages in the `passage_chunk_embedding` field. The embedding vectors are stored in the `knn` field for each chunk: + +```json +{ + "docs": [ + { + "doc": { + "_index": "testindex", + "_id": "1", + "_source": { + "passage_chunk_embedding": [ + { + "knn": [...] + }, + { + "knn": [...] + }, + { + "knn": [...] + } + ], + "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.", + "passage_chunk": [ + "This is an example document to be chunked. The document ", + "The document contains a single paragraph, two sentences and 24 ", + "and 24 tokens by standard tokenizer in OpenSearch." + ] + }, + "_ingest": { + "timestamp": "2024-03-20T03:04:49.144054Z" + } + } + } + ] +} +``` + +Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). + +## Cascaded text chunking processors + +You can chain multiple chunking processors together. For example, to split documents into paragraphs, apply the `delimiter` algorithm and specify the parameter as `\n\n`. To prevent a paragraph from exceeding the token limit, append another chunking processor that uses the `fixed_token_length` algorithm. You can configure the ingest pipeline for this example as follows: + +```json +PUT _ingest/pipeline/text-chunking-cascade-ingest-pipeline +{ + "description": "A text chunking pipeline with cascaded algorithms", + "processors": [ + { + "text_chunking": { + "algorithm": { + "delimiter": { + "delimiter": "\n\n" + } + }, + "field_map": { + "passage_text": "passage_chunk1" + } + } + }, + { + "text_chunking": { + "algorithm": { + "fixed_token_length": { + "token_limit": 500, + "overlap_rate": 0.2, + "tokenizer": "standard" + } + }, + "field_map": { + "passage_chunk1": "passage_chunk2" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +## Next steps + +- To learn more about semantic search, see [Semantic search]({{site.url}}{{site.baseurl}}/search-plugins/semantic-search/). +- To learn more about sparse search, see [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). +- To learn more about using models in OpenSearch, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model). +- For a comprehensive example, see [Neural search tutorial]({{site.url}}{{site.baseurl}}/search-plugins/neural-search-tutorial/). diff --git a/_install-and-configure/configuring-opensearch/index-settings.md b/_install-and-configure/configuring-opensearch/index-settings.md index 0f7e336cdd..25cd4b8810 100644 --- a/_install-and-configure/configuring-opensearch/index-settings.md +++ b/_install-and-configure/configuring-opensearch/index-settings.md @@ -100,6 +100,7 @@ OpenSearch supports the following static index-level index settings: - `index.merge_on_flush.policy` (default | merge-on-flush): This setting controls which merge policy should be used when `index.merge_on_flush.enabled` is enabled. Default is `default`. +- `index.check_pending_flush.enabled` (Boolean): This setting controls the Apache Lucene `checkPendingFlushOnUpdate` index writer setting, which specifies whether an indexing thread should check for pending flushes on an update in order to flush indexing buffers to disk. Default is `true`. ### Updating a static index setting @@ -184,9 +185,9 @@ OpenSearch supports the following dynamic index-level index settings: - `index.final_pipeline` (String): The final ingest node pipeline for the index. If the final pipeline is set and the pipeline does not exist, then index requests fail. The pipeline name `_none` specifies that the index does not have an ingest pipeline. -- `index.optimize_doc_id_lookup.fuzzy_set.enabled` (Boolean): This setting controls whether `fuzzy_set` should be enabled in order to optimize document ID lookups in index or search calls by using an additional data structure, in this case, the Bloom filter data structure. Enabling this setting improves performance for upsert and search operations that rely on document ID by creating a new data structure (Bloom filter). The Bloom filter allows for the handling of negative cases (that is, IDs being absent in the existing index) through faster off-heap lookups. Default is `false`. This setting can only be used if the feature flag `opensearch.experimental.optimize_doc_id_lookup.fuzzy_set.enabled` is set to `true`. +- `index.optimize_doc_id_lookup.fuzzy_set.enabled` (Boolean): This setting controls whether `fuzzy_set` should be enabled in order to optimize document ID lookups in index or search calls by using an additional data structure, in this case, the Bloom filter data structure. Enabling this setting improves performance for upsert and search operations that rely on document IDs by creating a new data structure (Bloom filter). The Bloom filter allows for the handling of negative cases (that is, IDs being absent in the existing index) through faster off-heap lookups. Note that creating a Bloom filter requires additional heap usage during indexing time. Default is `false`. -- `index.optimize_doc_id_lookup.fuzzy_set.false_positive_probability` (Double): Sets the false-positive probability for the underlying `fuzzy_set` (that is, the Bloom filter). A lower false-positive probability ensures higher throughput for `UPSERT` and `GET` operations. Allowed values range between `0.01` and `0.50`. Default is `0.20`. This setting can only be used if the feature flag `opensearch.experimental.optimize_doc_id_lookup.fuzzy_set.enabled` is set to `true`. +- `index.optimize_doc_id_lookup.fuzzy_set.false_positive_probability` (Double): Sets the false-positive probability for the underlying `fuzzy_set` (that is, the Bloom filter). A lower false-positive probability ensures higher throughput for upsert and get operations but results in increased storage and memory use. Allowed values range between `0.01` and `0.50`. Default is `0.20`. ### Updating a dynamic index setting diff --git a/_install-and-configure/install-dashboards/debian.md b/_install-and-configure/install-dashboards/debian.md index 4372049230..73aba46cd4 100644 --- a/_install-and-configure/install-dashboards/debian.md +++ b/_install-and-configure/install-dashboards/debian.md @@ -131,3 +131,44 @@ By default, OpenSearch Dashboards, like OpenSearch, binds to `localhost` when yo 1. From a web browser, navigate to OpenSearch Dashboards. The default port is 5601. 1. Log in with the default username `admin` and the default password `admin`. (For OpenSearch 2.12 and later, the password should be the custom admin password) 1. Visit [Getting started with OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/dashboards/index/) to learn more. + + +## Upgrade to a newer version + +OpenSearch Dashboards instances installed using `dpkg` or `apt-get` can be easily upgraded to a newer version. + +### Manual upgrade with DPKG + +Download the Debian package for the desired upgrade version directly from the [OpenSearch Project downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. + +Navigate to the directory containing the distribution and run the following command: + +```bash +sudo dpkg -i opensearch-dashboards-{{site.opensearch_version}}-linux-x64.deb +``` +{% include copy.html %} + +### APT-GET + +To upgrade to the latest version of OpenSearch Dashboards using `apt-get`, run the following command: + +```bash +sudo apt-get upgrade opensearch-dashboards +``` +{% include copy.html %} + +You can also upgrade to a specific OpenSearch Dashboards version by providing the version number: + +```bash +sudo apt-get upgrade opensearch-dashboards= +``` +{% include copy.html %} + +### Automatically restart the service after a package upgrade (2.13.0+) + +To automatically restart OpenSearch Dashboards after a package upgrade, enable the `opensearch-dashboards.service` through `systemd`: + +```bash +sudo systemctl enable opensearch-dashboards.service +``` +{% include copy.html %} diff --git a/_install-and-configure/install-dashboards/rpm.md b/_install-and-configure/install-dashboards/rpm.md index d250c4c1f3..cc5974c91e 100644 --- a/_install-and-configure/install-dashboards/rpm.md +++ b/_install-and-configure/install-dashboards/rpm.md @@ -89,4 +89,41 @@ YUM, the primary package management tool for Red Hat-based operating systems, al 1. Once complete, you can run OpenSearch Dashboards. ```bash sudo systemctl start opensearch-dashboards - ``` \ No newline at end of file + ``` + +## Upgrade to a newer version + +OpenSearch Dashboards instances installed using RPM or YUM can be easily upgraded to a newer version. We recommend using YUM, but you can also choose RPM. + + +### Manual upgrade with RPM + +Download the RPM package for the desired upgrade version directly from the [OpenSearch Project downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. + +Navigate to the directory containing the distribution and run the following command: + +```bash +rpm -Uvh opensearch-dashboards-{{site.opensearch_version}}-linux-x64.rpm +``` +{% include copy.html %} + +### YUM + +To upgrade to the latest version of OpenSearch Dashboards using YUM, run the following command: + +```bash +sudo yum update opensearch-dashboards +``` +{% include copy.html %} + +You can also upgrade to a specific OpenSearch Dashboards version by providing the version number: + + ```bash + sudo yum update opensearch-dashboards- + ``` + {% include copy.html %} + +### Automatically restart the service after a package upgrade + +The OpenSearch Dashboards RPM package does not currently support automatically restarting the service after a package upgrade. + diff --git a/_install-and-configure/install-opensearch/debian.md b/_install-and-configure/install-opensearch/debian.md index 6f9167a12c..72ae05d87c 100644 --- a/_install-and-configure/install-opensearch/debian.md +++ b/_install-and-configure/install-opensearch/debian.md @@ -528,7 +528,7 @@ OpenSearch instances installed using `dpkg` or `apt-get` can be easily upgraded ### Manual upgrade with DPKG -Download the Debian package for the desired upgrade version directly from the [OpenSearch downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. +Download the Debian package for the desired upgrade version directly from the [OpenSearch Project downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. Navigate to the directory containing the distribution and run the following command: ```bash @@ -550,6 +550,15 @@ sudo apt-get upgrade opensearch= ``` {% include copy.html %} +### Automatically restart the service after a package upgrade (2.13.0+) + +To automatically restart OpenSearch after a package upgrade, enable the `opensearch.service` through `systemd`: + +```bash +sudo systemctl enable opensearch.service +``` +{% include copy.html %} + ## Related links - [OpenSearch configuration]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/) diff --git a/_install-and-configure/install-opensearch/rpm.md b/_install-and-configure/install-opensearch/rpm.md index ac3ff4e0e9..a22ea96d61 100644 --- a/_install-and-configure/install-opensearch/rpm.md +++ b/_install-and-configure/install-opensearch/rpm.md @@ -500,7 +500,7 @@ OpenSearch instances installed using RPM or YUM can be easily upgraded to a newe ### Manual upgrade with RPM -Download the RPM package for the desired upgrade version directly from the [OpenSearch downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. +Download the RPM package for the desired upgrade version directly from the [OpenSearch Project downloads page](https://opensearch.org/downloads.html){:target='\_blank'}. Navigate to the directory containing the distribution and run the following command: ```bash @@ -512,7 +512,7 @@ rpm -Uvh opensearch-{{site.opensearch_version}}-linux-x64.rpm To upgrade to the latest version of OpenSearch using YUM: ```bash -sudo yum update +sudo yum update opensearch ``` {% include copy.html %} @@ -522,6 +522,10 @@ sudo yum update ``` {% include copy.html %} +### Automatically restart the service after a package upgrade + +The OpenSearch RPM package does not currently support automatically restarting the service after a package upgrade. + ## Related links - [OpenSearch configuration]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/) diff --git a/_install-and-configure/plugins.md b/_install-and-configure/plugins.md index b18257cf3e..6b0b28769e 100644 --- a/_install-and-configure/plugins.md +++ b/_install-and-configure/plugins.md @@ -247,8 +247,23 @@ bin/opensearch-plugin install --batch ## Available plugins -Major, minor, and patch plugin versions must match OpenSearch major, minor, and patch versions in order to be compatible. For example, plugins versions 2.3.0.x work only with OpenSearch 2.3.0. -{: .warning} +OpenSearch provides several bundled and additional plugins. + +### Plugin compatibility + +A plugin can explicitly specify compatibility with a specific OpenSearch version by listing that version in its `plugin-descriptor.properties` file. For example, a plugin with the following property is compatible only with OpenSearch 2.3.0: + +```properties +opensearch.version=2.3.0 +``` +Alternatively, a plugin can specify a range of compatible OpenSearch versions by setting the `dependencies` property in its `plugin-descriptor.properties` file using one of the following notations: +- `dependencies={ opensearch: "2.3.0" }`: The plugin is compatible only with OpenSearch version 2.3.0. +- `dependencies={ opensearch: "=2.3.0" }`: The plugin is compatible only with OpenSearch version 2.3.0. +- `dependencies={ opensearch: "~2.3.0" }`: The plugin is compatible with all versions starting from 2.3.0 up to the next minor version, in this example, 2.4.0 (exclusive). +- `dependencies={ opensearch: "^2.3.0" }`: The plugin is compatible with all versions starting from 2.3.0 up to the next major version, in this example, 3.0.0 (exclusive). + +You can specify only one of the `opensearch.version` or `dependencies` properties. +{: .note} ### Bundled plugins diff --git a/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md b/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md index bc2b7443de..109cbf8836 100644 --- a/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md +++ b/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md @@ -264,7 +264,7 @@ To test the LLM, send the following predict request: POST /_plugins/_ml/models/NWR9YIsBUysqmzBdifVJ/_predict { "parameters": { - "prompt": "\n\nHuman:hello\n\nnAssistant:" + "prompt": "\n\nHuman:hello\n\nAssistant:" } } ``` @@ -354,4 +354,50 @@ Therefore, the population increase of Seattle from 2021 to 2023 is 58,000.""" } ] } -``` \ No newline at end of file +``` + +## Hidden agents +**Introduced 2.13** +{: .label .label-purple } + +To hide agent details from end users, including the cluster admin, you can register a _hidden_ agent. If an agent is hidden, non-superadmin users don't have permission to call any [Agent APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/index/) except for the [Execute API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/execute-agent/), on the agent. + +Only superadmin users can register a hidden agent. To register a hidden agent, you first need to authenticate with an [admin certificate]({{site.url}}{{site.baseurl}}/security/configuration/tls/#configuring-admin-certificates): + +```bash +curl -k --cert ./kirk.pem --key ./kirk-key.pem -XGET 'https://localhost:9200/.opendistro_security/_search' +``` + +All agents created by a superadmin user are automatically registered as hidden. To register a hidden agent, send a request to the `_register` endpoint: + +```bash +curl -k --cert ./kirk.pem --key ./kirk-key.pem -X POST 'https://localhost:9200/_plugins/_ml/models/_register' -H 'Content-Type: application/json' -d ' +{ + "name": "Test_Agent_For_RAG", + "type": "flow", + "description": "this is a test agent", + "tools": [ + { + "name": "vector_tool", + "type": "VectorDBTool", + "parameters": { + "model_id": "zBRyYIsBls05QaITo5ex", + "index": "my_test_data", + "embedding_field": "embedding", + "source_field": [ + "text" + ], + "input": "${parameters.question}" + } + }, + { + "type": "MLModelTool", + "description": "A general tool to answer any question", + "parameters": { + "model_id": "NWR9YIsBUysqmzBdifVJ", + "prompt": "\n\nHuman:You are a professional data analyst. You will always answer question based on the given context first. If the answer is not directly shown in the context, you will analyze the data and find the answer. If you don't know the answer, just say don't know. \n\n Context:\n${parameters.vector_tool.output}\n\nHuman:${parameters.question}\n\nAssistant:" + } + } + ] +}' +``` diff --git a/_ml-commons-plugin/api/index.md b/_ml-commons-plugin/api/index.md index a41679f666..ec4cf12492 100644 --- a/_ml-commons-plugin/api/index.md +++ b/_ml-commons-plugin/api/index.md @@ -16,8 +16,11 @@ ML Commons supports the following APIs: - [Model APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/index/) - [Model group APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-group-apis/index/) - [Connector APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/connector-apis/index/) +- [Agent APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/index/) +- [Memory APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/memory-apis/index/) +- [Controller APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/controller-apis/index/) +- [Execute Algorithm API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/execute-algorithm/) - [Tasks APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/index/) - [Train and Predict APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/train-predict/index/) -- [Execute Algorithm API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/execute-algorithm/) - [Profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/profile/) - [Stats API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/stats/) diff --git a/_ml-commons-plugin/api/model-apis/deploy-model.md b/_ml-commons-plugin/api/model-apis/deploy-model.md index 52cf3f232e..2c6991ba22 100644 --- a/_ml-commons-plugin/api/model-apis/deploy-model.md +++ b/_ml-commons-plugin/api/model-apis/deploy-model.md @@ -8,7 +8,19 @@ nav_order: 20 # Deploy a model -The deploy model operation reads the model's chunks from the model index and then creates an instance of the model to cache into memory. This operation requires the `model_id`. +The deploy model operation reads the model's chunks from the model index and then creates an instance of the model to cache in memory. This operation requires the `model_id`. + +Starting with OpenSearch version 2.13, [externally hosted models]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index) are deployed automatically by default when you send a Predict API request for the first time. To disable automatic deployment for an externally hosted model, set `plugins.ml_commons.model_auto_deploy.enable` to `false`: + +```json +PUT _cluster/settings +{ + "persistent": { + "plugins.ml_commons.model_auto_deploy.enable": "false" + } +} +``` +{% include copy-curl.html %} For information about user access for this API, see [Model access control considerations]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/index/#model-access-control-considerations). diff --git a/_ml-commons-plugin/cluster-settings.md b/_ml-commons-plugin/cluster-settings.md index 5bf1c13599..c473af81a1 100644 --- a/_ml-commons-plugin/cluster-settings.md +++ b/_ml-commons-plugin/cluster-settings.md @@ -239,6 +239,33 @@ plugins.ml_commons.native_memory_threshold: 90 - Default value: 90 - Value range: [0, 100] +## Set JVM heap memory threshold + +Sets a circuit breaker that checks JVM heap memory usage before running an ML task. If the heap usage exceeds the threshold, OpenSearch triggers a circuit breaker and throws an exception to maintain optimal performance. + +Values are based on the percentage of JVM heap memory available. When set to `0`, no ML tasks will run. When set to `100`, the circuit breaker closes and no threshold exists. + +### Setting + +``` +plugins.ml_commons.jvm_heap_memory_threshold: 85 +``` + +### Values + +- Default value: 85 +- Value range: [0, 100] + +## Exclude node names + +Use this setting to specify the names of nodes on which you don't want to run ML tasks. The value should be a valid node name or a comma-separated node name list. + +### Setting + +``` +plugins.ml_commons.exclude_nodes._name: node1, node2 +``` + ## Allow custom deployment plans When enabled, this setting grants users the ability to deploy models to specific ML nodes according to that user's permissions. @@ -254,6 +281,21 @@ plugins.ml_commons.allow_custom_deployment_plan: false - Default value: false - Valid values: `false`, `true` +## Enable auto deploy + +This setting is applicable when you send a prediction request for an externally hosted model that has not been deployed. When set to `true`, this setting automatically deploys the model to the cluster if the model has not been deployed already. + +### Setting + +``` +plugins.ml_commons.model_auto_deploy.enable: false +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` + ## Enable auto redeploy This setting automatically redeploys deployed or partially deployed models upon cluster failure. If all ML nodes inside a cluster crash, the model switches to the `DEPLOYED_FAILED` state, and the model must be deployed manually. @@ -326,10 +368,110 @@ plugins.ml_commons.connector_access_control_enabled: true ### Values -- Default value: false +- Default value: `false` - Valid values: `false`, `true` +## Enable a local model + +This setting allows a cluster admin to enable running local models on the cluster. When this setting is `false`, users will not be able to run register, deploy, or predict operations on any local model. + +### Setting + +``` +plugins.ml_commons.local_model.enabled: true +``` +### Values + +- Default value: `true` +- Valid values: `false`, `true` +## Node roles that can run externally hosted models +This setting allows a cluster admin to control the types of nodes on which externally hosted models can run. + +### Setting + +``` +plugins.ml_commons.task_dispatcher.eligible_node_role.remote_model: ["ml"] +``` + +### Values + +- Default value: `["data", "ml"]`, which allows externally hosted models to run on data nodes and ML nodes. + + +## Node roles that can run local models + +This setting allows a cluster admin to control the types of nodes on which local models can run. The `plugins.ml_commons.only_run_on_ml_node` setting only allows the model to run on ML nodes. For a local model, if `plugins.ml_commons.only_run_on_ml_node` is set to `true`, then the model will always run on ML nodes. If `plugins.ml_commons.only_run_on_ml_node` is set to `false`, then the model will run on nodes defined in the `plugins.ml_commons.task_dispatcher.eligible_node_role.local_model` setting. + +### Setting +``` +plugins.ml_commons.task_dispatcher.eligible_node_role.remote_model: ["ml"] +``` + +### Values + +- Default value: `["data", "ml"]` + +## Enable remote inference + +This setting allows a cluster admin to enable remote inference on the cluster. If this setting is `false`, users will not be able to run register, deploy, or predict operations on any externally hosted model or create a connector for remote inference. + +### Setting + +``` +plugins.ml_commons.remote_inference.enabled: true +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` + +## Enable agent framework + +When set to `true`, this setting enables the agent framework (including agents and tools) on the cluster and allows users to run register, execute, delete, get, and search operations on an agent. + +### Setting + +``` +plugins.ml_commons.agent_framework_enabled: true +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` + +## Enable memory + +When set to `true`, this setting enables conversational memory, which stores all messages from a conversation for conversational search. + +### Setting + +``` +plugins.ml_commons.memory_feature_enabled: true +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` + + +## Enable RAG pipeline + +When set to `true`, this setting enables the search processors for retrieval-augmented generation (RAG). RAG enhances query results by generating responses using relevant information from memory and previous conversations. + +### Setting + +``` +plugins.ml_commons.agent_framework_enabled: true +``` + +### Values + +- Default value: `true` +- Valid values: `false`, `true` diff --git a/_ml-commons-plugin/custom-local-models.md b/_ml-commons-plugin/custom-local-models.md index f96f784196..ee44a0a529 100644 --- a/_ml-commons-plugin/custom-local-models.md +++ b/_ml-commons-plugin/custom-local-models.md @@ -20,12 +20,14 @@ As of OpenSearch 2.11, OpenSearch supports local sparse encoding models. As of OpenSearch 2.12, OpenSearch supports local cross-encoder models. +As of OpenSearch 2.13, OpenSearch supports local question answering models. + Running local models on the CentOS 7 operating system is not supported. Moreover, not all local models can run on all hardware and operating systems. {: .important} ## Preparing a model -For both text embedding and sparse encoding models, you must provide a tokenizer JSON file within the model zip file. +For all the models, you must provide a tokenizer JSON file within the model zip file. For sparse encoding models, make sure your output format is `{"output":}` so that ML Commons can post-process the sparse vector. @@ -157,7 +159,7 @@ POST /_plugins/_ml/models/_register ``` {% include copy.html %} -For descriptions of Register API parameters, see [Register a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/). The `model_task_type` corresponds to the model type. For text embedding models, set this parameter to `TEXT_EMBEDDING`. For sparse encoding models, set this parameter to `SPARSE_ENCODING` or `SPARSE_TOKENIZE`. For cross-encoder models, set this parameter to `TEXT_SIMILARITY`. +For descriptions of Register API parameters, see [Register a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/). The `model_task_type` corresponds to the model type. For text embedding models, set this parameter to `TEXT_EMBEDDING`. For sparse encoding models, set this parameter to `SPARSE_ENCODING` or `SPARSE_TOKENIZE`. For cross-encoder models, set this parameter to `TEXT_SIMILARITY`. For question answering models, set this parameter to `QUESTION_ANSWERING`. OpenSearch returns the task ID of the register operation: @@ -321,3 +323,60 @@ The response contains the tokens and weights: ## Step 5: Use the model for search To learn how to use the model for vector search, see [Using an ML model for neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/#using-an-ml-model-for-neural-search). + +## Question answering models + +A question answering model extracts the answer to a question from a given context. ML Commons supports context in `text` format. + +To register a question answering model, send a request in the following format. Specify the `function_name` as `QUESTION_ANSWERING`: + +```json +POST /_plugins/_ml/models/_register +{ + "name": "question_answering", + "version": "1.0.0", + "function_name": "QUESTION_ANSWERING", + "description": "test model", + "model_format": "TORCH_SCRIPT", + "model_group_id": "lN4AP40BKolAMNtR4KJ5", + "model_content_hash_value": "e837c8fc05fd58a6e2e8383b319257f9c3859dfb3edc89b26badfaf8a4405ff6", + "model_config": { + "model_type": "bert", + "framework_type": "huggingface_transformers" + }, + "url": "https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/question_answering/question_answering_pt.zip?raw=true" +} +``` +{% include copy-curl.html %} + +Then send a request to deploy the model: + +```json +POST _plugins/_ml/models//_deploy +``` +{% include copy-curl.html %} + +To test a question answering model, send the following request. It requires a `question` and the relevant `context` from which the answer will be generated: + +```json +POST /_plugins/_ml/_predict/question_answering/ +{ + "question": "Where do I live?" + "context": "My name is John. I live in New York" +} +``` +{% include copy-curl.html %} + +The response provides the answer based on the context: + +```json +{ + "inference_results": [ + { + "output": [ + { + "result": "New York" + } + } +} +``` \ No newline at end of file diff --git a/_ml-commons-plugin/remote-models/blueprints.md b/_ml-commons-plugin/remote-models/blueprints.md index 57e0e4177b..5cac2f3d3b 100644 --- a/_ml-commons-plugin/remote-models/blueprints.md +++ b/_ml-commons-plugin/remote-models/blueprints.md @@ -55,32 +55,41 @@ As an ML developer, you can build connector blueprints for other platforms. Usin ## Configuration parameters -The following configuration parameters are **required** in order to build a connector blueprint. - -| Field | Data type | Description | -| :--- | :--- | :--- | -| `name` | String | The name of the connector. | -| `description` | String | A description of the connector. | -| `version` | Integer | The version of the connector. | -| `protocol` | String | The protocol for the connection. For AWS services such as Amazon SageMaker and Amazon Bedrock, use `aws_sigv4`. For all other services, use `http`. | -| `parameters` | JSON object | The default connector parameters, including `endpoint` and `model`. Any parameters indicated in this field can be overridden by parameters specified in a predict request. | -| `credential` | JSON object | Defines any credential variables required in order to connect to your chosen endpoint. ML Commons uses **AES/GCM/NoPadding** symmetric encryption to encrypt your credentials. When the connection to the cluster first starts, OpenSearch creates a random 32-byte encryption key that persists in OpenSearch's system index. Therefore, you do not need to manually set the encryption key. | -| `actions` | JSON array | Defines what actions can run within the connector. If you're an administrator creating a connection, add the [blueprint]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/) for your desired connection. | -| `backend_roles` | JSON array | A list of OpenSearch backend roles. For more information about setting up backend roles, see [Assigning backend roles to users]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#assigning-backend-roles-to-users). | -| `access_mode` | String | Sets the access mode for the model, either `public`, `restricted`, or `private`. Default is `private`. For more information about `access_mode`, see [Model groups]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#model-groups). | -| `add_all_backend_roles` | Boolean | When set to `true`, adds all `backend_roles` to the access list, which only a user with admin permissions can adjust. When set to `false`, non-admins can add `backend_roles`. | - -The `action` parameter supports the following options. - -| Field | Data type | Description | -| :--- | :--- | :--- | -| `action_type` | String | Required. Sets the ML Commons API operation to use upon connection. As of OpenSearch 2.9, only `predict` is supported. | -| `method` | String | Required. Defines the HTTP method for the API call. Supports `POST` and `GET`. | -| `url` | String | Required. Sets the connection endpoint at which the action occurs. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index#adding-trusted-endpoints). | -| `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. | -| `request_body` | String | Required. Sets the parameters contained inside the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. | -| `pre_process_function` | String | Optional. A built-in or custom Painless script used to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models
- `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models
- `connector.pre_process.default.embedding`, which you can use to preprocess documents in neural search requests so that they are in the format that ML Commons can process with the default preprocessor (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions). | -| `post_process_function` | String | Optional. A built-in or custom Painless script used to post-process the model output data. OpenSearch provides the following built-in post-process functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere text embedding models](https://docs.cohere.com/reference/embed)
- `connector.pre_process.openai.embedding` for [OpenAI text embedding models](https://platform.openai.com/docs/api-reference/embeddings)
- `connector.post_process.default.embedding`, which you can use to post-process documents in the model response so that they are in the format that neural search expects (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions). | +| Field | Data type | Is required | Description | +|:------------------------|:------------|:------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `name` | String | Yes | The name of the connector. | +| `description` | String | Yes | A description of the connector. | +| `version` | Integer | Yes | The version of the connector. | +| `protocol` | String | Yes | The protocol for the connection. For AWS services such as Amazon SageMaker and Amazon Bedrock, use `aws_sigv4`. For all other services, use `http`. | +| `parameters` | JSON object | Yes | The default connector parameters, including `endpoint` and `model`. Any parameters indicated in this field can be overridden by parameters specified in a predict request. | +| `credential` | JSON object | Yes | Defines any credential variables required to connect to your chosen endpoint. ML Commons uses **AES/GCM/NoPadding** symmetric encryption to encrypt your credentials. When the connection to the cluster first starts, OpenSearch creates a random 32-byte encryption key that persists in OpenSearch's system index. Therefore, you do not need to manually set the encryption key. | +| `actions` | JSON array | Yes | Defines what actions can run within the connector. If you're an administrator creating a connection, add the [blueprint]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/) for your desired connection. | +| `backend_roles` | JSON array | Yes | A list of OpenSearch backend roles. For more information about setting up backend roles, see [Assigning backend roles to users]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#assigning-backend-roles-to-users). | +| `access_mode` | String | Yes | Sets the access mode for the model, either `public`, `restricted`, or `private`. Default is `private`. For more information about `access_mode`, see [Model groups]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#model-groups). | +| `add_all_backend_roles` | Boolean | Yes | When set to `true`, adds all `backend_roles` to the access list, which only a user with admin permissions can adjust. When set to `false`, non-admins can add `backend_roles`. | +| `client_config` | JSON object | No | The client configuration object, which provides settings that control the behavior of the client connections used by the connector. These settings allow you to manage connection limits and timeouts, ensuring efficient and reliable communication. | + + +The `actions` parameter supports the following options. + +| Field | Data type | Description | +|:------------------------|:------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `action_type` | String | Required. Sets the ML Commons API operation to use upon connection. As of OpenSearch 2.9, only `predict` is supported. | +| `method` | String | Required. Defines the HTTP method for the API call. Supports `POST` and `GET`. | +| `url` | String | Required. Sets the connection endpoint at which the action occurs. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index#adding-trusted-endpoints). | +| `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. | +| `request_body` | String | Required. Sets the parameters contained in the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. | +| `pre_process_function` | String | Optional. A built-in or custom Painless script used to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models
- `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models
- `connector.pre_process.default.embedding`, which you can use to preprocess documents in neural search requests so that they are in the format that ML Commons can process with the default preprocessor (OpenSearch 2.11 or later). For more information, see [Built-in functions](#built-in-pre--and-post-processing-functions). | +| `post_process_function` | String | Optional. A built-in or custom Painless script used to post-process the model output data. OpenSearch provides the following built-in post-process functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere text embedding models](https://docs.cohere.com/reference/embed)
- `connector.pre_process.openai.embedding` for [OpenAI text embedding models](https://platform.openai.com/docs/api-reference/embeddings)
- `connector.post_process.default.embedding`, which you can use to post-process documents in the model response so that they are in the format that neural search expects (OpenSearch 2.11 or later). For more information, see [Built-in functions](#built-in-pre--and-post-processing-functions). | + + +The `client_config` parameter supports the following options. + +| Field | Data type | Description | +|:---------------------|:----------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `max_connection` | Integer | The maximum number of concurrent connections that the client can establish with the server. | +| `connection_timeout` | Integer | The maximum amount of time (in seconds) that the client will wait while trying to establish a connection to the server. A timeout prevents the client from waiting indefinitely and allows it to recover from unreachable network endpoints. | +| `read_timeout` | Integer | The maximum amount of time (in seconds) that the client will wait for a response from the server after sending a request. Useful when the server is slow to respond or encounters issues while processing a request. | ## Built-in pre- and post-processing functions diff --git a/_ml-commons-plugin/remote-models/index.md b/_ml-commons-plugin/remote-models/index.md index 0b9c6d03ed..657d7254be 100644 --- a/_ml-commons-plugin/remote-models/index.md +++ b/_ml-commons-plugin/remote-models/index.md @@ -205,7 +205,18 @@ Take note of the returned `model_id` because you’ll need it to deploy the mode ## Step 4: Deploy the model -To deploy the registered model, provide its model ID from step 3 in the following request: +Starting with OpenSearch version 2.13, externally hosted models are deployed automatically by default when you send a Predict API request for the first time. To disable automatic deployment for an externally hosted model, set `plugins.ml_commons.model_auto_deploy.enable` to `false`: +```json +PUT _cluster/settings +{ + "persistent": { + "plugins.ml_commons.model_auto_deploy.enable" : "false" + } +} +``` +{% include copy-curl.html %} + +To undeploy the model, use the [Undeploy API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/undeploy-model/). ```bash POST /_plugins/_ml/models/cleMb4kBJ1eYAeTMFFg4/_deploy diff --git a/_query-dsl/minimum-should-match.md b/_query-dsl/minimum-should-match.md index 9ec65431b1..e2032b8911 100644 --- a/_query-dsl/minimum-should-match.md +++ b/_query-dsl/minimum-should-match.md @@ -26,7 +26,7 @@ GET /shakespeare/_search } ``` -In this example, the query has three optional clauses that are combined with an `OR`, so the document must match either `prince`, `king`, or `star`. +In this example, the query has three optional clauses that are combined with an `OR`, so the document must match either `prince` and `king`, or `prince` and `star`, or `king` and `star`. ## Valid values @@ -448,4 +448,4 @@ The results contain only four documents that match at least one of the optional ] } } -``` \ No newline at end of file +``` diff --git a/_search-plugins/caching/index.md b/_search-plugins/caching/index.md new file mode 100644 index 0000000000..4d0173fdc7 --- /dev/null +++ b/_search-plugins/caching/index.md @@ -0,0 +1,32 @@ +--- +layout: default +title: Caching +parent: Improving search performance +has_children: true +nav_order: 100 +--- + +# Caching + +OpenSearch relies heavily on different on-heap cache types to accelerate data retrieval, providing significant improvement in search latencies. However, cache size is limited by the amount of memory available on a node. If you are processing a larger dataset that can potentially be cached, the cache size limit causes a lot of cache evictions and misses. The increasing number of evictions impacts performance because OpenSearch needs to process the query again, causing high resource consumption. + +Prior to version 2.13, OpenSearch supported the following on-heap cache types: + +- **Request cache**: Caches the local results on each shard. This allows frequently used (and potentially resource-heavy) search requests to return results almost instantly. +- **Query cache**: The shard-level query cache caches common data from similar queries. The query cache is more granular than the request cache and can cache data that is reused in different queries. +- **Field data cache**: The field data cache contains field data and global ordinals, which are both used to support aggregations on certain field types. + +## Additional cache stores +**Introduced 2.13** +{: .label .label-purple } + +This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/10024). +{: .warning} + +In addition to existing OpenSearch custom on-heap cache stores, cache plugins provide the following cache stores: + +- **Disk cache**: This cache stores the precomputed result of a query on disk. You can use a disk cache to cache much larger datasets, provided that the disk latencies are acceptable. +- **Tiered cache**: This is a multi-level cache, in which each tier has its own characteristics and performance levels. For example, a tiered cache can contain on-heap and disk tiers. By combining different tiers, you can achieve a balance between cache performance and size. To learn more, see [Tiered cache]({{site.url}}{{site.baseurl}}/search-plugins/caching/tiered-cache/). + +In OpenSearch 2.13, the request cache is integrated with cache plugins. You can use a tiered or disk cache as a request-level cache. +{: .note} \ No newline at end of file diff --git a/_search-plugins/caching/tiered-cache.md b/_search-plugins/caching/tiered-cache.md new file mode 100644 index 0000000000..3842ebe5a9 --- /dev/null +++ b/_search-plugins/caching/tiered-cache.md @@ -0,0 +1,82 @@ +--- +layout: default +title: Tiered cache +parent: Caching +grand_parent: Improving search performance +nav_order: 10 +--- + +# Tiered cache + +This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/10024). +{: .warning} + +A tiered cache is a multi-level cache, in which each tier has its own characteristics and performance levels. By combining different tiers, you can achieve a balance between cache performance and size. + +## Types of tiered caches + +OpenSearch 2.13 provides an implementation of _tiered spillover cache_. This implementation spills the evicted items from upper to lower tiers. The upper tier is smaller in size but offers better latency, like the on-heap tier. The lower tier is larger in size but is slower in terms of latency compared to the upper tier. A disk cache is an example of a lower tier. OpenSearch 2.13 offers on-heap and disk tiers. + +## Enabling a tiered cache + +To enable a tiered cache, configure the following setting: + +```yaml +opensearch.experimental.feature.pluggable.caching.enabled: true +``` +{% include copy.html %} + +For more information about ways to enable experimental features, see [Experimental feature flags]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). + +## Installing required plugins + +A tiered cache provides a way to plug in any disk or on-heap tier implementation. You can install the plugins you intend to use in the tiered cache. As of OpenSearch 2.13, the available cache plugin is the `cache-ehcache` plugin. This plugin provides a disk cache implementation to use within a tiered cache as a disk tier. + +A tiered cache will fail to initialize if the `cache-ehcache` plugin is not installed or disk cache properties are not set. +{: .warning} + +## Tiered cache settings + +In OpenSearch 2.13, a request cache can use a tiered cache. To begin, configure the following settings in the `opensearch.yml` file. + +### Cache store name + +Set the cache store name to `tiered_spillover` to use the OpenSearch-provided tiered spillover cache implementation: + +```yaml +indices.request.cache.store.name: tiered_spillover: true +``` +{% include copy.html %} + +### Setting on-heap and disk store tiers + +The `opensearch_onheap` setting is the built-in on-heap cache available in OpenSearch. The `ehcache_disk` setting is the disk cache implementation from [Ehcache](https://www.ehcache.org/). This requires installing the `cache-ehcache` plugin: + +```yaml +indices.request.cache.tiered_spillover.onheap.store.name: opensearch_onheap +indices.request.cache.tiered_spillover.disk.store.name: ehcache_disk +``` +{% include copy.html %} + +For more information about installing non-bundled plugins, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins). + +### Configuring on-heap and disk stores + +The following table lists the cache store settings for the `opensearch_onheap` store. + +Setting | Default | Description +:--- | :--- | :--- +`indices.request.cache.opensearch_onheap.size` | 1% of the heap | The size of the on-heap cache. Optional. +`indices.request.cache.opensearch_onheap.expire` | `MAX_VALUE` (disabled) | Specify a time-to-live (TTL) for the cached results. Optional. + +The following table lists the disk cache store settings for the `ehcache_disk` store. + +Setting | Default | Description +:--- | :--- | :--- +`indices.request.cache.ehcache_disk.max_size_in_bytes` | `1073741824` (1 GB) | Defines the size of the disk cache. Optional. +`indices.request.cache.ehcache_disk.storage.path` | `""` | Defines the storage path for the disk cache. Required. +`indices.request.cache.ehcache_disk.expire_after_access` | `MAX_VALUE` (disabled) | Specify a time-to-live (TTL) for the cached results. Optional. +`indices.request.cache.ehcache_disk.alias` | `ehcacheDiskCache#INDICES_REQUEST_CACHE` (this is an example of request cache) | Specify an alias for the disk cache. Optional. +`indices.request.cache.ehcache_disk.segments` | `16` | Defines the number of segments the disk cache is separated into. Used for concurrency. Optional. +`indices.request.cache.ehcache_disk.concurrency` | `1` | Defines the number of distinct write queues created for the disk store, where a group of segments share a write queue. Optional. + diff --git a/_search-plugins/concurrent-segment-search.md b/_search-plugins/concurrent-segment-search.md index 58b8d9a8ce..0bb7657937 100644 --- a/_search-plugins/concurrent-segment-search.md +++ b/_search-plugins/concurrent-segment-search.md @@ -27,7 +27,7 @@ By default, concurrent segment search is disabled on the cluster. You can enable - Cluster level - Index level -The index-level setting takes priority over the cluster-level setting. Thus, if the cluster setting is enabled but the index setting is disabled, then concurrent segment search will be disabled for that index. +The index-level setting takes priority over the cluster-level setting. Thus, if the cluster setting is enabled but the index setting is disabled, then concurrent segment search will be disabled for that index. Because of this, the index-level setting is not evaluated unless it is explicitly set, regardless of the default value configured for the setting. You can retrieve the current value of the index-level setting by calling the [Index Settings API]({{site.url}}{{site.baseurl}}/api-reference/index-apis/get-settings/) and omitting the `?include_defaults` query parameter. {: .note} To enable concurrent segment search for all indexes in the cluster, set the following dynamic cluster setting: diff --git a/_search-plugins/hybrid-search.md b/_search-plugins/hybrid-search.md index ebd014b0de..b0fb4d5bef 100644 --- a/_search-plugins/hybrid-search.md +++ b/_search-plugins/hybrid-search.md @@ -146,7 +146,9 @@ PUT /_search/pipeline/nlp-search-pipeline To perform hybrid search on your index, use the [`hybrid` query]({{site.url}}{{site.baseurl}}/query-dsl/compound/hybrid/), which combines the results of keyword and semantic search. -The following example request combines two query clauses---a neural query and a `match` query. It specifies the search pipeline created in the previous step as a query parameter: +#### Example: Combining a neural query and a match query + +The following example request combines two query clauses---a `neural` query and a `match` query. It specifies the search pipeline created in the previous step as a query parameter: ```json GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline @@ -161,7 +163,7 @@ GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline "queries": [ { "match": { - "text": { + "passage_text": { "query": "Hi world" } } @@ -216,3 +218,355 @@ The response contains the matching document: } } ``` +{% include copy-curl.html %} + +#### Example: Combining a match query and a term query + +The following example request combines two query clauses---a `match` query and a `term` query. It specifies the search pipeline created in the previous step as a query parameter: + +```json +GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline +{ + "_source": { + "exclude": [ + "passage_embedding" + ] + }, + "query": { + "hybrid": { + "queries": [ + { + "match":{ + "passage_text": "hello" + } + }, + { + "term":{ + "passage_text":{ + "value":"planet" + } + } + } + ] + } + } +} +``` +{% include copy-curl.html %} + +The response contains the matching documents: + +```json +{ + "took": 11, + "timed_out": false, + "_shards": { + "total": 2, + "successful": 2, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 0.7, + "hits": [ + { + "_index": "my-nlp-index", + "_id": "2", + "_score": 0.7, + "_source": { + "id": "s2", + "passage_text": "Hi planet" + } + }, + { + "_index": "my-nlp-index", + "_id": "1", + "_score": 0.3, + "_source": { + "id": "s1", + "passage_text": "Hello world" + } + } + ] + } +} +``` +{% include copy-curl.html %} + +## Hybrid search with post-filtering +**Introduced 2.13** +{: .label .label-purple } + +You can perform post-filtering on hybrid search results by providing the `post_filter` parameter in your query. + +The `post_filter` clause is applied after the search results have been retrieved. Post-filtering is useful for applying additional filters to the search results without impacting the scoring or the order of the results. + +Post-filtering does not impact document relevance scores or aggregation results. +{: .note} + +#### Example: Post-filtering + +The following example request combines two query clauses---a `term` query and a `match` query. This is the same query as in the [preceding example](#example-combining-a-match-query-and-a-term-query), but it contains a `post_filter`: + +```json +GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline +{ + "query": { + "hybrid":{ + "queries":[ + { + "match":{ + "passage_text": "hello" + } + }, + { + "term":{ + "passage_text":{ + "value":"planet" + } + } + } + ] + } + + }, + "post_filter":{ + "match": { "passage_text": "world" } + } +} + +``` +{% include copy-curl.html %} + +Compare the results to the results without post-filtering in the [preceding example](#example-combining-a-match-query-and-a-term-query). Unlike the preceding example response, which contains two documents, the response in this example contains one document because the second document is filtered using post-filtering: + +```json +{ + "took": 18, + "timed_out": false, + "_shards": { + "total": 2, + "successful": 2, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 0.3, + "hits": [ + { + "_index": "my-nlp-index", + "_id": "1", + "_score": 0.3, + "_source": { + "id": "s1", + "passage_text": "Hello world" + } + } + ] + } +} +``` + + +## Combining hybrid search and aggregations +**Introduced 2.13** +{: .label .label-purple } + +You can enhance search results by combining a hybrid query clause with any aggregation that OpenSearch supports. Aggregations allow you to use OpenSearch as an analytics engine. For more information about aggregations, see [Aggregations]({{site.url}}{{site.baseurl}}/aggregations/). + +Most aggregations are performed on the subset of documents that is returned by a hybrid query. The only aggregation that operates on all documents is the [`global`]({{site.url}}{{site.baseurl}}/aggregations/bucket/global/) aggregation. + +To use aggregations with a hybrid query, first create an index. Aggregations are typically used on fields of special types, like `keyword` or `integer`. The following example creates an index with several such fields: + +```json +PUT /my-nlp-index +{ + "settings": { + "number_of_shards": 2 + }, + "mappings": { + "properties": { + "doc_index": { + "type": "integer" + }, + "doc_keyword": { + "type": "keyword" + }, + "category": { + "type": "keyword" + } + } + } +} +``` +{% include copy-curl.html %} + +The following request ingests six documents into your new index: + +```json +POST /_bulk +{ "index": { "_index": "my-nlp-index" } } +{ "category": "permission", "doc_keyword": "workable", "doc_index": 4976, "doc_price": 100} +{ "index": { "_index": "my-nlp-index" } } +{ "category": "sister", "doc_keyword": "angry", "doc_index": 2231, "doc_price": 200 } +{ "index": { "_index": "my-nlp-index" } } +{ "category": "hair", "doc_keyword": "likeable", "doc_price": 25 } +{ "index": { "_index": "my-nlp-index" } } +{ "category": "editor", "doc_index": 9871, "doc_price": 30 } +{ "index": { "_index": "my-nlp-index" } } +{ "category": "statement", "doc_keyword": "entire", "doc_index": 8242, "doc_price": 350 } +{ "index": { "_index": "my-nlp-index" } } +{ "category": "statement", "doc_keyword": "idea", "doc_index": 5212, "doc_price": 200 } +{ "index": { "_index": "index-test" } } +{ "category": "editor", "doc_keyword": "bubble", "doc_index": 1298, "doc_price": 130 } +{ "index": { "_index": "index-test" } } +{ "category": "editor", "doc_keyword": "bubble", "doc_index": 521, "doc_price": 75 } +``` +{% include copy-curl.html %} + +Now you can combine a hybrid query clause with a `min` aggregation: + +```json +GET /my-nlp-index/_search?search_pipeline=nlp-search-pipeline +{ + "query": { + "hybrid": { + "queries": [ + { + "term": { + "category": "permission" + } + }, + { + "bool": { + "should": [ + { + "term": { + "category": "editor" + } + }, + { + "term": { + "category": "statement" + } + } + ] + } + } + ] + } + }, + "aggs": { + "total_price": { + "sum": { + "field": "doc_price" + } + }, + "keywords": { + "terms": { + "field": "doc_keyword", + "size": 10 + } + } + } +} +``` +{% include copy-curl.html %} + +The response contains the matching documents and the aggregation results: + +```json +{ + "took": 9, + "timed_out": false, + "_shards": { + "total": 2, + "successful": 2, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 4, + "relation": "eq" + }, + "max_score": 0.5, + "hits": [ + { + "_index": "my-nlp-index", + "_id": "mHRPNY4BlN82W_Ar9UMY", + "_score": 0.5, + "_source": { + "doc_price": 100, + "doc_index": 4976, + "doc_keyword": "workable", + "category": "permission" + } + }, + { + "_index": "my-nlp-index", + "_id": "m3RPNY4BlN82W_Ar9UMY", + "_score": 0.5, + "_source": { + "doc_price": 30, + "doc_index": 9871, + "category": "editor" + } + }, + { + "_index": "my-nlp-index", + "_id": "nXRPNY4BlN82W_Ar9UMY", + "_score": 0.5, + "_source": { + "doc_price": 200, + "doc_index": 5212, + "doc_keyword": "idea", + "category": "statement" + } + }, + { + "_index": "my-nlp-index", + "_id": "nHRPNY4BlN82W_Ar9UMY", + "_score": 0.5, + "_source": { + "doc_price": 350, + "doc_index": 8242, + "doc_keyword": "entire", + "category": "statement" + } + } + ] + }, + "aggregations": { + "total_price": { + "value": 680 + }, + "doc_keywords": { + "doc_count_error_upper_bound": 0, + "sum_other_doc_count": 0, + "buckets": [ + { + "key": "entire", + "doc_count": 1 + }, + { + "key": "idea", + "doc_count": 1 + }, + { + "key": "workable", + "doc_count": 1 + } + ] + } + } +} +``` \ No newline at end of file diff --git a/_search-plugins/knn/approximate-knn.md b/_search-plugins/knn/approximate-knn.md index 74cf7e39f5..16d1a7e686 100644 --- a/_search-plugins/knn/approximate-knn.md +++ b/_search-plugins/knn/approximate-knn.md @@ -287,9 +287,15 @@ Not every method supports each of these spaces. Be sure to check out [the method nmslib and faiss:\[ score = {1 \over 1 + d } \]
Lucene:\[ score = {2 - d \over 2}\] - innerproduct (not supported for Lucene) - \[ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} · \mathbf{y}} = - \sum_{i=1}^n x_i y_i \] - \[ \text{If} d \ge 0, \] \[score = {1 \over 1 + d }\] \[\text{If} d < 0, score = −d + 1\] + innerproduct (supported for Lucene in OpenSearch version 2.13 and later) + \[ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} · \mathbf{y}} = - \sum_{i=1}^n x_i y_i \] +
Lucene: + \[ d(\mathbf{x}, \mathbf{y}) = {\mathbf{x} · \mathbf{y}} = \sum_{i=1}^n x_i y_i \] + + \[ \text{If} d \ge 0, \] \[score = {1 \over 1 + d }\] \[\text{If} d < 0, score = −d + 1\] +
Lucene: + \[ \text{If} d > 0, score = d + 1 \] \[\text{If} d \le 0\] \[score = {1 \over 1 + (-1 · d) }\] + @@ -297,3 +303,8 @@ The cosine similarity formula does not include the `1 -` prefix. However, becaus smaller scores with closer results, they return `1 - cosineSimilarity` for cosine similarity space---that's why `1 -` is included in the distance function. {: .note } + +With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of +such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests +containing the zero vector will be rejected and a corresponding exception will be thrown. +{: .note } \ No newline at end of file diff --git a/_search-plugins/knn/knn-index.md b/_search-plugins/knn/knn-index.md index 4a527f3bcb..1e0c2e84f5 100644 --- a/_search-plugins/knn/knn-index.md +++ b/_search-plugins/knn/knn-index.md @@ -17,7 +17,7 @@ Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `luce ## Method definitions -A method definition refers to the underlying configuration of the Approximate k-NN algorithm you want to use. Method definitions are used to either create a `knn_vector` field (when the method does not require training) or [create a model during training]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model) that can then be used to [create a `knn_vector` field]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model). +A method definition refers to the underlying configuration of the approximate k-NN algorithm you want to use. Method definitions are used to either create a `knn_vector` field (when the method does not require training) or [create a model during training]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model) that can then be used to [create a `knn_vector` field]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model). A method definition will always contain the name of the method, the space_type the method is built for, the engine (the library) to use, and a map of parameters. @@ -33,7 +33,7 @@ Mapping parameter | Required | Default | Updatable | Description Method name | Requires training | Supported spaces | Description :--- | :--- | :--- | :--- -`hnsw` | false | l2, innerproduct, cosinesimil, l1, linf | Hierarchical proximity graph approach to Approximate k-NN search. For more details on the algorithm, see this [abstract](https://arxiv.org/abs/1603.09320). +`hnsw` | false | l2, innerproduct, cosinesimil, l1, linf | Hierarchical proximity graph approach to approximate k-NN search. For more details on the algorithm, see this [abstract](https://arxiv.org/abs/1603.09320). #### HNSW parameters @@ -52,7 +52,7 @@ An index created in OpenSearch version 2.11 or earlier will still use the old `e Method name | Requires training | Supported spaces | Description :--- | :--- | :--- | :--- -`hnsw` | false | l2, innerproduct | Hierarchical proximity graph approach to Approximate k-NN search. +`hnsw` | false | l2, innerproduct | Hierarchical proximity graph approach to approximate k-NN search. `ivf` | true | l2, innerproduct | Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched. For hnsw, "innerproduct" is not available when PQ is used. @@ -90,8 +90,8 @@ Training data can be composed of either the same data that is going to be ingest ### Supported Lucene methods Method name | Requires training | Supported spaces | Description -:--- | :--- | :--- | :--- -`hnsw` | false | l2, cosinesimil | Hierarchical proximity graph approach to Approximate k-NN search. +:--- | :--- |:--------------------------------------------------------------------------------| :--- +`hnsw` | false | l2, cosinesimil, innerproduct (supported in OpenSearch 2.13 and later) | Hierarchical proximity graph approach to approximate k-NN search. #### HNSW parameters @@ -259,7 +259,7 @@ At the moment, several parameters defined in the settings are in the deprecation Setting | Default | Updatable | Description :--- | :--- | :--- | :--- -`index.knn` | false | false | Whether the index should build native library indexes for the `knn_vector` fields. If set to false, the `knn_vector` fields will be stored in doc values, but Approximate k-NN search functionality will be disabled. +`index.knn` | false | false | Whether the index should build native library indexes for the `knn_vector` fields. If set to false, the `knn_vector` fields will be stored in doc values, but approximate k-NN search functionality will be disabled. `index.knn.algo_param.ef_search` | 100 | true | The size of the dynamic list used during k-NN searches. Higher values result in more accurate but slower searches. Only available for NMSLIB. `index.knn.algo_param.ef_construction` | 100 | false | Deprecated in 1.0.0. Instead, use the [mapping parameters](https://opensearch.org/docs/latest/search-plugins/knn/knn-index/#method-definitions) to set this value. `index.knn.algo_param.m` | 16 | false | Deprecated in 1.0.0. Use the [mapping parameters](https://opensearch.org/docs/latest/search-plugins/knn/knn-index/#method-definitions) to set this value instead. diff --git a/_search-plugins/knn/knn-score-script.md b/_search-plugins/knn/knn-score-script.md index 602346803d..cc79e90850 100644 --- a/_search-plugins/knn/knn-score-script.md +++ b/_search-plugins/knn/knn-score-script.md @@ -313,9 +313,11 @@ A space corresponds to the function used to measure the distance between two poi \[ score = 2 - d \] - innerproduct (not supported for Lucene) - \[ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} · \mathbf{y}} = - \sum_{i=1}^n x_i y_i \] - \[ \text{If} d \ge 0, \] \[score = {1 \over 1 + d }\] \[\text{If} d < 0, score = −d + 1\] + innerproduct (supported for Lucene in OpenSearch version 2.13 and later) + \[ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} · \mathbf{y}} = - \sum_{i=1}^n x_i y_i \] + + \[ \text{If} d \ge 0, \] \[score = {1 \over 1 + d }\] \[\text{If} d < 0, score = −d + 1\] + hammingbit @@ -326,3 +328,8 @@ A space corresponds to the function used to measure the distance between two poi Cosine similarity returns a number between -1 and 1, and because OpenSearch relevance scores can't be below 0, the k-NN plugin adds 1 to get the final score. + +With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of +such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests +containing the zero vector will be rejected and a corresponding exception will be thrown. +{: .note } \ No newline at end of file diff --git a/_search-plugins/knn/painless-functions.md b/_search-plugins/knn/painless-functions.md index 2b28f753ef..1f27cc29a6 100644 --- a/_search-plugins/knn/painless-functions.md +++ b/_search-plugins/knn/painless-functions.md @@ -67,3 +67,8 @@ cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector fie ``` Because scores can only be positive, this script ranks documents with vector fields higher than those without. + +With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of +such a vector is 0, which raises a `divide by 0` exception when computing the value. Requests +containing the zero vector will be rejected and a corresponding exception will be thrown. +{: .note } \ No newline at end of file diff --git a/_search-plugins/neural-sparse-search.md b/_search-plugins/neural-sparse-search.md index 31ae43991e..88d30e4391 100644 --- a/_search-plugins/neural-sparse-search.md +++ b/_search-plugins/neural-sparse-search.md @@ -55,6 +55,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse ``` {% include copy-curl.html %} +To split long text into passages, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors). + ## Step 2: Create an index for ingestion In order to use the text embedding processor defined in your pipeline, create a rank features index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as [`rank_features`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/rank/#rank-features). Similarly, the `passage_text` field should be mapped as `text`. @@ -237,3 +239,129 @@ The response contains the matching documents: } } ``` + +## Setting a default model on an index or field + +A [`neural_sparse`]({{site.url}}{{site.baseurl}}/query-dsl/specialized/neural-sparse/) query requires a model ID for generating sparse embeddings. To eliminate passing the model ID with each neural_sparse query request, you can set a default model on index-level or field-level. + +First, create a [search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/index/) with a [`neural_query_enricher`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/neural-query-enricher/) request processor. To set a default model for an index, provide the model ID in the `default_model_id` parameter. To set a default model for a specific field, provide the field name and the corresponding model ID in the `neural_field_default_id` map. If you provide both `default_model_id` and `neural_field_default_id`, `neural_field_default_id` takes precedence: + +```json +PUT /_search/pipeline/default_model_pipeline +{ + "request_processors": [ + { + "neural_query_enricher" : { + "default_model_id": "bQ1J8ooBpBj3wT4HVUsb", + "neural_field_default_id": { + "my_field_1": "uZj0qYoBMtvQlfhaYeud", + "my_field_2": "upj0qYoBMtvQlfhaZOuM" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +Then set the default model for your index: + +```json +PUT /my-nlp-index/_settings +{ + "index.search.default_pipeline" : "default_model_pipeline" +} +``` +{% include copy-curl.html %} + +You can now omit the model ID when searching: + +```json +GET /my-nlp-index/_search +{ + "query": { + "neural_sparse": { + "passage_embedding": { + "query_text": "Hi world" + } + } + } +} +``` +{% include copy-curl.html %} + +The response contains both documents: + +```json +{ + "took" : 688, + "timed_out" : false, + "_shards" : { + "total" : 1, + "successful" : 1, + "skipped" : 0, + "failed" : 0 + }, + "hits" : { + "total" : { + "value" : 2, + "relation" : "eq" + }, + "max_score" : 30.0029, + "hits" : [ + { + "_index" : "my-nlp-index", + "_id" : "1", + "_score" : 30.0029, + "_source" : { + "passage_text" : "Hello world", + "passage_embedding" : { + "!" : 0.8708904, + "door" : 0.8587369, + "hi" : 2.3929274, + "worlds" : 2.7839446, + "yes" : 0.75845814, + "##world" : 2.5432441, + "born" : 0.2682308, + "nothing" : 0.8625516, + "goodbye" : 0.17146169, + "greeting" : 0.96817183, + "birth" : 1.2788506, + "come" : 0.1623208, + "global" : 0.4371151, + "it" : 0.42951578, + "life" : 1.5750692, + "thanks" : 0.26481047, + "world" : 4.7300377, + "tiny" : 0.5462298, + "earth" : 2.6555297, + "universe" : 2.0308156, + "worldwide" : 1.3903781, + "hello" : 6.696973, + "so" : 0.20279501, + "?" : 0.67785245 + }, + "id" : "s1" + } + }, + { + "_index" : "my-nlp-index", + "_id" : "2", + "_score" : 16.480486, + "_source" : { + "passage_text" : "Hi planet", + "passage_embedding" : { + "hi" : 4.338913, + "planets" : 2.7755864, + "planet" : 5.0969057, + "mars" : 1.7405145, + "earth" : 2.6087382, + "hello" : 3.3210192 + }, + "id" : "s2" + } + } + ] + } +} +``` \ No newline at end of file diff --git a/_search-plugins/search-pipelines/search-processors.md b/_search-plugins/search-pipelines/search-processors.md index 36b848e6eb..5e53cf5615 100644 --- a/_search-plugins/search-pipelines/search-processors.md +++ b/_search-plugins/search-pipelines/search-processors.md @@ -24,7 +24,7 @@ The following table lists all supported search request processors. Processor | Description | Earliest available version :--- | :--- | :--- [`filter_query`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/filter-query-processor/) | Adds a filtering query that is used to filter requests. | 2.8 -[`neural_query_enricher`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/neural-query-enricher/) | Sets a default model for neural search at the index or field level. | 2.11 +[`neural_query_enricher`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/neural-query-enricher/) | Sets a default model for neural search and neural sparse search at the index or field level. | 2.11(neural), 2.13(neural sparse) [`script`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/script-processor/) | Adds a script that is run on newly indexed documents. | 2.8 [`oversample`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/oversample-processor/) | Increases the search request `size` parameter, storing the original value in the pipeline state. | 2.12 diff --git a/_search-plugins/semantic-search.md b/_search-plugins/semantic-search.md index f4753bee1c..32bd18cd6c 100644 --- a/_search-plugins/semantic-search.md +++ b/_search-plugins/semantic-search.md @@ -48,6 +48,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline ``` {% include copy-curl.html %} +To split long text into passages, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors). + ## Step 2: Create an index for ingestion In order to use the text embedding processor defined in your pipeline, create a k-NN index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as a k-NN vector with a dimension that matches the model dimension. Similarly, the `passage_text` field should be mapped as `text`. diff --git a/_search-plugins/sql/ppl/index.md b/_search-plugins/sql/ppl/index.md index c39e3429e1..56ffebf555 100644 --- a/_search-plugins/sql/ppl/index.md +++ b/_search-plugins/sql/ppl/index.md @@ -12,6 +12,8 @@ redirect_from: - /search-plugins/ppl/index/ - /search-plugins/ppl/endpoint/ - /search-plugins/ppl/protocol/ + - /search-plugins/sql/ppl/index/ + - /observability-plugin/ppl/index/ --- # PPL diff --git a/_security/access-control/anonymous-authentication.md b/_security/access-control/anonymous-authentication.md index 429daafb9b..cb2f951546 100644 --- a/_security/access-control/anonymous-authentication.md +++ b/_security/access-control/anonymous-authentication.md @@ -30,6 +30,19 @@ The following table describes the `anonymous_auth_enabled` setting. For more inf If you disable anonymous authentication, you must provide at least one `authc` in order for the Security plugin to initialize successfully. {: .important } +## OpenSearch Dashboards configuration + +To enable anonymous authentication for OpenSearch Dashboards, you need to modify the `opensearch_dashboards.yml` file in the configuration directory of your OpenSearch Dashboards installation. + +Add the following setting to `opensearch_dashboards.yml`: + +```yml +opensearch_security.auth.anonymous_auth_enabled: true +``` + +Anonymous login for OpenSearch Dashboards requires anonymous authentication to be enabled on the OpenSearch cluster. +{: .important} + ## Defining anonymous authentication privileges When anonymous authentication is enabled, your defined HTTP authenticators still try to find user credentials inside your HTTP request. If credentials are found, the user is authenticated. If none are found, the user is authenticated as an `anonymous` user. diff --git a/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md b/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md index 3eb40fe2ed..7cc533fe76 100644 --- a/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md +++ b/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md @@ -24,8 +24,12 @@ _Cluster state_ is an internal data structure that contains the metadata of the The cluster state metadata is managed by the elected cluster manager node and is essential for the cluster to properly function. When the cluster loses the majority of the cluster manager nodes permanently, then the cluster may experience data loss because the latest cluster state metadata might not be present in the surviving cluster manager nodes. Persisting the state of all the cluster manager nodes in the cluster to remote-backed storage provides better durability. When the remote cluster state feature is enabled, the cluster metadata will be published to a remote repository configured in the cluster. -Any time new cluster manager nodes are launched after disaster recovery, the nodes will automatically bootstrap using the latest metadata stored in the remote repository. -After the metadata is restored automatically from the latest metadata stored, and if the data nodes are unchanged in the index data, the metadata lost will be automatically recovered. However, if the data nodes have been replaced, then you can restore the index data by invoking the `_remotestore/_restore` API as described in the [remote store documentation]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/index/). +Any time new cluster manager nodes are launched after disaster recovery, the nodes will automatically bootstrap using the latest metadata stored in the remote repository. This provides metadata durability. + +You can enable remote cluster state independently of remote-backed data storage. +{: .note} + +If you require data durability, you must enable remote-backed data storage as described in the [remote store documentation]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/index/). ## Configuring the remote cluster state @@ -59,4 +63,3 @@ Setting | Default | Description The remote cluster state functionality has the following limitations: - Unsafe bootstrap scripts cannot be run when the remote cluster state is enabled. When a majority of cluster-manager nodes are lost and the cluster goes down, the user needs to replace any remaining cluster manager nodes and reseed the nodes in order to bootstrap a new cluster. -- The remote cluster state cannot be enabled without first configuring remote-backed storage. diff --git a/images/dashboards/multidata-hide-localcluster.gif b/images/dashboards/multidata-hide-localcluster.gif new file mode 100644 index 0000000000..b778063943 Binary files /dev/null and b/images/dashboards/multidata-hide-localcluster.gif differ diff --git a/images/dashboards/multidata-hide-show-auth.gif b/images/dashboards/multidata-hide-show-auth.gif new file mode 100644 index 0000000000..9f1f945c44 Binary files /dev/null and b/images/dashboards/multidata-hide-show-auth.gif differ diff --git a/images/dashboards/vega-2.png b/images/dashboards/vega-2.png new file mode 100644 index 0000000000..1faa3a6e67 Binary files /dev/null and b/images/dashboards/vega-2.png differ