New plugin - Annotated_text field type (#30364)
New plugin for annotated_text field type. Largely a copy of `text` field type but adds ability to include markdown-like syntax in the text. The “AnnotatedText” class parses text+markup and converts into plain text and AnnotationTokens. The annotation token values are injected unchanged alongside the regular text tokens to provide a form of additional indexed overlay useful in positional searches and highlighting. Annotated_text fields do not support fielddata as we want to phase this out. Also includes a new "annotated" highlighter type that retains annotations and merges in search hits as additional annotation markup. Closes #29467
This commit is contained in:
parent
ab9c28a2b1
commit
2fa09f062e
|
@ -0,0 +1,328 @@
|
||||||
|
[[mapper-annotated-text]]
|
||||||
|
=== Mapper Annotated Text Plugin
|
||||||
|
|
||||||
|
experimental[]
|
||||||
|
|
||||||
|
The mapper-annotated-text plugin provides the ability to index text that is a
|
||||||
|
combination of free-text and special markup that is typically used to identify
|
||||||
|
items of interest such as people or organisations (see NER or Named Entity Recognition
|
||||||
|
tools).
|
||||||
|
|
||||||
|
|
||||||
|
The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token
|
||||||
|
stream at the same position as the underlying text it annotates.
|
||||||
|
|
||||||
|
:plugin_name: mapper-annotated-text
|
||||||
|
include::install_remove.asciidoc[]
|
||||||
|
|
||||||
|
[[mapper-annotated-text-usage]]
|
||||||
|
==== Using the `annotated-text` field
|
||||||
|
|
||||||
|
The `annotated-text` tokenizes text content as per the more common `text` field (see
|
||||||
|
"limitations" below) but also injects any marked-up annotation tokens directly into
|
||||||
|
the search index:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
--------------------------
|
||||||
|
PUT my_index
|
||||||
|
{
|
||||||
|
"mappings": {
|
||||||
|
"_doc": {
|
||||||
|
"properties": {
|
||||||
|
"my_field": {
|
||||||
|
"type": "annotated_text"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------
|
||||||
|
// CONSOLE
|
||||||
|
|
||||||
|
Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text
|
||||||
|
and structured tokens. The annotations use a markdown-like syntax using URL encoding of
|
||||||
|
one or more values separated by the `&` symbol.
|
||||||
|
|
||||||
|
|
||||||
|
We can use the "_analyze" api to test how an example annotation would be stored as tokens
|
||||||
|
in the search index:
|
||||||
|
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
--------------------------
|
||||||
|
GET my_index/_analyze
|
||||||
|
{
|
||||||
|
"field": "my_field",
|
||||||
|
"text":"Investors in [Apple](Apple+Inc.) rejoiced."
|
||||||
|
}
|
||||||
|
--------------------------
|
||||||
|
// NOTCONSOLE
|
||||||
|
|
||||||
|
Response:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
--------------------------------------------------
|
||||||
|
{
|
||||||
|
"tokens": [
|
||||||
|
{
|
||||||
|
"token": "investors",
|
||||||
|
"start_offset": 0,
|
||||||
|
"end_offset": 9,
|
||||||
|
"type": "<ALPHANUM>",
|
||||||
|
"position": 0
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token": "in",
|
||||||
|
"start_offset": 10,
|
||||||
|
"end_offset": 12,
|
||||||
|
"type": "<ALPHANUM>",
|
||||||
|
"position": 1
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token": "Apple Inc.", <1>
|
||||||
|
"start_offset": 13,
|
||||||
|
"end_offset": 18,
|
||||||
|
"type": "annotation",
|
||||||
|
"position": 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token": "apple",
|
||||||
|
"start_offset": 13,
|
||||||
|
"end_offset": 18,
|
||||||
|
"type": "<ALPHANUM>",
|
||||||
|
"position": 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token": "rejoiced",
|
||||||
|
"start_offset": 19,
|
||||||
|
"end_offset": 27,
|
||||||
|
"type": "<ALPHANUM>",
|
||||||
|
"position": 3
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
// NOTCONSOLE
|
||||||
|
|
||||||
|
<1> Note the whole annotation token `Apple Inc.` is placed, unchanged as a single token in
|
||||||
|
the token stream and at the same position (position 2) as the text token (`apple`) it annotates.
|
||||||
|
|
||||||
|
|
||||||
|
We can now perform searches for annotations using regular `term` queries that don't tokenize
|
||||||
|
the provided search values. Annotations are a more precise way of matching as can be seen
|
||||||
|
in this example where a search for `Beck` will not match `Jeff Beck` :
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
--------------------------
|
||||||
|
# Example documents
|
||||||
|
PUT my_index/_doc/1
|
||||||
|
{
|
||||||
|
"my_field": "[Beck](Beck) announced a new tour"<2>
|
||||||
|
}
|
||||||
|
|
||||||
|
PUT my_index/_doc/2
|
||||||
|
{
|
||||||
|
"my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"<1>
|
||||||
|
}
|
||||||
|
|
||||||
|
# Example search
|
||||||
|
GET my_index/_search
|
||||||
|
{
|
||||||
|
"query": {
|
||||||
|
"term": {
|
||||||
|
"my_field": "Beck" <3>
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------
|
||||||
|
// CONSOLE
|
||||||
|
|
||||||
|
<1> As well as tokenising the plain text into single words e.g. `beck`, here we
|
||||||
|
inject the single token value `Beck` at the same position as `beck` in the token stream.
|
||||||
|
<2> Note annotations can inject multiple tokens at the same position - here we inject both
|
||||||
|
the very specific value `Jeff Beck` and the broader term `Guitarist`. This enables
|
||||||
|
broader positional queries e.g. finding mentions of a `Guitarist` near to `strat`.
|
||||||
|
<3> A benefit of searching with these carefully defined annotation tokens is that a query for
|
||||||
|
`Beck` will not match document 2 that contains the tokens `jeff`, `beck` and `Jeff Beck`
|
||||||
|
|
||||||
|
WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will
|
||||||
|
cause the document to be rejected with a parse failure. In future we hope to have a use for
|
||||||
|
the equals signs so wil actively reject documents that contain this today.
|
||||||
|
|
||||||
|
|
||||||
|
[[mapper-annotated-text-tips]]
|
||||||
|
==== Data modelling tips
|
||||||
|
===== Use structured and unstructured fields
|
||||||
|
|
||||||
|
Annotations are normally a way of weaving structured information into unstructured text for
|
||||||
|
higher-precision search.
|
||||||
|
|
||||||
|
`Entity resolution` is a form of document enrichment undertaken by specialist software or people
|
||||||
|
where references to entities in a document are disambiguated by attaching a canonical ID.
|
||||||
|
The ID is used to resolve any number of aliases or distinguish between people with the
|
||||||
|
same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved
|
||||||
|
entity IDs woven into text.
|
||||||
|
|
||||||
|
These IDs can be embedded as annotations in an annotated_text field but it often makes
|
||||||
|
sense to include them in dedicated structured fields to support discovery via aggregations:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
--------------------------
|
||||||
|
PUT my_index
|
||||||
|
{
|
||||||
|
"mappings": {
|
||||||
|
"_doc": {
|
||||||
|
"properties": {
|
||||||
|
"my_unstructured_text_field": {
|
||||||
|
"type": "annotated_text"
|
||||||
|
},
|
||||||
|
"my_structured_people_field": {
|
||||||
|
"type": "text",
|
||||||
|
"fields": {
|
||||||
|
"keyword" :{
|
||||||
|
"type": "keyword"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------
|
||||||
|
// CONSOLE
|
||||||
|
|
||||||
|
Applications would then typically provide content and discover it as follows:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
--------------------------
|
||||||
|
# Example documents
|
||||||
|
PUT my_index/_doc/1
|
||||||
|
{
|
||||||
|
"my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch",
|
||||||
|
"my_twitter_handles": ["@kimchy"] <1>
|
||||||
|
}
|
||||||
|
|
||||||
|
GET my_index/_search
|
||||||
|
{
|
||||||
|
"query": {
|
||||||
|
"query_string": {
|
||||||
|
"query": "elasticsearch OR logstash OR kibana",<2>
|
||||||
|
"default_field": "my_unstructured_text_field"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"aggregations": {
|
||||||
|
"top_people" :{
|
||||||
|
"significant_terms" : { <3>
|
||||||
|
"field" : "my_twitter_handles.keyword"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------
|
||||||
|
// CONSOLE
|
||||||
|
|
||||||
|
<1> Note the `my_twitter_handles` contains a list of the annotation values
|
||||||
|
also used in the unstructured text. (Note the annotated_text syntax requires escaping).
|
||||||
|
By repeating the annotation values in a structured field this application has ensured that
|
||||||
|
the tokens discovered in the structured field can be used for search and highlighting
|
||||||
|
in the unstructured field.
|
||||||
|
<2> In this example we search for documents that talk about components of the elastic stack
|
||||||
|
<3> We use the `my_twitter_handles` field here to discover people who are significantly
|
||||||
|
associated with the elastic stack.
|
||||||
|
|
||||||
|
===== Avoiding over-matching annotations
|
||||||
|
By design, the regular text tokens and the annotation tokens co-exist in the same indexed
|
||||||
|
field but in rare cases this can lead to some over-matching.
|
||||||
|
|
||||||
|
The value of an annotation often denotes a _named entity_ (a person, place or company).
|
||||||
|
The tokens for these named entities are inserted untokenized, and differ from typical text
|
||||||
|
tokens because they are normally:
|
||||||
|
|
||||||
|
* Mixed case e.g. `Madonna`
|
||||||
|
* Multiple words e.g. `Jeff Beck`
|
||||||
|
* Can have punctuation or numbers e.g. `Apple Inc.` or `@kimchy`
|
||||||
|
|
||||||
|
This means, for the most part, a search for a named entity in the annotated text field will
|
||||||
|
not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result
|
||||||
|
you can drill down to highlight uses in the text without "over matching" on any text tokens
|
||||||
|
like the word `apple` in this context:
|
||||||
|
|
||||||
|
the apple was very juicy
|
||||||
|
|
||||||
|
However, a problem arises if your named entity happens to be a single term and lower-case e.g. the
|
||||||
|
company `elastic`. In this case, a search on the annotated text field for the token `elastic`
|
||||||
|
may match a text document such as this:
|
||||||
|
|
||||||
|
he fired an elastic band
|
||||||
|
|
||||||
|
To avoid such false matches users should consider prefixing annotation values to ensure
|
||||||
|
they don't name clash with text tokens e.g.
|
||||||
|
|
||||||
|
[elastic](Company_elastic) released version 7.0 of the elastic stack today
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
[[mapper-annotated-text-highlighter]]
|
||||||
|
==== Using the `annotated` highlighter
|
||||||
|
|
||||||
|
The `annotated-text` plugin includes a custom highlighter designed to mark up search hits
|
||||||
|
in a way which is respectful of the original markup:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
--------------------------
|
||||||
|
# Example documents
|
||||||
|
PUT my_index/_doc/1
|
||||||
|
{
|
||||||
|
"my_field": "The cat sat on the [mat](sku3578)"
|
||||||
|
}
|
||||||
|
|
||||||
|
GET my_index/_search
|
||||||
|
{
|
||||||
|
"query": {
|
||||||
|
"query_string": {
|
||||||
|
"query": "cats"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"highlight": {
|
||||||
|
"fields": {
|
||||||
|
"my_field": {
|
||||||
|
"type": "annotated", <1>
|
||||||
|
"require_field_match": false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------
|
||||||
|
// CONSOLE
|
||||||
|
<1> The `annotated` highlighter type is designed for use with annotated_text fields
|
||||||
|
|
||||||
|
The annotated highlighter is based on the `unified` highlighter and supports the same
|
||||||
|
settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using
|
||||||
|
html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
|
||||||
|
markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`
|
||||||
|
is the key and the matched search term is the value e.g.
|
||||||
|
|
||||||
|
The [cat](_hit_term=cat) sat on the [mat](sku3578)
|
||||||
|
|
||||||
|
The annotated highlighter tries to be respectful of any existing markup in the original
|
||||||
|
text:
|
||||||
|
|
||||||
|
* If the search term matches exactly the location of an existing annotation then the
|
||||||
|
`_hit_term` key is merged into the url-like syntax used in the `(...)` part of the
|
||||||
|
existing annotation.
|
||||||
|
* However, if the search term overlaps the span of an existing annotation it would break
|
||||||
|
the markup formatting so the original annotation is removed in favour of a new annotation
|
||||||
|
with just the search hit information in the results.
|
||||||
|
* Any non-overlapping annotations in the original text are preserved in highlighter
|
||||||
|
selections
|
||||||
|
|
||||||
|
|
||||||
|
[[mapper-annotated-text-limitations]]
|
||||||
|
==== Limitations
|
||||||
|
|
||||||
|
The annotated_text field type supports the same mapping settings as the `text` field type
|
||||||
|
but with the following exceptions:
|
||||||
|
|
||||||
|
* No support for `fielddata` or `fielddata_frequency_filter`
|
||||||
|
* No support for `index_prefixes` or `index_phrases` indexing
|
|
@ -19,5 +19,13 @@ indexes the size in bytes of the original
|
||||||
The mapper-murmur3 plugin allows hashes to be computed at index-time and stored
|
The mapper-murmur3 plugin allows hashes to be computed at index-time and stored
|
||||||
in the index for later use with the `cardinality` aggregation.
|
in the index for later use with the `cardinality` aggregation.
|
||||||
|
|
||||||
|
<<mapper-annotated-text>>::
|
||||||
|
|
||||||
|
The annotated text plugin provides the ability to index text that is a
|
||||||
|
combination of free-text and special markup that is typically used to identify
|
||||||
|
items of interest such as people or organisations (see NER or Named Entity Recognition
|
||||||
|
tools).
|
||||||
|
|
||||||
include::mapper-size.asciidoc[]
|
include::mapper-size.asciidoc[]
|
||||||
include::mapper-murmur3.asciidoc[]
|
include::mapper-murmur3.asciidoc[]
|
||||||
|
include::mapper-annotated-text.asciidoc[]
|
||||||
|
|
|
@ -28,6 +28,7 @@ U7321H6 discovery-gce {version} The Google Compute Engine (GCE) Discov
|
||||||
U7321H6 ingest-attachment {version} Ingest processor that uses Apache Tika to extract contents
|
U7321H6 ingest-attachment {version} Ingest processor that uses Apache Tika to extract contents
|
||||||
U7321H6 ingest-geoip {version} Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database
|
U7321H6 ingest-geoip {version} Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database
|
||||||
U7321H6 ingest-user-agent {version} Ingest processor that extracts information from a user agent
|
U7321H6 ingest-user-agent {version} Ingest processor that extracts information from a user agent
|
||||||
|
U7321H6 mapper-annotated-text {version} The Mapper Annotated_text plugin adds support for text fields with markup used to inject annotation tokens into the index.
|
||||||
U7321H6 mapper-murmur3 {version} The Mapper Murmur3 plugin allows to compute hashes of a field's values at index-time and to store them in the index.
|
U7321H6 mapper-murmur3 {version} The Mapper Murmur3 plugin allows to compute hashes of a field's values at index-time and to store them in the index.
|
||||||
U7321H6 mapper-size {version} The Mapper Size plugin allows document to record their uncompressed size at index time.
|
U7321H6 mapper-size {version} The Mapper Size plugin allows document to record their uncompressed size at index time.
|
||||||
U7321H6 store-smb {version} The Store SMB plugin adds support for SMB stores.
|
U7321H6 store-smb {version} The Store SMB plugin adds support for SMB stores.
|
||||||
|
|
|
@ -35,6 +35,7 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>
|
||||||
`completion` to provide auto-complete suggestions
|
`completion` to provide auto-complete suggestions
|
||||||
<<token-count>>:: `token_count` to count the number of tokens in a string
|
<<token-count>>:: `token_count` to count the number of tokens in a string
|
||||||
{plugins}/mapper-murmur3.html[`mapper-murmur3`]:: `murmur3` to compute hashes of values at index-time and store them in the index
|
{plugins}/mapper-murmur3.html[`mapper-murmur3`]:: `murmur3` to compute hashes of values at index-time and store them in the index
|
||||||
|
{plugins}/mapper-annotated-text.html[`mapper-annotated-text`]:: `annotated-text` to index text containing special markup (typically used for identifying named entities)
|
||||||
|
|
||||||
<<percolator>>:: Accepts queries from the query-dsl
|
<<percolator>>:: Accepts queries from the query-dsl
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,23 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
esplugin {
|
||||||
|
description 'The Mapper Annotated_text plugin adds support for text fields with markup used to inject annotation tokens into the index.'
|
||||||
|
classname 'org.elasticsearch.plugin.mapper.AnnotatedTextPlugin'
|
||||||
|
}
|
|
@ -0,0 +1,776 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.mapper.annotatedtext;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.Analyzer;
|
||||||
|
import org.apache.lucene.analysis.Analyzer.TokenStreamComponents;
|
||||||
|
import org.apache.lucene.analysis.AnalyzerWrapper;
|
||||||
|
import org.apache.lucene.analysis.TokenFilter;
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.Tokenizer;
|
||||||
|
import org.apache.lucene.analysis.standard.StandardAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
|
||||||
|
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
|
||||||
|
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
|
||||||
|
import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
|
||||||
|
import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute;
|
||||||
|
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
|
||||||
|
import org.apache.lucene.document.Field;
|
||||||
|
import org.apache.lucene.index.IndexOptions;
|
||||||
|
import org.apache.lucene.index.IndexableField;
|
||||||
|
import org.apache.lucene.index.Term;
|
||||||
|
import org.apache.lucene.search.MultiPhraseQuery;
|
||||||
|
import org.apache.lucene.search.NormsFieldExistsQuery;
|
||||||
|
import org.apache.lucene.search.PhraseQuery;
|
||||||
|
import org.apache.lucene.search.Query;
|
||||||
|
import org.apache.lucene.search.TermQuery;
|
||||||
|
import org.elasticsearch.ElasticsearchParseException;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.common.xcontent.XContentBuilder;
|
||||||
|
import org.elasticsearch.common.xcontent.support.XContentMapValues;
|
||||||
|
import org.elasticsearch.index.analysis.AnalyzerScope;
|
||||||
|
import org.elasticsearch.index.analysis.NamedAnalyzer;
|
||||||
|
import org.elasticsearch.index.mapper.FieldMapper;
|
||||||
|
import org.elasticsearch.index.mapper.FieldNamesFieldMapper;
|
||||||
|
import org.elasticsearch.index.mapper.MappedFieldType;
|
||||||
|
import org.elasticsearch.index.mapper.Mapper;
|
||||||
|
import org.elasticsearch.index.mapper.MapperParsingException;
|
||||||
|
import org.elasticsearch.index.mapper.ParseContext;
|
||||||
|
import org.elasticsearch.index.mapper.StringFieldType;
|
||||||
|
import org.elasticsearch.index.mapper.TextFieldMapper;
|
||||||
|
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedText.AnnotationToken;
|
||||||
|
import org.elasticsearch.index.query.QueryShardContext;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.io.Reader;
|
||||||
|
import java.io.StringReader;
|
||||||
|
import java.io.UncheckedIOException;
|
||||||
|
import java.io.UnsupportedEncodingException;
|
||||||
|
import java.net.URLDecoder;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.Iterator;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Objects;
|
||||||
|
import java.util.regex.Matcher;
|
||||||
|
import java.util.regex.Pattern;
|
||||||
|
|
||||||
|
import static org.elasticsearch.index.mapper.TypeParsers.parseTextField;
|
||||||
|
|
||||||
|
/** A {@link FieldMapper} for full-text fields with annotation markup e.g.
|
||||||
|
*
|
||||||
|
* "New mayor is [John Smith](type=person&value=John%20Smith) "
|
||||||
|
*
|
||||||
|
* A special Analyzer wraps the default choice of analyzer in order
|
||||||
|
* to strip the text field of annotation markup and inject the related
|
||||||
|
* entity annotation tokens as supplementary tokens at the relevant points
|
||||||
|
* in the token stream.
|
||||||
|
* This code is largely a copy of TextFieldMapper which is less than ideal -
|
||||||
|
* my attempts to subclass TextFieldMapper failed but we can revisit this.
|
||||||
|
**/
|
||||||
|
public class AnnotatedTextFieldMapper extends FieldMapper {
|
||||||
|
|
||||||
|
public static final String CONTENT_TYPE = "annotated_text";
|
||||||
|
private static final int POSITION_INCREMENT_GAP_USE_ANALYZER = -1;
|
||||||
|
|
||||||
|
public static class Defaults {
|
||||||
|
public static final MappedFieldType FIELD_TYPE = new AnnotatedTextFieldType();
|
||||||
|
static {
|
||||||
|
FIELD_TYPE.freeze();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public static class Builder extends FieldMapper.Builder<Builder, AnnotatedTextFieldMapper> {
|
||||||
|
|
||||||
|
private int positionIncrementGap = POSITION_INCREMENT_GAP_USE_ANALYZER;
|
||||||
|
|
||||||
|
public Builder(String name) {
|
||||||
|
super(name, Defaults.FIELD_TYPE, Defaults.FIELD_TYPE);
|
||||||
|
builder = this;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public AnnotatedTextFieldType fieldType() {
|
||||||
|
return (AnnotatedTextFieldType) super.fieldType();
|
||||||
|
}
|
||||||
|
|
||||||
|
public Builder positionIncrementGap(int positionIncrementGap) {
|
||||||
|
if (positionIncrementGap < 0) {
|
||||||
|
throw new MapperParsingException("[positions_increment_gap] must be positive, got " + positionIncrementGap);
|
||||||
|
}
|
||||||
|
this.positionIncrementGap = positionIncrementGap;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Builder docValues(boolean docValues) {
|
||||||
|
if (docValues) {
|
||||||
|
throw new IllegalArgumentException("[" + CONTENT_TYPE + "] fields do not support doc values");
|
||||||
|
}
|
||||||
|
return super.docValues(docValues);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public AnnotatedTextFieldMapper build(BuilderContext context) {
|
||||||
|
if (fieldType().indexOptions() == IndexOptions.NONE ) {
|
||||||
|
throw new IllegalArgumentException("[" + CONTENT_TYPE + "] fields must be indexed");
|
||||||
|
}
|
||||||
|
if (positionIncrementGap != POSITION_INCREMENT_GAP_USE_ANALYZER) {
|
||||||
|
if (fieldType.indexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) < 0) {
|
||||||
|
throw new IllegalArgumentException("Cannot set position_increment_gap on field ["
|
||||||
|
+ name + "] without positions enabled");
|
||||||
|
}
|
||||||
|
fieldType.setIndexAnalyzer(new NamedAnalyzer(fieldType.indexAnalyzer(), positionIncrementGap));
|
||||||
|
fieldType.setSearchAnalyzer(new NamedAnalyzer(fieldType.searchAnalyzer(), positionIncrementGap));
|
||||||
|
fieldType.setSearchQuoteAnalyzer(new NamedAnalyzer(fieldType.searchQuoteAnalyzer(), positionIncrementGap));
|
||||||
|
} else {
|
||||||
|
//Using the analyzer's default BUT need to do the same thing AnalysisRegistry.processAnalyzerFactory
|
||||||
|
// does to splice in new default of posIncGap=100 by wrapping the analyzer
|
||||||
|
if (fieldType.indexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0) {
|
||||||
|
int overrideInc = TextFieldMapper.Defaults.POSITION_INCREMENT_GAP;
|
||||||
|
fieldType.setIndexAnalyzer(new NamedAnalyzer(fieldType.indexAnalyzer(), overrideInc));
|
||||||
|
fieldType.setSearchAnalyzer(new NamedAnalyzer(fieldType.searchAnalyzer(), overrideInc));
|
||||||
|
fieldType.setSearchQuoteAnalyzer(new NamedAnalyzer(fieldType.searchQuoteAnalyzer(),overrideInc));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
setupFieldType(context);
|
||||||
|
return new AnnotatedTextFieldMapper(
|
||||||
|
name, fieldType(), defaultFieldType, positionIncrementGap,
|
||||||
|
context.indexSettings(), multiFieldsBuilder.build(this, context), copyTo);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public static class TypeParser implements Mapper.TypeParser {
|
||||||
|
@Override
|
||||||
|
public Mapper.Builder<AnnotatedTextFieldMapper.Builder, AnnotatedTextFieldMapper> parse(
|
||||||
|
String fieldName, Map<String, Object> node, ParserContext parserContext) throws MapperParsingException {
|
||||||
|
AnnotatedTextFieldMapper.Builder builder = new AnnotatedTextFieldMapper.Builder(fieldName);
|
||||||
|
|
||||||
|
builder.fieldType().setIndexAnalyzer(parserContext.getIndexAnalyzers().getDefaultIndexAnalyzer());
|
||||||
|
builder.fieldType().setSearchAnalyzer(parserContext.getIndexAnalyzers().getDefaultSearchAnalyzer());
|
||||||
|
builder.fieldType().setSearchQuoteAnalyzer(parserContext.getIndexAnalyzers().getDefaultSearchQuoteAnalyzer());
|
||||||
|
parseTextField(builder, fieldName, node, parserContext);
|
||||||
|
for (Iterator<Map.Entry<String, Object>> iterator = node.entrySet().iterator(); iterator.hasNext();) {
|
||||||
|
Map.Entry<String, Object> entry = iterator.next();
|
||||||
|
String propName = entry.getKey();
|
||||||
|
Object propNode = entry.getValue();
|
||||||
|
if (propName.equals("position_increment_gap")) {
|
||||||
|
int newPositionIncrementGap = XContentMapValues.nodeIntegerValue(propNode, -1);
|
||||||
|
builder.positionIncrementGap(newPositionIncrementGap);
|
||||||
|
iterator.remove();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return builder;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Parses markdown-like syntax into plain text and AnnotationTokens with offsets for
|
||||||
|
* annotations found in texts
|
||||||
|
*/
|
||||||
|
public static final class AnnotatedText {
|
||||||
|
public final String textPlusMarkup;
|
||||||
|
public final String textMinusMarkup;
|
||||||
|
List<AnnotationToken> annotations;
|
||||||
|
|
||||||
|
// Format is markdown-like syntax for URLs eg:
|
||||||
|
// "New mayor is [John Smith](type=person&value=John%20Smith) "
|
||||||
|
static Pattern markdownPattern = Pattern.compile("\\[([^\\]\\[]*)\\]\\(([^\\)\\(]*)\\)");
|
||||||
|
|
||||||
|
public static AnnotatedText parse (String textPlusMarkup) {
|
||||||
|
List<AnnotationToken> annotations =new ArrayList<>();
|
||||||
|
Matcher m = markdownPattern.matcher(textPlusMarkup);
|
||||||
|
int lastPos = 0;
|
||||||
|
StringBuilder sb = new StringBuilder();
|
||||||
|
while(m.find()){
|
||||||
|
if(m.start() > lastPos){
|
||||||
|
sb.append(textPlusMarkup.substring(lastPos, m.start()));
|
||||||
|
}
|
||||||
|
|
||||||
|
int startOffset = sb.length();
|
||||||
|
int endOffset = sb.length() + m.group(1).length();
|
||||||
|
sb.append(m.group(1));
|
||||||
|
lastPos = m.end();
|
||||||
|
|
||||||
|
String[] pairs = m.group(2).split("&");
|
||||||
|
String value = null;
|
||||||
|
for (String pair : pairs) {
|
||||||
|
String[] kv = pair.split("=");
|
||||||
|
try {
|
||||||
|
if(kv.length == 2){
|
||||||
|
throw new ElasticsearchParseException("key=value pairs are not supported in annotations");
|
||||||
|
}
|
||||||
|
if(kv.length == 1) {
|
||||||
|
//Check "=" sign wasn't in the pair string
|
||||||
|
if(kv[0].length() == pair.length()) {
|
||||||
|
//untyped value
|
||||||
|
value = URLDecoder.decode(kv[0], "UTF-8");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (value!=null && value.length() > 0) {
|
||||||
|
annotations.add(new AnnotationToken(startOffset, endOffset, value));
|
||||||
|
}
|
||||||
|
} catch (UnsupportedEncodingException uee){
|
||||||
|
throw new ElasticsearchParseException("Unsupported encoding parsing annotated text", uee);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if(lastPos < textPlusMarkup.length()){
|
||||||
|
sb.append(textPlusMarkup.substring(lastPos));
|
||||||
|
}
|
||||||
|
return new AnnotatedText(sb.toString(), textPlusMarkup, annotations);
|
||||||
|
}
|
||||||
|
|
||||||
|
protected AnnotatedText(String textMinusMarkup, String textPlusMarkup, List<AnnotationToken> annotations) {
|
||||||
|
this.textMinusMarkup = textMinusMarkup;
|
||||||
|
this.textPlusMarkup = textPlusMarkup;
|
||||||
|
this.annotations = annotations;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static final class AnnotationToken {
|
||||||
|
public final int offset;
|
||||||
|
public final int endOffset;
|
||||||
|
|
||||||
|
public final String value;
|
||||||
|
public AnnotationToken(int offset, int endOffset, String value) {
|
||||||
|
this.offset = offset;
|
||||||
|
this.endOffset = endOffset;
|
||||||
|
this.value = value;
|
||||||
|
}
|
||||||
|
@Override
|
||||||
|
public String toString() {
|
||||||
|
return value +" ("+offset+" - "+endOffset+")";
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean intersects(int start, int end) {
|
||||||
|
return (start <= offset && end >= offset) || (start <= endOffset && end >= endOffset)
|
||||||
|
|| (start >= offset && end <= endOffset);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public int hashCode() {
|
||||||
|
final int prime = 31;
|
||||||
|
int result = 1;
|
||||||
|
result = prime * result + endOffset;
|
||||||
|
result = prime * result + offset;
|
||||||
|
result = prime * result + Objects.hashCode(value);
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public boolean equals(Object obj) {
|
||||||
|
if (this == obj)
|
||||||
|
return true;
|
||||||
|
if (obj == null)
|
||||||
|
return false;
|
||||||
|
if (getClass() != obj.getClass())
|
||||||
|
return false;
|
||||||
|
AnnotationToken other = (AnnotationToken) obj;
|
||||||
|
return Objects.equals(endOffset, other.endOffset) && Objects.equals(offset, other.offset)
|
||||||
|
&& Objects.equals(value, other.value);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public String toString() {
|
||||||
|
StringBuilder sb = new StringBuilder();
|
||||||
|
sb.append(textMinusMarkup);
|
||||||
|
sb.append("\n");
|
||||||
|
annotations.forEach(a -> {sb.append(a); sb.append("\n");});
|
||||||
|
return sb.toString();
|
||||||
|
}
|
||||||
|
|
||||||
|
public int numAnnotations() {
|
||||||
|
return annotations.size();
|
||||||
|
}
|
||||||
|
|
||||||
|
public AnnotationToken getAnnotation(int index) {
|
||||||
|
return annotations.get(index);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// A utility class for use with highlighters where the content being highlighted
|
||||||
|
// needs plain text format for highlighting but marked-up format for token discovery.
|
||||||
|
// The class takes markedup format field values and returns plain text versions.
|
||||||
|
// When asked to tokenize plain-text versions by the highlighter it tokenizes the
|
||||||
|
// original markup form in order to inject annotations.
|
||||||
|
public static final class AnnotatedHighlighterAnalyzer extends AnalyzerWrapper {
|
||||||
|
private Analyzer delegate;
|
||||||
|
private AnnotatedText[] annotations;
|
||||||
|
public AnnotatedHighlighterAnalyzer(Analyzer delegate){
|
||||||
|
super(delegate.getReuseStrategy());
|
||||||
|
this.delegate = delegate;
|
||||||
|
}
|
||||||
|
|
||||||
|
public void init(String[] markedUpFieldValues) {
|
||||||
|
this.annotations = new AnnotatedText[markedUpFieldValues.length];
|
||||||
|
for (int i = 0; i < markedUpFieldValues.length; i++) {
|
||||||
|
annotations[i] = AnnotatedText.parse(markedUpFieldValues[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public String [] getPlainTextValuesForHighlighter(){
|
||||||
|
String [] result = new String[annotations.length];
|
||||||
|
for (int i = 0; i < annotations.length; i++) {
|
||||||
|
result[i] = annotations[i].textMinusMarkup;
|
||||||
|
}
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
public AnnotationToken[] getIntersectingAnnotations(int start, int end) {
|
||||||
|
List<AnnotationToken> intersectingAnnotations = new ArrayList<>();
|
||||||
|
int fieldValueOffset =0;
|
||||||
|
for (AnnotatedText fieldValueAnnotations : this.annotations) {
|
||||||
|
//This is called from a highlighter where all of the field values are concatenated
|
||||||
|
// so each annotation offset will need to be adjusted so that it takes into account
|
||||||
|
// the previous values AND the MULTIVAL delimiter
|
||||||
|
for (AnnotationToken token : fieldValueAnnotations.annotations) {
|
||||||
|
if(token.intersects(start - fieldValueOffset , end - fieldValueOffset)) {
|
||||||
|
intersectingAnnotations.add(new AnnotationToken(token.offset + fieldValueOffset,
|
||||||
|
token.endOffset + fieldValueOffset, token.value));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
//add 1 for the fieldvalue separator character
|
||||||
|
fieldValueOffset +=fieldValueAnnotations.textMinusMarkup.length() +1;
|
||||||
|
}
|
||||||
|
return intersectingAnnotations.toArray(new AnnotationToken[intersectingAnnotations.size()]);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Analyzer getWrappedAnalyzer(String fieldName) {
|
||||||
|
return delegate;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components) {
|
||||||
|
if(components instanceof AnnotatedHighlighterTokenStreamComponents){
|
||||||
|
// already wrapped.
|
||||||
|
return components;
|
||||||
|
}
|
||||||
|
AnnotationsInjector injector = new AnnotationsInjector(components.getTokenStream());
|
||||||
|
return new AnnotatedHighlighterTokenStreamComponents(components.getTokenizer(), injector, this.annotations);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
private static final class AnnotatedHighlighterTokenStreamComponents extends TokenStreamComponents{
|
||||||
|
|
||||||
|
private AnnotationsInjector annotationsInjector;
|
||||||
|
private AnnotatedText[] annotations;
|
||||||
|
int readerNum = 0;
|
||||||
|
|
||||||
|
AnnotatedHighlighterTokenStreamComponents(Tokenizer source, AnnotationsInjector annotationsFilter,
|
||||||
|
AnnotatedText[] annotations) {
|
||||||
|
super(source, annotationsFilter);
|
||||||
|
this.annotationsInjector = annotationsFilter;
|
||||||
|
this.annotations = annotations;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected void setReader(Reader reader) {
|
||||||
|
String plainText = readToString(reader);
|
||||||
|
AnnotatedText at = this.annotations[readerNum++];
|
||||||
|
assert at.textMinusMarkup.equals(plainText);
|
||||||
|
// This code is reliant on the behaviour of highlighter logic - it
|
||||||
|
// takes plain text multi-value fields and then calls the same analyzer
|
||||||
|
// for each field value in turn. This class has cached the annotations
|
||||||
|
// associated with each plain-text value and are arranged in the same order
|
||||||
|
annotationsInjector.setAnnotations(at);
|
||||||
|
super.setReader(new StringReader(at.textMinusMarkup));
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public static final class AnnotationAnalyzerWrapper extends AnalyzerWrapper {
|
||||||
|
|
||||||
|
|
||||||
|
private final Analyzer delegate;
|
||||||
|
|
||||||
|
public AnnotationAnalyzerWrapper (Analyzer delegate) {
|
||||||
|
super(delegate.getReuseStrategy());
|
||||||
|
this.delegate = delegate;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Wraps {@link StandardAnalyzer}.
|
||||||
|
*/
|
||||||
|
public AnnotationAnalyzerWrapper() {
|
||||||
|
this(new StandardAnalyzer());
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Analyzer getWrappedAnalyzer(String fieldName) {
|
||||||
|
return delegate;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components) {
|
||||||
|
if(components instanceof AnnotatedTokenStreamComponents){
|
||||||
|
// already wrapped.
|
||||||
|
return components;
|
||||||
|
}
|
||||||
|
AnnotationsInjector injector = new AnnotationsInjector(components.getTokenStream());
|
||||||
|
return new AnnotatedTokenStreamComponents(components.getTokenizer(), injector);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
//This Analyzer is not "wrappable" because of a limitation in Lucene https://issues.apache.org/jira/browse/LUCENE-8352
|
||||||
|
private static final class AnnotatedTokenStreamComponents extends TokenStreamComponents{
|
||||||
|
private AnnotationsInjector annotationsInjector;
|
||||||
|
|
||||||
|
AnnotatedTokenStreamComponents(Tokenizer source, AnnotationsInjector annotationsInjector) {
|
||||||
|
super(source, annotationsInjector);
|
||||||
|
this.annotationsInjector = annotationsInjector;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected void setReader(Reader reader) {
|
||||||
|
// Sneaky code to change the content downstream components will parse.
|
||||||
|
// Replace the marked-up content Reader with a plain text Reader and prime the
|
||||||
|
// annotations injector with the AnnotatedTokens that need to be injected
|
||||||
|
// as plain-text parsing progresses.
|
||||||
|
AnnotatedText annotations = AnnotatedText.parse(readToString(reader));
|
||||||
|
annotationsInjector.setAnnotations(annotations);
|
||||||
|
super.setReader(new StringReader(annotations.textMinusMarkup));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
static String readToString(Reader reader) {
|
||||||
|
char[] arr = new char[8 * 1024];
|
||||||
|
StringBuilder buffer = new StringBuilder();
|
||||||
|
int numCharsRead;
|
||||||
|
try {
|
||||||
|
while ((numCharsRead = reader.read(arr, 0, arr.length)) != -1) {
|
||||||
|
buffer.append(arr, 0, numCharsRead);
|
||||||
|
}
|
||||||
|
reader.close();
|
||||||
|
return buffer.toString();
|
||||||
|
} catch (IOException e) {
|
||||||
|
throw new UncheckedIOException("IO Error reading field content", e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public static final class AnnotationsInjector extends TokenFilter {
|
||||||
|
|
||||||
|
private AnnotatedText annotatedText;
|
||||||
|
AnnotatedText.AnnotationToken nextAnnotationForInjection = null;
|
||||||
|
private int currentAnnotationIndex = 0;
|
||||||
|
List<State> pendingStates = new ArrayList<>();
|
||||||
|
int pendingStatePos = 0;
|
||||||
|
boolean inputExhausted = false;
|
||||||
|
|
||||||
|
private final OffsetAttribute textOffsetAtt = addAttribute(OffsetAttribute.class);
|
||||||
|
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
|
||||||
|
private final PositionIncrementAttribute posAtt = addAttribute(PositionIncrementAttribute.class);
|
||||||
|
private final PositionLengthAttribute posLenAtt = addAttribute(PositionLengthAttribute.class);
|
||||||
|
private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class);
|
||||||
|
|
||||||
|
public AnnotationsInjector(TokenStream in) {
|
||||||
|
super(in);
|
||||||
|
}
|
||||||
|
|
||||||
|
public void setAnnotations(AnnotatedText annotatedText) {
|
||||||
|
this.annotatedText = annotatedText;
|
||||||
|
currentAnnotationIndex = 0;
|
||||||
|
if(annotatedText!=null && annotatedText.numAnnotations()>0){
|
||||||
|
nextAnnotationForInjection = annotatedText.getAnnotation(0);
|
||||||
|
} else {
|
||||||
|
nextAnnotationForInjection = null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public void reset() throws IOException {
|
||||||
|
pendingStates.clear();
|
||||||
|
pendingStatePos = 0;
|
||||||
|
inputExhausted = false;
|
||||||
|
super.reset();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Abstracts if we are pulling from some pre-cached buffer of
|
||||||
|
// text tokens or directly from the wrapped TokenStream
|
||||||
|
private boolean internalNextToken() throws IOException{
|
||||||
|
if (pendingStatePos < pendingStates.size()){
|
||||||
|
restoreState(pendingStates.get(pendingStatePos));
|
||||||
|
pendingStatePos ++;
|
||||||
|
if(pendingStatePos >=pendingStates.size()){
|
||||||
|
pendingStatePos =0;
|
||||||
|
pendingStates.clear();
|
||||||
|
}
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
if(inputExhausted) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
return input.incrementToken();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public boolean incrementToken() throws IOException {
|
||||||
|
if (internalNextToken()) {
|
||||||
|
if (nextAnnotationForInjection != null) {
|
||||||
|
// If we are at the right point to inject an annotation....
|
||||||
|
if (textOffsetAtt.startOffset() >= nextAnnotationForInjection.offset) {
|
||||||
|
int firstSpannedTextPosInc = posAtt.getPositionIncrement();
|
||||||
|
int annotationPosLen = 1;
|
||||||
|
|
||||||
|
// Capture the text token's state for later replay - but
|
||||||
|
// with a zero pos increment so is same as annotation
|
||||||
|
// that is injected before it
|
||||||
|
posAtt.setPositionIncrement(0);
|
||||||
|
pendingStates.add(captureState());
|
||||||
|
|
||||||
|
while (textOffsetAtt.endOffset() <= nextAnnotationForInjection.endOffset) {
|
||||||
|
// Buffer up all the other tokens spanned by this annotation to determine length.
|
||||||
|
if (input.incrementToken()) {
|
||||||
|
if (textOffsetAtt.endOffset() <= nextAnnotationForInjection.endOffset
|
||||||
|
&& textOffsetAtt.startOffset() < nextAnnotationForInjection.endOffset) {
|
||||||
|
annotationPosLen += posAtt.getPositionIncrement();
|
||||||
|
}
|
||||||
|
pendingStates.add(captureState());
|
||||||
|
} else {
|
||||||
|
inputExhausted = true;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
emitAnnotation(firstSpannedTextPosInc, annotationPosLen);
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return true;
|
||||||
|
} else {
|
||||||
|
inputExhausted = true;
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
private void setType(AnnotationToken token) {
|
||||||
|
//Default annotation type - in future AnnotationTokens may contain custom type info
|
||||||
|
typeAtt.setType("annotation");
|
||||||
|
}
|
||||||
|
|
||||||
|
private void emitAnnotation(int firstSpannedTextPosInc, int annotationPosLen) throws IOException {
|
||||||
|
// Set the annotation's attributes
|
||||||
|
posLenAtt.setPositionLength(annotationPosLen);
|
||||||
|
textOffsetAtt.setOffset(nextAnnotationForInjection.offset, nextAnnotationForInjection.endOffset);
|
||||||
|
setType(nextAnnotationForInjection);
|
||||||
|
|
||||||
|
// We may have multiple annotations at this location - stack them up
|
||||||
|
final int annotationOffset = nextAnnotationForInjection.offset;
|
||||||
|
final AnnotatedText.AnnotationToken firstAnnotationAtThisPos = nextAnnotationForInjection;
|
||||||
|
while (nextAnnotationForInjection != null && nextAnnotationForInjection.offset == annotationOffset) {
|
||||||
|
|
||||||
|
|
||||||
|
setType(nextAnnotationForInjection);
|
||||||
|
termAtt.resizeBuffer(nextAnnotationForInjection.value.length());
|
||||||
|
termAtt.copyBuffer(nextAnnotationForInjection.value.toCharArray(), 0, nextAnnotationForInjection.value.length());
|
||||||
|
|
||||||
|
if (nextAnnotationForInjection == firstAnnotationAtThisPos) {
|
||||||
|
posAtt.setPositionIncrement(firstSpannedTextPosInc);
|
||||||
|
//Put at the head of the queue of tokens to be emitted
|
||||||
|
pendingStates.add(0, captureState());
|
||||||
|
} else {
|
||||||
|
posAtt.setPositionIncrement(0);
|
||||||
|
//Put after the head of the queue of tokens to be emitted
|
||||||
|
pendingStates.add(1, captureState());
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
// Flag the inject annotation as null to prevent re-injection.
|
||||||
|
currentAnnotationIndex++;
|
||||||
|
if (currentAnnotationIndex < annotatedText.numAnnotations()) {
|
||||||
|
nextAnnotationForInjection = annotatedText.getAnnotation(currentAnnotationIndex);
|
||||||
|
} else {
|
||||||
|
nextAnnotationForInjection = null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
// Now pop the first of many potential buffered tokens:
|
||||||
|
internalNextToken();
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public static final class AnnotatedTextFieldType extends StringFieldType {
|
||||||
|
|
||||||
|
public AnnotatedTextFieldType() {
|
||||||
|
setTokenized(true);
|
||||||
|
}
|
||||||
|
|
||||||
|
protected AnnotatedTextFieldType(AnnotatedTextFieldType ref) {
|
||||||
|
super(ref);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public void setIndexAnalyzer(NamedAnalyzer delegate) {
|
||||||
|
if(delegate.analyzer() instanceof AnnotationAnalyzerWrapper){
|
||||||
|
// Already wrapped the Analyzer with an AnnotationAnalyzer
|
||||||
|
super.setIndexAnalyzer(delegate);
|
||||||
|
} else {
|
||||||
|
// Wrap the analyzer with an AnnotationAnalyzer that will inject required annotations
|
||||||
|
super.setIndexAnalyzer(new NamedAnalyzer(delegate.name(), AnalyzerScope.INDEX,
|
||||||
|
new AnnotationAnalyzerWrapper(delegate.analyzer())));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public AnnotatedTextFieldType clone() {
|
||||||
|
return new AnnotatedTextFieldType(this);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public String typeName() {
|
||||||
|
return CONTENT_TYPE;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Query existsQuery(QueryShardContext context) {
|
||||||
|
if (omitNorms()) {
|
||||||
|
return new TermQuery(new Term(FieldNamesFieldMapper.NAME, name()));
|
||||||
|
} else {
|
||||||
|
return new NormsFieldExistsQuery(name());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Query phraseQuery(String field, TokenStream stream, int slop, boolean enablePosIncrements) throws IOException {
|
||||||
|
PhraseQuery.Builder builder = new PhraseQuery.Builder();
|
||||||
|
builder.setSlop(slop);
|
||||||
|
|
||||||
|
TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
|
||||||
|
PositionIncrementAttribute posIncrAtt = stream.getAttribute(PositionIncrementAttribute.class);
|
||||||
|
int position = -1;
|
||||||
|
|
||||||
|
stream.reset();
|
||||||
|
while (stream.incrementToken()) {
|
||||||
|
if (enablePosIncrements) {
|
||||||
|
position += posIncrAtt.getPositionIncrement();
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
position += 1;
|
||||||
|
}
|
||||||
|
builder.add(new Term(field, termAtt.getBytesRef()), position);
|
||||||
|
}
|
||||||
|
|
||||||
|
return builder.build();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Query multiPhraseQuery(String field, TokenStream stream, int slop, boolean enablePositionIncrements) throws IOException {
|
||||||
|
|
||||||
|
MultiPhraseQuery.Builder mpqb = new MultiPhraseQuery.Builder();
|
||||||
|
mpqb.setSlop(slop);
|
||||||
|
|
||||||
|
TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
|
||||||
|
|
||||||
|
PositionIncrementAttribute posIncrAtt = stream.getAttribute(PositionIncrementAttribute.class);
|
||||||
|
int position = -1;
|
||||||
|
|
||||||
|
List<Term> multiTerms = new ArrayList<>();
|
||||||
|
stream.reset();
|
||||||
|
while (stream.incrementToken()) {
|
||||||
|
int positionIncrement = posIncrAtt.getPositionIncrement();
|
||||||
|
|
||||||
|
if (positionIncrement > 0 && multiTerms.size() > 0) {
|
||||||
|
if (enablePositionIncrements) {
|
||||||
|
mpqb.add(multiTerms.toArray(new Term[0]), position);
|
||||||
|
} else {
|
||||||
|
mpqb.add(multiTerms.toArray(new Term[0]));
|
||||||
|
}
|
||||||
|
multiTerms.clear();
|
||||||
|
}
|
||||||
|
position += positionIncrement;
|
||||||
|
multiTerms.add(new Term(field, termAtt.getBytesRef()));
|
||||||
|
}
|
||||||
|
|
||||||
|
if (enablePositionIncrements) {
|
||||||
|
mpqb.add(multiTerms.toArray(new Term[0]), position);
|
||||||
|
} else {
|
||||||
|
mpqb.add(multiTerms.toArray(new Term[0]));
|
||||||
|
}
|
||||||
|
return mpqb.build();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private int positionIncrementGap;
|
||||||
|
protected AnnotatedTextFieldMapper(String simpleName, AnnotatedTextFieldType fieldType, MappedFieldType defaultFieldType,
|
||||||
|
int positionIncrementGap,
|
||||||
|
Settings indexSettings, MultiFields multiFields, CopyTo copyTo) {
|
||||||
|
super(simpleName, fieldType, defaultFieldType, indexSettings, multiFields, copyTo);
|
||||||
|
assert fieldType.tokenized();
|
||||||
|
assert fieldType.hasDocValues() == false;
|
||||||
|
this.positionIncrementGap = positionIncrementGap;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected AnnotatedTextFieldMapper clone() {
|
||||||
|
return (AnnotatedTextFieldMapper) super.clone();
|
||||||
|
}
|
||||||
|
|
||||||
|
public int getPositionIncrementGap() {
|
||||||
|
return this.positionIncrementGap;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected void parseCreateField(ParseContext context, List<IndexableField> fields) throws IOException {
|
||||||
|
final String value;
|
||||||
|
if (context.externalValueSet()) {
|
||||||
|
value = context.externalValue().toString();
|
||||||
|
} else {
|
||||||
|
value = context.parser().textOrNull();
|
||||||
|
}
|
||||||
|
|
||||||
|
if (value == null) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (fieldType().indexOptions() != IndexOptions.NONE || fieldType().stored()) {
|
||||||
|
Field field = new Field(fieldType().name(), value, fieldType());
|
||||||
|
fields.add(field);
|
||||||
|
if (fieldType().omitNorms()) {
|
||||||
|
createFieldNamesField(context, fields);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected String contentType() {
|
||||||
|
return CONTENT_TYPE;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public AnnotatedTextFieldType fieldType() {
|
||||||
|
return (AnnotatedTextFieldType) super.fieldType();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, Params params) throws IOException {
|
||||||
|
super.doXContentBody(builder, includeDefaults, params);
|
||||||
|
doXContentAnalyzers(builder, includeDefaults);
|
||||||
|
|
||||||
|
if (includeDefaults || positionIncrementGap != POSITION_INCREMENT_GAP_USE_ANALYZER) {
|
||||||
|
builder.field("position_increment_gap", positionIncrementGap);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.plugin.mapper;
|
||||||
|
|
||||||
|
import java.util.Collections;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
import org.elasticsearch.index.mapper.Mapper;
|
||||||
|
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper;
|
||||||
|
import org.elasticsearch.plugins.MapperPlugin;
|
||||||
|
import org.elasticsearch.plugins.Plugin;
|
||||||
|
import org.elasticsearch.plugins.SearchPlugin;
|
||||||
|
import org.elasticsearch.search.fetch.subphase.highlight.AnnotatedTextHighlighter;
|
||||||
|
import org.elasticsearch.search.fetch.subphase.highlight.Highlighter;
|
||||||
|
|
||||||
|
public class AnnotatedTextPlugin extends Plugin implements MapperPlugin, SearchPlugin {
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Map<String, Mapper.TypeParser> getMappers() {
|
||||||
|
return Collections.singletonMap(AnnotatedTextFieldMapper.CONTENT_TYPE, new AnnotatedTextFieldMapper.TypeParser());
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Map<String, Highlighter> getHighlighters() {
|
||||||
|
return Collections.singletonMap(AnnotatedTextHighlighter.NAME, new AnnotatedTextHighlighter());
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,201 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.search.fetch.subphase.highlight;
|
||||||
|
|
||||||
|
import org.apache.lucene.search.highlight.Encoder;
|
||||||
|
import org.apache.lucene.search.uhighlight.Passage;
|
||||||
|
import org.apache.lucene.search.uhighlight.PassageFormatter;
|
||||||
|
import org.apache.lucene.search.uhighlight.Snippet;
|
||||||
|
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedHighlighterAnalyzer;
|
||||||
|
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedText.AnnotationToken;
|
||||||
|
|
||||||
|
import java.io.UnsupportedEncodingException;
|
||||||
|
import java.net.URLEncoder;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Custom passage formatter that :
|
||||||
|
* 1) marks up search hits in markdown-like syntax for URLs ({@link Snippet})
|
||||||
|
* 2) injects any annotations from the original text that don't conflict with search hit highlighting
|
||||||
|
*/
|
||||||
|
public class AnnotatedPassageFormatter extends PassageFormatter {
|
||||||
|
|
||||||
|
|
||||||
|
public static final String SEARCH_HIT_TYPE = "_hit_term";
|
||||||
|
private final Encoder encoder;
|
||||||
|
private AnnotatedHighlighterAnalyzer annotatedHighlighterAnalyzer;
|
||||||
|
|
||||||
|
public AnnotatedPassageFormatter(AnnotatedHighlighterAnalyzer annotatedHighlighterAnalyzer, Encoder encoder) {
|
||||||
|
this.annotatedHighlighterAnalyzer = annotatedHighlighterAnalyzer;
|
||||||
|
this.encoder = encoder;
|
||||||
|
}
|
||||||
|
|
||||||
|
static class MarkupPassage {
|
||||||
|
List<Markup> markups = new ArrayList<>();
|
||||||
|
int lastMarkupEnd = -1;
|
||||||
|
|
||||||
|
public void addUnlessOverlapping(Markup newMarkup) {
|
||||||
|
|
||||||
|
// Fast exit.
|
||||||
|
if(newMarkup.start > lastMarkupEnd) {
|
||||||
|
markups.add(newMarkup);
|
||||||
|
lastMarkupEnd = newMarkup.end;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check to see if this new markup overlaps with any prior
|
||||||
|
int index=0;
|
||||||
|
for (Markup existingMarkup: markups) {
|
||||||
|
if(existingMarkup.samePosition(newMarkup)) {
|
||||||
|
existingMarkup.merge(newMarkup);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if(existingMarkup.overlaps(newMarkup)) {
|
||||||
|
// existing markup wins - we throw away the new markup that would span this position
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
// markup list is in start offset order so we can insert at this position then shift others right
|
||||||
|
if(existingMarkup.isAfter(newMarkup)) {
|
||||||
|
markups.add(index, newMarkup);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
index++;
|
||||||
|
}
|
||||||
|
markups.add(newMarkup);
|
||||||
|
lastMarkupEnd = newMarkup.end;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
static class Markup {
|
||||||
|
int start;
|
||||||
|
int end;
|
||||||
|
String metadata;
|
||||||
|
Markup(int start, int end, String metadata) {
|
||||||
|
super();
|
||||||
|
this.start = start;
|
||||||
|
this.end = end;
|
||||||
|
this.metadata = metadata;
|
||||||
|
}
|
||||||
|
boolean isAfter(Markup other) {
|
||||||
|
return start > other.end;
|
||||||
|
}
|
||||||
|
void merge(Markup newMarkup) {
|
||||||
|
// metadata is key1=value&key2=value&.... syntax used for urls
|
||||||
|
assert samePosition(newMarkup);
|
||||||
|
metadata += "&" + newMarkup.metadata;
|
||||||
|
}
|
||||||
|
boolean samePosition(Markup other) {
|
||||||
|
return this.start == other.start && this.end == other.end;
|
||||||
|
}
|
||||||
|
boolean overlaps(Markup other) {
|
||||||
|
return (start<=other.start && end >= other.start)
|
||||||
|
|| (start <= other.end && end >=other.end)
|
||||||
|
|| (start>=other.start && end<=other.end);
|
||||||
|
}
|
||||||
|
@Override
|
||||||
|
public String toString() {
|
||||||
|
return "Markup [start=" + start + ", end=" + end + ", metadata=" + metadata + "]";
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
}
|
||||||
|
// Merge original annotations and search hits into a single set of markups for each passage
|
||||||
|
static MarkupPassage mergeAnnotations(AnnotationToken [] annotations, Passage passage){
|
||||||
|
try {
|
||||||
|
MarkupPassage markupPassage = new MarkupPassage();
|
||||||
|
|
||||||
|
// Add search hits first - they take precedence over any other markup
|
||||||
|
for (int i = 0; i < passage.getNumMatches(); i++) {
|
||||||
|
int start = passage.getMatchStarts()[i];
|
||||||
|
int end = passage.getMatchEnds()[i];
|
||||||
|
String searchTerm = passage.getMatchTerms()[i].utf8ToString();
|
||||||
|
Markup markup = new Markup(start, end, SEARCH_HIT_TYPE+"="+URLEncoder.encode(searchTerm, StandardCharsets.UTF_8.name()));
|
||||||
|
markupPassage.addUnlessOverlapping(markup);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Now add original text's annotations - ignoring any that might conflict with the search hits markup.
|
||||||
|
for (AnnotationToken token: annotations) {
|
||||||
|
int start = token.offset;
|
||||||
|
int end = token.endOffset;
|
||||||
|
if(start >= passage.getStartOffset() && end<=passage.getEndOffset()) {
|
||||||
|
String escapedValue = URLEncoder.encode(token.value, StandardCharsets.UTF_8.name());
|
||||||
|
Markup markup = new Markup(start, end, escapedValue);
|
||||||
|
markupPassage.addUnlessOverlapping(markup);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return markupPassage;
|
||||||
|
|
||||||
|
} catch (UnsupportedEncodingException e) {
|
||||||
|
// We should always have UTF-8 support
|
||||||
|
throw new IllegalStateException(e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Snippet[] format(Passage[] passages, String content) {
|
||||||
|
Snippet[] snippets = new Snippet[passages.length];
|
||||||
|
|
||||||
|
int pos;
|
||||||
|
int j = 0;
|
||||||
|
for (Passage passage : passages) {
|
||||||
|
AnnotationToken [] annotations = annotatedHighlighterAnalyzer.getIntersectingAnnotations(passage.getStartOffset(),
|
||||||
|
passage.getEndOffset());
|
||||||
|
MarkupPassage mergedMarkup = mergeAnnotations(annotations, passage);
|
||||||
|
|
||||||
|
StringBuilder sb = new StringBuilder();
|
||||||
|
pos = passage.getStartOffset();
|
||||||
|
for(Markup markup: mergedMarkup.markups) {
|
||||||
|
int start = markup.start;
|
||||||
|
int end = markup.end;
|
||||||
|
// its possible to have overlapping terms
|
||||||
|
if (start > pos) {
|
||||||
|
append(sb, content, pos, start);
|
||||||
|
}
|
||||||
|
if (end > pos) {
|
||||||
|
sb.append("[");
|
||||||
|
append(sb, content, Math.max(pos, start), end);
|
||||||
|
|
||||||
|
sb.append("](");
|
||||||
|
sb.append(markup.metadata);
|
||||||
|
sb.append(")");
|
||||||
|
pos = end;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
// its possible a "term" from the analyzer could span a sentence boundary.
|
||||||
|
append(sb, content, pos, Math.max(pos, passage.getEndOffset()));
|
||||||
|
//we remove the paragraph separator if present at the end of the snippet (we used it as separator between values)
|
||||||
|
if (sb.charAt(sb.length() - 1) == HighlightUtils.PARAGRAPH_SEPARATOR) {
|
||||||
|
sb.deleteCharAt(sb.length() - 1);
|
||||||
|
} else if (sb.charAt(sb.length() - 1) == HighlightUtils.NULL_SEPARATOR) {
|
||||||
|
sb.deleteCharAt(sb.length() - 1);
|
||||||
|
}
|
||||||
|
//and we trim the snippets too
|
||||||
|
snippets[j++] = new Snippet(sb.toString().trim(), passage.getScore(), passage.getNumMatches() > 0);
|
||||||
|
}
|
||||||
|
return snippets;
|
||||||
|
}
|
||||||
|
|
||||||
|
private void append(StringBuilder dest, String content, int start, int end) {
|
||||||
|
dest.append(encoder.encodeText(content.substring(start, end)));
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,64 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.search.fetch.subphase.highlight;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.Analyzer;
|
||||||
|
import org.apache.lucene.search.highlight.Encoder;
|
||||||
|
import org.apache.lucene.search.uhighlight.PassageFormatter;
|
||||||
|
import org.elasticsearch.index.mapper.DocumentMapper;
|
||||||
|
import org.elasticsearch.index.mapper.MappedFieldType;
|
||||||
|
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedHighlighterAnalyzer;
|
||||||
|
import org.elasticsearch.search.fetch.FetchSubPhase.HitContext;
|
||||||
|
import org.elasticsearch.search.fetch.subphase.highlight.SearchContextHighlight.Field;
|
||||||
|
import org.elasticsearch.search.internal.SearchContext;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.Arrays;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
public class AnnotatedTextHighlighter extends UnifiedHighlighter {
|
||||||
|
|
||||||
|
public static final String NAME = "annotated";
|
||||||
|
|
||||||
|
AnnotatedHighlighterAnalyzer annotatedHighlighterAnalyzer = null;
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected Analyzer getAnalyzer(DocumentMapper docMapper, MappedFieldType type) {
|
||||||
|
annotatedHighlighterAnalyzer = new AnnotatedHighlighterAnalyzer(super.getAnalyzer(docMapper, type));
|
||||||
|
return annotatedHighlighterAnalyzer;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Convert the marked-up values held on-disk to plain-text versions for highlighting
|
||||||
|
@Override
|
||||||
|
protected List<Object> loadFieldValues(MappedFieldType fieldType, Field field, SearchContext context, HitContext hitContext)
|
||||||
|
throws IOException {
|
||||||
|
List<Object> fieldValues = super.loadFieldValues(fieldType, field, context, hitContext);
|
||||||
|
String[] fieldValuesAsString = fieldValues.toArray(new String[fieldValues.size()]);
|
||||||
|
annotatedHighlighterAnalyzer.init(fieldValuesAsString);
|
||||||
|
return Arrays.asList((Object[]) annotatedHighlighterAnalyzer.getPlainTextValuesForHighlighter());
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected PassageFormatter getPassageFormatter(SearchContextHighlight.Field field, Encoder encoder) {
|
||||||
|
return new AnnotatedPassageFormatter(annotatedHighlighterAnalyzer, encoder);
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,39 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.mapper.annotatedtext;
|
||||||
|
|
||||||
|
import com.carrotsearch.randomizedtesting.annotations.Name;
|
||||||
|
import com.carrotsearch.randomizedtesting.annotations.ParametersFactory;
|
||||||
|
|
||||||
|
import org.elasticsearch.test.rest.yaml.ClientYamlTestCandidate;
|
||||||
|
import org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase;
|
||||||
|
|
||||||
|
public class AnnotatedTextClientYamlTestSuiteIT extends ESClientYamlSuiteTestCase {
|
||||||
|
|
||||||
|
public AnnotatedTextClientYamlTestSuiteIT(@Name("yaml") ClientYamlTestCandidate testCandidate) {
|
||||||
|
super(testCandidate);
|
||||||
|
}
|
||||||
|
|
||||||
|
@ParametersFactory
|
||||||
|
public static Iterable<Object[]> parameters() throws Exception {
|
||||||
|
return createParameters();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
|
@ -0,0 +1,681 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.mapper.annotatedtext;
|
||||||
|
|
||||||
|
import org.apache.lucene.index.DocValuesType;
|
||||||
|
import org.apache.lucene.index.IndexOptions;
|
||||||
|
import org.apache.lucene.index.IndexableField;
|
||||||
|
import org.apache.lucene.index.IndexableFieldType;
|
||||||
|
import org.apache.lucene.index.LeafReader;
|
||||||
|
import org.apache.lucene.index.PostingsEnum;
|
||||||
|
import org.apache.lucene.index.Terms;
|
||||||
|
import org.apache.lucene.index.TermsEnum;
|
||||||
|
import org.apache.lucene.util.BytesRef;
|
||||||
|
import org.elasticsearch.action.bulk.BulkRequestBuilder;
|
||||||
|
import org.elasticsearch.action.index.IndexRequest;
|
||||||
|
import org.elasticsearch.action.termvectors.TermVectorsRequest;
|
||||||
|
import org.elasticsearch.action.termvectors.TermVectorsResponse;
|
||||||
|
import org.elasticsearch.common.Strings;
|
||||||
|
import org.elasticsearch.common.bytes.BytesReference;
|
||||||
|
import org.elasticsearch.common.compress.CompressedXContent;
|
||||||
|
import org.elasticsearch.common.lucene.uid.Versions;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.common.xcontent.ToXContent;
|
||||||
|
import org.elasticsearch.common.xcontent.XContentBuilder;
|
||||||
|
import org.elasticsearch.common.xcontent.XContentFactory;
|
||||||
|
import org.elasticsearch.common.xcontent.XContentType;
|
||||||
|
import org.elasticsearch.index.IndexService;
|
||||||
|
import org.elasticsearch.index.VersionType;
|
||||||
|
import org.elasticsearch.index.engine.Engine;
|
||||||
|
import org.elasticsearch.index.mapper.DocumentMapper;
|
||||||
|
import org.elasticsearch.index.mapper.DocumentMapperParser;
|
||||||
|
import org.elasticsearch.index.mapper.MapperParsingException;
|
||||||
|
import org.elasticsearch.index.mapper.MapperService.MergeReason;
|
||||||
|
import org.elasticsearch.index.mapper.ParsedDocument;
|
||||||
|
import org.elasticsearch.index.mapper.SourceToParse;
|
||||||
|
import org.elasticsearch.index.mapper.TextFieldMapper;
|
||||||
|
import org.elasticsearch.index.shard.IndexShard;
|
||||||
|
import org.elasticsearch.index.termvectors.TermVectorsService;
|
||||||
|
import org.elasticsearch.indices.IndicesService;
|
||||||
|
import org.elasticsearch.plugin.mapper.AnnotatedTextPlugin;
|
||||||
|
import org.elasticsearch.plugins.Plugin;
|
||||||
|
import org.elasticsearch.test.ESSingleNodeTestCase;
|
||||||
|
import org.junit.Before;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.Arrays;
|
||||||
|
import java.util.Collection;
|
||||||
|
import java.util.Collections;
|
||||||
|
import java.util.HashMap;
|
||||||
|
import java.util.HashSet;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Set;
|
||||||
|
|
||||||
|
import static org.hamcrest.Matchers.containsString;
|
||||||
|
import static org.hamcrest.Matchers.equalTo;
|
||||||
|
import static org.hamcrest.Matchers.notNullValue;
|
||||||
|
|
||||||
|
public class AnnotatedTextFieldMapperTests extends ESSingleNodeTestCase {
|
||||||
|
|
||||||
|
IndexService indexService;
|
||||||
|
DocumentMapperParser parser;
|
||||||
|
|
||||||
|
@Before
|
||||||
|
public void setup() {
|
||||||
|
Settings settings = Settings.builder()
|
||||||
|
.put("index.analysis.filter.mySynonyms.type", "synonym")
|
||||||
|
.putList("index.analysis.filter.mySynonyms.synonyms", Collections.singletonList("car, auto"))
|
||||||
|
.put("index.analysis.analyzer.synonym.tokenizer", "standard")
|
||||||
|
.put("index.analysis.analyzer.synonym.filter", "mySynonyms")
|
||||||
|
// Stop filter remains in server as it is part of lucene-core
|
||||||
|
.put("index.analysis.analyzer.my_stop_analyzer.tokenizer", "standard")
|
||||||
|
.put("index.analysis.analyzer.my_stop_analyzer.filter", "stop")
|
||||||
|
.build();
|
||||||
|
indexService = createIndex("test", settings);
|
||||||
|
parser = indexService.mapperService().documentMapperParser();
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected Collection<Class<? extends Plugin>> getPlugins() {
|
||||||
|
List<Class<? extends Plugin>> classpathPlugins = new ArrayList<>();
|
||||||
|
classpathPlugins.add(AnnotatedTextPlugin.class);
|
||||||
|
return classpathPlugins;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
protected String getFieldType() {
|
||||||
|
return "annotated_text";
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAnnotationInjection() throws IOException {
|
||||||
|
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("field").field("type", getFieldType()).endObject().endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = indexService.mapperService().merge("type",
|
||||||
|
new CompressedXContent(mapping), MergeReason.MAPPING_UPDATE);
|
||||||
|
|
||||||
|
// Use example of typed and untyped annotations
|
||||||
|
String annotatedText = "He paid [Stormy Daniels](Stephanie+Clifford&Payee) hush money";
|
||||||
|
SourceToParse sourceToParse = SourceToParse.source("test", "type", "1", BytesReference
|
||||||
|
.bytes(XContentFactory.jsonBuilder()
|
||||||
|
.startObject()
|
||||||
|
.field("field", annotatedText)
|
||||||
|
.endObject()),
|
||||||
|
XContentType.JSON);
|
||||||
|
ParsedDocument doc = mapper.parse(sourceToParse);
|
||||||
|
|
||||||
|
IndexableField[] fields = doc.rootDoc().getFields("field");
|
||||||
|
assertEquals(1, fields.length);
|
||||||
|
|
||||||
|
assertEquals(annotatedText, fields[0].stringValue());
|
||||||
|
|
||||||
|
IndexShard shard = indexService.getShard(0);
|
||||||
|
shard.applyIndexOperationOnPrimary(Versions.MATCH_ANY, VersionType.INTERNAL,
|
||||||
|
sourceToParse, IndexRequest.UNSET_AUTO_GENERATED_TIMESTAMP, false);
|
||||||
|
shard.refresh("test");
|
||||||
|
try (Engine.Searcher searcher = shard.acquireSearcher("test")) {
|
||||||
|
LeafReader leaf = searcher.getDirectoryReader().leaves().get(0).reader();
|
||||||
|
TermsEnum terms = leaf.terms("field").iterator();
|
||||||
|
|
||||||
|
assertTrue(terms.seekExact(new BytesRef("stormy")));
|
||||||
|
PostingsEnum postings = terms.postings(null, PostingsEnum.POSITIONS);
|
||||||
|
assertEquals(0, postings.nextDoc());
|
||||||
|
assertEquals(2, postings.nextPosition());
|
||||||
|
|
||||||
|
assertTrue(terms.seekExact(new BytesRef("Stephanie Clifford")));
|
||||||
|
postings = terms.postings(null, PostingsEnum.POSITIONS);
|
||||||
|
assertEquals(0, postings.nextDoc());
|
||||||
|
assertEquals(2, postings.nextPosition());
|
||||||
|
|
||||||
|
assertTrue(terms.seekExact(new BytesRef("Payee")));
|
||||||
|
postings = terms.postings(null, PostingsEnum.POSITIONS);
|
||||||
|
assertEquals(0, postings.nextDoc());
|
||||||
|
assertEquals(2, postings.nextPosition());
|
||||||
|
|
||||||
|
|
||||||
|
assertTrue(terms.seekExact(new BytesRef("hush")));
|
||||||
|
postings = terms.postings(null, PostingsEnum.POSITIONS);
|
||||||
|
assertEquals(0, postings.nextDoc());
|
||||||
|
assertEquals(4, postings.nextPosition());
|
||||||
|
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testToleranceForBadAnnotationMarkup() throws IOException {
|
||||||
|
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("field").field("type", getFieldType()).endObject().endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = indexService.mapperService().merge("type",
|
||||||
|
new CompressedXContent(mapping), MergeReason.MAPPING_UPDATE);
|
||||||
|
|
||||||
|
String annotatedText = "foo [bar](MissingEndBracket baz";
|
||||||
|
SourceToParse sourceToParse = SourceToParse.source("test", "type", "1", BytesReference
|
||||||
|
.bytes(XContentFactory.jsonBuilder()
|
||||||
|
.startObject()
|
||||||
|
.field("field", annotatedText)
|
||||||
|
.endObject()),
|
||||||
|
XContentType.JSON);
|
||||||
|
ParsedDocument doc = mapper.parse(sourceToParse);
|
||||||
|
|
||||||
|
IndexableField[] fields = doc.rootDoc().getFields("field");
|
||||||
|
assertEquals(1, fields.length);
|
||||||
|
|
||||||
|
assertEquals(annotatedText, fields[0].stringValue());
|
||||||
|
|
||||||
|
IndexShard shard = indexService.getShard(0);
|
||||||
|
shard.applyIndexOperationOnPrimary(Versions.MATCH_ANY, VersionType.INTERNAL,
|
||||||
|
sourceToParse, IndexRequest.UNSET_AUTO_GENERATED_TIMESTAMP, false);
|
||||||
|
shard.refresh("test");
|
||||||
|
try (Engine.Searcher searcher = shard.acquireSearcher("test")) {
|
||||||
|
LeafReader leaf = searcher.getDirectoryReader().leaves().get(0).reader();
|
||||||
|
TermsEnum terms = leaf.terms("field").iterator();
|
||||||
|
|
||||||
|
assertTrue(terms.seekExact(new BytesRef("foo")));
|
||||||
|
PostingsEnum postings = terms.postings(null, PostingsEnum.POSITIONS);
|
||||||
|
assertEquals(0, postings.nextDoc());
|
||||||
|
assertEquals(0, postings.nextPosition());
|
||||||
|
|
||||||
|
assertTrue(terms.seekExact(new BytesRef("bar")));
|
||||||
|
postings = terms.postings(null, PostingsEnum.POSITIONS);
|
||||||
|
assertEquals(0, postings.nextDoc());
|
||||||
|
assertEquals(1, postings.nextPosition());
|
||||||
|
|
||||||
|
assertFalse(terms.seekExact(new BytesRef("MissingEndBracket")));
|
||||||
|
// Bad markup means value is treated as plain text and fed through tokenisation
|
||||||
|
assertTrue(terms.seekExact(new BytesRef("missingendbracket")));
|
||||||
|
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAgainstTermVectorsAPI() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("tvfield").field("type", getFieldType())
|
||||||
|
.field("term_vector", "with_positions_offsets_payloads")
|
||||||
|
.endObject().endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
indexService.mapperService().merge("type", new CompressedXContent(mapping), MergeReason.MAPPING_UPDATE);
|
||||||
|
|
||||||
|
|
||||||
|
int max = between(3, 10);
|
||||||
|
BulkRequestBuilder bulk = client().prepareBulk();
|
||||||
|
for (int i = 0; i < max; i++) {
|
||||||
|
bulk.add(client().prepareIndex("test", "type", Integer.toString(i))
|
||||||
|
.setSource("tvfield", "the quick [brown](Color) fox jumped over the lazy dog"));
|
||||||
|
}
|
||||||
|
bulk.get();
|
||||||
|
|
||||||
|
TermVectorsRequest request = new TermVectorsRequest("test", "type", "0").termStatistics(true);
|
||||||
|
|
||||||
|
IndicesService indicesService = getInstanceFromNode(IndicesService.class);
|
||||||
|
IndexService test = indicesService.indexService(resolveIndex("test"));
|
||||||
|
IndexShard shard = test.getShardOrNull(0);
|
||||||
|
assertThat(shard, notNullValue());
|
||||||
|
TermVectorsResponse response = TermVectorsService.getTermVectors(shard, request);
|
||||||
|
assertEquals(1, response.getFields().size());
|
||||||
|
|
||||||
|
Terms terms = response.getFields().terms("tvfield");
|
||||||
|
TermsEnum iterator = terms.iterator();
|
||||||
|
BytesRef term;
|
||||||
|
Set<String> foundTerms = new HashSet<>();
|
||||||
|
while ((term = iterator.next()) != null) {
|
||||||
|
foundTerms.add(term.utf8ToString());
|
||||||
|
}
|
||||||
|
//Check we have both text and annotation tokens
|
||||||
|
assertTrue(foundTerms.contains("brown"));
|
||||||
|
assertTrue(foundTerms.contains("Color"));
|
||||||
|
assertTrue(foundTerms.contains("fox"));
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
// ===== Code below copied from TextFieldMapperTests ========
|
||||||
|
|
||||||
|
public void testDefaults() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("field").field("type", getFieldType()).endObject().endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
ParsedDocument doc = mapper.parse(SourceToParse.source("test", "type", "1", BytesReference
|
||||||
|
.bytes(XContentFactory.jsonBuilder()
|
||||||
|
.startObject()
|
||||||
|
.field("field", "1234")
|
||||||
|
.endObject()),
|
||||||
|
XContentType.JSON));
|
||||||
|
|
||||||
|
IndexableField[] fields = doc.rootDoc().getFields("field");
|
||||||
|
assertEquals(1, fields.length);
|
||||||
|
|
||||||
|
assertEquals("1234", fields[0].stringValue());
|
||||||
|
IndexableFieldType fieldType = fields[0].fieldType();
|
||||||
|
assertThat(fieldType.omitNorms(), equalTo(false));
|
||||||
|
assertTrue(fieldType.tokenized());
|
||||||
|
assertFalse(fieldType.stored());
|
||||||
|
assertThat(fieldType.indexOptions(), equalTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS));
|
||||||
|
assertThat(fieldType.storeTermVectors(), equalTo(false));
|
||||||
|
assertThat(fieldType.storeTermVectorOffsets(), equalTo(false));
|
||||||
|
assertThat(fieldType.storeTermVectorPositions(), equalTo(false));
|
||||||
|
assertThat(fieldType.storeTermVectorPayloads(), equalTo(false));
|
||||||
|
assertEquals(DocValuesType.NONE, fieldType.docValuesType());
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testEnableStore() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("field").field("type", getFieldType()).field("store", true).endObject().endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
ParsedDocument doc = mapper.parse(SourceToParse.source("test", "type", "1", BytesReference
|
||||||
|
.bytes(XContentFactory.jsonBuilder()
|
||||||
|
.startObject()
|
||||||
|
.field("field", "1234")
|
||||||
|
.endObject()),
|
||||||
|
XContentType.JSON));
|
||||||
|
|
||||||
|
IndexableField[] fields = doc.rootDoc().getFields("field");
|
||||||
|
assertEquals(1, fields.length);
|
||||||
|
assertTrue(fields[0].fieldType().stored());
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testDisableNorms() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("norms", false)
|
||||||
|
.endObject().endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
ParsedDocument doc = mapper.parse(SourceToParse.source("test", "type", "1", BytesReference
|
||||||
|
.bytes(XContentFactory.jsonBuilder()
|
||||||
|
.startObject()
|
||||||
|
.field("field", "1234")
|
||||||
|
.endObject()),
|
||||||
|
XContentType.JSON));
|
||||||
|
|
||||||
|
IndexableField[] fields = doc.rootDoc().getFields("field");
|
||||||
|
assertEquals(1, fields.length);
|
||||||
|
assertTrue(fields[0].fieldType().omitNorms());
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testIndexOptions() throws IOException {
|
||||||
|
Map<String, IndexOptions> supportedOptions = new HashMap<>();
|
||||||
|
supportedOptions.put("docs", IndexOptions.DOCS);
|
||||||
|
supportedOptions.put("freqs", IndexOptions.DOCS_AND_FREQS);
|
||||||
|
supportedOptions.put("positions", IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
|
||||||
|
supportedOptions.put("offsets", IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
|
||||||
|
|
||||||
|
XContentBuilder mappingBuilder = XContentFactory.jsonBuilder().startObject().startObject("type").startObject("properties");
|
||||||
|
for (String option : supportedOptions.keySet()) {
|
||||||
|
mappingBuilder.startObject(option).field("type", getFieldType()).field("index_options", option).endObject();
|
||||||
|
}
|
||||||
|
String mapping = Strings.toString(mappingBuilder.endObject().endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
|
||||||
|
XContentBuilder jsonDoc = XContentFactory.jsonBuilder().startObject();
|
||||||
|
for (String option : supportedOptions.keySet()) {
|
||||||
|
jsonDoc.field(option, "1234");
|
||||||
|
}
|
||||||
|
ParsedDocument doc = mapper.parse(SourceToParse.source("test", "type", "1", BytesReference.bytes(jsonDoc.endObject()),
|
||||||
|
XContentType.JSON));
|
||||||
|
|
||||||
|
for (Map.Entry<String, IndexOptions> entry : supportedOptions.entrySet()) {
|
||||||
|
String field = entry.getKey();
|
||||||
|
IndexOptions options = entry.getValue();
|
||||||
|
IndexableField[] fields = doc.rootDoc().getFields(field);
|
||||||
|
assertEquals(1, fields.length);
|
||||||
|
assertEquals(options, fields[0].fieldType().indexOptions());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testDefaultPositionIncrementGap() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("field").field("type", getFieldType()).endObject().endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = indexService.mapperService().merge("type",
|
||||||
|
new CompressedXContent(mapping), MergeReason.MAPPING_UPDATE);
|
||||||
|
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
SourceToParse sourceToParse = SourceToParse.source("test", "type", "1", BytesReference
|
||||||
|
.bytes(XContentFactory.jsonBuilder()
|
||||||
|
.startObject()
|
||||||
|
.array("field", new String[] {"a", "b"})
|
||||||
|
.endObject()),
|
||||||
|
XContentType.JSON);
|
||||||
|
ParsedDocument doc = mapper.parse(sourceToParse);
|
||||||
|
|
||||||
|
IndexableField[] fields = doc.rootDoc().getFields("field");
|
||||||
|
assertEquals(2, fields.length);
|
||||||
|
|
||||||
|
assertEquals("a", fields[0].stringValue());
|
||||||
|
assertEquals("b", fields[1].stringValue());
|
||||||
|
|
||||||
|
IndexShard shard = indexService.getShard(0);
|
||||||
|
shard.applyIndexOperationOnPrimary(Versions.MATCH_ANY, VersionType.INTERNAL,
|
||||||
|
sourceToParse, IndexRequest.UNSET_AUTO_GENERATED_TIMESTAMP, false);
|
||||||
|
shard.refresh("test");
|
||||||
|
try (Engine.Searcher searcher = shard.acquireSearcher("test")) {
|
||||||
|
LeafReader leaf = searcher.getDirectoryReader().leaves().get(0).reader();
|
||||||
|
TermsEnum terms = leaf.terms("field").iterator();
|
||||||
|
assertTrue(terms.seekExact(new BytesRef("b")));
|
||||||
|
PostingsEnum postings = terms.postings(null, PostingsEnum.POSITIONS);
|
||||||
|
assertEquals(0, postings.nextDoc());
|
||||||
|
assertEquals(TextFieldMapper.Defaults.POSITION_INCREMENT_GAP + 1, postings.nextPosition());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testPositionIncrementGap() throws IOException {
|
||||||
|
final int positionIncrementGap = randomIntBetween(1, 1000);
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("position_increment_gap", positionIncrementGap)
|
||||||
|
.endObject().endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = indexService.mapperService().merge("type",
|
||||||
|
new CompressedXContent(mapping), MergeReason.MAPPING_UPDATE);
|
||||||
|
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
SourceToParse sourceToParse = SourceToParse.source("test", "type", "1", BytesReference
|
||||||
|
.bytes(XContentFactory.jsonBuilder()
|
||||||
|
.startObject()
|
||||||
|
.array("field", new String[]{"a", "b"})
|
||||||
|
.endObject()),
|
||||||
|
XContentType.JSON);
|
||||||
|
ParsedDocument doc = mapper.parse(sourceToParse);
|
||||||
|
|
||||||
|
IndexableField[] fields = doc.rootDoc().getFields("field");
|
||||||
|
assertEquals(2, fields.length);
|
||||||
|
|
||||||
|
assertEquals("a", fields[0].stringValue());
|
||||||
|
assertEquals("b", fields[1].stringValue());
|
||||||
|
|
||||||
|
IndexShard shard = indexService.getShard(0);
|
||||||
|
shard.applyIndexOperationOnPrimary(Versions.MATCH_ANY, VersionType.INTERNAL,
|
||||||
|
sourceToParse, IndexRequest.UNSET_AUTO_GENERATED_TIMESTAMP, false);
|
||||||
|
shard.refresh("test");
|
||||||
|
try (Engine.Searcher searcher = shard.acquireSearcher("test")) {
|
||||||
|
LeafReader leaf = searcher.getDirectoryReader().leaves().get(0).reader();
|
||||||
|
TermsEnum terms = leaf.terms("field").iterator();
|
||||||
|
assertTrue(terms.seekExact(new BytesRef("b")));
|
||||||
|
PostingsEnum postings = terms.postings(null, PostingsEnum.POSITIONS);
|
||||||
|
assertEquals(0, postings.nextDoc());
|
||||||
|
assertEquals(positionIncrementGap + 1, postings.nextPosition());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testSearchAnalyzerSerialization() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("analyzer", "standard")
|
||||||
|
.field("search_analyzer", "keyword")
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
// special case: default index analyzer
|
||||||
|
mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("analyzer", "default")
|
||||||
|
.field("search_analyzer", "keyword")
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject().endObject());
|
||||||
|
|
||||||
|
mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("analyzer", "keyword")
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject().endObject());
|
||||||
|
|
||||||
|
mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
// special case: default search analyzer
|
||||||
|
mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("analyzer", "keyword")
|
||||||
|
.field("search_analyzer", "default")
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject().endObject());
|
||||||
|
|
||||||
|
mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("analyzer", "keyword")
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject().endObject());
|
||||||
|
mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
|
||||||
|
XContentBuilder builder = XContentFactory.jsonBuilder();
|
||||||
|
builder.startObject();
|
||||||
|
mapper.toXContent(builder, new ToXContent.MapParams(Collections.singletonMap("include_defaults", "true")));
|
||||||
|
builder.endObject();
|
||||||
|
|
||||||
|
String mappingString = Strings.toString(builder);
|
||||||
|
assertTrue(mappingString.contains("analyzer"));
|
||||||
|
assertTrue(mappingString.contains("search_analyzer"));
|
||||||
|
assertTrue(mappingString.contains("search_quote_analyzer"));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testSearchQuoteAnalyzerSerialization() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("analyzer", "standard")
|
||||||
|
.field("search_analyzer", "standard")
|
||||||
|
.field("search_quote_analyzer", "keyword")
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
|
||||||
|
// special case: default index/search analyzer
|
||||||
|
mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("analyzer", "default")
|
||||||
|
.field("search_analyzer", "default")
|
||||||
|
.field("search_quote_analyzer", "keyword")
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject().endObject());
|
||||||
|
|
||||||
|
mapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
assertEquals(mapping, mapper.mappingSource().toString());
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testTermVectors() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("field1")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("term_vector", "no")
|
||||||
|
.endObject()
|
||||||
|
.startObject("field2")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("term_vector", "yes")
|
||||||
|
.endObject()
|
||||||
|
.startObject("field3")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("term_vector", "with_offsets")
|
||||||
|
.endObject()
|
||||||
|
.startObject("field4")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("term_vector", "with_positions")
|
||||||
|
.endObject()
|
||||||
|
.startObject("field5")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("term_vector", "with_positions_offsets")
|
||||||
|
.endObject()
|
||||||
|
.startObject("field6")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("term_vector", "with_positions_offsets_payloads")
|
||||||
|
.endObject()
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
DocumentMapper defaultMapper = parser.parse("type", new CompressedXContent(mapping));
|
||||||
|
|
||||||
|
ParsedDocument doc = defaultMapper.parse(SourceToParse.source("test", "type", "1", BytesReference
|
||||||
|
.bytes(XContentFactory.jsonBuilder()
|
||||||
|
.startObject()
|
||||||
|
.field("field1", "1234")
|
||||||
|
.field("field2", "1234")
|
||||||
|
.field("field3", "1234")
|
||||||
|
.field("field4", "1234")
|
||||||
|
.field("field5", "1234")
|
||||||
|
.field("field6", "1234")
|
||||||
|
.endObject()),
|
||||||
|
XContentType.JSON));
|
||||||
|
|
||||||
|
assertThat(doc.rootDoc().getField("field1").fieldType().storeTermVectors(), equalTo(false));
|
||||||
|
assertThat(doc.rootDoc().getField("field1").fieldType().storeTermVectorOffsets(), equalTo(false));
|
||||||
|
assertThat(doc.rootDoc().getField("field1").fieldType().storeTermVectorPositions(), equalTo(false));
|
||||||
|
assertThat(doc.rootDoc().getField("field1").fieldType().storeTermVectorPayloads(), equalTo(false));
|
||||||
|
|
||||||
|
assertThat(doc.rootDoc().getField("field2").fieldType().storeTermVectors(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field2").fieldType().storeTermVectorOffsets(), equalTo(false));
|
||||||
|
assertThat(doc.rootDoc().getField("field2").fieldType().storeTermVectorPositions(), equalTo(false));
|
||||||
|
assertThat(doc.rootDoc().getField("field2").fieldType().storeTermVectorPayloads(), equalTo(false));
|
||||||
|
|
||||||
|
assertThat(doc.rootDoc().getField("field3").fieldType().storeTermVectors(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field3").fieldType().storeTermVectorOffsets(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field3").fieldType().storeTermVectorPositions(), equalTo(false));
|
||||||
|
assertThat(doc.rootDoc().getField("field3").fieldType().storeTermVectorPayloads(), equalTo(false));
|
||||||
|
|
||||||
|
assertThat(doc.rootDoc().getField("field4").fieldType().storeTermVectors(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field4").fieldType().storeTermVectorOffsets(), equalTo(false));
|
||||||
|
assertThat(doc.rootDoc().getField("field4").fieldType().storeTermVectorPositions(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field4").fieldType().storeTermVectorPayloads(), equalTo(false));
|
||||||
|
|
||||||
|
assertThat(doc.rootDoc().getField("field5").fieldType().storeTermVectors(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field5").fieldType().storeTermVectorOffsets(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field5").fieldType().storeTermVectorPositions(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field5").fieldType().storeTermVectorPayloads(), equalTo(false));
|
||||||
|
|
||||||
|
assertThat(doc.rootDoc().getField("field6").fieldType().storeTermVectors(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field6").fieldType().storeTermVectorOffsets(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field6").fieldType().storeTermVectorPositions(), equalTo(true));
|
||||||
|
assertThat(doc.rootDoc().getField("field6").fieldType().storeTermVectorPayloads(), equalTo(true));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testNullConfigValuesFail() throws MapperParsingException, IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject()
|
||||||
|
.startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("analyzer", (String) null)
|
||||||
|
.endObject()
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
Exception e = expectThrows(MapperParsingException.class, () -> parser.parse("type", new CompressedXContent(mapping)));
|
||||||
|
assertEquals("[analyzer] must not have a [null] value", e.getMessage());
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testNotIndexedField() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("index", false)
|
||||||
|
.endObject().endObject().endObject().endObject());
|
||||||
|
|
||||||
|
IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
|
||||||
|
() -> parser.parse("type", new CompressedXContent(mapping)));
|
||||||
|
assertEquals("[annotated_text] fields must be indexed", e.getMessage());
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAnalyzedFieldPositionIncrementWithoutPositions() throws IOException {
|
||||||
|
for (String indexOptions : Arrays.asList("docs", "freqs")) {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject().startObject("type")
|
||||||
|
.startObject("properties").startObject("field")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.field("index_options", indexOptions)
|
||||||
|
.field("position_increment_gap", 10)
|
||||||
|
.endObject().endObject().endObject().endObject());
|
||||||
|
|
||||||
|
IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
|
||||||
|
() -> parser.parse("type", new CompressedXContent(mapping)));
|
||||||
|
assertEquals("Cannot set position_increment_gap on field [field] without positions enabled", e.getMessage());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testEmptyName() throws IOException {
|
||||||
|
String mapping = Strings.toString(XContentFactory.jsonBuilder().startObject()
|
||||||
|
.startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("")
|
||||||
|
.field("type", getFieldType())
|
||||||
|
.endObject()
|
||||||
|
.endObject()
|
||||||
|
.endObject().endObject());
|
||||||
|
|
||||||
|
// Empty name not allowed in index created after 5.0
|
||||||
|
IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
|
||||||
|
() -> parser.parse("type", new CompressedXContent(mapping))
|
||||||
|
);
|
||||||
|
assertThat(e.getMessage(), containsString("name cannot be empty string"));
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,73 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.mapper.annotatedtext;
|
||||||
|
|
||||||
|
import org.elasticsearch.ElasticsearchParseException;
|
||||||
|
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedText;
|
||||||
|
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedText.AnnotationToken;
|
||||||
|
import org.elasticsearch.test.ESTestCase;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
import static org.hamcrest.Matchers.equalTo;
|
||||||
|
|
||||||
|
public class AnnotatedTextParsingTests extends ESTestCase {
|
||||||
|
|
||||||
|
private void checkParsing(String markup, String expectedPlainText, AnnotationToken... expectedTokens) {
|
||||||
|
AnnotatedText at = AnnotatedText.parse(markup);
|
||||||
|
assertEquals(expectedPlainText, at.textMinusMarkup);
|
||||||
|
List<AnnotationToken> actualAnnotations = at.annotations;
|
||||||
|
assertEquals(expectedTokens.length, actualAnnotations.size());
|
||||||
|
for (int i = 0; i < expectedTokens.length; i++) {
|
||||||
|
assertEquals(expectedTokens[i], actualAnnotations.get(i));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testSingleValueMarkup() {
|
||||||
|
checkParsing("foo [bar](Y)", "foo bar", new AnnotationToken(4,7,"Y"));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testMultiValueMarkup() {
|
||||||
|
checkParsing("foo [bar](Y&B)", "foo bar", new AnnotationToken(4,7,"Y"),
|
||||||
|
new AnnotationToken(4,7,"B"));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testBlankTextAnnotation() {
|
||||||
|
checkParsing("It sounded like this:[](theSoundOfOneHandClapping)", "It sounded like this:",
|
||||||
|
new AnnotationToken(21,21,"theSoundOfOneHandClapping"));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testMissingBracket() {
|
||||||
|
checkParsing("[foo](MissingEndBracket bar",
|
||||||
|
"[foo](MissingEndBracket bar", new AnnotationToken[0]);
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAnnotationWithType() {
|
||||||
|
Exception expectedException = expectThrows(ElasticsearchParseException.class,
|
||||||
|
() -> checkParsing("foo [bar](type=foo) baz", "foo bar baz", new AnnotationToken(4,7, "noType")));
|
||||||
|
assertThat(expectedException.getMessage(), equalTo("key=value pairs are not supported in annotations"));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testMissingValue() {
|
||||||
|
checkParsing("[foo]() bar", "foo bar", new AnnotationToken[0]);
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,185 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.search.highlight;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.Analyzer;
|
||||||
|
import org.apache.lucene.analysis.standard.StandardAnalyzer;
|
||||||
|
import org.apache.lucene.document.Document;
|
||||||
|
import org.apache.lucene.document.Field;
|
||||||
|
import org.apache.lucene.document.FieldType;
|
||||||
|
import org.apache.lucene.document.TextField;
|
||||||
|
import org.apache.lucene.index.DirectoryReader;
|
||||||
|
import org.apache.lucene.index.IndexOptions;
|
||||||
|
import org.apache.lucene.index.IndexWriterConfig;
|
||||||
|
import org.apache.lucene.index.RandomIndexWriter;
|
||||||
|
import org.apache.lucene.index.Term;
|
||||||
|
import org.apache.lucene.search.IndexSearcher;
|
||||||
|
import org.apache.lucene.search.MatchAllDocsQuery;
|
||||||
|
import org.apache.lucene.search.PhraseQuery;
|
||||||
|
import org.apache.lucene.search.Query;
|
||||||
|
import org.apache.lucene.search.Sort;
|
||||||
|
import org.apache.lucene.search.TermQuery;
|
||||||
|
import org.apache.lucene.search.TopDocs;
|
||||||
|
import org.apache.lucene.search.highlight.DefaultEncoder;
|
||||||
|
import org.apache.lucene.search.uhighlight.CustomSeparatorBreakIterator;
|
||||||
|
import org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter;
|
||||||
|
import org.apache.lucene.search.uhighlight.PassageFormatter;
|
||||||
|
import org.apache.lucene.search.uhighlight.Snippet;
|
||||||
|
import org.apache.lucene.search.uhighlight.SplittingBreakIterator;
|
||||||
|
import org.apache.lucene.store.Directory;
|
||||||
|
import org.elasticsearch.common.Strings;
|
||||||
|
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedHighlighterAnalyzer;
|
||||||
|
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotationAnalyzerWrapper;
|
||||||
|
import org.elasticsearch.search.fetch.subphase.highlight.AnnotatedPassageFormatter;
|
||||||
|
import org.elasticsearch.test.ESTestCase;
|
||||||
|
|
||||||
|
import java.net.URLEncoder;
|
||||||
|
import java.text.BreakIterator;
|
||||||
|
import java.util.Locale;
|
||||||
|
|
||||||
|
import static org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter.MULTIVAL_SEP_CHAR;
|
||||||
|
import static org.hamcrest.CoreMatchers.equalTo;
|
||||||
|
|
||||||
|
public class AnnotatedTextHighlighterTests extends ESTestCase {
|
||||||
|
|
||||||
|
private void assertHighlightOneDoc(String fieldName, String []markedUpInputs,
|
||||||
|
Query query, Locale locale, BreakIterator breakIterator,
|
||||||
|
int noMatchSize, String[] expectedPassages) throws Exception {
|
||||||
|
|
||||||
|
// Annotated fields wrap the usual analyzer with one that injects extra tokens
|
||||||
|
Analyzer wrapperAnalyzer = new AnnotationAnalyzerWrapper(new StandardAnalyzer());
|
||||||
|
AnnotatedHighlighterAnalyzer hiliteAnalyzer = new AnnotatedHighlighterAnalyzer(wrapperAnalyzer);
|
||||||
|
hiliteAnalyzer.init(markedUpInputs);
|
||||||
|
PassageFormatter passageFormatter = new AnnotatedPassageFormatter(hiliteAnalyzer,new DefaultEncoder());
|
||||||
|
String []plainTextForHighlighter = hiliteAnalyzer.getPlainTextValuesForHighlighter();
|
||||||
|
|
||||||
|
|
||||||
|
Directory dir = newDirectory();
|
||||||
|
IndexWriterConfig iwc = newIndexWriterConfig(wrapperAnalyzer);
|
||||||
|
iwc.setMergePolicy(newTieredMergePolicy(random()));
|
||||||
|
RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc);
|
||||||
|
FieldType ft = new FieldType(TextField.TYPE_STORED);
|
||||||
|
if (randomBoolean()) {
|
||||||
|
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
|
||||||
|
} else {
|
||||||
|
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
|
||||||
|
}
|
||||||
|
ft.freeze();
|
||||||
|
Document doc = new Document();
|
||||||
|
for (String input : markedUpInputs) {
|
||||||
|
Field field = new Field(fieldName, "", ft);
|
||||||
|
field.setStringValue(input);
|
||||||
|
doc.add(field);
|
||||||
|
}
|
||||||
|
iw.addDocument(doc);
|
||||||
|
DirectoryReader reader = iw.getReader();
|
||||||
|
IndexSearcher searcher = newSearcher(reader);
|
||||||
|
iw.close();
|
||||||
|
TopDocs topDocs = searcher.search(new MatchAllDocsQuery(), 1, Sort.INDEXORDER);
|
||||||
|
assertThat(topDocs.totalHits.value, equalTo(1L));
|
||||||
|
String rawValue = Strings.arrayToDelimitedString(plainTextForHighlighter, String.valueOf(MULTIVAL_SEP_CHAR));
|
||||||
|
|
||||||
|
CustomUnifiedHighlighter highlighter = new CustomUnifiedHighlighter(searcher, hiliteAnalyzer, null,
|
||||||
|
passageFormatter, locale,
|
||||||
|
breakIterator, rawValue, noMatchSize);
|
||||||
|
highlighter.setFieldMatcher((name) -> "text".equals(name));
|
||||||
|
final Snippet[] snippets =
|
||||||
|
highlighter.highlightField("text", query, topDocs.scoreDocs[0].doc, expectedPassages.length);
|
||||||
|
assertEquals(expectedPassages.length, snippets.length);
|
||||||
|
for (int i = 0; i < snippets.length; i++) {
|
||||||
|
assertEquals(expectedPassages[i], snippets[i].getText());
|
||||||
|
}
|
||||||
|
reader.close();
|
||||||
|
dir.close();
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public void testAnnotatedTextStructuredMatch() throws Exception {
|
||||||
|
// Check that a structured token eg a URL can be highlighted in a query
|
||||||
|
// on marked-up
|
||||||
|
// content using an "annotated_text" type field.
|
||||||
|
String url = "https://en.wikipedia.org/wiki/Key_Word_in_Context";
|
||||||
|
String encodedUrl = URLEncoder.encode(url, "UTF-8");
|
||||||
|
String annotatedWord = "[highlighting](" + encodedUrl + ")";
|
||||||
|
String highlightedAnnotatedWord = "[highlighting](" + AnnotatedPassageFormatter.SEARCH_HIT_TYPE + "=" + encodedUrl + "&"
|
||||||
|
+ encodedUrl + ")";
|
||||||
|
final String[] markedUpInputs = { "This is a test. Just a test1 " + annotatedWord + " from [annotated](bar) highlighter.",
|
||||||
|
"This is the second " + annotatedWord + " value to perform highlighting on a longer text that gets scored lower." };
|
||||||
|
|
||||||
|
String[] expectedPassages = {
|
||||||
|
"This is a test. Just a test1 " + highlightedAnnotatedWord + " from [annotated](bar) highlighter.",
|
||||||
|
"This is the second " + highlightedAnnotatedWord + " value to perform highlighting on a"
|
||||||
|
+ " longer text that gets scored lower." };
|
||||||
|
Query query = new TermQuery(new Term("text", url));
|
||||||
|
BreakIterator breakIterator = new CustomSeparatorBreakIterator(MULTIVAL_SEP_CHAR);
|
||||||
|
assertHighlightOneDoc("text", markedUpInputs, query, Locale.ROOT, breakIterator, 0, expectedPassages);
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAnnotatedTextOverlapsWithUnstructuredSearchTerms() throws Exception {
|
||||||
|
final String[] markedUpInputs = { "[Donald Trump](Donald+Trump) visited Singapore",
|
||||||
|
"Donald duck is a [Disney](Disney+Inc) invention" };
|
||||||
|
|
||||||
|
String[] expectedPassages = { "[Donald](_hit_term=donald) Trump visited Singapore",
|
||||||
|
"[Donald](_hit_term=donald) duck is a [Disney](Disney+Inc) invention" };
|
||||||
|
Query query = new TermQuery(new Term("text", "donald"));
|
||||||
|
BreakIterator breakIterator = new CustomSeparatorBreakIterator(MULTIVAL_SEP_CHAR);
|
||||||
|
assertHighlightOneDoc("text", markedUpInputs, query, Locale.ROOT, breakIterator, 0, expectedPassages);
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAnnotatedTextMultiFieldWithBreakIterator() throws Exception {
|
||||||
|
final String[] markedUpInputs = { "[Donald Trump](Donald+Trump) visited Singapore. Kim shook hands with Donald",
|
||||||
|
"Donald duck is a [Disney](Disney+Inc) invention" };
|
||||||
|
String[] expectedPassages = { "[Donald](_hit_term=donald) Trump visited Singapore",
|
||||||
|
"Kim shook hands with [Donald](_hit_term=donald)",
|
||||||
|
"[Donald](_hit_term=donald) duck is a [Disney](Disney+Inc) invention" };
|
||||||
|
Query query = new TermQuery(new Term("text", "donald"));
|
||||||
|
BreakIterator breakIterator = new CustomSeparatorBreakIterator(MULTIVAL_SEP_CHAR);
|
||||||
|
breakIterator = new SplittingBreakIterator(breakIterator, '.');
|
||||||
|
assertHighlightOneDoc("text", markedUpInputs, query, Locale.ROOT, breakIterator, 0, expectedPassages);
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAnnotatedTextSingleFieldWithBreakIterator() throws Exception {
|
||||||
|
final String[] markedUpInputs = { "[Donald Trump](Donald+Trump) visited Singapore. Kim shook hands with Donald"};
|
||||||
|
String[] expectedPassages = { "[Donald](_hit_term=donald) Trump visited Singapore",
|
||||||
|
"Kim shook hands with [Donald](_hit_term=donald)"};
|
||||||
|
Query query = new TermQuery(new Term("text", "donald"));
|
||||||
|
BreakIterator breakIterator = new CustomSeparatorBreakIterator(MULTIVAL_SEP_CHAR);
|
||||||
|
breakIterator = new SplittingBreakIterator(breakIterator, '.');
|
||||||
|
assertHighlightOneDoc("text", markedUpInputs, query, Locale.ROOT, breakIterator, 0, expectedPassages);
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAnnotatedTextSingleFieldWithPhraseQuery() throws Exception {
|
||||||
|
final String[] markedUpInputs = { "[Donald Trump](Donald+Trump) visited Singapore",
|
||||||
|
"Donald Jr was with Melania Trump"};
|
||||||
|
String[] expectedPassages = { "[Donald](_hit_term=donald) [Trump](_hit_term=trump) visited Singapore"};
|
||||||
|
Query query = new PhraseQuery("text", "donald", "trump");
|
||||||
|
BreakIterator breakIterator = new CustomSeparatorBreakIterator(MULTIVAL_SEP_CHAR);
|
||||||
|
assertHighlightOneDoc("text", markedUpInputs, query, Locale.ROOT, breakIterator, 0, expectedPassages);
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testBadAnnotation() throws Exception {
|
||||||
|
final String[] markedUpInputs = { "Missing bracket for [Donald Trump](Donald+Trump visited Singapore"};
|
||||||
|
String[] expectedPassages = { "Missing bracket for [Donald Trump](Donald+Trump visited [Singapore](_hit_term=singapore)"};
|
||||||
|
Query query = new TermQuery(new Term("text", "singapore"));
|
||||||
|
BreakIterator breakIterator = new CustomSeparatorBreakIterator(MULTIVAL_SEP_CHAR);
|
||||||
|
assertHighlightOneDoc("text", markedUpInputs, query, Locale.ROOT, breakIterator, 0, expectedPassages);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
# Integration tests for Mapper Annotated_text components
|
||||||
|
#
|
||||||
|
|
||||||
|
---
|
||||||
|
"annotated highlighter on annotated text":
|
||||||
|
- skip:
|
||||||
|
version: " - 6.99.99"
|
||||||
|
reason: Annotated text type introduced in 7.0.0-alpha1
|
||||||
|
|
||||||
|
- do:
|
||||||
|
indices.create:
|
||||||
|
index: annotated
|
||||||
|
body:
|
||||||
|
settings:
|
||||||
|
number_of_shards: "1"
|
||||||
|
number_of_replicas: "0"
|
||||||
|
mappings:
|
||||||
|
doc:
|
||||||
|
properties:
|
||||||
|
text:
|
||||||
|
type: annotated_text
|
||||||
|
entityID:
|
||||||
|
type: keyword
|
||||||
|
|
||||||
|
- do:
|
||||||
|
index:
|
||||||
|
index: annotated
|
||||||
|
type: doc
|
||||||
|
body:
|
||||||
|
"text" : "The [quick brown fox](entity_3789) is brown."
|
||||||
|
"entityID": "entity_3789"
|
||||||
|
refresh: true
|
||||||
|
|
||||||
|
- do:
|
||||||
|
search:
|
||||||
|
body: { "query" : {"term" : { "entityID" : "entity_3789" } }, "highlight" : { "type" : "annotated", "require_field_match": false, "fields" : { "text" : {} } } }
|
||||||
|
|
||||||
|
- match: {hits.hits.0.highlight.text.0: "The [quick brown fox](_hit_term=entity_3789&entity_3789) is brown."}
|
||||||
|
|
||||||
|
- do:
|
||||||
|
search:
|
||||||
|
body: { "query" : {"term" : { "text" : "quick" } }, "highlight" : { "type" : "annotated", "require_field_match": false, "fields" : { "text" : {} } } }
|
||||||
|
|
||||||
|
- match: {hits.hits.0.highlight.text.0: "The [quick](_hit_term=quick) brown fox is brown."}
|
|
@ -266,6 +266,10 @@ fi
|
||||||
install_and_check_plugin mapper murmur3
|
install_and_check_plugin mapper murmur3
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@test "[$GROUP] install annotated-text mapper plugin" {
|
||||||
|
install_and_check_plugin mapper annotated-text
|
||||||
|
}
|
||||||
|
|
||||||
@test "[$GROUP] check reindex module" {
|
@test "[$GROUP] check reindex module" {
|
||||||
check_module reindex
|
check_module reindex
|
||||||
}
|
}
|
||||||
|
@ -380,6 +384,10 @@ fi
|
||||||
remove_plugin mapper-murmur3
|
remove_plugin mapper-murmur3
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@test "[$GROUP] remove annotated-text mapper plugin" {
|
||||||
|
remove_plugin mapper-annotated-text
|
||||||
|
}
|
||||||
|
|
||||||
@test "[$GROUP] remove size mapper plugin" {
|
@test "[$GROUP] remove size mapper plugin" {
|
||||||
remove_plugin mapper-size
|
remove_plugin mapper-size
|
||||||
}
|
}
|
||||||
|
|
|
@ -18,10 +18,13 @@
|
||||||
*/
|
*/
|
||||||
package org.elasticsearch.search.fetch.subphase.highlight;
|
package org.elasticsearch.search.fetch.subphase.highlight;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.Analyzer;
|
||||||
import org.apache.lucene.search.highlight.DefaultEncoder;
|
import org.apache.lucene.search.highlight.DefaultEncoder;
|
||||||
import org.apache.lucene.search.highlight.Encoder;
|
import org.apache.lucene.search.highlight.Encoder;
|
||||||
import org.apache.lucene.search.highlight.SimpleHTMLEncoder;
|
import org.apache.lucene.search.highlight.SimpleHTMLEncoder;
|
||||||
import org.elasticsearch.index.fieldvisitor.CustomFieldsVisitor;
|
import org.elasticsearch.index.fieldvisitor.CustomFieldsVisitor;
|
||||||
|
import org.elasticsearch.index.mapper.DocumentMapper;
|
||||||
|
import org.elasticsearch.index.mapper.KeywordFieldMapper;
|
||||||
import org.elasticsearch.index.mapper.MappedFieldType;
|
import org.elasticsearch.index.mapper.MappedFieldType;
|
||||||
import org.elasticsearch.search.fetch.FetchSubPhase;
|
import org.elasticsearch.search.fetch.FetchSubPhase;
|
||||||
import org.elasticsearch.search.internal.SearchContext;
|
import org.elasticsearch.search.internal.SearchContext;
|
||||||
|
@ -70,8 +73,18 @@ public final class HighlightUtils {
|
||||||
return textsToHighlight;
|
return textsToHighlight;
|
||||||
}
|
}
|
||||||
|
|
||||||
static class Encoders {
|
public static class Encoders {
|
||||||
static final Encoder DEFAULT = new DefaultEncoder();
|
public static final Encoder DEFAULT = new DefaultEncoder();
|
||||||
static final Encoder HTML = new SimpleHTMLEncoder();
|
public static final Encoder HTML = new SimpleHTMLEncoder();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
static Analyzer getAnalyzer(DocumentMapper docMapper, MappedFieldType type) {
|
||||||
|
if (type instanceof KeywordFieldMapper.KeywordFieldType) {
|
||||||
|
KeywordFieldMapper.KeywordFieldType keywordFieldType = (KeywordFieldMapper.KeywordFieldType) type;
|
||||||
|
if (keywordFieldType.normalizer() != null) {
|
||||||
|
return keywordFieldType.normalizer();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return docMapper.mappers().indexAnalyzer();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
|
@ -49,7 +49,6 @@ import java.util.List;
|
||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
|
|
||||||
import static org.elasticsearch.search.fetch.subphase.highlight.UnifiedHighlighter.convertFieldValue;
|
import static org.elasticsearch.search.fetch.subphase.highlight.UnifiedHighlighter.convertFieldValue;
|
||||||
import static org.elasticsearch.search.fetch.subphase.highlight.UnifiedHighlighter.getAnalyzer;
|
|
||||||
|
|
||||||
public class PlainHighlighter implements Highlighter {
|
public class PlainHighlighter implements Highlighter {
|
||||||
private static final String CACHE_KEY = "highlight-plain";
|
private static final String CACHE_KEY = "highlight-plain";
|
||||||
|
@ -102,7 +101,7 @@ public class PlainHighlighter implements Highlighter {
|
||||||
int numberOfFragments = field.fieldOptions().numberOfFragments() == 0 ? 1 : field.fieldOptions().numberOfFragments();
|
int numberOfFragments = field.fieldOptions().numberOfFragments() == 0 ? 1 : field.fieldOptions().numberOfFragments();
|
||||||
ArrayList<TextFragment> fragsList = new ArrayList<>();
|
ArrayList<TextFragment> fragsList = new ArrayList<>();
|
||||||
List<Object> textsToHighlight;
|
List<Object> textsToHighlight;
|
||||||
Analyzer analyzer = getAnalyzer(context.mapperService().documentMapper(hitContext.hit().getType()), fieldType);
|
Analyzer analyzer = HighlightUtils.getAnalyzer(context.mapperService().documentMapper(hitContext.hit().getType()), fieldType);
|
||||||
final int maxAnalyzedOffset = context.indexShard().indexSettings().getHighlightMaxAnalyzedOffset();
|
final int maxAnalyzedOffset = context.indexShard().indexSettings().getHighlightMaxAnalyzedOffset();
|
||||||
|
|
||||||
try {
|
try {
|
||||||
|
|
|
@ -26,6 +26,7 @@ import org.apache.lucene.search.uhighlight.BoundedBreakIteratorScanner;
|
||||||
import org.apache.lucene.search.uhighlight.CustomPassageFormatter;
|
import org.apache.lucene.search.uhighlight.CustomPassageFormatter;
|
||||||
import org.apache.lucene.search.uhighlight.CustomSeparatorBreakIterator;
|
import org.apache.lucene.search.uhighlight.CustomSeparatorBreakIterator;
|
||||||
import org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter;
|
import org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter;
|
||||||
|
import org.apache.lucene.search.uhighlight.PassageFormatter;
|
||||||
import org.apache.lucene.search.uhighlight.Snippet;
|
import org.apache.lucene.search.uhighlight.Snippet;
|
||||||
import org.apache.lucene.search.uhighlight.UnifiedHighlighter.OffsetSource;
|
import org.apache.lucene.search.uhighlight.UnifiedHighlighter.OffsetSource;
|
||||||
import org.apache.lucene.util.BytesRef;
|
import org.apache.lucene.util.BytesRef;
|
||||||
|
@ -34,7 +35,6 @@ import org.elasticsearch.common.Strings;
|
||||||
import org.elasticsearch.common.text.Text;
|
import org.elasticsearch.common.text.Text;
|
||||||
import org.elasticsearch.index.IndexSettings;
|
import org.elasticsearch.index.IndexSettings;
|
||||||
import org.elasticsearch.index.mapper.DocumentMapper;
|
import org.elasticsearch.index.mapper.DocumentMapper;
|
||||||
import org.elasticsearch.index.mapper.KeywordFieldMapper;
|
|
||||||
import org.elasticsearch.index.mapper.MappedFieldType;
|
import org.elasticsearch.index.mapper.MappedFieldType;
|
||||||
import org.elasticsearch.search.fetch.FetchPhaseExecutionException;
|
import org.elasticsearch.search.fetch.FetchPhaseExecutionException;
|
||||||
import org.elasticsearch.search.fetch.FetchSubPhase;
|
import org.elasticsearch.search.fetch.FetchSubPhase;
|
||||||
|
@ -54,7 +54,7 @@ public class UnifiedHighlighter implements Highlighter {
|
||||||
public boolean canHighlight(MappedFieldType fieldType) {
|
public boolean canHighlight(MappedFieldType fieldType) {
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public HighlightField highlight(HighlighterContext highlighterContext) {
|
public HighlightField highlight(HighlighterContext highlighterContext) {
|
||||||
MappedFieldType fieldType = highlighterContext.fieldType;
|
MappedFieldType fieldType = highlighterContext.fieldType;
|
||||||
|
@ -62,23 +62,18 @@ public class UnifiedHighlighter implements Highlighter {
|
||||||
SearchContext context = highlighterContext.context;
|
SearchContext context = highlighterContext.context;
|
||||||
FetchSubPhase.HitContext hitContext = highlighterContext.hitContext;
|
FetchSubPhase.HitContext hitContext = highlighterContext.hitContext;
|
||||||
Encoder encoder = field.fieldOptions().encoder().equals("html") ? HighlightUtils.Encoders.HTML : HighlightUtils.Encoders.DEFAULT;
|
Encoder encoder = field.fieldOptions().encoder().equals("html") ? HighlightUtils.Encoders.HTML : HighlightUtils.Encoders.DEFAULT;
|
||||||
CustomPassageFormatter passageFormatter = new CustomPassageFormatter(field.fieldOptions().preTags()[0],
|
|
||||||
field.fieldOptions().postTags()[0], encoder);
|
|
||||||
final int maxAnalyzedOffset = context.indexShard().indexSettings().getHighlightMaxAnalyzedOffset();
|
final int maxAnalyzedOffset = context.indexShard().indexSettings().getHighlightMaxAnalyzedOffset();
|
||||||
|
|
||||||
List<Snippet> snippets = new ArrayList<>();
|
List<Snippet> snippets = new ArrayList<>();
|
||||||
int numberOfFragments;
|
int numberOfFragments;
|
||||||
try {
|
try {
|
||||||
|
|
||||||
final Analyzer analyzer =
|
final Analyzer analyzer = getAnalyzer(context.mapperService().documentMapper(hitContext.hit().getType()), fieldType);
|
||||||
getAnalyzer(context.mapperService().documentMapper(hitContext.hit().getType()), fieldType);
|
List<Object> fieldValues = loadFieldValues(fieldType, field, context, hitContext);
|
||||||
List<Object> fieldValues = HighlightUtils.loadFieldValues(field, fieldType, context, hitContext);
|
|
||||||
fieldValues = fieldValues.stream()
|
|
||||||
.map((s) -> convertFieldValue(fieldType, s))
|
|
||||||
.collect(Collectors.toList());
|
|
||||||
if (fieldValues.size() == 0) {
|
if (fieldValues.size() == 0) {
|
||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
|
final PassageFormatter passageFormatter = getPassageFormatter(field, encoder);
|
||||||
final IndexSearcher searcher = new IndexSearcher(hitContext.reader());
|
final IndexSearcher searcher = new IndexSearcher(hitContext.reader());
|
||||||
final CustomUnifiedHighlighter highlighter;
|
final CustomUnifiedHighlighter highlighter;
|
||||||
final String fieldValue = mergeFieldValues(fieldValues, MULTIVAL_SEP_CHAR);
|
final String fieldValue = mergeFieldValues(fieldValues, MULTIVAL_SEP_CHAR);
|
||||||
|
@ -145,7 +140,27 @@ public class UnifiedHighlighter implements Highlighter {
|
||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
|
|
||||||
private BreakIterator getBreakIterator(SearchContextHighlight.Field field) {
|
protected PassageFormatter getPassageFormatter(SearchContextHighlight.Field field, Encoder encoder) {
|
||||||
|
CustomPassageFormatter passageFormatter = new CustomPassageFormatter(field.fieldOptions().preTags()[0],
|
||||||
|
field.fieldOptions().postTags()[0], encoder);
|
||||||
|
return passageFormatter;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
protected Analyzer getAnalyzer(DocumentMapper docMapper, MappedFieldType type) {
|
||||||
|
return HighlightUtils.getAnalyzer(docMapper, type);
|
||||||
|
}
|
||||||
|
|
||||||
|
protected List<Object> loadFieldValues(MappedFieldType fieldType, SearchContextHighlight.Field field, SearchContext context,
|
||||||
|
FetchSubPhase.HitContext hitContext) throws IOException {
|
||||||
|
List<Object> fieldValues = HighlightUtils.loadFieldValues(field, fieldType, context, hitContext);
|
||||||
|
fieldValues = fieldValues.stream()
|
||||||
|
.map((s) -> convertFieldValue(fieldType, s))
|
||||||
|
.collect(Collectors.toList());
|
||||||
|
return fieldValues;
|
||||||
|
}
|
||||||
|
|
||||||
|
protected BreakIterator getBreakIterator(SearchContextHighlight.Field field) {
|
||||||
final SearchContextHighlight.FieldOptions fieldOptions = field.fieldOptions();
|
final SearchContextHighlight.FieldOptions fieldOptions = field.fieldOptions();
|
||||||
final Locale locale =
|
final Locale locale =
|
||||||
fieldOptions.boundaryScannerLocale() != null ? fieldOptions.boundaryScannerLocale() :
|
fieldOptions.boundaryScannerLocale() != null ? fieldOptions.boundaryScannerLocale() :
|
||||||
|
@ -168,7 +183,7 @@ public class UnifiedHighlighter implements Highlighter {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private static List<Snippet> filterSnippets(List<Snippet> snippets, int numberOfFragments) {
|
protected static List<Snippet> filterSnippets(List<Snippet> snippets, int numberOfFragments) {
|
||||||
|
|
||||||
//We need to filter the snippets as due to no_match_size we could have
|
//We need to filter the snippets as due to no_match_size we could have
|
||||||
//either highlighted snippets or non highlighted ones and we don't want to mix those up
|
//either highlighted snippets or non highlighted ones and we don't want to mix those up
|
||||||
|
@ -203,17 +218,7 @@ public class UnifiedHighlighter implements Highlighter {
|
||||||
return filteredSnippets;
|
return filteredSnippets;
|
||||||
}
|
}
|
||||||
|
|
||||||
static Analyzer getAnalyzer(DocumentMapper docMapper, MappedFieldType type) {
|
protected static String convertFieldValue(MappedFieldType type, Object value) {
|
||||||
if (type instanceof KeywordFieldMapper.KeywordFieldType) {
|
|
||||||
KeywordFieldMapper.KeywordFieldType keywordFieldType = (KeywordFieldMapper.KeywordFieldType) type;
|
|
||||||
if (keywordFieldType.normalizer() != null) {
|
|
||||||
return keywordFieldType.normalizer();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return docMapper.mappers().indexAnalyzer();
|
|
||||||
}
|
|
||||||
|
|
||||||
static String convertFieldValue(MappedFieldType type, Object value) {
|
|
||||||
if (value instanceof BytesRef) {
|
if (value instanceof BytesRef) {
|
||||||
return type.valueForDisplay(value).toString();
|
return type.valueForDisplay(value).toString();
|
||||||
} else {
|
} else {
|
||||||
|
@ -221,14 +226,14 @@ public class UnifiedHighlighter implements Highlighter {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private static String mergeFieldValues(List<Object> fieldValues, char valuesSeparator) {
|
protected static String mergeFieldValues(List<Object> fieldValues, char valuesSeparator) {
|
||||||
//postings highlighter accepts all values in a single string, as offsets etc. need to match with content
|
//postings highlighter accepts all values in a single string, as offsets etc. need to match with content
|
||||||
//loaded from stored fields, we merge all values using a proper separator
|
//loaded from stored fields, we merge all values using a proper separator
|
||||||
String rawValue = Strings.collectionToDelimitedString(fieldValues, String.valueOf(valuesSeparator));
|
String rawValue = Strings.collectionToDelimitedString(fieldValues, String.valueOf(valuesSeparator));
|
||||||
return rawValue.substring(0, Math.min(rawValue.length(), Integer.MAX_VALUE - 1));
|
return rawValue.substring(0, Math.min(rawValue.length(), Integer.MAX_VALUE - 1));
|
||||||
}
|
}
|
||||||
|
|
||||||
private OffsetSource getOffsetSource(MappedFieldType fieldType) {
|
protected OffsetSource getOffsetSource(MappedFieldType fieldType) {
|
||||||
if (fieldType.indexOptions() == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS) {
|
if (fieldType.indexOptions() == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS) {
|
||||||
return fieldType.storeTermVectors() ? OffsetSource.POSTINGS_WITH_TERM_VECTORS : OffsetSource.POSTINGS;
|
return fieldType.storeTermVectors() ? OffsetSource.POSTINGS_WITH_TERM_VECTORS : OffsetSource.POSTINGS;
|
||||||
}
|
}
|
||||||
|
|
Loading…
Reference in New Issue