Hunspell token filter
Hunspell token filter
Provides dictionary stemming based on a provided Hunspell dictionary. The hunspell
filter requires configuration of one or more language-specific Hunspell dictionaries.
This filter uses Lucene’s HunspellStemFilter.
If available, we recommend trying an algorithmic stemmer for your language before using the hunspell token filter. In practice, algorithmic stemmers typically outperform dictionary stemmers. See Dictionary stemmers.
Configure Hunspell dictionaries
Hunspell dictionaries are stored and detected on a dedicated hunspell
directory on the filesystem: <$ES_PATH_CONF>/hunspell
. Each dictionary is expected to have its own directory, named after its associated language and locale (e.g., pt_BR
, en_GB
). This dictionary directory is expected to hold a single .aff
and one or more .dic
files, all of which will automatically be picked up. For example, the following directory layout will define the en_US
dictionary:
- config
|-- hunspell
| |-- en_US
| | |-- en_US.dic
| | |-- en_US.aff
Each dictionary can be configured with one setting:
ignore_case
(Static, Boolean) If true, dictionary matching will be case insensitive. Defaults to false
.
This setting can be configured globally in elasticsearch.yml
using indices.analysis.hunspell.dictionary.ignore_case
.
To configure the setting for a specific locale, use the indices.analysis.hunspell.dictionary.<locale>.ignore_case
setting (e.g., for the en_US
(American English) locale, the setting is indices.analysis.hunspell.dictionary.en_US.ignore_case
).
You can also add a settings.yml
file under the dictionary directory which holds these settings. This overrides any other ignore_case
settings defined in elasticsearch.yml
.
Example
The following analyze API request uses the hunspell
filter to stem the foxes jumping quickly
to the fox jump quick
.
The request specifies the en_US
locale, meaning that the .aff
and .dic
files in the <$ES_PATH_CONF>/hunspell/en_US
directory are used for the Hunspell dictionary.
resp = client.indices.analyze(
tokenizer="standard",
filter=[
{
"type": "hunspell",
"locale": "en_US"
}
],
text="the foxes jumping quickly",
)
print(resp)
const response = await client.indices.analyze({
tokenizer: "standard",
filter: [
{
type: "hunspell",
locale: "en_US",
},
],
text: "the foxes jumping quickly",
});
console.log(response);
GET /_analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "hunspell",
"locale": "en_US"
}
],
"text": "the foxes jumping quickly"
}
The filter produces the following tokens:
[ the, fox, jump, quick ]
Configurable parameters
dictionary
(Optional, string or array of strings) One or more .dic
files (e.g, en_US.dic, my_custom.dic
) to use for the Hunspell dictionary.
By default, the hunspell
filter uses all .dic
files in the <$ES_PATH_CONF>/hunspell/<locale>
directory specified using the lang
, language
, or locale
parameter.
dedup
(Optional, Boolean) If true
, duplicate tokens are removed from the filter’s output. Defaults to true
.
lang
(Required*, string) An alias for the locale parameter.
If this parameter is not specified, the language
or locale
parameter is required.
language
(Required*, string) An alias for the locale parameter.
If this parameter is not specified, the lang
or locale
parameter is required.
locale
(Required*, string) Locale directory used to specify the .aff
and .dic
files for a Hunspell dictionary. See Configure Hunspell dictionaries.
If this parameter is not specified, the lang
or language
parameter is required.
longest_only
(Optional, Boolean) If true
, only the longest stemmed version of each token is included in the output. If false
, all stemmed versions of the token are included. Defaults to false
.
Customize and add to an analyzer
To customize the hunspell
filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following create index API request uses a custom hunspell
filter, my_en_US_dict_stemmer
, to configure a new custom analyzer.
The my_en_US_dict_stemmer
filter uses a locale
of en_US
, meaning that the .aff
and .dic
files in the <$ES_PATH_CONF>/hunspell/en_US
directory are used. The filter also includes a dedup
argument of false
, meaning that duplicate tokens added from the dictionary are not removed from the filter’s output.
resp = client.indices.create(
index="my-index-000001",
settings={
"analysis": {
"analyzer": {
"en": {
"tokenizer": "standard",
"filter": [
"my_en_US_dict_stemmer"
]
}
},
"filter": {
"my_en_US_dict_stemmer": {
"type": "hunspell",
"locale": "en_US",
"dedup": False
}
}
}
},
)
print(resp)
response = client.indices.create(
index: 'my-index-000001',
body: {
settings: {
analysis: {
analyzer: {
en: {
tokenizer: 'standard',
filter: [
'my_en_US_dict_stemmer'
]
}
},
filter: {
"my_en_US_dict_stemmer": {
type: 'hunspell',
locale: 'en_US',
dedup: false
}
}
}
}
}
)
puts response
const response = await client.indices.create({
index: "my-index-000001",
settings: {
analysis: {
analyzer: {
en: {
tokenizer: "standard",
filter: ["my_en_US_dict_stemmer"],
},
},
filter: {
my_en_US_dict_stemmer: {
type: "hunspell",
locale: "en_US",
dedup: false,
},
},
},
},
});
console.log(response);
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"en": {
"tokenizer": "standard",
"filter": [ "my_en_US_dict_stemmer" ]
}
},
"filter": {
"my_en_US_dict_stemmer": {
"type": "hunspell",
"locale": "en_US",
"dedup": false
}
}
}
}
}
Settings
In addition to the ignore_case settings, you can configure the following global settings for the hunspell
filter using elasticsearch.yml
:
indices.analysis.hunspell.dictionary.lazy
(Static, Boolean) If true
, the loading of Hunspell dictionaries is deferred until a dictionary is used. If false
, the dictionary directory is checked for dictionaries when the node starts, and any dictionaries are automatically loaded. Defaults to false
.