Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

python - ElasticSearch: EdgeNgrams and Numbers

Any ideas on how EdgeNgram treats numbers?

I'm running haystack with an ElasticSearch backend. I created an indexed field of type EdgeNgram. This field will contain a string that may contain words as well as numbers.

When I run a search against this field using a partial word, it works how it's supposed to. But if I put in a partial number, I'm not getting the result that I want.

Example:

I search for the indexed field "EdgeNgram 12323" by typing "edgen" and I'll get the index returned to me. If I search for that same index by typing "123" I get nothing.

Thoughts?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I found my way here trying to solve this same problem in Haystack + Elasticsearch. Following the hints from uboness and ComoWhat, I wrote an alternate Haystack engine that (I believe) makes EdgeNGram fields treat numeric strings like words. Others may benefit, so I thought I'd share it.

from haystack.backends.elasticsearch_backend import ElasticsearchSearchEngine, ElasticsearchSearchBackend

class CustomElasticsearchBackend(ElasticsearchSearchBackend):
    """
    The default ElasticsearchSearchBackend settings don't tokenize strings of digits the same way as words, so emplids
    get lost: the lowercase tokenizer is the culprit. Switching to the standard tokenizer and doing the case-
    insensitivity in the filter seems to do the job.
    """
    def __init__(self, connection_alias, **connection_options):
        # see http://stackoverflow.com/questions/13636419/elasticsearch-edgengrams-and-numbers
        self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['tokenizer'] = 'standard'
        self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['filter'].append('lowercase')
        super(CustomElasticsearchBackend, self).__init__(connection_alias, **connection_options)

class CustomElasticsearchSearchEngine(ElasticsearchSearchEngine):
    backend = CustomElasticsearchBackend

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...