Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-17463

Better exact-text searching

    XMLWordPrintable

Details

    • 3
    • 1
    • We collect Jira feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

    Description

      NOTE: This suggestion is for JIRA Server. Using JIRA Cloud? See the corresponding suggestion.

      (See linked issues for failing use cases.)

      All of our indexing goes through the EnglishAnalyzer which passes the stream through several filters: lowercase, stop word removal, apostrophe removal, tokenizing on underscore and dash, and stemming. This works pretty well when you do a natural language kind of query but has numerous cases where it fails to return results that user experiences. What we need is some kind of secondary "exact text" index and use that to boost results.

      One way would be to create a new "exact query" field in the SummaryIndexer, EnvironmentIndexer, and DescriptionIndexer and a PerFieldAnalyzerWrapper to use a "dumber" Analyzer that uses the WhitespaceTokenizer instead of the StandardTokenizer. Then we'd need to fix querying to understand that this new field exists. We could either overload the tilda operator and make it smart enough to search across both fields and combine the results in a smart way. Or we could make the equals operator handle text (right now it throws an exception). We'd also want some kind of UI configuration to enable/disable this since it changes behaviour and not everyone is going to want the new behaviour. This would likely break the existing unit tests and func tests for indexing and JQL so we'd need to fix those. We'd probably want to writer new Analyzers for German, French, and Spanish as well instead of just limiting this new behaviour to our English users. Possibly Chinese/Japanese/Korean as well. We'd want better func test coverage on searching in non-English languages to try to make sure this change doesn't have adverse effects for non-English users.

      We'd also need to figure out how to handle text custom fields in all of this.

      Another approach would be something more along the lines of the SynonymAnalyzer described in "Lucene in Action": have a single Analyzer emit multiple tokens with the same virtual position. Under the covers the Analyzer would multiplex TokenStreams from a "normal" Analyzer and a "dumb" Analyzer. I'm not sure if sharing the Reader between Tokenizers like that works, if not we'd might have to make changes to Lucene core. Having both kinds of tokens in the same index field seems like it OUGHT to end up giving better results for these "exact searches". But it might have adverse effects on normal search ordering. This approach would probably have less impact on JQL but has a higher risk of us running into shortcomings in Lucene APIs.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jpendleton Justus Pendleton (Inactive)
              Votes:
              50 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 240h
                  240h
                  Remaining:
                  Remaining Estimate - 240h
                  240h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified