Understanding Search
When a learner has their Percipio site language set to any language other than English (US), and does a search in that language, search results also include relevant English (US) content.
Search engines
Elasticsearch
Percipio uses Elasticsearch, which is an open-source search engine based on Apache Lucene. Elasticsearch is currently the most popular enterprise search engine and is used by sites such as Facebook, Netflix, GitHub, and Skillport 8i. Elasticsearch supports many features expected of modern search engines such as type-ahead suggestions, word stemming/word forms, synonyms, and fuzzy matching (typo correction).
Google BERT
Percipio also uses Google BERT to find relevant content based on context or meaning as opposed to simple text matches. Google BERT uses natural language processing to generate numerical vectors, which encode the meaning of a section of text. We compute and store vectors for all content published in Percipio. These vectors can be compared to find the content that has similar meaning to the search query, even if the search query does not use the exact same words that are used to describe the content.
How search works
When you enter a search term, the search engine looks for matches to indexed asset metadata fields and content types. When a match is found, a relevance score is calculated for each asset. If you have your language set to any language other than English (US) you can also see English content in the search results along with results in your selected language. A Language filter option displays.
Keyword search
For a single word search term, Percipio looks for simple word matches. For multiple words, it also searches for phrase matches and assets that contain most of the terms. For example, if you search on the phrase project management your search will return matches based on both words, not based on all assets that contain either project or management.
Quotes are not necessary for an “exact phrase match” since this is done automatically when a multiple word search term is entered.
Content Type search
Search will recognize certain content types when they are included in the search query. For example, a search for leadership audiobooks will match on the term leadership and apply a boost to all matching audiobooks. Searching for just audiobooks will return a list of only audiobooks the learner is entitled to.
Recognized content types include:
- Audiobooks
- Aspire Journeys
- Live Events
- Testpreps
- Live Courses
- Practice lab
- Skill benchmark
- AI Simulator (CAISY™)
Searching on these terms returns all items of that content type.
Metadata fields used in search
The asset metadata fields used in a search, include:
- Asset and channel titles
- Asset and channel descriptions (overview)
- Book author / Video speaker / Course instructor names
- Book ISBNs
- Publisher names
- Course, video and book IDs
- Certification exam names and numbers
- Technology and version
- Video transcripts and book full text
- Content source
- Job role family
- Skills
Type ahead
Type-ahead displays suggested terms as you type in the search field. These suggestions are compiled from several sources:
- A curated list of common search terms
- Certification exam names and numbers
- Channel and asset titles
- Author and instructor names
Selecting a suggested term enters it into the search field and executes the search.
Word stemming / word forms
Word stemming looks for different forms of words, so that relevant results are not omitted. In addition to searching for exact word/phrase matches, the search engine reduces words to their common form. For instance, the words "programming," "programmer," "programmed," "programs," and "program" will all count as matches.
Exact matches are scored higher than stemmed matches.
Synonyms
If the search term is not used in any asset metadata, a match may not be found. Therefore, the search engine uses synonyms to define equivalent or related terms.
For example, you want to find content on the Internet of Things. You enter the search term iot, but some content may not use this acronym in the descriptive metadata. The search engine has a defined synonym making iot equivalent to the text “internet of things” so the search for iot will match content that contains “internet of things".
Additionally, terms such as "accessibility," "wcag," and "section 508" are considered related. A course on accessibility may not use the term "section 508" in the asset metadata, but a search for that term will return accessibility-related assets.
"Coaching" and "mentoring" are very closely related, but an asset might use only one of these terms in the metadata. Using synonyms to associate these related terms fosters a successful search.
Skillsoft periodically reviews common search term history and works with the curators to keep the list of synonyms updated.
Fuzzy match and Slop
Both the search term and the type ahead suggestions use fuzziness to identify misspellings or typos. If no exact match results are found to the original search query, the search engine applies fuzziness to the term and looks for close matches.
The Elasticsearch Slop function is used for finding matches to author and instructor names that may not be an exact match. For example, 'Peter Drucker' and 'Peter F. Drucker' would be considered a match by using Slop.
Relevance score
The search engine calculates a relevance score for all matching assets. This relevance score is calculated based on a number of factors:
- The more matches, the higher the score
- Matching multiple words in a given field (phrase match), the higher the score
- Matches in a shorter field are weighted higher than matches in a longer field (for example, matches in titles are weighted higher than matches in a description, which are weighted higher than matches in full text)
Each match is scored, then adjusted, to fine-tune the relevance. Percipio boosts the asset relevance score based on the following factors, in descending order (highest boost to lowest):
- Phrase or word match in channel title
- Phrase or word match in custom content title
- Phrase or word match in video title
- Phrase or word match in course title
- Phrase or word match in certification exam
- Phrase or word match in custom content source
- Phrase match in book text
- Phrase or word match in book title
- Phrase or word match instructor/author/presenter name
- Phrase or word match in content ID
- Phrase or word match in parent channel title
- Phrase or word match in publisher name
- Phrase or word match in child titles (video titles in a course)
- Word match in custom content description
- Word match in channel description (for channel relevance)
- Word match to book ISBN
- Word match in technology title or version
- Word match in book text
- Word match in video transcripts
The match scores are combined to give a final relevance score for each matched asset. The results display in descending relevance order.
Age decay
A matching asset’s relevance score is reduced based on its age, whether it is archived, or has a scheduled retirement date.
Advanced features
In order to make search as simple to use as possible, advanced features such as “quoted strings” for exact phrase matches, Boolean operators, wildcards, and proximity indicators are not currently supported. However, Skillsoft monitors how learners use search and may add support for these features in the future.