Site search developers FAQ
If you have a question about OI Site Search, please take a look at our frequently asked questions below. If you can’t find your question, please contact us!
Can you handle large websites?
Sure, this is no problem, we can deal with millions of URL’s.
How do you minimize downtime?
We minimize downtime by operating two large clusters of machines that serve the search index. About half the servers can be shut down without disrupting the service. In case of trouble in one data center we can just shut it down without down time.
Can you provide search for intranets or webpages behind paywalls or passwords?
Yes. If you allow our crawler access to your site. You can secure the search page by configuring leases on search pages secured by a HMAC-SHA1 shared secret. By providing an expiration time, the user can never access the search after the expiration time. See the lease and HMAC configuration options for more information.
Which languages do you detect?
Our crawlers can detect content in over 50 different languages with high precision. We can detect content of the following languages: Afrikaans, العربية, Български, বাংলা, Česky, Dansk, Deutsch, Ελληνικά, English, Español, Eesti, فارسی, Suomi, Français, Frysk, ગુજરાતી, עברית, हिन्दी, Hrvatski, Magyar, Bahasa Indonesia, Íslenska, Italiano, 日本語, ಕನ್ನಡ, 한국어, Lietuvių, Latviešu, Македонски, മലയാളം, मराठी, नेपाली, Nederlands, Norsk (bokmål), ਪੰਜਾਬੀ, Polski, Português, Română, Pусский, Slovenčina, Slovenščina, Soomaaliga, Shqip, Svenska, Kiswahili, தமிழ், తెలుగు, ไทย, Tagalog, Türkçe, Українська, اردو, Tiếng Việt, 中文 and 國語. If your language is not listed and you need to have pages detected for that language please contact us.
Do you support Microdata?
Yes, we can now partially extract relevant microdata items such as an image, the bread crumb, a product’s price, user downloads, comments and more. With support for microdata implemented we can provide interesting information on search results to your users such as the price of a web shop product, review rating or event events. We recommend websites to implement the Schema.org vocabulary.
How often do you crawl my website?
We crawl important pages of your website every 30 to 120 minutes. Please contact us if you want us to crawl a specific URL even more regular.
Do you support sitemap.xml?
Do you support canonical URL’s?
Yes, we fully support canonical URL’s. You can either use the link element in the HTML head:
<link rel="canonical" href="http://www.oi.com/"/>
or specify the canonical URL in the HTTP header in case of media files such as PDF files:
How often do you revisit pages?
Four to six days after the first time we discovered it. If a page changes between visits we’ll reduce the interval and crawl it more frequently, if it hasn’t changed less frequently.
Can you extract publication dates from content pages?
Yes. We make an effort to extract dates in most common formats or languages from your pages. This date is used to calculate the age of a document relative to other document, and can be displayed in the result snippet. Some very exotic formats are not supported. We also read dates from one of the properties of the creative work microdata schema, OpenGraph article:published_time and article:modified_time are also supported. Make sure you use the correct date format as described, including correct timezone information is encouraged.
How can i prevent some pages from being indexed?
You can use robots.txt to deny our crawler access to some pages or directories. We also obey robots meta tags in the HTML with which you can tell us not to index (NOINDEX) or not to follow (NOFOLLOW) a specific page. The NOFOLLOW and NOINDEX values can be combined. We recommend to put a NOINDEX robots meta tag on overview pages, directory listings, news archives and similar type of pages.
How can I remove a page from the index?
This works the same as preventing a page from being indexed. If you disallow the page or directory in your robots.txt or via the robots meta tag we will remove the page from the index the next time we visit it. If you want a page to be removed as soon as possible you can contact us so we can schedule removal of the page. We’re working on an API with which you can submit a list of URL’s that are scheduled to be crawled as soon as possible.
Which file types are supported?
We support web pages (HTML and XHTML), PDF documents, all kinds of office documents and spreadsheets (MS Office, Open Office, Libre Office), and EPUB documents. Support for multimedia such as audio, video and images is not planned.
Can you extract relevant images from article’s?
Yes, we have partial support for image extraction. We can extract a page’s image from the og:image tag or from a microdata image item. We’re also working on improving full automatic extraction of a relevant image from the main content. This feature is enabled by default. See the showImage API call for more information.
How do you find a title for web pages without a title element?
If, for some reason, a title tag is missing we try to use a h1 or h2 element instead. If it still doesn’t work we can use a file name or even the first few words from the content.
Can you crawl AJAX websites?
Search engine questions
Which languages do you support?
Our search index can properly analyze and deal with a variety of languages. The following languages are supported: languages: العربية, Български, Česky, Dansk, Deutsch, Ελληνικά, English, Español, فارسی, Suomi, Français, עברית, Magyar, Bahasa Indonesia, Italiano, 日本語, 한국어, Latviešu, Nederlands, Norsk (bokmål), Polski, Português, Română, Русский, Slovenčina, Svenska, ไทย, Türkçe, Українська, Tiếng Việt, 中文, and 國語. If your language is not listed and you need to have pages detected for that language please contact us.
How do you deal with compound words?
We support compounds for most Danish, Dutch, German, Norwegian and Swedish words.
Can we use boolean operators?
Yes. Users can use the + (plus) and – (minus) signs to indicate the following term being mandatory or missing in the documents. The following query: the +quick -brown fox will return documents with quick, without brown and maybe with fox.
Can queries be location aware?
No, not at this time but we are working on a proper location aware system. We can only accurately detect in which country your user currently is.
How do you handle American and British English?
We don’t treat the spelling differences in any special way. Differences between the languages can be corrected as spelling errors depending on the spelling used on your website and the search terms used by your users. We have not yet planned American/British spelling normalisation.
Do you provide auto suggest or search term auto completion?
Yes! We provide a search suggestion autocomplete widget. This widget uses previous search requests for suggestions so it takes some time for it to populate with enough useful suggestions.
Do you provide related keywords?
Yes! We provide a related query widget. This widget shows related queries for the current search query.
Can a specific language be preferred over another in the results?
Yes. You can set a list of languages to be preferred in the search results or let your user’s current geographical location or browser preferences decide. See the openindex.lang.pref configuration option.
Can the search results be sorted in any way?
No. The search results are sorted by relevance only.
Can we show a different icon for results from a specific host or language?
Yes. See the openindex.result.icons configuration option. It allows a flexible configuration. If you want results in some language or from some host to have a different icon you can override the file type directives by adding new directives for your field on top of them. The example below will display a shopping cart icon for results from shop.example.org:
<"host:shop.example.org" : "http://www.example.org/img/icons/cart.png">
Can a specific file type be preferred over another in the results?
Yes. You can instruct the search engine to prefer a given file type. See the openindex.type.pref configuration option.
Can a specific category be preferred over another in the results?
Yes. You can instruct the search engine to prefer a given category. See the openindex.cat.pref configuration option.
Can I add a field from the HTML metadata?
Yes. If the metadata tag with the name cat or
oi-cat is added to the index and can be used as facet and filtering. You can, for example, use it to define specific sections or categories of your website.
<meta name="oi-cat" content="section of your website"/>
Are fielded queries possible?
No, not at this time. Please contact us if you want to provide your users with the ability to execute fielded queries and do not want to rely on faceting to do the job.
Can i restrict the search to one or more filters?
Yes. You can preset a filter for all available facetting widgets. See for example the openindex.type.restrict configuration option. Users will still be able to select another value for the restricted field so it’s usually best to hide the restricted widget.
Can you deduplicate search results?
Search results are deduplicated by default.
Do you support TLS/SSL?
Why are only 12 result pages at most returned?
We will only ever return a fixed number of results for a query, even if there are many more results. Results further than 12 pages from the start are not very relevant.
Can i change the number of results per page?
Yes sure, see the openindex.pager.rows configuration option. We still only return 12 pages maximum, regardless of number of results per page.
Which search engine result states are possible?
There are four distinct states the engine can be in, not counting the initial state before any request is executed. The following four states are possible:
– normal result set and no spell checker suggestion
– result set but with spell checker suggestion
– no results but an available spell checker suggestion is automatically executed
– no results and no spell checker suggestion
Can the safe search policy be tuned?
Yes. You can instruct the search engine to which safe search policy to use. openindex.safe configuration option.
Can i show ads within the search results page?
Can i have a custom function called before or after a search?
Can we set up hooks on the events on the main input field?
Yes, using ARIA roles and states we signal information to screen readers so they know what to deal with. If you want to be compliant too, it is important to wrap all our widgets within an element having the
Can I kick it?
Yes you can.