Technical Information
Browser Recommendations
The main work of the search engine is done by splitting up the corpus into batches to process based on the number of threads available to the client machine, and then requesting groups of 100 files per thread as large JSON aggregates. Since many modern processors can handle this workload quite quickly, this often results in quite a few requests being sent at the same time (~22,000 files/ 100 per request = ~220 requests total, some simultaneous).
This kind of simultaneous work allows for extremely fast processing relative to the overall size of the job, but it also means that some browsers handle this better than others. In particular, Chromium-based browsers have a low internal request limit, and are QUITE BAD for this task. While Chromium-based browsers can be used for searching the corpus, it is strongly discouraged because the searches will be significantly slower. As such, I provide the following lists of browser recommendations and dispreferences.
Recommended Browsers
- Firefox (validated)
- Safari (validated)
- Gnome Web (validated)
Dispreferred Browsers (Chromium-based)
- Chromium
- Google Chrome
- Brave
- Opera
- Microsoft Edge
- Vivaldi
and others
Danger
Do Not Use Internet Explorer. It is not validated, it does not support most modern features, it is not secure, and will almost certainly break. You have been warned.
Search Speeds
In order to achieve the most efficient search experience possible, the search engine employs several optimization strategies to minimize the impact of both retrieving and searching through the roughly 22,000 XML files that comprise the corpus. One of these strategies is caching at two levels:
- Caching site data with Cloudflare edge servers
- Caching XML data directly on the client machine
Tip
Cloudflare is a service that, among other things, provides a proxy layer for a website that may improve speeds by serving files from nearby datacenters rather than requiring the user to directly query origin servers.
In practice, what this means is that when a given user requests data from the site, the site will send a request which is first processed by Cloudflare's proxy layer, to which either a HIT or a MISS response is returned. If the response is a MISS, indicating that the file is not stored in Cloudflare's local datacenter, a further request is made to the origin server to retrieve the file.
The returned file is then cached in two places, first in Cloudflare's local datacenter, and then it is processed and cached again on the user's local device. This has the following important effects:
- Subsequent searches are substantially faster for a single user if that user does not refresh the page after the first search due to local caching.
- Subsequent searches for all users served by that datacenter are substantially faster for the duration of their cloud-cached storage, thanks to Cloudflare's CDN (Content Distribution Network).
Warning
HOWEVER, the first search that a user runs, particularly if they are the first user to run a search served by a given datacenter will be significantly slower. The fastest searches are those performed when both local and cloud caching are already present. Testing shows that searches with neither cache in play may stretch as long as ~3 minutes, but once cloud caching is in play ~10s is more usual, and often >5s once local caching is implemented (given a moderately powerful desktop machine).
Device Considerations
This search engine runs primarily client-side to prevent server overload. This means that some devices will perform better than others, based primarily on CPU compute power (although of course network speed is also a factor). Testing has been done on a network with ~70Mbps download speeds, which is not exceptionally fast (home networks can often reach or exceed speeds of 1Gbps), using a powerful desktop, a moderately competent laptop, and a cell phone. All devices tested worked acceptably, with some small but tolerable hit to performance for weaker devices.
Kagi Translate
As most of the corpus is developed by a predominantly German-language team, so too are most of the glosses and annotations in German. To translate these to English, the SANH portal makes use of the Kagi Translate API, developed by the Kagi team.
Tip
The Kagi project is a team interested in creating more human-centered ways of interacting with the web, including developing an alternative to Google's search engine that does not contain ads and other promotional content. For our purposes, the most important thing to note is Kagi's Translate application, which serves as a more robust translation tool than Google Translate. More information on their project is available at help.kagi.com. We are not affiliated with them in any way, but our development team is very enthusiastic about their project.
Since the Kagi API is a 3rd-party component, steps have been taken to ensure that a translation request made from the client machine does not have to go all the way through the route Client => CloudFlare => Origin => Kagi API and back again. For security reasons, the Kagi authentications cannot be made directly accessible to the client. Instead, the application makes use of a CloudFlare Web Worker that receives and processes the translation fetch request from the client, and handles authentication and the actual request then made to the Kagi API. This cuts out the step requiring forwarding to the origin server, and makes better use of CloudFlare's distribution network.
A further optimization made for the sake of speed is utilizing Kagi's streaming feature of their API. This feature enables data chunking, meaning that responses to translation requests are not sent as one bulk set of information, but are provided in small parts as they become available. These are then processed on the client end, and the translation information is loaded into the DOM as soon as it becomes available, discarding irrelevant information and dramatically improving loading and display speeds for the user.
Console Outputs
After any search, three pieces of information can be observed in the developer console:
- How the reCAPTCHA was passed, whether with a somewhat suspect score or not.
- The number of words analyzed in the search
- The length of time in milliseconds required to execute the search
403 Forbidden Errors
In some rare instances, you may be kicked to a 403 Forbidden page for suspicious activity. This is a result of the Google reCAPTCHA which is enabled on the page to prevent excessive bot interactions. Should you find this happening frequently, please get in contact with me and record the behavior which is triggering the 403 error, and I will work to find a resolution. In most cases, this will be isolated, but if it is ocurring regularly it may be a sign that the threshold for identifying bot interaction is set too high.