Searching REST API documents with TYPO3's indexed_search - Christian Weiske

Planet PHP

Guest
Instead of writing our own search, we managed to integrate REST API data into a TYPO3's native indexed_search results. This brings us a mix of website content and REST data in one result list.


A TYPO3 v7.6 site at work consists of a normal page tree with content that is searchable with indexed_search.

A separate management interface is used by editors to administrate some domain-specific data outside of TYPO3. Those data are available via a REST API, which is utilized by one of our TYPO3 extensions to display data on the website.

Those externally managed data should now be searchable on the TYPO3 website.

Integration options


I pondered a long time how to tackle this task. There were two approaches:

  1. Integrate API data into indexed_search, so that they appear inside the normal search result list.
  2. Have separate searches for website content and API content. The search result list would have two tabs, one for each type. An indicator would show how many results are found for each type and the user would need to switch between them.

The second option looked easier at first because it does not require one to dig into indexed_search. But after thinking long enough I found that I would be replicating all the basic features needed for search: Listing data, paging, and those tabs as well.

The customer would then also demand that we'd have an overview page showing the first 3 results from each of the types, with a "view all" button.

In the end I decided to use option #1 because it would feel most integrated and would mean less code.

How indexed_search + crawler work together


At first I have to recommend Indexed Search & Crawler - The Missing Manual because it explains many things and helps with basic setup.

URL list generation


You may create crawler configurations and indexed_search configurations in the TYPO3 page tree. Both are similar, yet different. How do they work together?

  1. The crawler scheduler task and command line script both start crawler_lib::CLI_run().
  2. cli_hooks are executed. indexed_search has registered its IS\CrawlerHook as one, and that is started.
  3. All indexing configuration records are checked for their next execution time. If one of them needs to be run, it is put into crawler queue as a callback that runs IS\CrawlerHook again.
  4. The crawler queue is processed and calls IS\CrawlerHook::crawler_execute().

  5. IS\CrawlerHook::crawler_execute_type4() gets an URL list via crawler_lib::getUrlsForPageRow().
    1. Crawler configuration records are searched in the rootline.
    2. URLs are generated from the configurations found (crawler_lib::compileUrls())
    3. URLs are queued with crawler_lib::urlListFromUrlArray()

Note that the crawler only processes entries that were in the queue when it started. Queue items added during the crawl run are not processed yet, but in a later run.

This means that it may take 6 or 7 crawler runs until it gets to your page with the indexing and crawler configuration. It's better to use the backend module Info -> Site crawler to enqueue your custom URLs during development, or have a minimal page tree with one page :)

Crawler URLs


Crawler configuration records are URL generators.

Without special configuration, they return the URL for a page ID. Pretty dull.

The crawler manual shows that they can be used for more, and gives a


Truncated by Planet PHP, read more at the original (another 6331 bytes)

Читать дальше...
 
Сверху