Page MenuHomePhabricator

CirrusSearch: Add filter for exclusion of redirects or finding only them
Open, LowPublic13 Estimated Story PointsFeature

Description

This would at least be very useful for maintenance of typo fixes in file titles, maybe elsewhere, as well (in some kind the opposite of T171155 I guess).

Generally, when files are renamed (technically moved to the new name) they are not deleted under the old name, but a redirect will be created. So, if there is a typo fix, the wrong name will still be found with an according search. But if one frequently is searching for new files with the same mistake the redirects of the fixed files are interfering.

Examples from the wild:

On the other hand there may be cases when it could be useful to search only for redirects.

Edit: The examples above are from Commons only, but this would be useful elsewhere, too: Searching for possible impacts of an issue in dewiki it was necessary to find lemmata with parentheses containing templates commons and commonscat, but the results show quite a lot redirects with parentheses (very probably created by page moves) making it harder to look for the issue impact:

As @stjn noticed the behaviour is even inconsistent. An insource search should find a redirect if this contains the search string, but it does not.
An example: There is a redirect Scil in dewiki. None of these searches does find it:

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
cirrus: Add fields for first class redirectsrepos/data-engineering/schemas-event-primary!62ebernhardsonwork/ebernhardson/first-class-redirectsmaster
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

A valuable use case for title search would be looking at the number of titles matching certain patterns, to establish whether there's a consistent form in use for certain types of articles (cf. https://en.wikipedia.org/wiki/Wikipedia:CONSISTENCY). For example, comparing the parenthetical disambiguators "(video games)" vs. "(video gaming)". The inability to exclude redirects makes this impossible.

Also, FWIW, 2 of the 7 threads at https://www.mediawiki.org/wiki/API_talk:Search are complaints about the removal of srredirects, so it seems like this is not an uncommon pain point.

@Speravir, @stjn, @halfeatenscone—quick question for you about this feature. Do you have any thoughts on how it should be exposed? Would you want keywords like intitleonly: and inredirect: (exact names open for discussion) or a checkbox? If a checkbox, should the checkbox only exist on the "Advanced" tab to keep the UI simple for less sophisticated searchers?

It looks like T64680 took away a checkbox to enable redirects in results. I agree that general searchers would want redirect results, but I think it was overzealous because I can see why editors would want to exclude them in the examples above.

I think it could be a keyword, preferably one that allows people both to exclude and include redirects from the search (inredirect:true / inredirect:false then?). As to UI, it can probably be included to AdvancedSearch in some way (and not included in older UI?), but I don’t have a strong opinion about it.

I think it could be a keyword, preferably one that allows people both to exclude and include redirects from the search (inredirect:true / inredirect:false then?).

Are the :true and :false required? They probably would have to be, otherwise if true were the default, then searching for "false" with inredirect:false could be ambiguous, or at least not very transparent. If they are required, then you just have longer keywords with a colon in the middle. I also dislike introducing an (actual or apparent) new feature that only works on one keyword. Someone somewhere will want to search on incategory:false:Dogs or something similar, and will be disappointed or confused or get poor results—or all three.

I prefer keywords over checkboxes, too—though it isn't up to me, but it is why I'm interested in trying to devise a good keyword formalism.

As to UI, it can probably be included to AdvancedSearch in some way (and not included in older UI?), but I don’t have a strong opinion about it.

Ahh, to be clear I did actually mean the Advanced tab in the old UI. I tend to forget about the AdvancedSearch UI because I have it disabled. It's also maintained by a different team, so once we have the core functionality working—whether as a keyword or flag of some sort—they will have to update AdvancedSearch to make us of it. My first thought is that a checkbox makes sense for them, but I'm not sure. It is kind of awkward to have a three-way toggle (title, redirect, both). Maybe both and title-only are good enough for most users. But we are maybe getting ahead of ourselves worrying about AdvancedSearch.

In terms of the API, there used to be a boolean srredirects parameter, which would include redirects if true. Restoring that seems fine to me. Though it looks like the default value for boolean parameters is always false (https://www.mediawiki.org/w/api.php?action=help&modules=main#main/datatypes), so maybe it would be preferable to keep the current default behaviour, and add a boolean param with opposite semantics (srnoredirects? srsuppressredirects?).

No strong opinion on the UI (checkbox vs. keyword, or both).

@halfeatenscone, thanks for the API point of view.

(I also have to say that while this is on our workboard, it is not very near the top at the moment. So, I appreciate the feedback, but it may be a while before anyone gets to work on it.)

Do you have any thoughts on how it should be exposed? Would you want keywords like intitleonly: and inredirect: (exact names open for discussion)

From my point of view (no dev, no code insight) inredirect: seems to be semantically not quite right to me. intitleonly: could be too easily confused with the existing intitle:. I would instead favour something like redirects: with possible options show, i.e. true, (default behaviour) and hide, i.e. false, and, if technically possible or not too difficult to achieve, also only meaning show only redirects, hide every other result.

Addendum: Another idea would be to create a keyword showredirects: (perhaps with a short form like a simple redirs:) without options: default case with adding showredirects: or +showredirects: the same like one did not add this, but with -showredirects: or !showredirects: redirects would be hidden.

or a checkbox? If a checkbox, should the checkbox only exist on the "Advanced" tab to keep the UI simple for less sophisticated searchers?

No checkbox for default search, but one for the advanced search extension (or for three options a selection form).

It looks like T64680 took away a checkbox to enable redirects in results.

Thanks. This reminds me that I actually wanted to add a ping @Deskana .

BTW, another thing that makes the current state of affairs feel confusing and inconsistent is that the prefix: operator doesn't apply to redirects, only the actual title.

With all respect, I'm beginning to think the change to remove srredirects and always include redirects was a step backward. It seems like the change was made in response to the request of one user (T171155), but I can count at least 6 users who have since reported problems owing to the change:

Adding an optional flag to exclude redirects would be fabulous, but I do believe that returning to the previous behaviour (defaulting to not matching redirects for intitle) would be even better in terms of utility/intuitiveness for the average user.

How hard would this be to add? I can compare it to if you want to see all local files + all Commons files when searching for files at enwiki. To only find local files you write 'file: local:'. A similar solution would be perfect here, and very appreciated.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Jul 21 2023, 1:57 PM

Yes, please add this feature. I need lots of redirects in my wiki but they’ll flood and bury the actual articles.

The current way this functions is also really inconsistent. intitle: search includes redirects without any possibility to exclude them, and insource: search excludes redirects without any possibility to include them (there I would argue redirects should be part of results by default). It is really sad that this is still an issue.

I deal with these searches a *lot* looking from Greek Letter Organizations as part of WikiProject:Fraternities and Sororities. the current use case is finding all redirects starting with a greek letter and including Redirects with possibilities

Something can be done to improve some of the use cases around redirects, but we would need to narrow in on things that can be done. The fundamental limitation here is that in the search data model redirects are not their own pages. The only metadata about redirects stored in the search index is the namespace and title, and that is attached to the page that is redirected to. Additionally search results are always at the granularity of the indexed documents. This means if two redirects to the same page match that can not be represented in the output as two matches. It will always be a match against the document that was redirected to, with a scoring bump for matching twice.

First round of defining some user stories to guide potential implementation:

Editor-facing search (redirect mode)

  1. As a wiki editor, I want to search redirect pages by their own wikitext with withredirects: insource:/R from typo/, so that I can find and audit redirects created by a particular template or convention.
  2. As a wiki editor, I want withredirects: incategory:"Redirects from typos" to find redirect pages by their own categories, so that I can work through a maintenance category as a worklist.
  3. As a wiki editor, I want withredirects: intitle:/Obama/ to return the redirect page "Obama" as its own result, so that I can act on the redirect directly rather than only seeing its target.
  4. As a wiki editor, I want withredirects: intitle:/Barack/ intitle:/Obama/ to match only when both terms appear in the *same* redirect title, so that I stop getting false cross-matches between two different redirects on one target.
  5. As a wiki editor, I want the same false-cross-match fix to hold for the regex form withredirects: intitle:/Barack/ intitle:/Obama/, so that switching to regex does not re-introduce the bug.
  6. As a wiki editor, I want withredirects: to apply to my whole query when I put it at the front, so that I do not have to annotate every clause.
  7. As a wiki editor, I want a redirect document result to show its own title, namespace, and "redirects to X" target, so that I can see what it is and where it points.
  8. As a wiki editor, I want a primary result in redirect mode to *not* carry a "redirected from …" indicator, so that I am not shown redundant information when the causing redirect is already a result in its own right.
  9. As a wiki editor, I want every redirect indexed with no cap, so that insource:// over redirects is authoritative and not a truncated sample.
  10. As a wiki editor, I want orphan, broken, cross-namespace, Special:, and interwiki redirects to still appear in withredirects: searches, so that nothing falls through the cracks of an audit.
  11. As a wiki editor on a wiki where the feature is off, I want withredirects: to tell me it is not enabled and return no results, so that I am never silently handed a normal search I did not ask for.
  12. As a wiki editor, I want a bare full-text term in redirect mode to still reach a redirect's title and categories (via all), so that simple searches behave predictably, while wikitext stays reachable through insource:.

Ordinary search (default mode, must not regress)

  1. As an ordinary searcher, I want my normal searches to behave exactly as before, so that first-class redirect documents are invisible to me.
  2. As an ordinary searcher, I want "redirected from Obama" annotations to keep appearing on target results in default mode, so that I still learn which redirect matched.
  3. As an ordinary searcher, I want autocomplete/prefix and near-match to keep returning one result per page plus its redirects, so that suggestions are unchanged.
  4. As an ordinary searcher, I want did-you-mean and the completion suggester to behave byte-for-byte as before, so that redirect documents never pollute suggestions in this iteration.

Indexing & maintenance (the write side)

  1. As a search operator, I want editing a redirect page to produce its redirect document *and* refresh its target's redirect[] array, so that both representations stay current.
  2. As a search operator, I want a template/category change that touches a redirect (not just a direct edit) to refresh the redirect document's source_text/category/template/ outgoing_link, so that these fields do not go stale.
  3. As a search operator, I want converting a redirect into an ordinary page to overwrite the redirect document with a primary document and clear its redirect_target, so that no stale target lingers.
  4. As a search operator, I want converting an ordinary page into a redirect to overwrite its primary document with a redirect document, so that the index reflects reality.
  5. As a search operator, I want a page move that leaves a redirect to index that redirect immediately from the hook, so that moved-from titles are not invisible to withredirects: until the next backfill (ADR-0006).
  6. As a search operator, I want deleting a redirect page to remove its document and drop it from the target's redirect[] array, so that deletions converge with no zombie docs.
  7. As a search operator, I want an edit-then-delete race to never leave a zombie redirect document, so that the index stays consistent under churn.
  8. As a search operator, I want forceSearchIndex to reproduce the runtime indexing outcome exactly — writing both the redirect document and the target's refreshed array — so that a reindex is never partial.
  9. As a search operator, I want the saneitizer to treat a redirect page as *expected* to have a page_type:"redirect" document under build:true, remediating a missing/stale one as pageNotInIndex, so that coverage self-heals without infinite churn.
  10. As a search operator, I want a single saneitizer dispatch chokepoint to route every redirect remediation through UpdateRedirectDocument rather than LinksUpdate, so that the saneitizer never re-flags the same redirect forever.
  11. As a search operator, I want saneitizer remediation to write to the one cluster being checked, so that multi-cluster fleets are not silently mis-targeted.

WMF streaming updater (cross-repo)

  1. As a WMF search operator, I want the prop=cirrusbuilddoc API to return a redirect's document body under build:true, so that the streaming updater can fetch and index it.
  2. As a WMF search operator, I want the producer to emit REV_BASED_UPDATE (not PAGE_DELETE) for redirect-page edits, so that redirect documents are written rather than deleted.
  3. As a WMF search operator, I want the producer to stop dropping rerender events for redirects, so that template/category changes refresh redirect documents.
  4. As a WMF search operator, I want the Kafka stream schema to declare page_type and redirect_target, so that the Flink typed-Row round-trip does not silently strip them.
  5. As a WMF search operator, I want backfill of the tens-of-millions of existing redirects to ride the saneitizer loop paced over loopDuration, so that the MW API and indexing pipeline are never overwhelmed by a one-shot sweep.
  6. As a WMF search operator, I want the rollout ordered so no page_type-less document is ever written, so that nothing leaks into normal search during migration (ADR-0007).
  7. As a WMF search operator, I want redirect_target's equals noop handler and null-clearing to ride through the existing super_detect_noop machinery unchanged, so that production write semantics match the in-MW path with no consumer code change.

Operations & rollout

  1. As a search operator, I want page_type/redirect_target mappings and the must_not filter to ship *before* any redirect write, so that the first write is immediately excluded from normal search (ADR-0007).
  2. As a search operator, I want build and use to be independent flags, so that I can populate the index first and expose the keyword later.
  3. As a search operator, I want a use:true && build:false misconfiguration to warn and return zero results, so that a config mistake is noticable by the affected users.
  4. As a search operator, I want to verify coverage by comparing count(page_type:"redirect") against SELECT COUNT(*) FROM page WHERE page_is_redirect=1, so that I know backfill is complete before flipping use:true.

Change #1297158 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add redirectScope concept to SearchContext

https://gerrit.wikimedia.org/r/1297158

Change #1297159 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add ability to build redirect documents

https://gerrit.wikimedia.org/r/1297159

Change #1297160 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Wire redirect documents into the update process

https://gerrit.wikimedia.org/r/1297160

First round of defining some user stories to guide potential implementation:
[…]

Impressive!

I was going to say that I miss the aspect which was the reason I opened this ticket: finding results without redirects. But from https://gerrit.wikimedia.org/r/1297158 it seems you are aware of this, and I just did not understand the intended scope thing.

This said, since I opened this ticket almost 8 years are gone, and something’s seems to have changed: The last examples with a search for "Scil" seem to work, now. But there is another inconsistency I do not get at all: The first two examples still work like 8 years ago, but the third with search for "postdam" does not! (Edit: See update.) And that’s most annoying. The search result still displays the true issues, and still also displays the pages where the issues has actually been fixed, but I do not get part with the hint “Redirected from“ (I get the German translation, though, on search queries where it works), the one embedded in a <span class="searchalttitle">…</span>. I remember having this non-display issue for other search queries, too. In fact, I was surprised that the first two search examples still display this part.

Update: Oh, if I change the sort criterion for the "postdam" search query (in fact to alphabetic), the redirect string is suddenly displayed, again.

Change #1297159 abandoned by Ebernhardson:

[mediawiki/extensions/CirrusSearch@master] Add ability to build redirect documents

Reason:

squashed into Ib9a9e10f16c

https://gerrit.wikimedia.org/r/1297159

Change #1298808 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] first-class-redirects: Wire up query side

https://gerrit.wikimedia.org/r/1298808

Change #1298877 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/Wikibase@master] Skip MoreLikeWikibaseTest

https://gerrit.wikimedia.org/r/1298877

Change #1298878 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/Wikibase@master] Restore MoreLikeWikibaseTest

https://gerrit.wikimedia.org/r/1298878

Change #1298895 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Skip EntitySearchElasticTest

https://gerrit.wikimedia.org/r/1298895

Change #1298896 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Re-enable EntitySearchElasticTest

https://gerrit.wikimedia.org/r/1298896

Change #1298897 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Update to align with ResultsType inheritance

https://gerrit.wikimedia.org/r/1298897

I was going to say that I miss the aspect which was the reason I opened this ticket: finding results without redirects. But from https://gerrit.wikimedia.org/r/1297158 it seems you are aware of this, and I just did not understand the intended scope thing.

It does and it doesn't, i suspect solving that portion will have to build upon what is already planned, but that should be easy once this first, and more difficult, step is done. The challenge with solving this class of problem has been that the search engine did not index and thus could not return redirects as their own search results. The insource queries can't work because the wikitext of redirect pages is not in the index anywhere, the only thing the search engine knew was that page A has redirects [B, C, ...] pointing at it. This first step is introducing the redirect documents to the search index but only exposing them under the new 'redirect scope' where redirects and articles are all equal. This was decided on as the most flexible starting point for building the rest from.

With a query mode that includes redirects as first-class search results it will be much easier to build filters such as returning only redirects, or only redirects that point at a specific page, or a specific namespace. Often times the hardest part of building those filters, when the search index already supports them, is defining the name and semantics of the keywords. It's often unclear how much we should bake in though, as insource:/.../ should be able to answer many more questions than we can imagine. Are there any new semantics that would serve your use cases better than insource?

This said, since I opened this ticket almost 8 years are gone, and something’s seems to have changed: The last examples with a search for "Scil" seem to work, now. But there is another inconsistency I do not get at all: The first two examples still work like 8 years ago, but the third with search for "postdam" does not! (Edit: See update.) And that’s most annoying. The search result still displays the true issues, and still also displays the pages where the issues has actually been fixed, but I do not get part with the hint “Redirected from“ (I get the German translation, though, on search queries where it works), the one embedded in a <span class="searchalttitle">…</span>. I remember having this non-display issue for other search queries, too. In fact, I was surprised that the first two search examples still display this part.

Update: Oh, if I change the sort criterion for the "postdam" search query (in fact to alphabetic), the redirect string is suddenly displayed, again.

It looks like the highlighting variance is perhaps an issue with MediaSearch we need to look into. The query working when set to alphabetic, but not showing the highlights when set to relevance sort suggests that the specialized media searching query is not exposing the search terms to the highlighting layer in the same way as the default cirrus query. I see the same general behavior with your other avaition example. This will always be awkward in the default mode though, two words can match on two different redirects and the search index doesn't really know, I'm hoping though that the redirect scope will better serve this use case by allowing to search the redirect documents directly and avoiding this semantic overlap.

Change #1299603 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseMediaInfo@master] Skip MediaSearchQueryBuilderTest

https://gerrit.wikimedia.org/r/1299603

Change #1299604 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseMediaInfo@master] Re-enable MediaSearchQueryBuilderTest

https://gerrit.wikimedia.org/r/1299604

Change #1297157 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add page_type and redirect_target fields

https://gerrit.wikimedia.org/r/1297157

Change #1297157 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add page_type and redirect_target fields

https://gerrit.wikimedia.org/r/1297157

Change #1299603 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Skip MediaSearchQueryBuilderTest

https://gerrit.wikimedia.org/r/1299603

Change #1298877 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Skip MoreLikeWikibaseTest

https://gerrit.wikimedia.org/r/1298877

Change #1298895 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Skip EntitySearchElasticTest

https://gerrit.wikimedia.org/r/1298895

Change #1297158 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add redirectScope concept to SearchContext

https://gerrit.wikimedia.org/r/1297158

Change #1300939 had a related patch set uploaded (by Reedy; author: Reedy):

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] LexemeFullTextQueryBuilderTest: Mark test as skipped

https://gerrit.wikimedia.org/r/1300939

Change #1300940 had a related patch set uploaded (by Reedy; author: Reedy):

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Re-enable LexemeFullTextQueryBuilderTest

https://gerrit.wikimedia.org/r/1300940

Change #1300939 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] LexemeFullTextQueryBuilderTest: Mark test as skipped

https://gerrit.wikimedia.org/r/1300939

Change #1298896 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Re-enable EntitySearchElasticTest

https://gerrit.wikimedia.org/r/1298896

Change #1301427 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] ForceSearchIndex: build redirect documents during reindex

https://gerrit.wikimedia.org/r/1301427

Change #1301428 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] saneitizer: Handle first-class redirect documents

https://gerrit.wikimedia.org/r/1301428

Change #1299604 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Re-enable MediaSearchQueryBuilderTest

https://gerrit.wikimedia.org/r/1299604

Change #1298897 abandoned by Ebernhardson:

[mediawiki/extensions/WikibaseCirrusSearch@master] Update to align with ResultsType inheritance

Reason:

Went with a different solution which kept the interface unchanged

https://gerrit.wikimedia.org/r/1298897

Change #1297160 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Wire redirect documents into the update process

https://gerrit.wikimedia.org/r/1297160

Change #1301427 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] ForceSearchIndex: build redirect documents during reindex

https://gerrit.wikimedia.org/r/1301427