Page MenuHomePhabricator

massviews hits 429 Too Many Requests despite making requests synchronously
Closed, ResolvedPublic

Description

Original report

Sometime recently (I want to say it's recent), https://tools.wmflabs.org/massviews sometimes gets 429 responses from the pageviews API. Each request is separated by 10ms, which should mean it would never exceed the 100 req/sec limit, as indicated at https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end.

Did something change recently? Or perhaps I'm doing something wrong?

For reference, here is the code I use to add rate-limiting: https://github.com/MusikAnimal/pageviews/blob/b90732a6e3329b3caaf89337237463c21dc5ec00/javascripts/shared/pv.js#L1369-L1402. fn here would be the promise to actually make the request to the API.

March 2026

In addition to the original report, we are now getting hit by https://www.mediawiki.org/wiki/Wikimedia_APIs/Rate_limits

So the goal now is to simply have the tool reliably make requests en masse. We are trying to avoid moving the API querying logic to the server for the time being.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@MusikAnimal is this report coming from users getting 429? Are you getting the errors yourself?

fdans triaged this task as Medium priority.Apr 4 2019, 5:17 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

@MusikAnimal is this report coming from users getting 429? Are you getting the errors yourself?

Yes, it was reported at meta:Talk:Pageviews Analysis. The user only sees "Error occurred when querying Pageviews API - Unknown". This is not at all uncommon when you give Massviews a large set of pages, but when I've checked it in the past it was always the 404 gotcha, where they were all obscure pages that evidently hadn't been viewed since the pageviews API was introduced (so 404 in that case means 0 pageviews). When I investigated the aforementioned report, I saw that for some pages, the response was 429. If you read the bug report, they are saying that when they try a second time the pageviews will successfully be fetched for those pages, which makes sense when reading the gotcha for 429:

429 throttling
Client has made too many requests and it is being throttled. This will happen if the storage cannot keep up with the request ratio from a given IP. Throttling is enforced at the storage layer, meaning that if you request data we have in cache (cause other client has requested it earlier) there is no throttling. Throttling will be enabled late May 2016.

So on the second try, the API can serve from cache for most of the pages. It only has to pull from storage for the pages that got a 429 on the first run.

The problem here is that I am conforming to the 100 req/sec throttling, which in theory means we shouldn't get 429s in the first place.

So on the second try, the API can serve from cache for most of the pages. It only has to pull from storage for the pages that got a 429 on the first run.

This is not a correct assumption, requests that received a 429 are not fetched from storage, rather they are rejected before being processed.

A second run succeeds (likely) cause "some" requests were processed on the first run (they did not get 429s) and thus are now cached and the request batch that hits storage is smaller for the same client side request. Makes sense?

I am conforming to the 100 req/sec throttling

Throttling happens per IP. A user with two tabs open in your case can send 100 reqs per sec per tab, correct? If so while it is good that some rate limiting code exists in the tool, it is easy to bypass it.

So on the second try, the API can serve from cache for most of the pages. It only has to pull from storage for the pages that got a 429 on the first run.

This is not a correct assumption, requests that received a 429 are not fetched from storage, rather they are rejected before being processed.

A second run succeeds (likely) cause "some" requests were processed on the first run (they did not get 429s) and thus are now cached and the request batch that hits storage is smaller for the same client side request. Makes sense?

Yes that is what I meant. One the second try we only pull from storage for pages that received a 429 on the first run. The remaining pages are cached, as you say.

I am conforming to the 100 req/sec throttling

Throttling happens per IP. A user with two tabs open in your case can send 100 reqs per sec per tab, correct? If so while it is good that some rate limiting code exists in the tool, it is easy to bypass it.

I can't speak for the user who reported the error, but in my testing I was only using one tab. The example used for testing was https://tools.wmflabs.org/massviews/?platform=all-access&agent=user&source=category&target=https%3A%2F%2Fsv.wikipedia.org%2Fwiki%2FKategori%3ANaturreservat_i_Sverige&start=2018-10-01&end=2019-03-31&subjectpage=0&subcategories=1&sort=views&direction=1&view=list (check the network log in the developer console). You may or may not actually get 429s, if you don't I suppose you could wait however long it takes for the cache to expire and try again. There will probably be some 404s in there too (zero pageviews).

@Nuria @fdans Now I see "HyperSwitch request rate limit exceeded" (before it was 429s without a message), despite making no more than the maximum 100 req/sec. This starts happening only after so many thousand requests are made in succession. It seems like it's some sort of DDoS prevention, because every request returns 429, when at least some should go through if it was only enforcing the 100 req/sec limit. In the case of Massviews, we could be querying for up to 20,000 pages, which comes out to about 3.3 minutes straight of making requests in 10ms intervals.

The issue has been going on since sometime in early 2019, perhaps earlier. Before then Massviews was able to run without any errors (apart from 404s).

How can I avoid the 429s?

when at least some should go through if it was only enforcing the 100 req/sec limit.

Let's see, ratelimiting is enforced per IP for public APIs, once you go over the limit of what we think is sustainable your iP will be throttled for a bit (limit enforcing does not automatically stop when you stop making connections but a bit after), so there is no guarantee that 100 of your connections per sec are going to make it once you go above that limit. These (to be clear) are connections from the browser correct?

This starts happening only after so many thousand requests are made in succession.

Right, the volume of requests hitting the server must be >100 reqs per sec, that can happen even if your client is sending a number of requests close to that limit but a bit below.

I think this is the right code, please see: https://github.com/wikimedia/limitation

Thanks. I'm thinking what I might do is when I hit the first 429, make it pause for a bit before resuming making more requests. Or I could just increase the timeout between requests. Both seem like hacky, sub-par solutions; I can try to dig through wikimedia/limitation to see what the exact logic is, and try to go by it, to ensure my tool goes as fast as it can. Massviews is used for GLAM, outreach, etc., where by nature there will be thousands of pages to lookup.

These (to be clear) are connections from the browser correct?

Yes, all from the browser.

I've been hitting this problem consistently with the Massviews tool (only using 1 tab). I wonder if slightly tweaking the 10ms pause would fix it. Maybe we could try changing it to 12ms and see if that makes the difference, as currently we're surfing right on the edge of the throttle.

Given that this tool for links like the one above (see couple comments up) does 5000 requests on 1 tab (see network panel for chrome) it is unlikely to work even if you "space" requests a bit more. The tool, to work best, needs an entirely different api that is, say, category-based and not page-based.

In the absence of an API that is more taylored to your use case you can manage the queue of requests. For example, you can send N request (the browser will multiplex) and when the first one gets a 429 you stop , message user on UI and continue some time after. So user will get data in stages.

you can send N request (the browser will multiplex) and when the first one gets a 429 you stop , message user on UI and continue some time after. So user will get data in stages

Yeah that's basically my idea; I'm going to implement a retry handler to make it pause continually after each 429, ensuring every page is accounted for. This is what we recently had to do for Popular Pages bot.

However I'd like to reiterate that this wasn't an issue some months ago. For however many years it's been, we were safe to query at 100 req/sec without worry. I still think there might be some issue with the API's throttling logic, because again we are abiding by the advertised rate limit but are still getting 429s.

because again we are abiding by the advertised rate limit but are still getting 429s.

Ok, maybe we need to look a this a bit more but in any case the best way to approach these massive number of requests is in stages.

This can be tricky to diagnose because we don't really know what if any upstream changes are made to Hyperswitch. Do you have a more accurate idea about when you started seeing this? Is it when you made the task, beginning of April this year?

This can be tricky to diagnose because we don't really know what if any upstream changes are made to Hyperswitch. Do you have a more accurate idea about when you started seeing this? Is it when you made the task, beginning of April this year?

Probably April, or at least sometime in 2019... I'm not sure :( This situation has certainly grown worse in recent months.

I'll note that most of the 429s have an empty response body, with a Retry-After header of 1 second. More recently (late June / early July), the 429s *sometimes* have a JSON response with the error message "HyperSwitch request rate limit exceeded", and no Retry-After header. So it seems like there are two things at play here. I'm just fairly certain there weren't 429s at all for most of Massview's life. When I developed it, I made sure it ran at 100 req/sec, and I only ever saw 404s (meaning zero pageviews).

Hopefully this is helpful. If it means anything, the Popular Pages bot for instance (which also does mass querying) still goes impressively fast despite having to pause for the 429s. Massviews just doesn't have the same kind of retry handler implemented, which I'm going to add. When I do, I suspect it will be of satisfactory speed for the users. My point with this task is that those 429s weren't a problem (or as much of a problem) before with the current, long-standing Massviews implementation, and we're not exceeding 100 req/sec.

Thanks for looking into it!

So, looked into code history more carefully. There's literally one code change in AQS in 2019, and it doesn't touch pageviews handling at all. npm saw fit to update some of the repository references for kad, swagger-ui, and json-stable-stringify. I suppose we could look into those but that would be pretty crazy bad luck. I think the logical next place to look is the layer in front of AQS, the problem is 99% likely to be from there. Pinging @Pchelolo to see if this sounds familiar. Petr, basically we're seeing a lot more 429s since around April 2019, and we see two different kinds:

  • with a Retry-After header of 1 second
  • JSON response with the error message "HyperSwitch request rate limit exceeded"

Searching the hyperswitch repo shows this has been in place for 3 years. Any idea what changed around it? It's possible it's behaving as designed, just trying to understand exactly what's going on so we can maybe set expectations.

NOTE: I'm getting this weird deja-vu feeling like I bothered Petr about this before, sorry if I forgot something obvious.

@MusikAnimal is this still an issue ? Since there hasn't happened anything in this ticket for 3 years (if you ignore the workboard/team shuffling).

@MusikAnimal is this still an issue ? Since there hasn't happened anything in this ticket for 3 years (if you ignore the workboard/team shuffling).

Yes. The API seems to permit 100 req/sec, but only for like 10 seconds or something, then you get a flood of 429s. In my opinion that's a bug… but probably all pageviews clients should have retry logic, anyway.

In recent tests, it seems to start giving 429s after exactly 1000 requests. That's with each request separated by 10ms, which worked great some years ago until this task was filed.

It looks like the rate limiting policy might have changed? The new docs at https://doc.wikimedia.org/generated-data-platform/aqs/analytics-api/documentation/access-policy.html#rate-limits say we need to do full round-trip requests before making a new request. I was not aware of that! That should be easier to implement than adding re-try logic, though.

I will look into fixing this in the Pageviews suite of tools, but it will mean the mass-querying tools in particular such as Massviews will run considerably slower. Better than it not working at all, certainly.

From preliminary tests on ~8-10K requests:

  • Making calls synchronously (full round-trip before making another request) doesn't seem to be enough. We will still eventually get 429s.
    • I've included additional delays before making subsequent calls too, but it also doesn't seem to help (I tried up to 250ms), which seemingly works to a degree but after a long wait we still get the 429s
  • I've added in retry logic, and while some retries succeed, other times I get blocked by CORS (?!). The latter might be something to do with my local environment, not sure.
  • Wikiemdia-side, things seemed to be cached in batches of 1000. After hitting 429s and retrying with a page refresh so many times, eventually the next batch of 1,000 will be cached, and it will work. Around ~10K requests used to take a few minutes at most, and now using the strategies mentioned above, it could take hours.

Is there a foolproof way to make things work for frontend, anonymous clients? From what I'm told, most users are willing to wait, if need be.

MusikAnimal raised the priority of this task from Medium to High.Mar 15 2026, 3:38 AM

Massviews was working on 7 December 2023. I am sure it was working in 2024.

It looks like the rate limiting policy might have changed?

We rolled out T417778: rest gateway: enforce rate limits (stage one), see https://www.mediawiki.org/wiki/Wikimedia_APIs/Rate_limits

  • I've added in retry logic, and while some retries succeed, other times I get blocked by CORS (?!). The latter might be something to do with my local environment, not sure.

maybe T418969: Rate limiting gateway should allow cross-origin requests from web browsers to read the HTTP 429 response?

You may be interested in T418957: Add client-side logging for non-MediaWiki action API errors (HTTP 429) as well. Hit me up on Slack if you have questions.

MusikAnimal renamed this task from 429 Too Many Requests hit despite throttling to 100 req/sec to 429 Too Many Requests hit despite making requests synchronously.Mar 16 2026, 6:49 PM
MusikAnimal updated the task description. (Show Details)
daniel renamed this task from 429 Too Many Requests hit despite making requests synchronously to massviews hits 429 Too Many Requests despite making requests synchronously.Mar 16 2026, 6:53 PM

@MusikAnimal how many requests does this tool need to make to provide a useful response to the user? I'm asking because with the current limits for anons (500 requests per hour), it may not be possible. Unfortunately, hitting that limit leaves many features (popups, VE) broken until it resets (at the next hour). We have been considering using per-minute limits instead, since they reset more quickly. But they would either be very easy to hit (10 req/minute is not much) or so high that the sustained load is more than what we want to allow for anons (that is, what we can accept from scrapers)...

@MusikAnimal how many requests does this tool need to make to provide a useful response to the user? I'm asking because with the current limits for anons (500 requests per hour), it may not be possible. Unfortunately, hitting that limit leaves many features (popups, VE) broken until it resets (at the next hour).

Wow, okay. Well, these are human requests with a User-Agent, so it's at least up to 1,000 an hour, but yeah… that won't suffice. Massviews will need to query up to 20,000, and I suspect these end users and researchers will want to use the tool over and over in succession for different categories, date ranges, etc.

I was going to try make everything respect the Retry-After header when a 429 is hit, but it sounding like that will not be enough.

I guess we have to move the Pageviews API querying logic to the server. That means a crazy amount of work, or, as cheap as it is, we could simply let the server act as a proxy? So, we just wrap the same request that would be made clientside with our User-Agent and send it from WMCS instead of the client. Perhaps that's the easiest path forward? I'll play around with that, but it's quite difficult to implement on my local since the "server" is still the same IP :(

I could also make the server-side requests authenticated, using a bot account or something. Maybe I'll try that, at least for local testing.

And heck, for Massviews specifically, maybe it's not too much to ask for users to login? Then we can make authenticated requests clientside.

Can we roll back this 429 change? I know I only use it once a year or so, but I really need these reports.

… we could simply let the server act as a proxy? So, we just wrap the same request that would be made clientside with our User-Agent and send it from WMCS instead of the client.

That seemed to work great as a quick fix. Things are significantly slower: Before, we could query for ~10K pages in under a minute, now the same set of pages took roughly 20 minutes to process… but hey, it at least finished!

I will deploy this solution today, and continue to work on improving performance. Massviews could do the querying server-side en masse, instead of piecemeal for each page. That would speed things up just from avoiding the round-trip to the server for each request. This would however mean we loose the "progress bar", since the client won't be informed as each individual page is processed. That might be a deal-breaker from the UI perspective, but maybe there's a way to do both! Say, a polling endpoint to see how far along the server is, or even a data steam with client using the Streams API. That would be quite nice as the post-processing could go ahead and start on the client, too.

One a thing a time :) I will get Massviews back to a working state – however slow – sometime in the next 12 hours. The repo unfortunately is quite out of date so I might need to rebuild a new VM with updated PHP and what not.

And heck, for Massviews specifically, maybe it's not too much to ask for users to login? Then we can make authenticated requests clientside.

That would be the preferred solution, yes. It would fall into the same category as other high volume power-user tools (like e.g. GlobalWatchlist).

we could simply let the server act as a proxy? So, we just wrap the same request that would be made clientside with our User-Agent and send it from WMCS instead of the client.

That works until some scraper discovers that proxy and starts using it for their own purposes.

Thanks so much for the help!

And heck, for Massviews specifically, maybe it's not too much to ask for users to login? Then we can make authenticated requests clientside.

That would be the preferred solution, yes. It would fall into the same category as other high volume power-user tools (like e.g. GlobalWatchlist).

I'll look into this as a follow-up, and this solution can also be applied to all of the other mass-querying tools in the suite, like Langviews, Userviews, etc. There's a larger rewrite of sorts that should happen as part of this, so it will take me a bit to get it production-ready.

we could simply let the server act as a proxy? So, we just wrap the same request that would be made clientside with our User-Agent and send it from WMCS instead of the client.

That works until some scraper discovers that proxy and starts using it for their own purposes.

Indeed. For now I will some rudimentary security measures to ensure we only process requests from the production tool.

MusikAnimal claimed this task.

For most users, things should be back to normal now. I've done as explained above and am using a cheap WMCS-hosted proxy for making the requests (with some security measures to ensure it isn't used outside the Pageviews tool).

Massviews is much slower, which I think will be vastly improved if we take the PHP proxy out of the equation. So I will be adding OAuth next, and then I think we can make signed requests clientside. I have filed T420295: Require login for mass-querying tools like Massviews for that.

Closing this as resolved!

This comment was removed by Hawkeye7.