Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBS-10232: Set global timeout to query remote servers #1102

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yvanzo
Copy link
Contributor

@yvanzo yvanzo commented Jun 13, 2019

Addresses MBS-10232: Unresponsive search server freezes MBS

Wrap LWP::UserAgent used to query other servers (search, discourse, critiquebrainz, blog, and so on). Difference is the added global timeout related to the initial call to a request method rather than just connection inactivity time. Thus it should now timeout even when a server becomes unresponsive.

Add dependency to AnyEvent, AnyEvent::HTTP::LWP::UserAgent, EV and update cpanfile.snapshot.

@yvanzo yvanzo force-pushed the mbs-10232-timeout branch from aed194b to dacf988 Compare June 13, 2019 20:40
@yvanzo yvanzo requested a review from mwiencek June 13, 2019 20:41
@mwiencek
Copy link
Member

It sounds like this module won't work if you try to run MBS from behind a proxy. :\ Maybe it's a bit overbroad for this purpose anyway since it appears to be designed for protecting against attackers.

Would it be possible to just create MusicBrainz::LWP package that wraps get with an alarm as in

my $action = POSIX::SigAction->new(sub {
my $context = $c->model('MB')->context;
if (my $sth = $context->sql->sth) {
$sth->cancel;
}
MusicBrainz::Server::Exceptions::GenericTimeout->throw(
$c->req->method . ' ' . $c->req->uri .
" took more than $max_request_time seconds"
);
});
$action->safe(1);
POSIX::sigaction(SIGALRM, $action);
?

@yvanzo yvanzo force-pushed the mbs-10232-timeout branch 5 times, most recently from 6da854a to efd59b6 Compare June 14, 2019 18:15
@yvanzo
Copy link
Contributor Author

yvanzo commented Jun 14, 2019

@mwiencek, done, got send_request wrapped in MusicBrainz::LWP::TimelyUserAgent instead of get, as it would potentially harm requests going through a few redirects, whereas send_request is a primitive used by get and every other subroutines in LWP::UserAgent.

Copy link
Member

@mwiencek mwiencek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a search times out, it'd be nice to display a message saying it timed out rather than a "Server Error" page with the GenericTimeout->throw stack trace. Though perhaps this is just due to GenericTimeout not being caught (see below).


If both C<sending_timeout> and C<timeout> are positive integers,
when sending a request and receiving a response takes more than
C<sending_timeout> plus C<timeout> seconds, it forges a response
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought was it's not intuitive how to use these options, since e.g. (sending_timeout => 1, timeout => 3) and (sending_timeout => 3, timeout => 1) would both have the same global timeout, but different "no activity" timeouts. Since they're added, it's not clear what sending_timeout measures, i.e. what will happen after 1s in the first case or 3s in the second.

Might be less confusing to just have one option for the global timeout (perhaps repurposing timeout for it), since these may not work as one might expect. The LWP::UserAgent timeout doesn't seem to be something we particularly need to configure separately - it aborts the request "if no activity on the connection to the server is observed for timeout seconds," but we mainly just want to ensure the entire request returns within X seconds. So LWP::UserAgent timeout can just be set to the same value as the global timeout (or even timeout / 2 if we want to abort early).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to get around LWP::UserAgent’s limitation which may hang indefinitely before getting connected, that is exactly what happened when we experienced networking issues.

The sending_timeout just gives an extra second so as to allow to distinguish between timeout correctly handled by LWP::UserAgent and timeout saved by TimelyUserAgent. It doesn’t attempt to take processing time, redirect time, and so on, into account.

Turning it into a really global timeout (on returning the final response) would require:

  • either to wrap request subroutine which is recursive (not alarm friendly),
  • or to wrap get, post, and mirror subs and replace request calls with post calls in ::Controller::Discourse and ::Test::HTML5.

I’d still prefer the current solution, maybe just explain it better in POD.

Parameters specification will follow, once we agree on which issue we try to address.

P.S. Isn’t using alarm here going to interfere with alarm call in MusicBrainz::Server?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. Isn’t using alarm here going to interfere with alarm call in MusicBrainz::Server?

Yeah...good point. We should probably be using AnyEvent for this, though we'll have to pull it in as a dependency. I played around with it and it looks like there are workable solutions that way:

# cpanm AnyEvent AnyEvent::HTTP::LWP::UserAgent EV

BEGIN { $ENV{PERL_ANYEVENT_MODEL} = 'EV'; }

use AnyEvent;
use AnyEvent::HTTP::LWP::UserAgent;
use Data::Dumper;

my $ua = AnyEvent::HTTP::LWP::UserAgent->new;
my $cv = AE::cv;
$cv->begin;

my $response;
my $timeout = 0.5; # seconds
my $got_timeout = 0;

my $req_cv = $ua->get_async('https://musicbrainz.org/');

$req_cv->cb(sub {
    $response = shift->recv;
    $cv->end;
});

my $timer = AnyEvent->timer(after => $timeout, cb => sub {
    $req_cv->croak;
    $got_timeout = 1;
    $cv->croak;
});

$cv->recv;

undef $timer; # cancels the timer

if ($got_timeout) {
    print 'Got timeout';
} else {
    print Dumper($response);
}

This would probably have to be implemented as a utility function rather than an LWP subclass.

$response = $self->$orig(@args);
} catch {
my $err = $_;
if (ref($err) eq 'MusicBrainz::Server::Exceptions::GenericTimeout') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that will catch the GenericTimeout thrown above, since it's not in the same call stack. At least I'm not able to see it being caught when I put in some Time::HiRes::sleep calls to trigger the global timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It catches the GenericTimeout but there is a bug while forging/retrieving the response.
See commit 5d3dc58 and below log excerpt on GET /search?query=MBS-10232&type=release:

[debug] Sleeping until the alarm rings...
[debug] Throwing exception from signal
[debug] Catching error of type 'MusicBrainz::Server::Exceptions::GenericTimeout'...
[debug] Error message = 'GET http://search.musicbrainz.org/ws/2/release/?query=MBS-10232&offset=0&max=25&fmt=jsonnew&dismax=true&web=1 took more than 1 seconds (sending included)'
[debug] Forging HTTP::Response with code '503'...
[debug] Response code = 'HTTP::Request=HASH(0x5620f85638d0)'
[debug] Response message = '503'
[debug] Catching search error...
[debug] Search returned code = 'HTTP::Request=HASH(0x5620f85638d0)'
[debug] Search returned error = 'GET http://search.musicbrainz.org/ws/2/release/?query=MBS-10232&offset=0&max=25&fmt=jsonnew&dismax=true&web=1 took more than 1 seconds (sending included)'

Response code should be 503 and response message should be the same as error message.

What am I missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be due to https://github.com/metabrainz/musicbrainz-server/pull/1102/files/efd59b627bd8ed562d42d7d95663d39b8a2c3fcd#diff-045d15826466d1e497f53b359fbf1139R67

$self->SUPER::_new_response implicitly passes $self as the first argument, which _new_response doesn't expect. It can be replaced with LWP::UserAgent::_new_response(...).

Also looks like the GenericTimeout stack trace gets logged by plackup even though it's caught, which is odd and not ideal. :/

@yvanzo yvanzo force-pushed the mbs-10232-timeout branch 2 times, most recently from e819110 to 800d9e5 Compare September 12, 2019 13:34
@yvanzo yvanzo changed the title MBS-10232: Set timeout to connect to other servers MBS-10232: Set global timeout to query remote servers Sep 12, 2019
@yvanzo yvanzo force-pushed the mbs-10232-timeout branch 2 times, most recently from cef9b7e to 39b09d2 Compare September 12, 2019 21:24
Copy link
Member

@mwiencek mwiencek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not leaving any feedback here for a long time. I'm having trouble testing this at the moment: if I navigate to a page that makes a request, like the homepage or a search results listing, the AnyEvent timer doesn't fire at all on the first page load, but on subsequent reloads fires immediately with no delay. Should probably rebase this first so we can debug that more easily.

log_debug { $message };
$response = LWP::UserAgent::_new_response(
$request,
HTTP_SERVICE_UNAVAILABLE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

503 seems misleading since we don't actually know if the upstream server is overloaded or down for maintenance (could be any number of network issues en route). LWP::UserAgent handles this by just returning 500 with a Client-Warning header set to "Internal response". https://metacpan.org/pod/LWP::UserAgent#timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ebd2a31.

$self->_lwp_user_agent(AnyEvent::HTTP::LWP::UserAgent->new(%$args));
}

sub _timely_call_method {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking but something like

Suggested change
sub _timely_call_method {
sub _call_lwp_with_timeout {

seems like a clearer name to me (timely implies something completes within a good time, but here it may not complete at all)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in f64a282.

Comment on lines +92 to +93
$uri //= $request->uri;
$request //= HTTP::Request->new($http_method, $uri);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. Maybe we need die unless defined $uri || defined $request above 'cause these lines are mutually dependent on each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the same as die unless defined $arg which I don’t think is needed, mainly because this is checked by called subroutines already.

@@ -9,7 +9,7 @@ our @EXPORT_OK = qw(
sub get_chunked_with_retry {
my ($ua, $url) = @_;
my $response;
my $retries_remaining = int(25.0 / $ua->timeout);
my $retries_remaining = int(25.0 / $ua->inactivity_timeout);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it perform no search, since the default LWP timeout is 180 seconds and int(25.0 / 180) is 0.

I don't understand what the previous code is calculating, anyway. A fixed number of retries is probably fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25 seconds was set in 7c22997 as an arbitrary period of time. It is no longer needed with the new global timeout.

Retry at most 5 times instead with 61fd923.

Comment on lines +81 to +85
my $timer = AnyEvent->timer(after => $global_timeout, cb => sub {
$req_cv->croak;
$got_timeout = 1;
$cv->croak;
}) unless $global_timeout == 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this is firing immediately for me. :\ I think it was working in the original script I suggested. Maybe the PR needs to be rebased first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just rebased on master (36fdb6e) but did not test it again yet.

@yvanzo yvanzo force-pushed the mbs-10232-timeout branch 2 times, most recently from 61fd923 to 7ef1cf0 Compare June 19, 2020 19:07
yvanzo added 2 commits June 20, 2020 13:17
Wraps LWP::UserAgent to abort waiting for a response if it timeouts
`global_timeout` seconds after initial call to a request method,
then to return a forged response with code 500 with custom message and
"Client-Warning" header set to "Internal response" (by LWP).

It is possible to set 'LWP::UserAgent::timeout' through
self-explanatory 'MusicBrainz::LWP::inactivity_timeout'.

It additionally autosets MusicBrainz Server User-Agent header.

This change affects requests to most external services (e.g. search,
mediawiki, discourse, critiquebrainz, blog, amazon, and so on).

Also add dependencies AnyEvent, AnyEvent::HTTP::LWP::UserAgent, EV,
and update cpanfile.snapshot.
25 was an arbritrary value. If timeout is not set, it defaults to 180,
then the GET request is not tried at all.

This patch sets a fixed number of 5 tries instead.
@yvanzo yvanzo force-pushed the mbs-10232-timeout branch from f80a38d to 36a6937 Compare June 20, 2020 11:18
@yvanzo
Copy link
Contributor Author

yvanzo commented Jun 20, 2020

The website seems to work fine locally, but replication. As shown by tests on CircleCI (fixed by building and uploading a new musicbrainz-tests image with added Perl modules), MusicBrainz::LWP is still missing proper implementation of mirror as AnyEvent::HTTP::LWP::UserAgent has no asynchronous equivalent to LWP::UserAgent’s mirror subroutine.

@mwiencek, if I continue in this direction, I will just reimplement mirror in an asynchronous fashion. What seems the most reasonable to you:

  1. Implement our own mirror_async.
  2. Give up calling mirror for replication in favor of get only.
  3. Get back to efd59b6 that simply wraps send_request and find another alternative to alarm.

@yvanzo yvanzo requested a review from mwiencek October 13, 2020 15:53
@mwiencek
Copy link
Member

I'm having the same problem as before: the timeout works initially after plackup first starts, but subsequent requests log [debug] get $url took more than 5 seconds very quickly, or much less than 5 seconds -- at least it's clear it doesn't wait 5 seconds.

How are you testing this? I wrote a small http proxy that responds after 30 seconds to simulate an unresponsive server. Tried increasing global_timeout to 10 or more, and it gives the same issue.

Get back to efd59b6 that simply wraps send_request and find another alternative to alarm.

If we could get the AnyEvent implementation working, that would be ideal. :) I'm just not sure there's any alternative to alarm that doesn't involve forking another process and using IPC. Though I'm also very much in favor of implementing the search pages in Node, where this sort of thing (asynchronous requests, global request timeouts) is completely trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants