-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MBS-10232: Set global timeout to query remote servers #1102
base: master
Are you sure you want to change the base?
Conversation
aed194b
to
dacf988
Compare
It sounds like this module won't work if you try to run MBS from behind a proxy. :\ Maybe it's a bit overbroad for this purpose anyway since it appears to be designed for protecting against attackers. Would it be possible to just create musicbrainz-server/lib/MusicBrainz/Server.pm Lines 402 to 413 in edb0f09
|
6da854a
to
efd59b6
Compare
@mwiencek, done, got |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a search times out, it'd be nice to display a message saying it timed out rather than a "Server Error" page with the GenericTimeout->throw
stack trace. Though perhaps this is just due to GenericTimeout
not being caught (see below).
|
||
If both C<sending_timeout> and C<timeout> are positive integers, | ||
when sending a request and receiving a response takes more than | ||
C<sending_timeout> plus C<timeout> seconds, it forges a response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first thought was it's not intuitive how to use these options, since e.g. (sending_timeout => 1, timeout => 3)
and (sending_timeout => 3, timeout => 1)
would both have the same global timeout, but different "no activity" timeouts. Since they're added, it's not clear what sending_timeout
measures, i.e. what will happen after 1s in the first case or 3s in the second.
Might be less confusing to just have one option for the global timeout (perhaps repurposing timeout
for it), since these may not work as one might expect. The LWP::UserAgent
timeout
doesn't seem to be something we particularly need to configure separately - it aborts the request "if no activity on the connection to the server is observed for timeout seconds," but we mainly just want to ensure the entire request returns within X seconds. So LWP::UserAgent
timeout
can just be set to the same value as the global timeout
(or even timeout / 2 if we want to abort early).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to get around LWP::UserAgent
’s limitation which may hang indefinitely before getting connected, that is exactly what happened when we experienced networking issues.
The sending_timeout
just gives an extra second so as to allow to distinguish between timeout correctly handled by LWP::UserAgent
and timeout saved by TimelyUserAgent
. It doesn’t attempt to take processing time, redirect time, and so on, into account.
Turning it into a really global timeout (on returning the final response) would require:
- either to wrap
request
subroutine which is recursive (not alarm friendly), - or to wrap
get
,post
, andmirror
subs and replacerequest
calls withpost
calls in::Controller::Discourse
and::Test::HTML5
.
I’d still prefer the current solution, maybe just explain it better in POD.
Parameters specification will follow, once we agree on which issue we try to address.
P.S. Isn’t using alarm
here going to interfere with alarm
call in MusicBrainz::Server
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. Isn’t using alarm here going to interfere with alarm call in MusicBrainz::Server?
Yeah...good point. We should probably be using AnyEvent
for this, though we'll have to pull it in as a dependency. I played around with it and it looks like there are workable solutions that way:
# cpanm AnyEvent AnyEvent::HTTP::LWP::UserAgent EV
BEGIN { $ENV{PERL_ANYEVENT_MODEL} = 'EV'; }
use AnyEvent;
use AnyEvent::HTTP::LWP::UserAgent;
use Data::Dumper;
my $ua = AnyEvent::HTTP::LWP::UserAgent->new;
my $cv = AE::cv;
$cv->begin;
my $response;
my $timeout = 0.5; # seconds
my $got_timeout = 0;
my $req_cv = $ua->get_async('https://musicbrainz.org/');
$req_cv->cb(sub {
$response = shift->recv;
$cv->end;
});
my $timer = AnyEvent->timer(after => $timeout, cb => sub {
$req_cv->croak;
$got_timeout = 1;
$cv->croak;
});
$cv->recv;
undef $timer; # cancels the timer
if ($got_timeout) {
print 'Got timeout';
} else {
print Dumper($response);
}
This would probably have to be implemented as a utility function rather than an LWP subclass.
$response = $self->$orig(@args); | ||
} catch { | ||
my $err = $_; | ||
if (ref($err) eq 'MusicBrainz::Server::Exceptions::GenericTimeout') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that will catch the GenericTimeout
thrown above, since it's not in the same call stack. At least I'm not able to see it being caught when I put in some Time::HiRes::sleep
calls to trigger the global timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It catches the GenericTimeout
but there is a bug while forging/retrieving the response.
See commit 5d3dc58 and below log excerpt on GET /search?query=MBS-10232&type=release
:
[debug] Sleeping until the alarm rings...
[debug] Throwing exception from signal
[debug] Catching error of type 'MusicBrainz::Server::Exceptions::GenericTimeout'...
[debug] Error message = 'GET http://search.musicbrainz.org/ws/2/release/?query=MBS-10232&offset=0&max=25&fmt=jsonnew&dismax=true&web=1 took more than 1 seconds (sending included)'
[debug] Forging HTTP::Response with code '503'...
[debug] Response code = 'HTTP::Request=HASH(0x5620f85638d0)'
[debug] Response message = '503'
[debug] Catching search error...
[debug] Search returned code = 'HTTP::Request=HASH(0x5620f85638d0)'
[debug] Search returned error = 'GET http://search.musicbrainz.org/ws/2/release/?query=MBS-10232&offset=0&max=25&fmt=jsonnew&dismax=true&web=1 took more than 1 seconds (sending included)'
Response code should be 503 and response message should be the same as error message.
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be due to https://github.com/metabrainz/musicbrainz-server/pull/1102/files/efd59b627bd8ed562d42d7d95663d39b8a2c3fcd#diff-045d15826466d1e497f53b359fbf1139R67
$self->SUPER::_new_response
implicitly passes $self
as the first argument, which _new_response
doesn't expect. It can be replaced with LWP::UserAgent::_new_response(...)
.
Also looks like the GenericTimeout
stack trace gets logged by plackup even though it's caught, which is odd and not ideal. :/
e819110
to
800d9e5
Compare
cef9b7e
to
39b09d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not leaving any feedback here for a long time. I'm having trouble testing this at the moment: if I navigate to a page that makes a request, like the homepage or a search results listing, the AnyEvent timer doesn't fire at all on the first page load, but on subsequent reloads fires immediately with no delay. Should probably rebase this first so we can debug that more easily.
lib/MusicBrainz/LWP.pm
Outdated
log_debug { $message }; | ||
$response = LWP::UserAgent::_new_response( | ||
$request, | ||
HTTP_SERVICE_UNAVAILABLE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
503 seems misleading since we don't actually know if the upstream server is overloaded or down for maintenance (could be any number of network issues en route). LWP::UserAgent
handles this by just returning 500 with a Client-Warning
header set to "Internal response". https://metacpan.org/pod/LWP::UserAgent#timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in ebd2a31.
lib/MusicBrainz/LWP.pm
Outdated
$self->_lwp_user_agent(AnyEvent::HTTP::LWP::UserAgent->new(%$args)); | ||
} | ||
|
||
sub _timely_call_method { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking but something like
sub _timely_call_method { | |
sub _call_lwp_with_timeout { |
seems like a clearer name to me (timely implies something completes within a good time, but here it may not complete at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in f64a282.
$uri //= $request->uri; | ||
$request //= HTTP::Request->new($http_method, $uri); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm. Maybe we need die unless defined $uri || defined $request
above 'cause these lines are mutually dependent on each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the same as die unless defined $arg
which I don’t think is needed, mainly because this is checked by called subroutines already.
@@ -9,7 +9,7 @@ our @EXPORT_OK = qw( | |||
sub get_chunked_with_retry { | |||
my ($ua, $url) = @_; | |||
my $response; | |||
my $retries_remaining = int(25.0 / $ua->timeout); | |||
my $retries_remaining = int(25.0 / $ua->inactivity_timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it perform no search, since the default LWP timeout is 180 seconds and int(25.0 / 180)
is 0.
I don't understand what the previous code is calculating, anyway. A fixed number of retries is probably fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my $timer = AnyEvent->timer(after => $global_timeout, cb => sub { | ||
$req_cv->croak; | ||
$got_timeout = 1; | ||
$cv->croak; | ||
}) unless $global_timeout == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason this is firing immediately for me. :\ I think it was working in the original script I suggested. Maybe the PR needs to be rebased first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just rebased on master
(36fdb6e) but did not test it again yet.
61fd923
to
7ef1cf0
Compare
Wraps LWP::UserAgent to abort waiting for a response if it timeouts `global_timeout` seconds after initial call to a request method, then to return a forged response with code 500 with custom message and "Client-Warning" header set to "Internal response" (by LWP). It is possible to set 'LWP::UserAgent::timeout' through self-explanatory 'MusicBrainz::LWP::inactivity_timeout'. It additionally autosets MusicBrainz Server User-Agent header. This change affects requests to most external services (e.g. search, mediawiki, discourse, critiquebrainz, blog, amazon, and so on). Also add dependencies AnyEvent, AnyEvent::HTTP::LWP::UserAgent, EV, and update cpanfile.snapshot.
25 was an arbritrary value. If timeout is not set, it defaults to 180, then the GET request is not tried at all. This patch sets a fixed number of 5 tries instead.
f80a38d
to
36a6937
Compare
Reapply 3f84e91 to downgrade Template::Toolkit from 3.008 to 3.007.
The website seems to work fine locally, but replication. As shown by tests on CircleCI (fixed by building and uploading a new @mwiencek, if I continue in this direction, I will just reimplement
|
I'm having the same problem as before: the timeout works initially after plackup first starts, but subsequent requests log How are you testing this? I wrote a small http proxy that responds after 30 seconds to simulate an unresponsive server. Tried increasing
If we could get the AnyEvent implementation working, that would be ideal. :) I'm just not sure there's any alternative to |
Addresses MBS-10232: Unresponsive search server freezes MBS
Wrap
LWP::UserAgent
used to query other servers (search, discourse, critiquebrainz, blog, and so on). Difference is the added global timeout related to the initial call to a request method rather than just connection inactivity time. Thus it should now timeout even when a server becomes unresponsive.Add dependency to
AnyEvent
,AnyEvent::HTTP::LWP::UserAgent
,EV
and updatecpanfile.snapshot
.