Create helper scripts to help analyze what happened with cluster #228

mattsfuller · 2016-09-01T20:41:33Z

We need to create some scripts (maybe as a part of presto-admin), which will help to identify issues with presto cluster.

Ideally we should be able to detect:

long GC pauses based on GC log if enabled
jvm crashes

It would create timeline of events which happened in given time period:
{code}
presto-admin show-events 24h
2015-01-01 00:00:000 Node 10.10.0.1 started
2015-01-01 01:00:000 Node 10.10.0.2 crashed (Out of memory error)
2015-01-01 02:00:000 Node 10.10.0.3 long STW GC pause (22.003 seconds)
{code}

We should be able to do this based on gc and launcher logs.

mattsfuller · 2016-09-01T20:41:37Z

This is an extension to the existing collect logs presto-admin command. Basically, it would look through the logs (and maybe also jmx stats) to produce a timeline of what's happening on the cluster.
This seems to me something that would be a fun hackathon project, but not something that's essential to work on right now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create helper scripts to help analyze what happened with cluster #228

Create helper scripts to help analyze what happened with cluster #228

mattsfuller commented Sep 1, 2016

mattsfuller commented Sep 1, 2016

Create helper scripts to help analyze what happened with cluster #228

Create helper scripts to help analyze what happened with cluster #228

Comments

mattsfuller commented Sep 1, 2016

mattsfuller commented Sep 1, 2016