Junos traceroutes occasionally fail #179

ggidofalvy-tc · 2021-12-07T18:49:33Z

Bug Description

Hey folks,

I've noticed this issue on our LG deployment: traceroutes fail quite often, with the undescriptive 'Something went wrong.' error message appearing, ca. 10-15 seconds after submitting the traceroute request. There's no really that much in the logs, even with debugging turned on, please see below.

I made sure that this is not a connectivity/permissions issue, I tried to connecting to the same router with the username + password directly copied out of the configuration file from the same host, as the hyperglass user; I was successfully able to run the command constructed by Hyperglass.

I've found that this behaviour occurs when trying to traceroute to destinations that might take a long time to run (read: lots of timeouts in the intermediate hops) -- I already tried increasing the request_timeout to 120, but it seems like the query fails much, much earlier than even the default 90s request timeout.

Expected behavior

The result of the traceroute is returned.

Steps to Reproduce

Set up a Junos router
Pick a traceroute destination that might take a long time to traceroute
Submit the query

Local Configurations

I added request_timeout: 120 to hyperglass.yaml as an attempt to fix the problem. Additionally, in my devices.yaml, all Junos devices have netmiko set as their driver instead of scrapli, due to other issues impacting the scrapli driver.

In an attempt to track down this issue, debug: true has also been added to hyperglass.yaml

Logs

Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:24 | _member_of → Checking membership of XX.XX.XX.XX/32 for 0.0.0.0/0
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:31 | _member_of → XX.XX.XX.XX/32 is a member of 0.0.0.0/0
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:109 | validate_ip → XX.XX.XX.XX/32 is allowed by access-list AccessList4(network=IPv4Network('0.0.0.0/0'), action='permit', ge=0, le=32)
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:131 | validate_ip → Converted 'XX.XX.XX.XX/32' to 'XX.XX.XX.XX' for 'traceroute' query
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.models.api.validators:180 | validate_ip → Validation passed for XX.XX.XX.XX
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.api.routes:80 | query → Initialized cache HyperglassCache(db=1, host=localhost, port=6379, password=None)
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.api.routes:89 | query → Cache Timeout: 120
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [INFO] 20211207 18:33:12 | hyperglass.api.routes:90 | query → Starting query execution for query Query(query_location=asdasdasdasd, query_type=traceroute, query_vrf=master, query_target=XX.XX.XX.XX)
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.api.routes:116 | query → No existing cache entry for query b43a845d27d8e0555179232203b18216d695981aa4ecaebfc2742a434b697218
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.api.routes:118 | query → Created new cache key b43a845d27d8e0555179232203b18216d695981aa4ecaebfc2742a434b697218 entry for query Query(query_location=asdasdasdasd, query_type=traceroute, query_vrf=master, query_target=XX.XX.XX.XX)
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.main:49 | execute → Received query for {"query_location": "asdasdasdasd", "query_type": "traceroute", "query_vrf": {"name": "master", "display_name": "Global", "info": {"bgp_aspath": {"enable": true, "file": null, "params": {"title": null}}, "bgp_community": {"enable": true, "file": null, "params": {"title": null}}, "bgp_route": {"enable": true, "file": null, "params": {"title": null}}, "ping": {"enable": true, "file": null, "params": {"title": null}}, "traceroute": {"enable": true, "file": null, "params": {"title": null}}}, "ipv4": {"source_address": "Y.Y.Y.Y", "access_list": [{"network": "0.0.0.0/0", "action": "permit", "ge": 0, "le": 32}], "force_cidr": false, "protocol": "ipv4_default", "version": 4}, "ipv6": {"source_address": "Z:Z:Z::Z", "access_list": [{"network": "::/0", "action": "permit", "ge": 0, "le": 128}], "force_cidr": false, "protocol": "ipv6_default", "version": 6}, "default": true}, "query_target": "XX.XX.XX.XX", "timestamp": "2021-12-07 18:33:12"}
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.main:50 | execute → Matched device config: name='router.xxx.yyy' address='router.xxx.yyy' network=Network(name='production', display_name='AS11111') credential=Credential(username='hyperglass', password=SecretStr('***'), key=None, _method='password') proxy=None display_name=None port=22 ssl=None nos='juniper' commands='juniper' vrfs=[Vrf(name='master', display_name='Global', info=Info(bgp_aspath=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), bgp_community=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), bgp_route=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), ping=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None)), traceroute=InfoConfig(enable=True, file=None, params=InfoConfigParams(title=None))), ipv4=DeviceVrf4(source_address=IPv4Address('Y.Y.Y.Y'), access_list=[AccessList4(network=IPv4Network('0.0.0.0/0'), action='permit', ge=0, le=32)], force_cidr=False, protocol='ipv4_default', version=4), ipv6=DeviceVrf6(source_address=IPv6Address('Z:Z:Z::Z'), access_list=[AccessList6(network=IPv6Network('::/0'), action='permit', ge=0, le=128)], force_cidr=False, protocol='ipv6_default', version=6), default=True)] display_vrfs=['Global'] vrf_names=['master'] structured_output=True driver='netmiko'
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.drivers._construct:27 | __init__ → Constructing traceroute query for 'XX.XX.XX.XX'
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.drivers._construct:111 | queries → Constructed query: ['traceroute inet XX.XX.XX.XX wait 1 source Y.Y.Y.Y']
Dec 07 18:33:12 lg.xxx.yyy hyperglass[15315]: [DEBUG] 20211207 18:33:12 | hyperglass.execution.drivers.ssh_netmiko:57 | collect → Connecting directly to router.xxx.yyy
########## At 18:33:20-30-ish, this is where I received the "Something went wrong." message on the frontend.
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: [CRITICAL] 20211207 18:35:11 | hyperglass.exceptions:40 | __init__ → [DANGER] Error connecting to router.xxx.yyy: Request timed out.
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: Exception in callback Loop._read_from_self
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: handle: <Handle Loop._read_from_self>
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: Traceback (most recent call last):
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "uvloop/cbhandles.pyx", line 73, in uvloop.loop.Handle._run
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "uvloop/loop.pyx", line 359, in uvloop.loop.Loop._read_from_self
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "uvloop/loop.pyx", line 364, in uvloop.loop.Loop._invoke_signals
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "uvloop/loop.pyx", line 339, in uvloop.loop.Loop._ceval_process_signals
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:   File "/opt/hyperglass/.local/lib/python3.7/site-packages/hyperglass/execution/main.py", line 39, in handler
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]:     raise DeviceTimeout(**exc_args)
Dec 07 18:35:11 lg.xxx.yyy hyperglass[15315]: hyperglass.exceptions.DeviceTimeout: Error connecting to router.xxx.yyy: Request timed out.

Possible Solution

Environment

Server

Metric	Value
hyperglass Version	1.0.4
hyperglass Path	`/opt/hyperglass/hyperglass`
Python Version	`3.7.3`
Node Version	`14.18.2`
Platform Info	`Linux-5.6.0-0.bpo.2-cloud-amd64-x86_64-with-debian-10.11`
CPU Info	AMD EPYC Processor (with IBPB)
Logical Cores	`2`
Physical Cores	`2`
Processor Speed	`1.999999GHz`
Total Memory	4.13 GB
Memory Utilization	19.5%
Total Disk Space	32.22 GB
Disk Utilization	13.7%

Client

OS & Version: Fedora Linux 35
Browser: Firefox 94.0 // Also happens in Chromium 94.0.4606.81

The text was updated successfully, but these errors were encountered:

MorningLightMountain713 · 2021-12-10T00:06:21Z

Netmiko is a poor driver for long running junos commands. You can alter nemiko_delay_factor to make it work, but this is a global variable, and will slow down every other command. We run solely Juniper in our network, so I wrote a pyez driver that uses RPCs instead of scraping. It works will all commands (including traceroute). I host it privately, but if you're interested, can fork this repo and push it there.

ggidofalvy-tc · 2021-12-10T10:56:51Z

@DavidReubenWhite I'd really appreciate that!

MorningLightMountain713 · 2021-12-13T00:58:33Z

@ggidofalvy-tc Give this a try (add_pyez_driver branch)

Note - I only use vrf's in our network so haven't tested the commands that don't use vrfs. They should work though. Also we don't use the communities command, untested.

thatmattlove · 2021-12-14T03:55:53Z

For what it's worth, I've added the capability to add custom Netmiko arguments on a per-device basis in the v2.0.0 branch, via b49b6ea. I don't particularly want to add another driver to hyperglass, especially one that only covers one vendor. At least, not without adding extras to the hyperglass package.

In addition to this feature, I'm also happy to add any Juniper-specific Netmiko arguments as defaults (applied to every Juniper SSH session), if anyone has a magic recipe.

carlmontanari · 2021-12-21T19:59:06Z

Sorry to issue-jack -- Happened to stumble onto this:

due to other issues impacting the scrapli driver

And did a bit of poking around and saw a prior issue or two that I assume are related. I am 99% sure that current develop branch of scrapli solves the scrapli+junos issues that I saw. TL;DR the scrapli prompt finding was breaking w/ junos because xml sure can look a lot like an exec/normal prompt. Fixed here with a test so this doesn't regress 😁 . Feel free to ignore, just saw it and figured I'd comment!

Carl

thatmattlove · 2024-06-01T19:46:52Z

Can anyone verify if this is still an issue on v2 and provide details if so?

saint-kid · 2024-06-15T13:33:26Z

ping and traceroute does not work. with the following error and bgp route works

mattpenn20 · 2024-07-01T17:37:52Z

@thatmattlove I can confirm that this seems to still be an issue as of v2.0.4

[20240701 16:16:46] Validation passed {'query': Query(query_location='DEVICE', query_target=['8.8.8.8'], query_type='juniper-traceroute-custom')}
[20240701 16:16:46] Starting query execution {'query': Query(query_location='DEVICE', query_target=['8.8.8.8'], query_type='juniper-traceroute-custom')}
[20240701 16:16:46] Cache miss {'query': Query(query_location='DEVICE', query_target=['8.8.8.8'], query_type='juniper-traceroute-custom'), 'cache_key': 'hyperglass.query.dd24d57262dbf8c8537bd2c6440a9b00dd04023b2107a63f17a0f0d0d2fec8e6'}
[20240701 16:16:46] {'query': Query(query_location='DEVICE', query_target=['8.8.8.8'], query_type='juniper-traceroute-custom'), 'device': 'DEVICE'}
[20240701 16:16:46] Constructing query {'type': 'juniper-traceroute-custom', 'target': ['8.8.8.8']}
[20240701 16:16:46] Constructed query {'type': 'juniper-traceroute-custom', 'target': ['8.8.8.8'], 'constructed_query': ['traceroute inet 8.8.8.8 wait 2 source (SOURCE IP) no-resolve']}
[20240701 16:16:46] Connecting to device {'device': 'DEVICE', 'address': 'None:None', 'proxy': None}
[20240701 16:16:58] Error {'method': 'POST', 'path': '/api/query', 'detail': "\nPattern not detected: 'hyperglass@DEVICE>' in output.\n\nThings you might try to fix this:\n1. Explicitly set your pattern using the expect_string argument.\n2. Increase the read_timeout to a larger value.\n\nYou can also look at the Netmiko session_log or debug log for more information.\n\n"}
[20240701 16:16:58] (DEVICE IP) - "POST /api/query HTTP/1.1" 500 {}
[20240701 16:18:15] Request timed out. (Connection timed out) {}

I've created a custom directive to test wait times, and noticed that adding no-resolve helps more traceroutes complete, likely due to the decreased response time on the juniper.

I tested different wait times, but didn't really see any improvements from changing this.

Testing this takes quiet a bit of time due to hyperglass taking about ~2 minutes to startup on docker, so I created a simple python script with netmiko ssh to compare, and found that I see similar issues (i.e similar length traceroutes timing out) when I set the read_timeout to anything less than 5 - which is odd considering, that netmiko seems to have default timeout of 10.

I attempted to use driver_config in device yaml settings to add a read_timeout of 10 but got the error:
Error {'method': 'POST', 'path': '/api/query', 'detail': "BaseConnection.init() got an unexpected keyword argument 'read_timeout'"}

juniper-traceroute-custom:
name: Traceroute
rules:
- condition: 0.0.0.0/0
command: "traceroute inet {target} wait 1 source {source4} no-resolve"
- condition: ::/0
command: "traceroute inet6 {target} wait 1 source {source6} no-resolve"
field:
description: IP Address, Prefix, or Hostname

mattpenn20 · 2024-07-01T20:37:50Z

Another quick update on this, I tested setting request_timeout to 1200, and it still fails on the same traceroutes. I am currently testing on the longest traceroute I can find ~27 hops, though I have seen it fail on as few as 8 hops. It takes the exact same time to fail with a high request_timeout vs. the default 90, with the time between connection initiated and the netmiko error being pretty much exactly 12 seconds each time.

astlaurent · 2024-07-17T14:15:44Z

Another quick update on this, I tested setting request_timeout to 1200, and it still fails on the same traceroutes. I am currently testing on the longest traceroute I can find ~27 hops, though I have seen it fail on as few as 8 hops. It takes the exact same time to fail with a high request_timeout vs. the default 90, with the time between connection initiated and the netmiko error being pretty much exactly 12 seconds each time.

Hi I am seeing the same issue also on XR as well as Juniper. I attempted to adjust the read_timeout without much luck with this in the driver py file
send_args['read_timeout'] = 120
You have any luck finding anything that works

astlaurent · 2024-07-17T18:14:25Z

FYI I figured out that the read_timeout does work fine but it has to be edited inside the docker. You can refer to my issue
#274

ggidofalvy-tc added the possible-bug Something isn't working label Dec 7, 2021

thatmattlove added the needs-info Issue lacks appropriate detail or requires additional information label Jun 1, 2024

thatmattlove self-assigned this Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Junos traceroutes occasionally fail #179

Junos traceroutes occasionally fail #179

ggidofalvy-tc commented Dec 7, 2021

MorningLightMountain713 commented Dec 10, 2021

ggidofalvy-tc commented Dec 10, 2021

MorningLightMountain713 commented Dec 13, 2021 •

edited

Loading

thatmattlove commented Dec 14, 2021

carlmontanari commented Dec 21, 2021

thatmattlove commented Jun 1, 2024

saint-kid commented Jun 15, 2024

mattpenn20 commented Jul 1, 2024

mattpenn20 commented Jul 1, 2024

astlaurent commented Jul 17, 2024

astlaurent commented Jul 17, 2024

Junos traceroutes occasionally fail #179

Junos traceroutes occasionally fail #179

Comments

ggidofalvy-tc commented Dec 7, 2021

Bug Description

Expected behavior

Steps to Reproduce

Local Configurations

Logs

Possible Solution

Environment

Server

Client

MorningLightMountain713 commented Dec 10, 2021

ggidofalvy-tc commented Dec 10, 2021

MorningLightMountain713 commented Dec 13, 2021 • edited Loading

thatmattlove commented Dec 14, 2021

carlmontanari commented Dec 21, 2021

thatmattlove commented Jun 1, 2024

saint-kid commented Jun 15, 2024

mattpenn20 commented Jul 1, 2024

mattpenn20 commented Jul 1, 2024

astlaurent commented Jul 17, 2024

astlaurent commented Jul 17, 2024

MorningLightMountain713 commented Dec 13, 2021 •

edited

Loading