-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Junos traceroutes occasionally fail #179
Comments
Netmiko is a poor driver for long running junos commands. You can alter nemiko_delay_factor to make it work, but this is a global variable, and will slow down every other command. We run solely Juniper in our network, so I wrote a pyez driver that uses RPCs instead of scraping. It works will all commands (including traceroute). I host it privately, but if you're interested, can fork this repo and push it there. |
@DavidReubenWhite I'd really appreciate that! |
@ggidofalvy-tc Give this a try (add_pyez_driver branch) Note - I only use vrf's in our network so haven't tested the commands that don't use vrfs. They should work though. Also we don't use the communities command, untested. |
For what it's worth, I've added the capability to add custom Netmiko arguments on a per-device basis in the v2.0.0 branch, via b49b6ea. I don't particularly want to add another driver to hyperglass, especially one that only covers one vendor. At least, not without adding extras to the hyperglass package. In addition to this feature, I'm also happy to add any Juniper-specific Netmiko arguments as defaults (applied to every Juniper SSH session), if anyone has a magic recipe. |
Sorry to issue-jack -- Happened to stumble onto this:
And did a bit of poking around and saw a prior issue or two that I assume are related. I am 99% sure that current develop branch of scrapli solves the scrapli+junos issues that I saw. TL;DR the scrapli prompt finding was breaking w/ junos because xml sure can look a lot like an exec/normal prompt. Fixed here with a test so this doesn't regress 😁 . Feel free to ignore, just saw it and figured I'd comment! Carl |
Can anyone verify if this is still an issue on v2 and provide details if so? |
ping and traceroute does not work. with the following error and bgp route works hyperglass-1 | 2024-06-15 13:32:36.037 | DEBUG | hyperglass.models.directive:membership:111 - Checking target membership |
@thatmattlove I can confirm that this seems to still be an issue as of v2.0.4 [20240701 16:16:46] Validation passed {'query': Query(query_location='DEVICE', query_target=['8.8.8.8'], query_type='juniper-traceroute-custom')} I've created a custom directive to test wait times, and noticed that adding no-resolve helps more traceroutes complete, likely due to the decreased response time on the juniper. I tested different wait times, but didn't really see any improvements from changing this. Testing this takes quiet a bit of time due to hyperglass taking about ~2 minutes to startup on docker, so I created a simple python script with netmiko ssh to compare, and found that I see similar issues (i.e similar length traceroutes timing out) when I set the read_timeout to anything less than 5 - which is odd considering, that netmiko seems to have default timeout of 10. I attempted to use driver_config in device yaml settings to add a read_timeout of 10 but got the error: juniper-traceroute-custom: |
Another quick update on this, I tested setting request_timeout to 1200, and it still fails on the same traceroutes. I am currently testing on the longest traceroute I can find ~27 hops, though I have seen it fail on as few as 8 hops. It takes the exact same time to fail with a high request_timeout vs. the default 90, with the time between connection initiated and the netmiko error being pretty much exactly 12 seconds each time. |
Hi I am seeing the same issue also on XR as well as Juniper. I attempted to adjust the read_timeout without much luck with this in the driver py file |
FYI I figured out that the read_timeout does work fine but it has to be edited inside the docker. You can refer to my issue |
Bug Description
Hey folks,
I've noticed this issue on our LG deployment: traceroutes fail quite often, with the undescriptive 'Something went wrong.' error message appearing, ca. 10-15 seconds after submitting the traceroute request. There's no really that much in the logs, even with debugging turned on, please see below.
I made sure that this is not a connectivity/permissions issue, I tried to connecting to the same router with the username + password directly copied out of the configuration file from the same host, as the hyperglass user; I was successfully able to run the command constructed by Hyperglass.
I've found that this behaviour occurs when trying to traceroute to destinations that might take a long time to run (read: lots of timeouts in the intermediate hops) -- I already tried increasing the
request_timeout
to 120, but it seems like the query fails much, much earlier than even the default 90s request timeout.Expected behavior
The result of the traceroute is returned.
Steps to Reproduce
Local Configurations
I added
request_timeout: 120
to hyperglass.yaml as an attempt to fix the problem. Additionally, in my devices.yaml, all Junos devices have netmiko set as their driver instead of scrapli, due to other issues impacting the scrapli driver.In an attempt to track down this issue,
debug: true
has also been added to hyperglass.yamlLogs
Possible Solution
Environment
Server
/opt/hyperglass/hyperglass
3.7.3
14.18.2
Linux-5.6.0-0.bpo.2-cloud-amd64-x86_64-with-debian-10.11
2
2
1.999999GHz
Client
The text was updated successfully, but these errors were encountered: