Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a very good suggestion (HTTrack needs update for modern times) #215

Open
pb5050 opened this issue Oct 9, 2021 · 1 comment
Open

a very good suggestion (HTTrack needs update for modern times) #215

pb5050 opened this issue Oct 9, 2021 · 1 comment

Comments

@pb5050
Copy link

pb5050 commented Oct 9, 2021

@xroche

i dont know if your active or even care anymore, but i think this suggestion would ve a very nice update to httrack i have searched on the net and see countless people having issues logging into webpages. It basically is using edges data all cookies, and saved forms are easily attained by this!! meaning i could log into a webpage on my browser and click remember me or stay signed in. Fire up httrack and bam go to the page and im logged in already

Adding a option in the menu to select the browsers directory for data

##here is a npm script section for puppeteer or playwright. i think this would be a very very useful addition.##

heres a link explaining what it does and how it works
playwright/docs API usedataDIR

async function startBrowser() {
    let browser;
    try {
        console.log("starting Browser");
        browser = await puppeteer.launch({
            headless: true,
            'ignoreHTTPSErrors': true,
           **userDataDir: "C:\\Users\\Janss\\AppData\\Local\\Microsoft\\Edge SxS\\User Data\\Default",
            executablePath: "C:\\Users\\Janss\\AppData\\Local\\Microsoft\\Edge SxS\\Application\\msedge.exe"**
        });
    } catch (err) {
        console.log("Could not create a browser instance => : ", err);
    }
    return browser;

i personally have given up i can do almost anything with httrack except scrape behind a login page i am trying to copy a vehicle manual for my truck from chilton and have tried so many times spent hours on this..... i just keep getting this eror

please return to your library's access page and re-authorize a new session. -

  1. ive tried proxy forms fill out option
  2. the cookies.txt in the project folder, and also added one into the the main directory
  3. even a link anyone can click and will be automatically logged into the page
  4. added a referrer url
  5. followed robots.txt / didnt follow it.
  6. excluded all links containing logout, or quit
  7. tried to keep it in the directory of /lh/Repair/Index
@mitchcapper
Copy link

@pb5050 your best bet would be to login in a browser, exact those cookies, then run httprack through a proxy like fiddler. See the request it makes vs one the browser makes (can run browser also through proxy to compare). There are likely a few options:

*) You are missing some header, or cookie set in a specific way, or more strict referral tracking. Make sure you are getting "session" only cookies as well.
*) They use something like client side storage rather than cookies for part of the session, without executed javascript this becomes tricky. It is possible if you find this then you could potentially use a custom work around for that specific site.
*) They are doing some anti-bot detection through scripting which would be more unlikely but possible.

A proxy is the best place to start, if you find exactly what is different can potentially resolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants