Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check valid robots.txt file #50

Open
lucasfcosta opened this issue Nov 8, 2015 · 8 comments
Open

Check valid robots.txt file #50

lucasfcosta opened this issue Nov 8, 2015 · 8 comments
Labels

Comments

@lucasfcosta
Copy link
Member

We should add a method to check if the content of robots.txt contains valid syntax.

@coderingo
Copy link

I really need this feature and I will be happy to help :)

@vieiralucas
Copy link
Contributor

Hello @coderingo
Feel free to send a PR addressing this issue.
If you have any questions or issues about the code base you can ask us here in this issue, or open a PR with an [WIP] tag and ask there.

Thank you so much for wiling to contribute

😄

@coderingo
Copy link

Hi @vieiralucas

Thanks for the quick response. I am up to start working on it. Can you kindly give me a hint about where to start from? I read the standards specified by google and they are very forgiving regarding the syntax validation.

Kind Regards :)

@vieiralucas
Copy link
Contributor

vieiralucas commented Feb 8, 2017

Hello,

So I read this page from google and I agree that they are very forgiving regarding syntax validation.

Only valid records will be considered; all other content will be ignored.

This basically means that if a line is invalid it is just ignored.

IMHO robotto should just ignore invalid lines when parsing (it already does that), but this new method could return an object pointing out this lines as invalid.

So, I would create a method, I'm not sure how to call it, maybe isValid?
This method should return an object like:

// when valid
{
  valid: true,
  invalidLines: []
}

// when invalid
{
  valid: false,
  invalidLines: [
    { number: 2, content: '....' }
  ]
}

Another thing that might be a good idea is to just add an invalidLines field to the return of parse, which the user could check its length to determine if a robots.txt file contains any invalid line

@lucasfcosta, @coderingo What do you think?

I personally like this latest idea

@coderingo
Copy link

Hi @vieiralucas

I agree with your last idea to just add an invalidLines key to the object returned by the parse. I want to suggest that along the number and content of the line with invalid syntax, the object should also contain an error key with an error code or name so that errors can be identified to other modules or end users.

On another note, The parse gave me an error on almost a valid file. I will like to share contents of the file:

` User-agent: *
Disallow: /search
Disallow: /ebooks?q=
Disallow: /ebooks?output=
Disallow: /ebooks?pg=
Disallow: /ebooks?jscmd=
Disallow: /ebooks?buy=
Disallow: /ebooks?zoom=
Disallow: /intl/*/about/views/
User-agent: Twitterbot
Allow: /imgres
User-agent: facebookexternalhit
Allow: /imgres

Sitemap: http://www.gstatic.com/profiles-sitemap.xml
Sitemap: https://www.google.com/sitemap.xml`

Its a modified part from https://google.com/robots.txt . If you notice the four spaces at the start of the User-agent:, that's what giving me the following error on parse

TypeError: Cannot read property 'disallow' of undefined at lines.forEach.line (node_modules\robotto\lib\robotto.js:78:47) at Array.forEach (native) at Object.robotto.parse (node_modules\robotto\lib\robotto.js:41:11)

I haven't debugged it further but I think it may be because of how the array of lines is created from the content.

Thanks for your precious time and dedication to the issues

@lucasfcosta
Copy link
Member Author

I think that we can be a little more forgiving about parsing rules.

We wrote this code a while ago so we were kind of strict in our implementation and didn't consider many edge cases.

I have already implemented the same package in Go and it has parsing rules that are more likely to be compliant with Google's standards. I think that implementation might help you when thinking about how you should tackle parsing.

I also like the idea of pointing out which things are wrong in a line basis since one invalid line won't invalidate the others. Basically we should start treating each line as an atomic unit, it can be parsed alone.

Whenever you send us your pull request we will be very happy to review and point you the right direction if we disagree with anything, but please feel free to do it the way you think it's better.

Thank you very much for contributing!

@coderingo
Copy link

Hi Lucas,

I am working my head around the gobotto. Its my first interaction with golang 😅 . I will try to work on it. Seems interesting.

As for the undefined lastUserAgent in the above code, we can eliminate it by trimming every line. Google forgives spaces at the start and at the end. We can check for the whitespace and log it as error/warning but the parser should not break on it. What I think is that the following code at lines 58 & 59 are causing the error:

let userAgentIndex = line.toLowerCase().indexOf('user-agent:');
if (userAgentIndex === 0) {
Due to white space, the index returns -1. Trimming the line can remove the error.

@lucasfcosta
Copy link
Member Author

@coderingo You are totally right!

Ah, and sorry for showing you Go code without asking if you knew Go first. Let me give you a more detailed explanation of the algorithm I used there:

  1. Split robots.txt contents when we find a linebreak (\n)
  2. Trim whitespaces both to the left and the right (I haven't thought about this before, well noticed, buddy)
  3. Lowercase the whole line and store it in another variable
  4. Check the line's preffix to determine what is its semantic value
    4.1 If the line starts with: # we get everything after # and append it to our array of comments
    4.2 If the line starts with: user-agent we get everything after the first : and save this as the current user agent so any rules after that should be put under this agent's namespace
    4.3 If the line starts with: allow we get everything after the first : and add this rule to the map of allowed routes for the current user-agent
    4.4 If the line starts with: disallow we get everything after the first : and add this rule to the map of disallowed routes for the current user-agent
    4.5 Here we must check for sitemaps and other valid syntax which doesn't make any difference on the crawler's permissions
  5. If we couldn't match the line in any of these cases it is considered invalid and added to our array of invalid lines

I think that's pretty much it, maybe there are some more edge cases I haven't thought about, so if you can remember any other please let me know.

Thanks again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants