Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parsing for saltstack interpreted files #86

Closed
wants to merge 9 commits into from
14 changes: 13 additions & 1 deletion identify/extensions.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,10 @@
'gif': {'binary', 'image', 'gif'},
'go': {'text', 'go'},
'gotmpl': {'text', 'gotmpl'},
'gpg': {'text', 'gnupg'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'gpg': {'text', 'gnupg'},
'gpg': {'binary', 'gnupg'},

There are 101 000 results on Github, and spot-checking the results, as well as Linguist, MIME-DB and FileInfo indicates no conflicts. However, this GnuPG mailing list response, this SO answer and the FileInfo listing, as well as an examining a sample of .gpg files on Github, indicates that this extension is designated for binary-format GnuPG data, with asc being the ASCII-armored text equivalent for both PGP and GnuPG (with 635 000 Github hits, but only 470 000 of them keys, the rest being identified as AGS scripts and ASCIIDoc text).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, I'll update when I get back

'gpx': {'text', 'gpx', 'xml'},
'graphql': {'text', 'graphql'},
'gradle': {'text', 'groovy'},
'graphql': {'text', 'graphql'},
'groovy': {'text', 'groovy'},
'gyb': {'text', 'gyb'},
'gyp': {'text', 'gyp', 'python'},
Expand Down Expand Up @@ -100,6 +101,7 @@
'lr': {'text', 'lektor'},
'lua': {'text', 'lua'},
'm': {'text', 'c', 'objective-c'},
'mako': {'text', 'mako'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no authority whatsoever on the matter, but for reference, this change was previously rejected in PR #137 , for reasons of mako refusing to specify a standard file extension. However, for reference, over 170 000 files on Github have the .mako extension (while 655 have the .mao extension), of which over 160 000 are identified as Mako by Linguist and a spot check of the remaining 10 000 suggests they too are likely all or almost all Mako. A check of mime-db indicates no other registered file types using this extension, and Mako template is the only type registered on FileInfo for .mako.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right on, I threw a couple of these in only because I didn;t see them in there already and thought why not, but if there are issues I'll just pull mako no problem there.

'manifest': {'text', 'manifest'},
'map': {'text', 'map'},
'markdown': {'text', 'markdown'},
Expand Down Expand Up @@ -179,6 +181,7 @@
'tgz': {'binary', 'gzip'},
'thrift': {'text', 'thrift'},
'tiff': {'binary', 'image', 'tiff'},
'tmpl': {'text', 'cheetah'},
Copy link
Contributor

@CAM-Gerlach CAM-Gerlach Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This association seems rather questionable. While it is documented that tmpl is considered the default file extension for Cheetah files, the likely ambiguity seems much too high to be acceptable. While there are no conflicts registered in Linguist or MIME-DB, there are two (of uncertain provenience) in FileInfo, and it is apparently used for Go templates.

Much worse, of the 2 300 000 Github file results, many/most don't seem to use the Cheetah $ syntax (that I'm not really familiar with beyond a quick skim of the docs), and most appear to be Javascript templates, XML templates, GoHTML templates, or other files. Furthermore, it seems rather improbable that more than a small fraction are Cheetah given, for instance, the Mustache template engine has 14.4k stars, an unambiguously standard extension (.mustache) and 995 000 files, while the Github repos I can find associated with Cheetah have only 131, 63, and 41 stars respectively, 2 OoM less despite having many more files associated with the extension.

Copy link
Author

@raddessi raddessi Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I did not find those on a quick search, good call again. I'll remove this one.

'toml': {'text', 'toml'},
'ts': {'text', 'ts'},
'tsx': {'text', 'tsx'},
Expand Down Expand Up @@ -223,6 +226,15 @@
EXTENSIONS_NEED_BINARY_CHECK = {
'plist': {'plist'},
}
# This should contain a map of file extensions to a map of interpreter names to
# their own file extensions
EXTENSIONS_NEED_SHEBANG_CHECK = {
'sls': {
'pydsl': 'py',
'pyobjects': 'py',
'cheetah': 'tmpl',
},
}

NAMES = {
'.babelrc': EXTENSIONS['json'] | {'babelrc'},
Expand Down
38 changes: 38 additions & 0 deletions identify/identify.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ def tags_from_path(path):
if len(shebang) > 0:
tags.update(tags_from_interpreter(shebang[0]))

tags.update(tags_from_extension_specific_shebang(path))

# some extensions can be both binary and text
# see EXTENSIONS_NEED_BINARY_CHECK
if not {TEXT, BINARY} & tags:
Expand All @@ -73,6 +75,42 @@ def tags_from_path(path):
return tags


def tags_from_extension_specific_shebang(path):
"""Match tags from an extension that we need to read the shebang from."""
_, filename = os.path.split(path)
_, ext = os.path.splitext(filename)
ret = set()
if ext.lstrip('.') not in extensions.EXTENSIONS_NEED_SHEBANG_CHECK:
return ret

interpreter_to_extension_map = extensions.EXTENSIONS_NEED_SHEBANG_CHECK[
ext.lstrip('.')
]

with open(path, 'rb') as f:
shebang = parse_shebang(f)

if ext == '.sls':
if shebang:
# try to match tags for the file extension of the first interpreter
try:
first_interpreter = shebang[0].split('|')[0]
ret.update(
extensions.EXTENSIONS[
interpreter_to_extension_map.get(
first_interpreter, first_interpreter,
)
],
)
except (IndexError, KeyError):
pass
else:
# the default interpreter is jinja
ret.update(extensions.EXTENSIONS['jinja'])

return ret
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the lines coverage is complaining about missing tests. I'm stumped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried the tests locally to verify that this block is actually executing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I verified 4 locations get hit



def tags_from_filename(filename):
_, filename = os.path.split(filename)
_, ext = os.path.splitext(filename)
Expand Down
49 changes: 49 additions & 0 deletions tests/identify_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,55 @@ def test_tags_from_path_plist_text(tmpdir):
}


def test_tags_from_extension_specific_shebang_executable_file(tmpdir):
x = tmpdir.join('test.sls')
x.write('')
make_executable(x.strpath)
assert identify.tags_from_extension_specific_shebang(x.strpath) == {
'jinja',
'text',
}


@pytest.mark.parametrize(
('interpreter', 'expected'),
(
('cheetah', {'text', 'cheetah'}),
('dson', set()),
('genshi', set()),
('gpg', {'text', 'gnupg'}),
('jinja', {'text', 'jinja'}),
('jinja|py', {'text', 'jinja'}),
('jinja|yaml', {'text', 'jinja'}),
('jinja|yaml|gpg', {'text', 'jinja'}),
('mako', {'text', 'mako'}),
('py', {'text', 'python'}),
('pydsl', {'text', 'python'}),
('pyobjects', {'text', 'python'}),
('wempy', set()),
('yaml', {'text', 'yaml'}),
('yamlex', set()),
('yaml|gpg', {'text', 'yaml'}),
),
)
@pytest.mark.parametrize(
('shebang_prefix',),
(
('#!',),
('#! ',),
),
)
def test_tags_from_extension_specific_shebang(
tmpdir,
shebang_prefix,
interpreter,
expected,
):
x = tmpdir.join('test.sls')
x.write(shebang_prefix + interpreter)
assert identify.tags_from_extension_specific_shebang(x.strpath) == expected


@pytest.mark.parametrize(
('filename', 'expected'),
(
Expand Down