sqlean is a Python CLI to clean your SQL queries. It aspires to be the black for SQL queries: an uncompromising query formatter that imposes a strict, strongly-opinionated formatting standard, with an easy workflow to achieve that standard.
Early releases of sqlean will support BigQuery dialect SQL queries with or without dbt syntax. If there's sufficient interest, other SQL dialects can be added.
WARNING: this is very early in sqlean's life and it should not be used in production.
Make sure you're in a virtual environment and install sqlean with pip:
pip install sqlean
The sqlean CLI has one argument and several options. The one argument in the
sqlean CLI is the TARGET
which can be a file or a directory. If the TARGET
is a directory, sqlean will be applied to all SQL files within the directory and
all sub-directories.
When sqlean is run without any options, SQL files will be changed in place so that the files comply with sqlean's formatting rules. Here's an example of a successful run (with no options):
sqlean is thoroughly tested to ensure that original queries have the same
abstract syntax tree (AST) as the formatted query. In plain English, that means
that sqlean should never break a query. But sqlean is very new so if you don't
fully trust sqlean yet, you can use the dry-run option (-d
/--dry-run
)
to display the diff that would happen if sqlean is run for real on the file :
Because sqlean needs to build an AST, if sqlean doesn't understand an SQL query
then it is unparsable. In this case, the example query is unparsable because
there's a problem with the query, and the verbose option (-v
/--verbose
)
points to the problem:
Brief documentation on sqlean commands can be accessed with:
sqlean --help
Once sqlean is production ready, it is ideally used in your continuous integration (CI) workflow so that a pull request that adds or changes SQL queries can only be merged if sqlean runs without errors (i.e. with the final message "All files passed").
When you first introduce sqlean into your workflow, many of your SQL files can be automatically fixed. In an example production query-base of over 1,600 files, nearly 60% of queries could be automatically fixed.
The other SQL files are unparsable. sqlean will need further development to treat these files and you can help by reporting the issue so that the problem can be addressed.
In the mean time though, the unparsable files can be marked so that sqlean knows
to ignore them in CI runs. You can do this automatically by running sqlean with
the ignore option (-i
/--write-ignore
), or just by inserting this comment as
the first line of the file that you want ignored:
# sqlean ignore
If some of the issues are addressed by a future version of sqlean, you can
upgrade sqlean and automatically remove the ignore indicators with the force
(-f
/--force
) option.
CLI options are for options that can change from one run of sqlean
to another.
Project level configuration will not change from one run to another, and must be
set in pyproject.toml
.
One exception is the TARGET
directory or file that can be both a project level
setting and also change from one run to another, so it can appear in both
pyproject.toml
or as a CLI argument.
In pyproject.toml
, sqlean options go in a [tool.sqlean]
section. The options
are:
includes
: a single path or list of paths to include when sqlean is run. This will be over-ridden by theTARGET
argument in the CLI when it is supplied. The default value isNone
, in which caseTARGET
must be used.indent_size
: an integer that indicates the number of spaces in the indent. The default value is 2.whisper
: (NOT YET IMPLEMENTED) a boolean which should be set totrue
to make all SQL reserved words and function names all lower case. The default value isfalse
, which makes all SQL reserved words and function names all-caps.max_line_length
: (NOT YET IMPLEMENTED) an integer value which adds additional line breaks if a line is longer than this value. The default value is-1
to indicate that no maximum will be applied.dialect
: (NOT YET IMPLEMENTED) a string to indicate the SQL dialect used in the project. The default value is"BIGQUERY"
In the last few years, the usage of SQL in big data processing has steadily increased with the emergence of computational engines like Google Cloud BigQuery, Amazon Redshift, Snowflake and others with SQL interfaces.
Leveraging on the growth of SQL, dbt has gained wide-spread adoption as a tool to orchestrate the running of SQL queries. While analytics and feature generation pipelines were previously written primarily in programming languages like Python or Scala, they are now mostly dbt code: a combination of SQL and jinja templating.
The tooling around dbt code does not have the maturity as tooling around older languages like Python. We see code formatting and styling as a critical part of tooling around a language: consistently formatted code makes it easier to collaborate within a team by making it faster and easier to read other people's code.
Black has emerged as the most popular code formatter for Python for a couple reasons:
- There are few options so there is little to argue about when setting it up for the first time.
- If two versions of a piece of code have the same AST, they will have the same black formatted output (modulo minor differences in line spacing).
The strong consistency in formatting means that sometimes the output looks a bit strange or at least different than what you're used to. But over time our brains adjust and it becomes faster to read black formatted code.
sqlean aspires to play the same rule with SQL/dbt queries. By using sqlean, you'll give up control on exactly how your queries are formatted and styled, but you'll gain strong consistency in the formatting. No more nitpicking in code reviews and no more getting used to other team members' query styles.
These are very early, experimental days in the development of sqlean so these numbers will change. On a production code base of 1,615 queries, 58.7% could be parsed and formatted and the other 42.3% could not be parsed. It takes 8.8 seconds for sqlean to attempt to parse and fix these queries. The time increases as more queries can be parsed, so it is expected to take roughly 15 seconds when most or all of the queries can be parsed.
There are other SQL formatting/linting tools out there and some may be more suitable for you.
SQLFluff is a dialect-flexible and configurable SQL linter. Designed with ELT applications in mind, SQLFluff also works with Jinja templating and dbt. SQLFluff will auto-fix most linting errors, allowing you to focus your time on what matters.
SQLFluff is the main alternative to sqlean with wide community adoption and active contributions (4K+ Github stars and 150+ contributors). SQLFluff has a wide array of rules that can be customized. Time to auto-format queries can vary widely. Running SQLFluff on the production code base of 1,615 queries took too long, so a subset of 617 queries was used for comparison:
sqlean | SQLFluff | |
---|---|---|
Time to auto-format (s) | 3.8 | 1295.2 |
# files auto-formatted | 388 | 228 |
Version | 0.0.3 | 0.13.2 |
While SQLFluff's performance depends on the starting queries, on this sample
sqlean was able to run nearly 350 times faster than SQLFluff and was able to
auto-format 70% more files than SQLFluff. SQLFluff was run with the default
rules using the command: sqlfluff fix -f --dialect bigquery
.
The large difference in run-times is likely due to the different approaches taken to format queries: SQLFluff rules are implemented as crawlers, each of which need to be run on the parsed structure. On the other hand, sqlean formatting are applied all in two passes of the parsed structure: one pass to determine indent levels and another pass to output the formatted query.
Language: Python
ZetaSQL defines a language (grammar, types, data model, and semantics) as well as a parser and analyzer.
It is a Google project (but not official) that is used for BigQuery, Cloud Spanner and Dataflow. An improved formatter is in progress, and a fork provides a formatter with preserved comments.
Language: C++
SQLLanguageServer provides auto-completion for editors, parsing and linting and has a VS Code extension. It has support for MySQL, PostgreSQL and SQLite3.
Language: TypeScript
From dbt Labs, unfortunately this is in an early stage and hasn't had new commits in two years.
Language: Python
sqlparse is a non-validating SQL parser for Python. It provides support for parsing, splitting and formatting SQL statements.
Language: Python
SQLint is a simple command-line linter which reads your SQL files and reports any syntax errors or warnings it finds.
Language: Ruby
Whenever a formatting choice needs to be made, the guiding question should be: if black formatted SQL queries, what choice would black make? By following this principle, sqlean formatted code will at least look familiar to the analytics engineers who use Python and black.
This is a non-exhaustive list of formatting changes that sqlean makes:
-
SQL reserved words like SELECT, AS, FROM, JOIN, etc and function names are written all-caps. A future
--whisper
option will force these to be all lower-case. -
Select lists are one item per line, with no exceptions. Even if there are only two or three items in the list and writing them on a single line would make a query more compact, there are two main reasons to write them on separate lines:
- If more items are added to the list later, more effort on editting is needed and the git diff will be harder to read.
- Not making an exception for short lists makes select lists more visually/spatially consistent and therefore faster to read.
-
An explicit "AS" will be added to all alias statements for select items or tables. From The Zen of Python:
Explicit is better than implicit.
-
SQL syntax allows both double and single quotes. sqlean processes quoted strings with black so that double quotes are usually used (with exceptions for strings that contain double-quotes).
-
Character spacing: a single space after commas, no space between function names and left parentheses, etc.
-
BigQuery now allows trailing commas in select lists, which obviates the need for leading commas (which were convenient, but aesthetically ugly). sqlean will insert a trailing comma in every select list. The primary reason is to reduce the size of future diffs. If no trailing comma is present:
SELECT field_1, field_2, field_3 FROM table
then when you add another item, two lines need to be changed and there will be three lines in the diff:
field_2, - field_3 + field_3, + field_4 FROM
If instead there is a trailing comma:
SELECT field_1, field_2, field_3, FROM table
then adding an item results in only a single line in the diff:
field_3, + field_4, FROM
- Reduce unparsable queries
- VSCode extension
- Styling improvements, e.g. dealing with long lines
- Support for other SQL dialects.
- SQL hygiene: e.g. warnings for select items in a join query that don't have a table reference.
sqlean
will need a community effort to be capable of parsing all valid SQL
queries.
If the query is correctly written but it is unparsable by sqlean, then sqlean's grammar needs to be extended to accommodate the valid query. You can help by reporting the issue and providing a minimal query that demonstrates the parse error.
The primary PRs for non core-developers are adding to sqlean's grammar and changing its styling. We've setup the testing to make it easier to add to the grammar so that more and more elements of valid SQL can be parsed.
Snapshot testing that is somewhat similar to snapshot testing in Jest is used to make sure that:
- when the grammar is changed, parsing changes in expected ways
- when formatting changes, the format in all preceding snapshot tests change in expected ways
Snapshot files are contained in sqlean/tests/snapshots
directory.
Each snapshot file is divided into 3 parts, with each part separated by a line
with three dashes:
---
- The first part is minimal working example (MWE) for the SQL query. This is the only part that needs to be human-written.
- The second part is the MWE formatted by sqlean, for easy comparison with the raw query written in the first part. This part will be auto-generated.
- The third part is the AST as represented by a parse tree. This makes is easier to understand what the grammar is doing to parse the query, and how the formatting should work. This part will be auto-generated.
The snapshots are located in the sqlean/tests/snapshots
directory, under
different sub-directories. The sub-directories are grouped by grammar elements.
Within each sub-directory, the files are prefixed by a three digit integer to
preserve ordering that is natural to the grammar rather than alphabetical
ordering.
The snapshot files have extension .snapshot
. If you set up your editor to
recognise the .snapshot
file as an R
file, you'll get decent syntax
highlighting of the parse tree.
If you run sqlean
on a file that contains a valid SQL/dbt query but
sqlean
marks it as "unparsable", this indicates that there is an element of
your file that is not in sqlean
's grammar. You can contribute to the grammar
so that the element and the file can be parsed and formatted.
-
If a query of yours cannot be parsed by sqlean, run sqlean on the single file with the verbose option to identify what is causing the problem:
sqlean -v <your_file_name>.sql
-
Once the unparsable element has been identified, write a minimal working example (MWE) of an SQL query that contains the element, and which sqlean cannot parse.
-
Identify the sub-directory in the
sqlean/tests/snapshots
directory where the element should go, or if there needs to be a new sub-directory. -
Within the sub-directories, create a new file named according to the conventions and put your MWE in the file.
-
Run the snapshot test for this file:
make snapshot L=tests/snapshots/<sub_dir>/<new_file>
This will fail with an error printed in the snapshot file which can guide you on modifying the grammar so that the file can be parsed.
If there are a number of different new files, you can use the
M
(match) argument instead of theL
(location) argument to match a string within the file names. For example,make snapshot M=dbt
will run the snapshot tests on all snapshot files that contain "dbt" in the file name.
-
The grammar for sqlean is defined in
.lark
files insqlean/sqlean/grammar
. sqlean uses Lark to parse queries once the grammar is defined. You can read the Lark docs or just short cut to the Lark cheat sheet. -
As you modify the grammar, you can check the output of the parser by running the
make snapshot L/M=...
command and checking what is printed in the second or third section of the snapshot file. Errors will appear in the second and third section if there is a problem with the grammar. -
You're done modifying the grammar once the parse tree appears in the third section of the snapshot file and the tree is as you want it.
Once you're happy with the parse tree, you can move on to formatting the output.
-
If the parse tree is output in the third section, there may be a styling related error in the second section. The error will indicate a node that is missing, which you can add in
sqlean/sqlean/sql_styler.py
. -
Similar to the grammar, as you modify the formatting, you can check the output of the formatter by running the
make snapshot L/M=...
command and checking what is printed in the second section of the snapshot file. Errors will appear in the second section if there is a problem with the formatter. -
Once you're happy with the formatting, you should run the tests on all existing snapshot files with:
make snapshot
If your parsing or formatting changed other files, it will show up here. These file changes will show up in git so it will be easy to see whether the change was intentional or not.
A few dependencies are critical to sqlean. We are eternally grateful for these amazing projects:
- Lark is used to parse queries into a tree that can then be styled. sqlean's parsing is lightning fast because of lark.
- Black is used to help style some bits in queries, and of course is a role model for sqlean.
- Rich makes the output from sqlean easier for humans to read.
- Typer made the sqlean CLI fast to develop.