How to add support for a new language
This document is about adding support for a new programming language in Semgrep using the tree-sitter technology. Most languages in semgrep use tree-parser though you may also need to update the menhir parser.
Repositories involved directly:
semgrep: the semgrep command line program.ocaml-tree-sitter-semgrep: language-specific setup, generates C/OCaml parsers for semgrep.- A new repository
semgrep-LANGfor the language you're adding: this is a C or OCaml parser generated fromocaml-tree-sitter-semgrepby a Semgrep administrator. semgrep-interfaces
Placeholder values
This document uses the placeholder LANG to indicate that you should substitute the name of your language as the value in the given context. For example, if your language is Ruby, and the document's instructions read:
Create a new file
TEST_LANG_LANG.txtwhere LANG is in small caps.
The name of your file should be TEST_LANG_ruby.txt
Create a file
Pretty_print.EXTENSIONwith the filename extension of your language:
The name of your file should be Pretty_print.rb.
semgrep repository overview
There are some GitHub repositories involved in porting a language.
Here is the file hierarchy of the semgrep
repository:
/languages
├── bash
...
├── swift
├── generic
└── tree-sitter
└── semgrep-swift # generated tree-sitter parsers
When you're done with the work in ocaml-tree-sitter-semgrep, you'll need a new repository semgrep-LANG to host the generated parser code.
Ask someone from the Semgrep team to create one for you. For this, they should use the template
semgrep-lang-template when creating the repository.
The instructions for adding a language start in ocaml-tree-sitter-semgrep, as indicated below. Be careful that you are always in the correct repository!
Set up ocaml-tree-sitter-semgrep
As a model, you can use the existing setup for ruby or javascript. The most complicated setup is for typescript and tsx.
Expedited setup
If you're lucky, the language you want to add can be added with the script add-simple-lang:
cd lang
./add-simple-lang --help
Follow the instructions from --help.
This often works with languages that define a single dialect using a grammar.js file at the root of the project. If this simplified approach fails, use the Manual setup instructions below to understand what's going on or to set things up manually.
Manual setup
From the ocaml-tree-sitter-semgrep repository, do the following:
- Create a
lang/LANGfolder. - Make a
test/okdirectory. Inside the directory, create a simplehello-worldprogram for the language you are porting. Name the programhello-world.EXTENSION. - Now make a file called
extensions.txtand input all the language extensions (.rb, .kt, etc) for your language in the file. - Create a file called
fyi.listwith all the information files, such assemgrep-grammars/src/tree-sitter-LANG/LICENSE,semgrep-grammars/src/tree-sitter-LANG/grammar.js,semgrep-grammars/src/semgrep-LANG/grammar.js, etc. to bundle with the final OCaml/C project. - Link the Makefile.common to a Makefile in the directory with:
ln -s ../Makefile.common Makefile - Create a test corpus. You can do this by:
- Running
most-starred-for-languageto gather projects on which to run parsing stats. Run with the following command:./scripts/most-starred-for-language LANG YOUR_USERNAME API_KEY - Using github advanced search to find the most starred or most forked repositories.
- Running
- Copy the generated
projects.txtfile into thelang/LANGdirectory. - Add in extra projects and extra input sets as you see necessary.
Here's the file hierarchy for Ruby:
lang/ruby # language name of the form [a-z][a-z0-9]*
├── extensions.txt # standard name. Required for stats.
├── fyi.list # list of informational files to copy. Recommended.
├── Makefile -> ../Makefile.common
├── projects.txt # standard name. Required for stats.
└── test # sample input files
├── ok # contains input files supported by the current grammar
│ ├── comment.rb
│ ├── ex1.rb
│ ├── ex2.rb
│ ├── hello.rb
│ └── poly.rb
└── xfail # contains input files that are expected to fail
└── rating.rb
To test a language in ocaml-tree-sitter-semgrep, you must build the
ocaml-tree-sitter-semgrep OCaml code generator, run it to produce a parser,
then run some tests for the parser. Full instructions for this
are given in updating-a-grammar under
"Testing". The short instructions are:
- For the first time, build everything with
./scripts/rebuild-everything. - Subsequently, work from the
lang/LANGfolder and runmakeandmake test.
The fyi.list file
The fyi.list file was created to specify informational files that
should accompany the generated files. These files are typically:
- the source grammar, most often a single
grammar.jsfile. - the licensing conditions usually specified in a
LICENSEfile.
Example:
# Comments are allowed on their own line.
# Blank lines are ok.
# Each path is relative to ocaml-tree-sitter-semgrep/lang
semgrep-grammars/src/tree-sitter-ruby/LICENSE
semgrep-grammars/src/tree-sitter-ruby/grammar.js
semgrep-grammars/src/semgrep-ruby/grammar.js
The files listed in fyi.list end up in a fyi folder in
tree-sitter-lang. For example,
see ruby/fyi.
Extend the original grammar with semgrep syntax
This is best done after everything else is set up. Some constructs
such as semgrep metavariables ($FOO) may already be valid constructs
in the language, in which case there's nothing to do. Some support for
the semgrep ellipsis ... usually needs to be added as well.
You'll need to learn how to create tree-sitter grammars.
- Work from
semgrep-grammars/src/semgrep-LANGand usemakeandmake testto build and test. - Add new test cases to
test/corpus/semgrep.text. - Edit
grammar.js. - Refer to the original grammar in
semgrep-grammars/src/tree-sitter-LANGto determine which rules to extend.
For an example of how to extend a language, you can:
- Look at what was done for the semgrep extensions of other languages
in their respective
semgrep-*folders. - Look at how
tree-sitter-typescriptextends the JavaScript grammar. This is the filecommon/define-grammar.jsin the tree-sitter-typescript repository.
Avoiding parsing conflicts is the trickiest part. Asking for help is encouraged.
💡 A note on the JavaScript syntax that's heavily used to define and extend grammars:
When possible, the development team prefers shorthand notation for anonymous functions made of a single expression:
(x) => x
which is the same as
(x) => { return x; }
which is itself the same as
function(x) { return x; }
When extending any rule with an alternate choice such as $.ellipsis,
the simpler way is this one:
expression: ($, previous) => choice(previous, $.ellipsis),
However, if the previous rule is known to be a choice(), you can avoid
one level of nesting and append to the original list of choices, which
is done as follows:
expression: ($, previous) => choice(...previous.members, $.ellipsis),
Whether to use one or the other is a matter of taste.
Finally, on rare occasions where the rule body is more than a single expression, you'll have to use the curly brace or return syntax:
expression: ($, previous) => {
if (semgrep_ext)
return choice(...previous.members, $.ellipsis);
else
return previous;
},
Parsing statistics
From a language's folder such as lang/csharp, two targets are
available to exercise the generated parser:
make test: runs ontest/okandtest/xfailmake stat: downloads the code specified inprojects.txtand parses the files whose extension matches those inextensions.txt, reporting parsing success in the form of a CSV file.
For gathering a good test corpus, you can use GitHub
Search or the script provided in
scripts/most-starred-for-language.py. For github searches, filter by
programming language and use a constraint to select large projects,
such as "> 100 forks". Collect the repository URLs and put them into
projects.txt.
Publish generated parsers
After you have pushed your ocaml-tree-sitter-semgrep changes to the main branch, do the following:
- Check that the original
grammar.js,src/scanner.c/.cc(if applicable) look clean and have minimal external dependencies. - In
ocaml-tree-sitter/lang/Makefile, add language under 'SUPPORTED_LANGUAGES' and 'STAT_LANGUAGES'. - In
ocaml-tree-sitter/langdirectory, run./release LANG --dry-run. If this looks good, please ask someone from the Semgrep team to publish the code using./release LANG.
Troubleshooting
Various errors can occur along the way.
Compilation errors in C or C++ are usually due to a missing source
file scanner.c or scanner.cc, or a grammar with a name that
doesn't match the name inside the scanner file. JavaScript files may
also be missing, in particular in the case of grammars that extend
existing grammars such as C++ for C or TypeScript for
JavaScript. Check for require() calls in grammar.js and learn how
this NodeJS primitive resolves paths.
There may also be errors when generating or compiling OCaml code. These are likely bugs in ocaml-tree-sitter-semgrep and they should be reported or fixed right away.
Here are some known types of parsing errors:
- A syntax error. The input program is in the wrong syntax or uses a
recent feature that's not supported yet:
make testor directly theparse_LANGprogram will show the tree produced by tree-sitter with one or moreERRORnodes. - A "reparsing" error. It's an error generated after the first
successful parsing pass by the tree-sitter parser, during the
reparsing pass by the OCaml code performed by the generated
Parse.mlfile. The error message should tell you something like "cannot interpret tree-sitter's output", with details on what code failed to match what pattern. This is most likely a bug inocaml-tree-sitter-semgrep. - A segmentation fault. This could be due to a bug in the OCaml/tree-sitter C bindings and should be fixed. A simple test case that reproduces the problem would be nice. See https://github.com/semgrep/ocaml-tree-sitter-semgrep/issues/65
Parsing errors that are due to an incomplete or incorrect grammar should be recorded, and eventually reported or fixed in the upstream project.
We keep failing test cases in a fail/ folder, preferably in the form of the minimal program suitable for a bug report, with a comment describing what was expected and what's going on.
Update the semgrep repository
Now that you have added your new language LANG to tree-sitter, do the following:
- Update
generate.pyin thesemgrep-interfacesrepository with your new language. - In the
semgreprepository, go to/src/parsing/Check_pattern.ml, and add LANG tolang_has_no_dollar_ids. If the grammar has no dollar identifiers, add LANG above 'true'. Otherwise, add it above 'false'. - In
/src/printing/Pretty_print_AST.ml, add LANG to the appropriate functions:print_boolif_stmtwhile_stmtdo_whilefor_stmtdef_stmtreturnbreakcontinueliteral
- In
/src/parsing/tests/Test_parsing.ml, add in LANG todump_tree_sitter_cst_lang. - Inspect the other languages in
/languagesas a reference for what code to add. Create a new folder for your language. - Add the
semgrep-LANGrepository as a submodule under/languages/LANG/tree-sitter/(git submodule add ...). - Create a file
/languages/LANG/tree-sitter/Parse_LANG_tree_sitter.mlby copying the generated templateBoilerplate.mlthat you'll find in thesemgrep-LANGsubmodule. Add basic functionality to define the functionparseand import the moduleParse_tree_sitter_helpers. Look at other languages to get a better idea of how to define the parse file function. This file should contain something similar to:module H = Parse_tree_sitter_helpers
let parse file =
H.wrap_parser
(fun () ->
Parallel.backtrace_when_exn := false
Parallel.invoke Tree_sitter_X.Parse.file file ()
) - Create the missing
dunefiles wherever you have OCaml source files (.ml,.mli) by imitating what was done for other languages. - Write a basic test case for your language in
tests/LANG/hello-world.EXT. This can just be a hello-world function. - Try to build the project using the usual commands
(
makeormake dev). - Test that the command
semgrep-core/bin/semgrep-core -dump_tree_sitter_cst test/LANG/hello-worldprints out a CST for your language.
At this point, you're ready to start writing the translator from the CST produced by the tree-sitter parser for LANG into the generic AST used by Semgrep, accommodating all the languages in a single AST type. It's recommended but not required to first translate the CST into a language-specific AST before translating it into the generic AST in a second step.
Legal concerns
Be thankful for the authors of the original code, keep clearly visible license notices, and make it easy to get back to the original projects:
- Make sure to preserve the
LICENSEfiles. This should be listed in thefyi.listfile. - For sample input in
test/, consider Public Domain ("The Unlicense") files or write your own, for simplicity. GitHub Search allows you to filter projects by license and by programming language.
See also
How to upgrade the grammar for a language
Not finding what you need in this doc? Ask questions in our Community Slack group, or see Support for other ways to get help.