parsenn /

Filename Size Date modified Message
project
repos
src
366 B
Add git repo lists
10.2 KB
add LICENSE
7.1 KB
clarification in README
1.3 KB
Add specs2 to deps

ParseNN

An experimentation framework for applying deep learning to source code. ParseNN provides functionality to build training data based on GitHub sources using a generic ANTLR-based parser.

Overview

The main functionality ParseNN provides at the moment is creating input/output pairs of data based on source code from Git repositories. In general, you use ParseNN by providing a list of Git URLs and selecting a particular extraction format.

The following examples illustrate some of the available formats. The first line corresponds to the original source code (with leading and trailing whitespace stripped). The following pairs of lines each show one of the available representations, both in their word-based, human readable representation, as well as their numerical embedding.

  • $dst/tok/source: public static String s = "foo bar";
  • $dst/tok/source-chars: p u b l i c ＀ s t a t i c ＀ S t r i n g ＀ s ＀ = ＀ " f o o ＀ b a r " ;
  • $dst/tok/source-chars.ints: 17 15 18 9 6 13 4 7 5 14 5 6 13 4 20 5 11 6 8 16 4 7 4 23 4 10 36 12 12 4 18 14 11 10 19
  • $dst/ast/tokens: public static String s = "foo＀bar" ;
  • $dst/ast/tokens.ints: 6 10 5 8 7 24 4
  • $dst/ast/nodeContext-depth: ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclaratorId│10 VariableDeclarator│9 Literal│13 FieldDeclaration│7
  • $dst/ast/nodeContext-depth.ints: 4 4 28 21 25 22 31

For many formats, the unassigned unicode character 0xFF00 is used as a "magic" character which replaces whitespace, allowing words to be separated by spaces in the resulting files.

Through a simple interface, it is easily possible to add new formats for extraction. See the chapter "Extending" below.

Installation

Clone the repository. The only requirements are sbt, scala and JDK-8

Using ParseNN-included data writers

Use ParseNN's CLI to apply any existing extractors. For usage information, run:

sbt "run-main ch.uzh.ifi.seal.parsenn.ParseNN --help"

For a demo run that will apply all formatters to two projects and write the resulting data to /tmp/results, run:

sbt "run-main ch.uzh.ifi.seal.parsenn.ParseNN repos/demo-projects.txt java all /tmp/results"

In general, the tool will create several files corresponding to the provided source code as input and the selected output formats as output. As such, for all files within a subdirectory, each line in one file corresponds exactly to the same line in any other file. For any format, three files are created:

  • foo - plain text words (spaces escaped, separated by spaces)
  • foo.ints - embedded words
  • foo.vocab - vocabulary used for the embedding (line numbers minus 1 correspond to indices)

In addition, a file source will create the original corresponding character sequence taken from the source code.

Supported languages:

  • Java
  • Go

Additional languages can be supported by dropping grammars into src/main/antlr4/... and creating the corresponding parser scaffolding. See JavaAntlr.scala for a simple example using ANTLRv4.

Source files

ParseNN always creates the following three files for any extraction, representing the original source (without an embedding) as well as the source characters and their embedding:

 * `$dst/tok/source`: `public static String s = "foo bar";`
 * `$dst/tok/source-chars`: `p u b l i c ＀ s t a t i c ＀ S t r i n g ＀ s ＀ = ＀ " f o o ＀ b a r " ;`
 * `$dst/tok/source-chars.ints`: `17 15 18 9 6 13 4 7 5 14 5 6 13 4 20 5 11 6 8 16 4 7 4 23 4 10 36 12 12 4 18 14 11 10 19`

Supported lexer (tok/) output formats, contained in TokenFormatter:

These formats are based on the lexer, i.e. they apply tranformations on a sequence of tokens.

  • all: Produce output using all formats
  • endings01: Input as characters, output denoting endings of tokens, e.g.:

    • 0 0 0 0 0 1 ＀ 0 0 0 0 0 1 ＀ 0 0 0 0 0 1 ＀ 1 ＀ 1 ＀ 0 0 0 0 0 0 0 0 1 1
    • 4 4 4 4 4 5 6 4 4 4 4 4 5 6 4 4 4 4 4 5 6 5 6 5 6 4 4 4 4 4 4 4 4 5 5
  • takeDropWhile: Series of parsing instructions as "take while"/"drop while", e.g.:

    • T:1:c T:1:c T:1:g T:1:s T:1:= T:2:" T:1:;
    • 4 4 11 5 10 6 7

Supported parser (ast/) output formats, contained in AstNodeFormatter:

These formats are based on the AST-representation extracted by the parser. Formatters act on a list of tokens or on a tree-representation directly.

  • all: Produce output using all formats
  • tokens: The tokens from a given line, linearized (depth-first)

    • public static String s = "foo＀bar" ;
    • $dst/ast/tokens.ints: 6 10 25 26 12 5 28 21 20 15 9
  • nodeContext-depth (2 features): The token type (context) and depth in the AST

    • ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclaratorId│10 VariableDeclarator│9 Literal│13 FieldDeclaration│7
    • 4 4 28 21 25 22 31
  • anonIdentifiers: Same as tokens but with variable identifiers replaced by a placeholder

    • public static String _VAR = "foo＀bar" ;
    • 7 11 6 4 8 22 5

Extending

To make a new output format for either the tokenization or parsing step, implement an object extending the TokenFormatter or AstNodeFormatter interfaces. They each specify mkWords and mkInts methods, which need to return a Seq[String] or a Seq[Int] respectively, representing the words or embedded words for a given line (provided as a String). The mkInts method usually doesn't need to be overridden; by default, it automatically maintains a vocabulary to create matching embedded sequences for the words produces by mkWords.

Crawling GitHub

A helper script can retrieve git URLs using the GitHub API, for example:

./src/main/bash/crawl-github.bash 10 'language:java' /tmp/projects.txt

will retrieve 10 pages (of 100 entries each) for the language:java query and store them in projects.txt. Results are sorted by stars in descending order.

License

Copyright 2017 Carol V. Alexandru

Licensed under the Apache License, Version 2.0 (the "License"); You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.