Skip to content
@SCANL

Source Code Analysis and Natural Language Laboratory

Welcome to the SCANL Github Organization!

Here, you will find tools and datasets related to the research done by SCANL.

What is SCANL?

SCANL stands for the Source Code and Natural Language Laboratory. We are a diverse team of scientsits dedicated to studying the latent connection between source code behavior and the natural language elements used to describe that behavior. Feel free to visit https://www.scanl.org/ to learn more about who is part of the lab and to find more about our goals and research motivations.

What is in this repository?

We have tools, datasets, and learning/educational resources. We will briefly describe each below, but refer to their individual repositories for more information.

Name Description
Identifier Name Structure Catalogue A catalogue of identifier name structures found in code and their significance to program behavior. This catalogue also covers various perspectives on how research literature characterizes identifier name meaning and behavior.
Ensemble Tagger A part-of-speech tagger designed to work on the specialized phrase structure of identifiers (e.g., variable names).
IDEAL An identifier name appraisal and recommendation tool.
srcML Identifier Getter Tool A tool for collecting samples of identifier names from software systems using srcML. It can help you take statistically sound samples for research on identifier names.
Project Sunshine An implementation of the linguistic anti-patterns and soon-to-be merged with IDEAL to create a framework for identifier name appraisal and recommendation.
Datasets The current home of the abbreviation study data set, the grammar patterns data set, and the ensemble tagger train/test data set

We also host some (potentially modified) tools that other researchers made: SWUM has been modified by us to act primarily as a part-of-speech tagger. POSSE is the same; modified to help us use it as a part-of-speech tagger more easily. The Ensemble Tagger (mentioned above) uses it. The other public repositories are student projects, sample code, or misc tools that we use internally (i.e., not explicitly meant to be easy for others to use).

If you have trouble with any of our tools/datasets, please make an issue! In addition, if you like what we do, leave a star on the project-- it helps us know what to focus our maintenance efforts on and what kind of content people want to see most!

Pinned

  1. identifier_name_structure_catalogue identifier_name_structure_catalogue Public

    7 1

  2. datasets datasets Public

    All datasets curated by SCANL lab

    C++ 4

  3. ensemble_tagger ensemble_tagger Public

    Python 4 2

  4. ProjectSunshine ProjectSunshine Public

    Python 4 6

Repositories

Showing 10 of 19 repositories

Most used topics

Loading…