Sorry, you need to enable JavaScript to visit this website.

GitHub* Issue Classification Utilizing the end-to-end System Stacks

BY Daniela Plascencia, Cord Meados, Rahul Unnikrish... ON Dec 09, 2019

Introduction

This article describes how to classify GitHub issues using the end-to-end system stacks from Intel. In this scenario, we auto-classify and tag issues using the Deep Learning Reference Stack for deep learning workloads and the Data Analytics Reference Stack for data processing.

Background

In almost all open source projects, interaction between developers is done via GitHub repositories. These repositories receive a significant amount of contributions in the form of issue reports and pull requests, which in many cases are manually reviewed and approved by administrators and maintainers of a certain project. Despite not being challenging, this task requires time and effort. The scope of this article is GitHub issues, but the displayed content can be adapted for pull requests and different applications where content analysis and classification is required.

As developers are faced with the challenge of semantically understanding and classifying GitHub issues for projects in production, we propose an automated way to fetch and process data, analyze the issue content and label it using a pipelined process that runs the end-to-end system stacks from Intel.

Software

  • Deep Learning Reference Stack v0.4.0
    • TensorFlow* 1.14
  • Data Analytics Reference Stack v0.1.0
    • Apache Spark*
  • Kubernetes* v1.15
  • Kubeflow* v0.6.1
    • TFJobs

Overview

The GitHub Issue Classification solution can be viewed as a pipeline with different stages using the end-to-end system stacks on all of them.

Before you continue, please make sure you are familiar with the GitHub Issue Classification repository as it is referenced in the following sections.

Use case architecture

The following diagram shows the design of the use case. It is comprised of three bigger stages: preprocess, training, and model serving, and smaller ones describing each of the tasks required for the whole process; a description of each stage and task is provided below.

Figure 1. GitHub Issue Classification example architecture diagram

Data ingestion and processing

Software/Scripts used in this section:

  • Data Analytics Reference Stack
  • scripts/get-data.sh
  • scripts/proc-data.scala

The first stage in the pipeline shows how data is ingested and processed so it can be consumed by the model in the training stage. For this particular use case, the data coming in is a big multiline JSON file containing actual GitHub issues with various attributes like submitter, issue description, issue label, and date, to name a few. A script that pulls all issues from a repo using GitHub’s API is available for you to use in this step.

The output of the script mentioned above is as follows:

$ get_data.sh
  {
      "url": "https://api.github.com/repos/clearlinux/distribution/issues/1515",
      "repository_url": "https://api.github.com/repos/clearlinux/distribution",
      ...
      labels": [
      {
            ...
            "name": "package-request",
"body": "With the official release of Brave 1.0, can you please add a package for this browser\r\n\r\nMore info: https://brave.com/"
  }

This is a trimmed example of what the data looks like. In fact, each issue has several attributes which are not needed by the model, this is where the data preprocessing comes into the picture. Since this is a classification example, we want to match labels with the interpretation of the issue description, thus we might need some attributes like issue labels and the body of the issue (or issue description).

For all data processing, the Data Analytics Reference Stack was used. The proc-data.scala script handles JSON multiline files, formats each attribute in a table, and selects those that are really needed: id, labels, and body. Once the attributes are selected, data is filtered to pick up the top labels because there could be labels that are used just once or twice and might not be useful for the model. The final output is a filtered list of issues with description, ID and label.

Model training

Software/Scripts used in this section:

  • Deep Learning Reference Stack
  • python/train.py

This section describes how the Deep Learning Reference Stack was used for training a model using data from the previous stage. This is a multi-class and multi-label classification problem, which means each input sample can belong to one or more classes that are not mutually exclusive. For correctly classifying issues, this example first extracts features from the text in the body, meaning it transforms arbitrary data into numerical features which can be used in machine learning algorithms as they expect numerical feature vectors with a fixed size rather than raw data. The output labels are converted to vectors using a multi-label binarizer.

For extracting features from the text in this example, the first step is to convert the raw data into a matrix of TF-IDF features. A TF-IDF matrix defines the importance of a keyword or phrase in a document, that is, a metric that represents how important a word is based on how many times it appears in a document (term frequency or TF) and how common or rare a word is across a set of documents (inverse document frequency or IDF). The TF can be seen as a count from 0 to n, whilst IDF can be seen as a scale from 0 to 1 where the more common the word, the closer it is to have a 0 value. In terms of this example, it weights words as they appear in issue descriptions: the highest scores represent more relevant words that can be considered keywords.

After this text feature extraction step, a fully connected neural network with a sigmoid activation is used to model the multi-class multi-label machine learning problem. We opted to use the sigmoid function as the final layer’s activation function. By using this activation function as the output, we are trying to model each of the class probabilities using a binomial distribution. The model is compiled and trained using vectorized input and binarized labels. Once training is done, the model is used to classify issues described in English into one more of the labels.

Network Architecture

There are two options to choose from: one is a fully connected network (FCN) with four hidden layers and the other is based on stateless Gated Recurrent Net (GRU) with 64 units which can be better for short sequences. The advantage for using a recurrent model is that the temporal structure of the sequence is taken into consideration, as the hidden state from the previous time step is used along with the present input for prediction. For a detailed exposition on RNNs, please see [Ref 1]. By default, the FCN would be chosen. In our tests, the FCN performed faster and gave relatively accurate numbers.

Figure 2. Fully Connected Network Architecture

 

Figure 3. Recurrent Network Architecture

Model Serving

Software/Scripts used in this section:

  • Deep Learning Reference Stack
  • python/infer.py
  • python/rest.py

The last stage on the workflow runs live inference via a RESTful API that communicates with a local Flask server. The pre-trained model is loaded into the container running the API and the Flask server sends the actual GitHub issue descriptions. Next, rest.py handles these calls and passes the data to infer.py, where the pre-trained model is loaded and inference is done. The last step in the stage returns the appropriate labels based on the user’s input.

Orchestration layer

The GitHub Issue Classification example can be either run inside containers managed by any user or using a container orchestrator like Kubernetes. One of the advantages of Kubernetes is you can leverage the many available projects for deploying different kinds of workloads. Take Kubeflow, for instance, which is a project dedicated to making simple, portable, and scalable Machine Learning deployments without the need of a new service.

Training for this use case is done using TensorFlow, so a Kubeflow TensorFlow training job (TFJob) was used for this stage of the pipeline. A TFJob is a Kubernetes custom resource used to run training jobs. The best part is you can do it in a distributed fashion on a multi-node environment. Deploying a TFJob on Kubernetes is as easy as specifying the container image to use and the workload to run inside the TFJob YAML file. The rest is left to Kubeflow’s tf-operator, which takes care of the business-logic to perform some backend operations, like scaling for instance.

Figure 4. Snippet of a TFJob using Deep Learning Reference Stack for running CNN benchmarks

An extra layer of orchestration can be added to this use case. As mentioned before, Kubeflow helps make simple and portable machine learning (ML) deployments on Kubernetes. Another component that helps developers achieve this is Kubeflow Pipelines, a platform for building and deploying ML workflows. A Kubeflow Pipeline is self-contained code that can be packaged as a Docker* image, and performs a step in a pipeline. Each step can be viewed as a set of instructions running different workloads and complement each other, but can also be independent. For instance, if applied in the GitHub Issue Classification use case, you can use a pipeline with a data preprocessing stage, followed by a training and a model serving stage.

When combined with the System Stacks, the Kubeflow Pipelines provide an end-to-end automation of the process, letting data scientists, machine learning, and DevOps engineers focus on their areas of expertise by abstracting the many configurations and settings it might take to integrate and deploy a Machine Learning workflow with several components.

For more information and examples using Kubeflow Pipelines and System Stacks, refer to the Kubeflow Pipelines example on GitHub - temporal link:  https://github.intel.com/verticals/usecases/tree/master/kubeflow/pipelines/pytorch-mnist

Running the GitHub issue classification example

This section contains a set of instructions to run the GitHub issue classification example using the end-to-end System Stacks. First run the data preprocessing tasks, move on to the training stage, and finally serve your model and use it for inference when an API call is received on the web UI.

Please note there are two ways of running the example, one is running container images using Docker and the other is adding an orchestration layer with Kubernetes. If you are not using Kubernetes/Kubeflow, you can skip those steps.

Prerequisites

  • Docker
  • Git
  • Flask
  • Intel® Advanced Vector Extensions 512 (Intel® AVX-512) capable machine
  • Proxy setup for Docker (if required)
  • Shared network between Docker containers and host PC
  • (Optional) Kubernetes 1.16 and Kubeflow 0.6.0 deployment - For Kubernetes/Kubeflow only

Initial setup

  1. Create a local clone of the system stacks use cases repository:
$ git clone https://github.com/intel/stacks-usecase
  1. Change directory into github-issue-classification:
$ cd stacks-usecase/github-issue-classification

For Kubernetes/Kubeflow only:

Create a persistent volume (PV):

# Fetch all data
$ /bin/bash scripts/get-data.sh

# Load a PV on the kubeflow namespace
$ kubectl apply -f k8s/storage/

# Create needed directories
$ kubectl exec -ti tfevent-pod -n kubeflow -- /bin/bash

# Once inside the PV container
root@tfevent-pod:/ mkdir -p /workdir/data/raw
root@tfevent-pod:/ mkdir -p /workdir/training
root@tfevent-pod:/ exit

# Copy all needed files into the PV
$ kubectl cp python/train.py kubeflow/tfevent-pod:/workdir/training
$ kubectl cp data/raw/all_issues.json kubeflow/tfevent-pod:/workdir/data/raw
$ kubectl cp scripts/proc-data.scala kubeflow/tfevent-pod:/workdir/data

Data preprocess

  1. Run a Data Analytics Reference Stack image on a Docker container:
$ docker run -it --ulimit nofile=1000000:1000000 -v ${PWD}:/workdir clearlinux/stacks-dars-mkl bash
  1. Prepare the Apache Spark environment:
$ cd /workdir
$ mkdir /data
$ mkdir /data/raw
  1. Get the data:
$ /bin/bash /workdir/scripts/get-data.sh
  1. Process the data:
$ spark-shell
scala > :load -v scripts/proc-data.scala

For Kubernetes/Kubeflow only:

Deploy Data Analytics Reference Stack container for pre-processing:

$ kubectl apply -f k8s/manifests/process-data.yaml

Model training

  1. Make sure your working directory is inside the github-issue-classification directory:
$ pwd
stacks-usecase/github-issue-classification
  1. Run a Deep Learning Reference Stack image on a Docker container:
$ docker run -it -v ${PWD}:/workdir clearlinux/stacks-dlrs-mkl
  1. Navigate to the github-issue-classification directory and install requirements:
$ cd /workdir/docker
$ pip install -r requirements_train.txt
  1. Run the training script:
$ mkdir /workdir/models
$ cd /workdir/python
$ python train.py

For Kubernetes/Kubeflow only:

  1. Build a Docker image to be consumed by the TFJob. Note the image should be available at a registry your Kubernetes deployment can access.
$ cd docker/
$ docker build -f Dockerfile.train -t <your-registry>/stacks-dlrs-mkl-gic:v0.4.1 .
$ docker push <your-registry>/stacks-dlrs-mkl-gic:v0.4.1
  1.  Edit the TFJob manifest at k8s/manifests/tf_job_github.yaml to match your registry in all fields that apply:
image: <your-registry>/stacks-dlrs-mkl-gic:v0.4.1
  1.  Create a TFJob for training the model:
$ cd k8s/
$ kubectl create -f manifests/tf_job_github.yaml

Serving the model

To run inference, we set up a special dockerfile based on our image. The dockerfile creates a RESTful API that communicates to a local Flask server to run live inference.

From your local system, navigate to the github-issues-classification folder, where "Dockerfile" is stored inside the "docker" directory, and run:

$ cd stacks-usecase/github-issue-classification/docker
$ make
$ docker run -p 5059:5059 -it github_issue_classifier:latest

A Docker container will run in the background.

For Kubernetes/Kubeflow only:

  1. Build a Docker image to be consumed by the inference pod. Note the image should be available at a registry that your Kubernetes deployment can access.
$ cd docker
$ docker build -f Dockerfile -t <your-registry>/stacks-dlrs-mkl-inference:v0.4.1 ..
$ docker push <your-registry>/stacks-dlrs-mkl-inference:v0.4.1
  1. Edit the inference pod manifest at k8s/manifests/inference-pod.yaml to match your registry in all fields that apply:
image: <your-registry>/stacks-dlrs-mkl-inference:v0.4.1
  1. Create an inference pod:
$ cd k8s
$ kubectl create -f manifests/inference-pod.yaml

Run the web UI

Create a Flask server on your local system with the following command:

$ cd stacks-usecase/github-issue-classification/website
$ flask run

You’re all set up! Open any Web browser and navigate to localhost:5000 to see an interactive example. Copy or type any issue into the top left box, and hit submit. The Flask server calls the REST API, which processes your input and returns the appropriate labels.

Summary

The System Stacks enable and integrate different components into highly performant stacks that are ready to be used, as shown in this example with the Deep Learning Reference Stack and the Data Analytics Reference Stack. After trying the GitHub issue classification example and seeing how easy is to run machine learning and data processing workloads, we would like your feedback. Your contributions are also welcomed.

For reference and troubleshooting, please refer to the use case repository.

Submit your comments at the following email: stacks@lists.01.org

Report issues with this use case at: https://github.com/intel/stacks-usecase

Report issues related to the Docker images at: https://github.com/intel/stacks

References

  1. Understanding LSTM Networks
  2. sklearn.preprocessing.MultiLabelBinarizer
  3. Deep Learning Reference Stack
  4. Data Analytics Reference Stack
  5. Overview of Kubeflow Pipelines
  6. MNIST Kubeflow Pipelines on Deep Learning Reference Stack