Question Relatedness on Stack Overflow: The Task, Dataset, and Corpus-inspired Models

AAAI-2019
The Thirty-Third AAAI Conference on Artificial Intelligence

[Navigation] | [INTRODUCTION] | [STRUCTURE] | [DOWNLOAD] | [STATISTIC]

INTRODUCTION

A dataset for prediction of relatedness in more than 300K knowledge units pairs on Stack Overflow. In addition to natural language, this dataset contains program languages (i.e., code snippets) and tags, which their utility in prediction of relatedness can be evaluated in future research.

A question along with all its answers refered as a knowledge unit(KU). There are four classes of relatedness between two knowledge units, duplicate, direct, indirect and isolated.
Definitions are as follows:

Link Type
(Class)
Description
Duplicate Two knowledge units discuss the same question in different ways, and can be answeredby the same answer.
Direct One knowledge unit can help solve the problem in the other knowledge unit, for ex-ample, by explaining certain concepts, providing examples, or covering a sub-step forsolving a complex problem.
Indirect One knowledge unit provides related information, but it does not directly answer thequestion in the other knowledge unit.
Isolated The two knowledge units are not semantically related.

STRUCTURE

Attr. Id Attr. Name Attr. Description
1 Id KU Pair (< KU1, KU2 >) Id
2/13 q1/2_Id Id of KU's Question on SO
3/14 q1/2_Title KU's Title
4/15 q1/2_Body The text of KU's Body (Exclude Code Snippets)
5/16 q1/2_BodyCode Code Snippets in KU's Body
6/17 q1/2_AcceptedAnswerId Ids of KU's Accepted Answers on SO
7/18 q1/2_AcceptedAnswerBody The text of KU's Accepted Answer (Exclude Code Snippets)
8/19 q1/2_AcceptedAnswerCode Code Snippets in KU's Accepted Answer
9/20 q1/2_AnswersIdList Ids of KU's Answers on SO
10/21 q1/2_AnswersBody The text of KU's Answers (Exclude Code Snippets)
11/22 q1/2_AnswersCode Code Snippets in KU's Answers
12/23 q1/2_Tags Tags of KU
24 Class Relationship (i.e., duplicate, direct, indirect or isolated)

DOWNLOAD

Format Load Script Source
CSV
1
2
3
4
import pandas as pd
df = pd.read_csv(file_path, sep=",", header=0, lineterminator='\n')
for index, row in df.iterrows():
    print(row)
[DOWNLOAD[448MB]]
MySQL
1
mysql -u username -p database_name < file.sql
[DOWNLOAD[3GB]]

STATISTIC

Scope Indicator Size
Whole KU # of distinct KUs 160,161
# of four types of KU pairs 347,372
Title avg. # of words in title 8.52
Body avg. # of words in body(exclude code snippets) 97.02
# of distinct KUs whose body has at least one code snippet 117,139(73%)
avg. # of code snippets in one body 1.46
avg. # of words in single code snippet in one body 118.46
Answers # of distinct answers 318,491
avg. # of answers within single KU 1.99
# of distinct KUs contain at least one answer 140,122(87%)
# of distinct KUs contain an accepted answer 90,672(57%)
# of distinct KUs whose answers has at least one code snippet 96,707(60%)
avg. # of words in an answer (exclude code snippets) 68.39
avg. # of code snippets within one answer 0.60
avg. # of words in single code snippet 81.98