A dataset for prediction of relatedness in more than 300K knowledge units pairs on Stack Overflow. In addition to natural language, this dataset contains program languages (i.e., code snippets) and tags, which their utility in prediction of relatedness can be evaluated in future research.
A question along with all its answers refered as a knowledge unit(KU).
There are four classes of relatedness between two knowledge units, duplicate, direct, indirect and isolated.
Definitions are as follows:
Attr. Id |
Attr. Name |
Attr. Description |
1 |
Id |
KU Pair (< KU1, KU2 >) Id |
2/13 |
q1/2_Id |
Id of KU's Question on SO |
3/14 |
q1/2_Title |
KU's Title |
4/15 |
q1/2_Body |
The text of KU's Body (Exclude Code Snippets) |
5/16 |
q1/2_BodyCode |
Code Snippets in KU's Body |
6/17 |
q1/2_AcceptedAnswerId |
Ids of KU's Accepted Answers on SO |
7/18 |
q1/2_AcceptedAnswerBody |
The text of KU's Accepted Answer (Exclude Code Snippets) |
8/19 |
q1/2_AcceptedAnswerCode |
Code Snippets in KU's Accepted Answer |
9/20 |
q1/2_AnswersIdList |
Ids of KU's Answers on SO |
10/21 |
q1/2_AnswersBody |
The text of KU's Answers (Exclude Code Snippets) |
11/22 |
q1/2_AnswersCode |
Code Snippets in KU's Answers |
12/23 |
q1/2_Tags |
Tags of KU |
24 |
Class |
Relationship (i.e., duplicate, direct, indirect or isolated) |
Scope |
Indicator |
Size |
Whole KU |
# of distinct KUs |
160,161 |
# of four types of KU pairs |
347,372 |
Title |
avg. # of words in title |
8.52 |
Body |
avg. # of words in body(exclude code snippets) |
97.02 |
# of distinct KUs whose body has at least one code snippet |
117,139(73%) |
avg. # of code snippets in one body |
1.46 |
avg. # of words in single code snippet in one body |
118.46 |
Answers |
# of distinct answers |
318,491 |
avg. # of answers within single KU |
1.99 |
# of distinct KUs contain at least one answer |
140,122(87%) |
# of distinct KUs contain an accepted answer |
90,672(57%) |
# of distinct KUs whose answers has at least one code snippet |
96,707(60%) |
avg. # of words in an answer (exclude code snippets) |
68.39 |
avg. # of code snippets within one answer |
0.60 |
avg. # of words in single code snippet |
81.98 |