This dataset provided by CiteSeerX contains 2,118,122 dissectible academic research papers. Each paper has two related files:
All the papers are stored in hierarchical directories, accord with their unique document ID. ( XML and TXT files are stored separately. ) We packed the dataset into different compressed files (a single tar.gz file contains all TXT documents (43.97GB), 4 ZIP files contains all XML documents (12.07 GB).) These files can be downloaded via the following link:Download the complete CiteSeerX dataset.
We also build the full text search engine for CiteSeerX dataset, click here to jump to our search engine. We use Lucene to index first 400 terms of the full text (for the purpose of searching potential titles in the front part of the full text), and CiteSeerX extracted title, authors, Document ID. For each searched result, we also provide the link to download the original full text file.
The most important metadata CiteSeerX provided is the Citation String. In each paper's XML file, there are multiple "citation" tags, each "citation" tag record a reference (one cited paper). The following figure shows one "citation" tag:
In each "citation" tag, the "raw" tag is the originally extracted citation string, including all the information that the original paper listed in references. Other tags refers to the classified metadata (title, authors, etc.) of this cited paper extracted from the "raw" citation string. Based on our observation, the accuracy of the classified metadata are not good, as we can see from the example citation tag figure. However, the raw citation string are well extracted. (since it is much easier to separate different citation strings in the original paper's references part.)
Note: "paperid" tag refers to the unique ID of the paper that this XML file belongs to, not the cited paper in this "citation" tag.
We build the citation graph based on title, author, and citation string metadata provided by CiteSeerX XML files. However, based our observation, lots of (around 40%) CiteSeerX title metadata are inaccurate. Papers with incorrect title metadata need to be excluded from the citation graph. The following procedure illustrates how we construct the CiteSeerX citation graph.
If a paper has N citation strings, then this paper will have N different [DOI <--> ci] Lucene indexed elements. The source code for citation index construction can be downloaded here.
Wrong title training set: download here. Correct title training set: download here.
The source code of citation graph construction can be downloaded here.As a result, our CiteSeerX citatio graph has 10,595,956 edges, 1,286,659 out of 2,118,122 (60.75%) papers are connected in the graph (have at least 1 in-degree or out-degree). The citation graph file can be downloaded here (71 MB). In the file each line is a citation relation (two paper ID (CiteSeerX DOI) separated by "\t"). The table below lists the top in-degree papers:
|Maximum likelihood from incomplete data via the EM algorithm||7665||0||10.1.1.133.4884|
|Knowledge discovery in databases||5804||21||10.1.1.54.9108... (3 duplicates)|
|A tutorial on hidden markov models and selected applications in speech recognition||4589||2||10.1.1.131.2084|
|Chord: A scalable peer-to-peer lookup service for internet applications||4044||21||10.1.1.135.7635... (13 duplicates)|
|Graph-Based Algorithms for Boolean Function Manipulation||4030||1||10.1.1.35.8734... (5 duplicates)|
|Congestion Avoidance and Control||3717||4||10.1.1.88.1484... (2 duplicates)|
|Optimization by simulated annealing||3624||0||10.1.1.123.7607|
|Induction of Decision Trees||3577||4||10.1.1.220.1843... (4 duplicates)|
|Fast Algorithms for Mining Association Rules||3398||20||10.1.1.40.7506... (4 duplicates)|
|The Anatomy of a Large-Scale Hypertextual Web Search Engine||3361||13||10.1.1.117.3693... (5 duplicates)|
|A Scalable Content-Addressable Network||3352||13||10.1.1.42.3243... (3 duplicates)|
|Object-oriented Software Construction||3262||0||10.1.1.105.3673|
|New Directions in Cryptography||3242||1||10.1.1.37.9720|
|Snakes: Active contour models||3027||1||10.1.1.124.5318|
|A method for obtaining digital signatures and public-key cryptosystems||2994||2||10.1.1.86.2023... (3 duplicates)|
|Authoritative Sources in a Hyperlinked Environment||2868||39||10.1.1.62.751... (5 duplicates)|
|Building a Large Annotated Corpus of English: The Penn Treebank||2797||6||10.1.1.115.4365|
|Ad-hoc on-demand distance vector routing||2781||13||10.1.1.96.6641... (2 duplicates)|
|MPI: A message passing interface||2775||0||10.1.1.52.5877|
|Distinctive image features from scale-invariant keypoints||2760||34||10.1.1.14.4931... (4 duplicates)|
DBLP is a computer science bibliography. It was originally the metadata database for scientific literatures that existed at least since the 1980s. Until today, DBLP listed millions of computer science research papers that published in all important journals and conferences on computer science. The DBLP dataset is available to download on their official website. It is a huge single XML file that contains the metadata of 2,797,143 academic research papers. You can click here to download the newest DBLP dataset.
Or click here to download the DBLP titles we extracted (200 MB contains 2,797,143 titles).
|Main Paper Types||Metadata|
|inProceedings (1,553,584 documents)||title, author, pages, year, booktitle, crossref, url|
|Article (Journal Papers)(1,200,411 documents)||title, author, pages, year, journal, volume, number, url|
|Book (11,289 documents)||title, author, pages, year, publisher, series, volume, ISBN, url|
|PhDThesis (6,956 documents)||title, author, year, school, pages, ISBN, url|
The following table is the comparison between CiteSeerX and DBLP dataset.
|Paper Areas||Computer Science||All Areas|
|Data Collection||Manually Collected||Crawled|
|Data Richness||Major metadata (title, author, venue, pages, year, etc.)||Major metadata, full paper text|
|Metadata Correctness||High (manually maintained)||Low (Automatically parsed)|
|Citation||No||Yes (provide extracted citation strings)|
ArnetMiner provides the DBLP citation network extracted from DBLP, ACM and other sources. Until now, their newest version (V7) contains 2,244,021 DBLP papers, and 4,354,534 citation relationship; each paper is labeled by an unique index ID. However, in their graph, only 781,108 (36.39%) nodes are connected. The graph can be downloaded via this page. This file not only includes the citation information, but also includes the metadata of each paper. The format of each paper is:
We also extract the citation relations out from the Aminer provided metadata and store in a file ( download). Each line in the file is an edge, (citing paper ID \t cited paper ID).
The intersection of CiteSeerX and DBLP can be treated as CS papers, remaining CiteSeerX papers can be treated as non-CS papers.
We found out the intersection of CiteSeerX and DBLP. We stored the merged dataset in a single TXT file (Download here (58.1 MB).), each line represents a paper, in total there are 504,040 DBLP records matched and 649,296 CiteSeerX papers matched. Among these matches, 333,312 matches have one-to-one unique mapping between DBLP and CiteSeerX. We also search our merging data set in Aminer data set. For every paper in our merged data set that can be found in Aminer data set, we add the paper's Aminer ID. 464,386 (92.13%) DBLP records in our merged data set are found in Aminer.The following table is an example of a matched paper:
|Paper Type (from DBLP)||Title (from DBLP)||Authors (from DBLP)||Year (from DBLP)||Venue ID (from DBLP)||Venue Name (from DBLP)||Incremental Search Flag||Matched CiteSeerX paper ID||Index ID (from Aminer)|
|inproceedings||unifying data and domain knowledge using virtual views||lim,wang,wang||2007||conf/vldb/2007||vldb||1, 0||10.1.1.101.6030||644171|
Our algorithm is to search each DBLP paper's title in CiteSeerX papers' full text. If any CiteSeerX papers' full text can match this DBLP title as a sub-string, then based on the number of matches, execute incremental search to purify the match results.
Indexed fields (for each paper) including:
Our build index Python code can be downloaded here. The size of the indexed file is 9.78GB. We use NLTK Regular Expression Tokenizer to normalize the full text. Remove any non-English or non-digit characters (e.g. "-", "?", "/", ","), lowercase English words. The regular expression we used is:
Note: we use the first N terms instead of first N lines, because some full texts are splitted into lots of lines (even one line in the original PDF file can be splitted into multiple lines in some TXT files, even though these are only very few cases. We think it is better to extract the first N terms to ensure adequate amount of text.
We have already found out DBLP and CiteSeerX have node intersection (504,040 DBLP nodes and 649,296 CiteSeerX nodes). By comparing the two citation graphs, we can further find out the edge intersection between these two data sets.