The computational problem targeted by CDC is to automatically generate complete citations for general queries over evolving data sources represented by diverse data models. The aim of this research program is to design the first well-founded model as well as to develop efficient algorithms and a solid citation system for citing data. This research program is timely because the paradigm shift towards data-intensive science is happening now and scientific communication must adapt as quickly as possible to the new ways in which science progresses; and, it is ambitious because it shapes a new field in computer science as well as it tackles with a uniform approach a range of computational issues, query languages and data models that have never been treated with a shared vision before. The broader impact of this research will be on scientists and data centers that curate, elaborate and publish data, on government agencies that direct research investments, and on research performance measures (e.g., the h-index) that will be based also on data and not only on text-based contributions.
A. Alawini, S. B. Davidson, G. Silvello, V. Tannen, Y. Wu (2018). CDC: Data Citation: A New Provenance Challenge. Bulletin of the Technical Committee on Data Engineering Bulletin, 41(1): 27–38, 2018. IEEE Computer Society. ISSN 1053-1238 |
M. Agosti, N. Ferro, and G. Silvello (2018). Digital Libraries: From Digital Resources to Challenges in Scientific Data Sharing and Re-Use. A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, vol. 31, pp. 27–41. Springer Berlin-Heidelberg, ISBN 978-3-319-61892-0. |
O. Alonso and G. Silvello (2018). DESIRES: Design of Experimental Search & Information Retrieval Systems. Proceedings of the First Biennial Conference on Design of Experimental Search & Information Retrieval Systems, CEUR Workshop Proceedings 2167. Bertinoro, Italy, August 28-31, 2018. |
G. Silvello, R. Bucco, G. Busato, G. Fornari, A. Langeli, A. Purpura, G. Rocco, A. Tezza, and M. Agosti (2018). Statistical Stemmers: A Reproducibility Study. 40th European Conference on Information Retrieval (ECIR 2018), Lecture Notes in Computer Science (LNCS) 10772, pp. 385–397. Springer. Best paper award. |
Y. Wu, A. Alawini, S. B. Davidson and G. Silvello (2018). Data Citation: Giving Credit Where Credit is Due. SIGMOD Conference 2018, pages 99–114, IEEE ACM, 2018. |
N. Ferro, G. Silvello E. Bruelink, B. Doubrov, A. Fresa, M. Geber, K. Jadeglans, B. Justrell, B. Lemmens, J. Martinez, V. Munnoz, S. Oliveras, C. Prandoni, D. Rice, S. Rohde-Enslin, X. Terres, E. Verbruggen, B. Yousefi and C. Wilson (2018). Evaluation of Conformance Checkers for Long-Term Preservation of Multimedia Documents. 2018 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2018, pages 145–154, IEEE Computer Society. |
M. Agosti, G. M. Di Nunzio, N. Ferro and G. Silvello (2018). Thirty Years of Digital Libraries Research at the University of Padua: The Systems Side. 14th Italian Research Conference on Digital Libraries (IRCDL 2018), Communications in Computer and Information Science (CCIS) 806, pp. 30–41. Springer, Heidelberg, Germany. |
A. Purpura (2018). Non-negative Matrix Factorization for Topic Modeling. Design of Experimental Search & Information REtrieval Systems (DESIRES 2018), CEUR Workshop Proceedings 2167. |
D. Dosso (2018). Keyword Search on RDF Graphs. Design of Experimental Search & Information REtrieval Systems (DESIRES 2018), CEUR Workshop Proceedings 2167. |
S. Marchesin (2018). Implicit-Explicit Representations for Case-Based Retrieval. Design of Experimental Search & Information REtrieval Systems (DESIRES 2018), CEUR Workshop Proceedings 2167. |
M. Agosti, E. Fabris and G. Silvello (2019). On Synergies between Information Retrieval and Digital Libraries. 15th Italian Research Conference on Digital Libraries (IRCDL 2019), Communications in Computer and Information Science book series (CCIS, volume 988), Springer, Heidelberg, Germany. |
D. Dosso, G. Setti and G. Silvello (2019). Learning to Cite: Transfer Learning for Digital Archives. 15th Italian Research Conference on Digital Libraries (IRCDL 2019), Communications in Computer and Information Science book series (CCIS, volume 988), Springer, Heidelberg, Germany. |
D. Dosso and G. Silvello (2019). A Scalable Virtual Document-Based Keyword Search System for RDF Datasets. Advances in Information Retrieval - 41st European Conference on IR Research, ECIR 2019, ACM Press. |
D. Dosso (2019). Keyword Search on RDF Datasets. 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), ecture Notes in Computer Science 11438, pp. 332-336, Springer 2019. Best doctoral consortium paper award. |
M. Agosti, G. M. Di Nunzio, S. Marchesin and G. Silvello (2019). Medical Retrieval using Structured Information Extracted from Knowledge Bases. 27th Italian Symposium on Advanced Database Systems (SEBD 2019) |
You can browse the software at
http://ims-svn.dei.unipd.it/repos/datacitation/
Username: guest - Password: guest
You can check it out using Subversion
$ svn checkout --username guest --password guest
http://ims-svn.dei.unipd.it/repos/datacitation/ datacitation
Documentation
The JavaDoc is available at the URL:
http://www.dei.unipd.it/~silvello/datacitation/learningtocite
We build the experimental collection by using the Library of Congress digital finding aids collection encoded in the EAD format which is publicly available at the following URL: http://findingaids.loc.gov/.
To build the training and validation set, we selected at random 25 EAD files and for each one of these files we randomly extract 4 citable units; we obtained a set of 100 XPaths identifying an equal number of different citable units. For each citable unit (i.e., XML element), we manually created a human-readable citation to be used to train the citation system and a machine-readable citation to build the ground-truth to be used for validation purposes.
The test set has been built by following a similar procedure: from the whole EAD collection minus the 25 files selected for the training and validation set, we randomly selected 50 EAD files and for each one a single citable unit has been selected at random. Then, we manually created a ground-truth machine-readable citation for each one of these randomly sampled citable units. We created the ground-truth citations by following the guidelines provided by the archives of the Purdue University which follows the Modern Language Association (MLA) citation style.
You can browse the test collection at
http://ims-svn.dei.unipd.it/repos/datacitation_collections/
Username: guest - Password: guest
You can check it out using Subversion
$ svn checkout --username guest --password guest
http://ims-svn.dei.unipd.it/repos/datacitation_collections/ datacitation_collections
The JavaDoc is available at the URL:
http://www.dei.unipd.it/~silvello/datacitation/rulebasedsystem
This is the first system which enables the automatically generation of citations for arbitrary queries against a certain scientific dataset. Our system modified the code from CoreCover algorithm, which is an implementation for query rewriting using views.
A system for the automatic creation of citation text snippets and landing pages for nanopublications is available here: