RT Report
T1 A Comparative Analysis of Open and Commercial Bibliographic Infrastructures: Scale, Metadata Standardization, and Implications for Bibliometric Evaluation
A1 De Moya Anegón, Félix
A1 Sánchez Jiménez, Rodrigo
A1 Halevi, Gali
A1 Guerrero Bote, Vicente P.
A1 Guerrero Castillo, Pablo
A1 Rivadeneyra, Federico
AB This report evaluates the structural viability of open bibliographic infrastructures for researchassessment purposes, with a particular focus on how leading open databases compare withScopus in terms of coverage, metadata quality, transparency, interoperability, and suitabilityfor research evaluation workflows.While recent policy frameworks such as the Coalition for Advancing Research Assessment(CoARA) and the Barcelona Declaration mandate a transition toward open research data, anempirical analysis reveals a critical bottleneck: a structural trade-off between scale andmetadata standardization. Platforms such as OpenAIRE, which aggregates more than 150 mil-lion records, and open bibliographic platforms including OpenAlex and The Lens, each withover 200 million records, significantly surpass the publication volume covered by commercialcurated databases, most notably Scopus, across the analyzed 1996–2024 period.However, this aggregation model prioritizes recall over structural consistency, which can leadto metadata gaps that compromise direct bibliometric application. The massive ingestion capa-bilities of open platforms are counterbalanced by substantial limitations in key metadata fields.Affiliation data are absent in more than 55% of records, severely constraining the feasibility ofinstitutional evaluations, and key identifiers such as ISSNs and DOIs exhibit significantlylower levels of completeness than in Scopus. Document type classification also frequentlylacks editorial rigor, relying heavily on algorithmic labeling that does not consistently stand-ardize the categorization of scholarly outputs.Furthermore, the analysis of citation flows reveals a markedly asymmetric dynamic: the ex-pansive long tail of open databases functions primarily as a reference feeder that reinforces theimpact indicators of the already established commercial core, rather than substantially redis-tributing measured impact across the broader scholarly corpus. In this way, the additional lit-erature that open sources seek to incorporate ultimately serves to strengthen the prominence ofthe publications already represented in commercial databases. This finding points to a struc-tural paradox in open scholarly infrastructures and raises important questions that warrant fur-ther reflection and investigation.Geographic and editorial analyses reveal persistent asymmetries. Within the Global South, rep-resentation trajectories diverge: while regions such as Africa and Latin America have improvedtheir visibility, significant coverage gaps, reaching up to 25%, remain in Asia and the MiddleEast. Additionally, deficits persist in specialized humanities monographs and complex publi-cation structures like conference proceedings. Consequently, the theoretical advantage of theopen "long tail" cannot currently be leveraged to offset these geographic and editorial biases,as its source-level metadata remains structurally incomplete or absent.This operational friction stems from a fundamentally bifurcated data reality. Within the coreliterature that overlaps with Scopus, open infrastructures achieve high metadata completenessin fields essential for research evaluation. However, the extended literature outside this over-lapping core suffers from profound structural deficiencies, including empty essential fields,duplication, and incomplete source data.The corpus derived from Scopus's editorial processes exhibits a structural consistency withouta direct equivalent in open platforms. While all databases utilize normalization methods, open infrastructures depend intensively on algorithmic procedures which are notably prominent inOpenAlex. Conversely, Scopus integrates automated processes with author and institutionalfeedback to refine data disambiguation. Although the data indicates that Scopus captures ahigher number of affiliations per document, this study does not include an empirical compari-son regarding the effectiveness of their respective disambiguation systems.Conversely, open platforms face significant structural trade-offs: The Lens struggles withglobal metadata standardization, reporting the lowest global rates of ISSN and DOI presenceand a 71.67% deficit in capturing conference proceedings. OpenAlex relies heavily on unstruc-tured source data, with 41.5% of its records (having a source) lacking an ISSN, and faces po-tential analytical bias due to algorithmic over-labeling of documents as "articles". Finally,OpenAIRE presents important technical anomalies, including over one million duplicatedDOIs and the highest rate of unclassified documents (23.1%) within the curated core, resultingin the lowest overall citation impact ratio of the group.Despite the structural limitations observed in their extended corpora, open bibliographic infra-structures present advantages when applied to targeted use cases. The Lens, with over 215million records, integrates scholarly outputs with patent data, making it highly effective formapping technology transfer while maintaining a 96.1% citable document density within itscore overlap. OpenAlex demonstrates the highest absolute alignment with commercial stand-ards by capturing 63.8 million Scopus-indexed records and highest citation density in that coreamong the three open databases. Finally, OpenAIRE offers the highest coverage of persistentidentifiers (73.2% for DOIs and 59.7% for ISSNs) and the lowest rate of missing institutionalaffiliations (40.55%) among the open platforms.The high structural availability of open data must not be uniformly equated with evaluativeviability. Uncritical adoption of the full open dataset in its raw state risks introducing new,systemic biases into the global science policy landscape, imposing significant methodologicalcompromises. Nevertheless, these infrastructures have evolved considerably. While direct ag-gregation currently complicates standard institutional evaluation, these platforms can deliverhighly functional solutions for specialized bibliometric analyses, provided that institutionscommit to investing in rigorous data normalization and disambiguation processes. Conse-quently, the transition toward open research assessment requires a technical shift from meredata accessibility to active data validation.
PB Ediciones Profesionales de la Información
SN 978-84-125757-8-1
YR 2026
FD 2026-05-25
LK https://hdl.handle.net/20.500.14352/137610
UL https://hdl.handle.net/20.500.14352/137610
LA eng
DS Docta Complutense
RD 27 jul 2026