Adding value to Linked Open Data using a multidimensional model approach based on the RDF Data Cube Vocabulary

Most organisations using Open Data currently focus on data processing and analysis. However, although Open Data may be available online, these data are generally of poor quality, thus discouraging others from contributing to and reusing them. This paper describes an approach to publish statistical data from public repositories by using Semantic Web standards published by the W3C, such as RDF and SPARQL, in order to facilitate the analysis of multidimensional models. We have deﬁned a framework based on the entire lifecycle of data publication including a novel step of Linked Open Data assessment and the use of external repositories as knowledge base for data enrichment. As a result, users are able to interact with the data generated according to the RDF Data Cube vocabulary, which makes it possible for general users to avoid the complexity of SPARQL when analysing data. The use case was applied to the Barcelona Open Data platform and revealed the beneﬁts of the application of our approach, such as helping in the decision-making process.


Introduction
The technological advances made in the last few decades have enhanced and connected the entire globe. In June 2018, more than 4 billion users worldwide were connected to the Internet, which is approximately 55% of the world's population [1]. This scenario has generated a huge volume and variety of data that Barcelona Open Data platform, and finally, Section 5 shows our conclusions and future work.

Related work
In this section, we provide an overview of the concepts and previous research related to Open Data, along with the current state of Open Data reuse.

Open Data
Open Data are a key resource for social innovation and economic growth [21], and have a tremendous commercial value [22]. Providing access to data concerning public services by means of Open Data will open up a new scenario 70 in which governments will be able to collaborate with citizens as regards, for example, the evaluation of public services. Moreover, both traditional businesses and new entrepreneurs are using Open Data not only to better understand potential markets, but also to build new data-driven products.
Many cities in the world are currently producing huge amounts of data. 75 A discussion regarding Open Data utilisation in five smart cities (Barcelona, Chicago, Manchester, Amsterdam, and Helsinki) was presented by Ojo [23]. Dong [24] provides a detailed explanation of the datasets for each Canadian city, including the different data catalogues and their detailed characteristics.
Both highlight the significance of Open Data and its resulting innovations in 80 these cities.
Cities normally group their datasets into categories based on government activities, such as security, culture and leisure, environment, transport and city facilities [25].
The format of an open dataset refers to how the data are structured and 85 published for humans and machines. Choosing the right format enhances their management and reuse. However, a satisfactory response to users' needs could be provided by using common formats (e.g. text files) and others that are more advanced and not so widespread [26].
In order to make data available, publishers generally organise a central cat- 90 alogue in which to list the datasets. It is also possible to consider alternatives means, such as the use of an Application Programming Interface (API), which allows programmers to identify and select data from the entire data set by using a custom set of criteria, rather than downloading the entire dataset. Others initiatives propose a conceptual model which is able to originate an effective scal-95 able e-Government ontology [27]. In addition, the interlinking of Schema.org to vocabularies has been analysed in order to enhance the enrichment process of a dataset [28].
In addition, since cities are complex systems producing huge amounts of data, there are still research challenges concerning the study of advanced tech-100 niques of visualisation and services to enable data exploration. In this context, smart city ontologies, such as Smart City Ontology (SCO), provide a powerful tool for semantics-enabled exploration of urban data [29,30].
Free and open knowledge bases such as Wikidata 1 have, meanwhile, been growing in popularity, thus promoting the publication and reuse of Open Data. 105 Wikidata takes an innovative approach by providing an online workflow in order to propose the creation of new properties that are discussed in a participatory manner and, if there are some supporters and a consensus is reached, the property is eventually created by an administrator. 110 Today, most organisations using Open Data focus on data processing and analysis, and some of the services provided are, for instance, the transformation of raw data into actionable insights. However, much work must still be done if the full potential of reusing Open Data is exploited [31].

Barriers to Open Data reuse
According to the report published by the European Commission concerning 115 the reuse of Open Data [32], both external and internal barriers remain, which hinder re-users from standardising or automating the collection and processing 1 https://www.wikidata.org, accessed 11-February-2019.
of Open Data. The report concludes with a series of recommendations for both the public and private sectors.
Ruijer [13] suggested that the interaction among governments, industries 120 and universities could overcome the barriers that prevent governments from implementing new technologies and smart processes, owing to their tight budgets and human resource constraints.
In [14], data users cited that the lack of basic guidelines for the use and enrichment of the available data has a negative impact on the reuse level. They 125 suggested the creation of a basic reuse kit including a guideline that would help them to download, connect, enrich and display released data, which could help re-users to understand how the city's open datasets could be used in a meaningful way.
Link [12] suggested factors that could reduce the impact of Open Data, such 130 as recollection by automated tools that pose challenges as regards guaranteeing privacy, data quality, and analysing the data. When it comes to considering which dataset to use, data-quality is a crucial aspect. A number of initiatives have been undertaken in order to specify and evaluate the quality of linked data [33,34]. The evaluation of a LOD includes several aspects such as consistency, 135 accuracy, and completeness.

Methodologies for publishing Linked Open Data
As more Open Data are published on the Web, best practices and guidelines are also evolving. In [35], a Linked Data life-cycle workflow architecture is proposed based on four components: (1) Acquisition, (2) Ontology Learning 140 Method, (3) RDF Store and (4) Analysis System. In [36], limitations and drawbacks of current frameworks are identified and a methodology for publishing LOD with the use of cloud computing is proposed.
The W3C Government Linked Data Working Group proposes a guide to aid in the access and re-use of Open Government Data [37]. In addition to this 145 guide, several publications propose life cycle models, that share common activities, such as specifying, modelling and publishing data in standard open Web formats. Hyland [38] provides a lifecycle consisting of the following activities: (1) Identify, (2) Model, (3) Name, (4) Describe, (5) Convert, (6) Publish, and (7) Maintain. In Villazón [39], the authors propose a preliminary set of method-150 ological guidelines to assist in the generation, publication and exploitation of Linked Government Data. Their life cycle consists of the following activities: (1) Specify, (2) Model, (3) Generate, (4) Publish, and (5) Exploit. They capture the tasks that are required in a traditional information management workflow, but provide different boundaries between these tasks [37]. 155 We can conclude this section emphasizing that we identified a lot of common features and functionalities between compared frameworks. However, some key features were omitted or not used, such as the use of external repositories as knowledge base for data enrichment (for instance, Wikidata or GeoNames), and the inclusion of a step to perform the assessment of LOD. 160

Linked data and multidimensional datasets
In the topic of LOD, multidimensional models are the combination of different datasets, which enable the application of evaluation techniques by means of statistics and indicators [40,41] Scotland, 5 UK 6 and Japan [46] provide their statistical data as LOD based on the RDF Data Cube vocabulary. In [47], the publication of official pension 180 statistics as LOD based on the RDF Data Cube vocabulary illustrates how the data is reused in applications and how it contributes to statistical indicators in combination with other LOD. In addition, AirBase is the European air quality dataset maintained by the Environmental European Agency which represents air pollution information as an RDF data cube, which has been linked to the 185 YAGO and DBpedia knowledge bases [48]. ability of the server-side part. In contrast, CubeViz.js [49] is a client-side only application that allows connections to be made to a SPARQL endpoint or a file dataset. CubeViz.js is based on the RDF Data Cube vocabulary and is able to process the Data Cubes provided by a self-maintained SPARQL endpoint, along with Data Cubes that are published as Turtle or JSON files. Moreover, [50] pro-195 pose four methods for linked data viewing identifying potential uses cases of a dataset such as, an overview of queries and different tools to allow data to be visualized.
However, some challenges remain related to the creation of cubes as linked data and approaches to addressing them, highlighting the difficulties to integrate 200 different sources and the development of generic software tools [51].

Findings and contributions of our proposal
After reviewing the previous work, we identified a lot of common features between the frameworks oriented towards publishing and exploiting linked data.
However, some key features were omitted or not used. We present below the 205 main challenges and open issues in this area: • The inclusion of a step to perform the assessment of LOD.
• The use of external repositories as knowledge base for data enrichment.
• The improvement of data exploitation and visualization.
• The analysis of the different data sources to automate their integration. 210 Below, we summarize the main contributions presented in this paper: • The proposal of a generic framework to enhance the enrichment and publication of Open Data by means of multidimensional models and LOD.
• The definition of a novel step of LOD assessment using different criteria concerning data quality.

215
• The enrichment of the original dataset by using links to external repositories (such as Wikidata and GeoNames).
• The dataset exploitation providing: (a) dashboards that allow non-expert users to interact with data generated according to the RDF Data Cube vocabulary (using existing tools such as CubeViz), and (b) a public SPARQL 220 endpoint for expert users.
• The evaluation of the framework by means of a case study applied to the Barcelona's official Open Data platform illustrating how the transformation process can aid in the decision-making process.

The framework for publishing Linked Open Data
In the following subsections, we describe each step of our framework based on the life cycle of Villazón [39] which includes the main methodological guidelines oriented towards publishing and exploiting linked data. Our approach enhances the original process of Villazón by including an additional step of LOD assessment based on the methodology proposed by [34] and adapted to 230 the specificities of data cube repositories. Furthermore, the enrichment of the original dataset by using connections to external repositories has been carried out. In addition, to facilitate the repository exploitation, dashboards and a public SPARQL endpoint have been made available. In Figure 1 the proposed framework is shown.

Data source specification
The format of a dataset refers to how data are structured and published for humans and machines. Choosing the right format enhances management and reuse. While the most common format used by organisations to publish data is CSV (Comma Separated Files), which is simple to understand, highly reusable 240 and machine-readable, more advanced approaches use XML, RDF and JSON, thus providing a higher level of information in terms of semantics [26]. However, in some cases, statistics are more understandable and readable when using XLS as a format, but its macros and formulas may be hard to handle.
In addition, it is not possible to guarantee the homogeneity of Open Data 245 data across institutions owing to the variety of data formats, vocabularies and external repositories. Common problems appear, such as textual errors, typos, abbreviations, languages, a lack of information and the disambiguation of locations [52]. The pre-processing step, therefore, generally includes a set of parsers (e.g. implemented in Java, Python or using Extraction, Transform and Load 250 tools [53]) in order to normalise the information contained in the source data.
Our approach is based on the development of Extraction, Transform and Load (ETL) processes designed by means of Pentaho Data Integration (Kettle), 7 which is a modern data integration platform that allows access to and the preparation, combination and analysis of unstructured data, in order to 255 normalise the data obtained from heterogeneous data sources.
It is important to note that although data sources may differ across institutions, our approach is generic in order to facilitate its application to any domain.
This process requires the identification of commons points in the data sources in order to join them. Once the original data sources are treated as a whole, 260 several additional tasks are required such as cleaning and normalising the data.
As a result of this semi-automatic process, a unique file with the integrated information is returned which is finally used to create the RDF. Multidimensional data models, may, in particular, have different relational representations, including the star schema and the snowflake schema [54], both 275 of which use dimension tables to describe data aggregated in a fact table.

RDF data modelling
The most common is the star schema, whose main feature is that its dimen-

Data generation
This step includes the transformation of the source data into a machine readable language, i.e., RDF, thereby providing interoperability and links to other datasets. The transformation may be carried out in a batch or in a graphically-aided manner.

310
For example, Jena 8 is a Java API that can be used to create and manipulate RDF graphs, and provides classes to represent graphs, resources, properties and literals.
Moreover, OpenRefine 9 is a standalone open source desktop application for data cleanup and transformation to other formats. OpenRefine allows us to au-315 tomatically transform raw data into a machine readable language, thus enabling graphical mapping from a project onto an RDF skeleton and its subsequent exportation in RDF format. The RDF schema alignment skeleton specifies how the RDF data will be generated from the source data. The cells in each record of the data will be placed in nodes within the skeleton.

320
Datasets become more useful and reusable when they are closely interlinked with other collections. These links are described by means of the owl:sameAs relationship and they contribute to the rich connectivity promoted by LOD. The interlinking process normally takes place in two steps: (i) an automatic procedure extracts the information from the data source, parses textual information 325 8 https://jena.apache.org/documentation/rdf/index.html, accessed 11-February-2019. 9 https://github.com/OpenRefine/OpenRefine, accessed 11-February-2019. and finds the candidate links to external resources and (ii) a further manual refinement is carried out by data curators in order to validate the external links.
However, in some cases the automatic procedure can be particularly difficult and data curators are assisted by tools. For instance, the Mix'n'match 10 tool permits users to match Wikidata entries with a list of topics from external 330 repositories in a fast and simple manner.
Many repositories can currently be used to enrich a dataset depending on the context. More and more systems rely on gazetteers in order to link natural language texts to geographical locations, with GeoNames being arguably the most commonly used gazetteer at present [55]. With regard to knowledge 335 graphs, DBpedia and, more recently, Wikidata have become very popular within the community. In general, they provide an API in order to consume the data, which can be easily adopted.
Our approach is based on OpenRefine, since is a powerful tool as regards working with heterogeneous data, transforming it to a uniform vocabulary and 340 enriching it with external repositories.

Data publishing
The rapid development of the Semantic Web has promoted an increase in RDF data on the Web. As a result, a set of techniques to store RDF data have been proposed. The efficient storage of RDF data has already been discussed 345 in literature [56,57]. There are several ways in which to store RDF data (commonly known as triple stores) that support data storage mechanisms, inference, update options, scalability, SPARQL endpoint and distribution, among others. In addition, the use of terms and properties from vocabularies to describe linked Datasets (VoID) [59] is concerned with metadata related to RDF datasets.
By providing licensing information, users are aware of the conditions and 355 terms of use. In general, this information is specified in RDF by means of relations such as dcterms:licence and dcterms:rights, either in the dataset or in a separate VoID file.
Since directly publishing the final dataset reduces complex maintenance tasks, our approach proposes the publication of RDF as a file that can be ac-360 cessed by third-parties, including metadata, such as licensing information, and described by means of VoID. concerning data-quality [60].

LOD assessment
A data-quality criterion is a function with values in the range 0-1 which scores a particular feature -such as availability and timeliness frequency. A data-quality dimension comprises one or more criteria which are grouped into 370 categories as is shown in Table 1.

Data exploitation
This step covers the exploitation of the dataset as a result of the transformation process. In order to exploit it to its full potential, it is necessary to provide dashboards that enable users with limited knowledge and a lack of Information Technology (IT) skills to interact with the dataset. CubeViz.js generates a faceted browsing widget that can be used to interactively filter observations 380 that are to be visualised in charts. In addition, a public SPARQL endpoint could be enabled in order to facilitate the access and reuse the dataset.

A real case scenario
According to the State of European Cities Report 11 and the priorities for EU regional and urban development, 12  Our approach has been evaluated by using the data from the Barcelona The details of each step of the publication process are described below.

Data source specification
This section presents the specification of the data sources according to the 405 guidelines. As a result of this step, a CSV file format is obtained which can be automatically processed. This is a regular text file used for storage of tabular data in which the fields are separated using in this case, a comma. This is a semi-automatic step that requires a previous analysis and the identification of common points that allow the integration of data sources. The CSV file is used 410 in the next step RDF Data Modelling.
In the case of the government data sources, we followed two paths: • We reused data already opened up and published by the Barcelona Open Data platform. 14 • We identified datasets that share common joint points (i.e. district, geo-415 graphical location, etc.), and thus allowing further analysis. These include the geographical location, the neighbourhood, the visits generated by the critical point, and the reason why it is a critical point. This data has In order to analyze whether population and income influenced the critical points, we combined this information with urban environment and administrative boundary files. Territorial income distribution and population data in the city of Barcelona have been extracted from a PDF file, considered as poor 430 quality since they are not suitable to be automatically processed by a computer.
Once the source files are prepared, they can be read and processed sequentially to get data about the critical spots. Several additional tasks are required such as cleaning and normalising the data since the data in different sources files are not consistent with each other, for example: the same data may use 435 different field names; the same field contains information of various attributes, so it is necessary to process the text to extract the data separately (e.g. the 3. la Barceloneta text value contains the code and the name of the neighborhood). Figure 2 shows a graphical representation of the transformation process which has three entry points that correspond to three heterogeneous data sources 440 in terms of format and content. Finally, the data sources are combined in a single output file that will be later used to generate the RDF.

RDF data modelling
In Figure 3, we present our approach as a snowflake schema, including the fact table, which stores aggregated data (critical cleaning spots, number of visits, 445 income per capita, population and state) created from the datasets. Surrounding The RDF data cube model obtained as a result of this step is based on the 450 RDF Data Cube vocabulary, in which each resource is identified by an URI in order to benefit from the value of LOD. The prefixes listed in Table 3 indicate the namespaces used in the dataset. Following the design issues for the publication of LOD [4], our approach is characterised by the following structure: • the dataset is identified by {base URI}/dataset. A resource representing 455 the entire dataset is created and typed as qb:DataSet and is then linked to the corresponding data structure definition via the qb:structure property.
• the data structure definition of the dataset, which includes the components such as dimensions, attributes and measures, is identified by {base URI}/dsd and typed as qb:DataStructureDefinition.   • finally, each observation is typed as qb:Observation and identified by a URI which contains the date followed by an auto increment number. For

Data generation
This case study is based on OpenRefine to transform the source data into the RDF Data Cube vocabulary. The mappings are used to create the dataset structure, along with the observations and the components, using the appropri-

Data publishing
This step includes the publication of the dataset following the LOD principles. Our approach reuses the original Creative Commons Attribution 4.0 15 520 licence for the Government data sources. It is stored on an RDF4J server 16 which has enabled a public SPARQL endpoint.
The dataset was described by means of VoID vocabulary, which helps data producers to publish metadata in a human and machine-readable format. In addition, DataHub was used as a platform on which to publish the dataset. 17  Table 4. Next, we will explain in detail the criteria related to our

Trustworthiness
• Trustworthiness on dataset level. The dataset is published by means of an automatic conversion to LOD. The score 0.25 is defined in [34] as data extracted from structured data sources.

565
• Trustworthiness on statement level. Vocabularies have not been included to describe the origin of the data, therefore, this criterion value is 0.
• Using unknown and empty values. No identifiers have been used to capture the unknown and empty values, thus the value here is 0.

570
• Consistency of schema restrictions during insertion of new statements.
The score obtained is 0 since the user interface does not perform checks restrictions during insertion of new statements.
• Consistency of statements with respect to class constraints. The owl:disjointWith property has been used in order to check the class constraints. The con-

Relevancy
• Creating a ranking of statements. The dataset does not support the ranking of statements.

Completeness
In order to evaluate this dimension, it is also necessary to define a set of classes and properties listed in Table 5, which is based on RDF Data Cube 20 and DBpedia 21 vocabulary.  properties -such as Wikidata end time (P582)-to specify validity.
• Specification of the modification date of statements. No information concerning modification dates such as dcterms:modified or schema:dateModified were found.

630
• Description of resources. The rate of entities described with the property rdfs:label has been computed and found to be low.
• Labels in multiple languages. The string value of a property can be encoded in multiple languages by adding attributes such as @es, @en, etc.
The dataset declares the language of the rdfs:label and rdfs:comment 635 properties, in which references only to English were found.
• Understandable RDF serialization. Alternative encodings that are more understandable for humans than RDF include N-Triples, N3 and Turtle • Availability of a public SPARQL endpoint. The dataset is stored in an RDF4J 24 server and the SPARQL endpoint is located at http://data. cervantesvirtual.com/rdf4j-server/repositories/rdfdatacube.
• Provisioning of an RDF export. The dataset is available as a data dump based on N-Triples.
• Support of content negotiation. The consistency between the RDF serialization format requested (RDF/XML, N3, Turtle, and N-Triples) and that which was returned was checked. As a result, Turtle was not supported.
• Linking HTML sites to RDF serializations. This criterion has not been 675 considered in our dataset since there is not an HTML website in which items are browsed.
• Provisioning of repository metadata. The repository can be described using Vocabulary of Interlinked Datasets (VoID) [59]. The dataset includes a VoID file with the title, description, creators and vocabularies used.

Interlinking
• Interlinking via owl:sameAs. This score is obtained as the rate of instances having at least one owl:sameAs triple pointing to an external resource.
• Validity of external URIs. The number of timeouts and HTTP errors were 690 computed when accessing a random sample of 100 URIs defined with the owl:sameAs relation.

Data exploitation
In order to exploit the full potential of the dataset, and avoid technical requirements such as the use of SPARQL, our approach uses CubeViz.js to and Time (months) in Figure 5, to be filtered.  We also concluded that the number of critical cleaning spots does not change significantly over the year. However, we can confirm that there is a relationship 710 between the number of critical cleaning spots and the low per capita income in areas such as El Raval and El Poble Sec (see Figure 8).   Linking to other data sources makes it possible to add further information to enhance the final dataset. Wikidata provides a full set of properties that can be exploited, such as administrative subdivisions, dimensions, images and geographic proximity. Listing 5 shows an example of a SPARQL federated query that is employed to execute queries distributed over different SPARQL endpoints by means of the SERVICE keyword. This query was executed from our SPARQL endpoint and merge data from Wikidata (such as geographic coordinates, area occupied by a region and additional external identifiers) as an example of how 720 expert users are allowed to exploit our dataset.

740
The publication of Open Data has attracted the interest of the research community to a great extent. Open data may be visually available online, but are generally of poor quality, thus discouraging others from contributing to and reusing that data. In this paper we have defined a framework suitable for publishing and exploiting linked data including a new step of LOD assessment.

745
The framework uses Semantic Web standards published by the W3C, such as RDF and SPARQL, and focuses mainly on providing and facilitating the analysis of multidimensional models. The main motivation for our research is based on how to increase the value of Open Data and make it useful enriching and assessing the quality of the data before exploitation.

750
The proposed framework consists of 6 steps: (1) data source specification (integration of different data through the ETL process in order to normalise the data obtained from heterogeneous data sources); (2) RDF data modelling is then used to generate the RDF data by means of OpenRefine); (4) data publishing (the RDF data is enriched and stored in our repository); (5) data assessment (the methodology to evaluate the final dataset is provided); (6) data exploitation (dashboards and a public SPARQL endpoint are provided to interact with the dataset).

760
The main contributions of our framework are the following: • The enrichment of the original dataset by using links to external repositories. In our experimentation, we established links to Wikidata and GeoNames. In addition, data enrichment allows the inclusion of new indicators by means of federated SPARQL queries. • The dataset exploitation using: (1) dashboards that allow non-expert users to interact with data generated according to the RDF Data Cube vocabulary; (2) a public SPARQL endpoint for expert users. 775 The approach has been evaluated by using the data from the Barcelona Open Data platform, and specifically, the state of critical cleaning spots in the city of Barcelona.
We foresee several opportunities to improve our work, such as including the data from other cities and adding multiple Linked Data repositories. We also 780 plan to improve our dataset by using more vocabularies in addition to evaluating new methods for the visualisation and exploitation of Government Linked Data.
Finally, the results of the data quality process will be taken into account in order to identify features to improve the publication of the dataset such as the addition of provenance and ranking information.