TOURISTS’ DIGITAL FOOTPRINT IN CITIES: COMPARING BIG DATA SOURCES MARÍA HENAR SALAS-OLMEDO Departamento de Geografía Humana. Universidad Complutense de Madrid C/Profesor Aranguren, s/n. 28040 Madrid, Spain msalas01@ucm.es BORJA MOYA-GÓMEZ Departamento de Geografía Humana. Universidad Complutense de Madrid. C/Profesor Aranguren, s/n. 28040 Madrid (España) bmoyagomez@ucm.es JUAN CARLOS GARCÍA-PALOMARES * Departamento de Geografía Humana. Universidad Complutense de Madrid C/Profesor Aranguren, s/n. 28040 Madrid, Spain jcgarcia@ucm.es JAVIER GUTIÉRREZ Departamento de Geografía Humana. Universidad Complutense de Madrid C/Profesor Aranguren, s/n. 28040 Madrid, Spain javiergutierrez@ghis.ucm.es *Corresponding author. mailto:msalas01%40ucm.es mailto:bmoyagomez@ucm.es mailto:jcgarcia%40ucm.es mailto:javiergutierrez%40ghis.ucm.es Highlights  The paper analyses the digital footprint of urban tourists through Big Data.  Panoramio, Foursquare and Twitter reflect different tourism activities.  The methods used are density maps, OLS, spatial self-correlation and cluster analysis.  The results show different tourist spaces: multifunction (several activities) vs specialising. 1 TOURISTS’ DIGITAL FOOTPRINT IN CITIES: COMPARING BIG DATA SOURCES Abstract.- There is little knowledge available on the spatial behaviour of urban tourists, and yet tourists generate an enormous quantity of data when they visit cities. These data sources can be used to track their presence through their activities. The aim of this paper is to analyse the digital footprint of urban tourists through Big Data. Unlike other papers that use a single data source, this article examines three sources of data to reflect different tourism activities in cities: Panoramio (sightseeing), Foursquare (consumption), and Twitter (being connected- accommodation). The results show that the data from the three activities are partly spatially redundant and partly complementary, and allow the characterisation of multifunction tourist spaces and spaces specialising in one or various activities. The main conclusion is that it is not sufficient to use one data source to analyse the presence of tourists in cities; several must be used in a complementary manner. Key words.- Urban tourism, Big data, Photo-sharing services, Social networks, Spatial analysis, GIS http://ees.elsevier.com/jtma/viewRCResults.aspx?pdf=1&docID=18028&rev=1&fileID=353869&msid={ABE64E9D-2D79-4F23-B021-4E06D49F77CE} 2 1. INTRODUCTION The tourism product of large cities has enormous capacity and is highly diversified (Jansen- Verbeke, 1986). Tourists make very selective use of the city. They reduce uncertainty in their exploration of an area by visiting sites perceived to give the greatest reward for effort (Cooper, 1981). In fact, it is impossible for tourists to consume the entire urban tourism product on an average 2 to 3-day visit to such a city (Mazanec, 1997), so they must choose which of the attractions they wish to visit, and which to skip. The result is the creation of typical tourism- product consumption patterns based on the preferences and limitations of different tourist types (Shoval & Raveh, 2004). Studies analysing spatial patterns of tourist mobility in cities show that they tend to be concentrated in specific areas of city centers (Shoval & Raveh, 2004; Hayllar & Griffin, 2009), where they find the main tourist attractions (historical buildings and parks, museums, theatres, concert halls, etc.), along with leisure and shopping facilities and accommodation services for tourists (Pearce, 1987). Not surprisingly, most tourists seek hotels or apartments that are within walking distance of major attractions in the city (Arbel & Pizam, 1977) and spend a large share of their time budget in the immediate vicinity of the hotel (Shoval et al., 2011). As a result, city centers are profoundly transformed by the pressure of tourism. They become more and more oriented towards meeting the needs of tourists (hotels and apartments, souvenir shops, restaurants, etc.), and less toward residents’ needs, who tend to abandon central locations. These processes are known as tourismification (Jansen-Verbeke, 1998) and tourism gentrification (Gotham, 2005). Traditionally the spatial behaviour of tourists in cities has been studied through surveys (for example, Cooper, 1981). However, surveys do not provide great spatial and temporal data on the spatial behaviour of tourists. Over the past nine years, the rapid advancement and availability of small, inexpensive and reliable tracking devices that draw on GPS technology is assisting researchers in developing new methods of spatial research (Edwards & Griffin, 2013). GPS allows for the precise and continuous tracking of individuals and provides spatially rich data, making it possible to accurately track the paths tourists take and to provide greater understanding of their socio-spatial behaviour (Asakura & Iryo, 2007). Not surprisingly, several studies have used GPS tracks in recent years to analyse the spatial behaviour of tourists (for example, Shoval et al., 2011). However, most of these studies use small samples, since the collaboration of tourists is necessary to obtain their tracks. Big Data offers new opportunities in tourism research by providing high spatial and temporal data that make it possible to analyse the spatiotemporal patterns of a large numbers of tourists. Big Data is a new concept that has become widely popularised in recent years to describe the production of massive quantities of data. Big Data covers a range of very different areas: Internet searches, bank card transactions, records of mobile phone activity, social networks, data on water and electricity consumption, meteorological data, images recorded with video cameras and many more. The main characteristics of these new data sources include particularly the following three Vs: volume, in terabytes or petabytes of data; velocity, created in or at near real time; and variety, taken from a wide variety of sources, either structured (data that can be stored in the form of tables), semistructured (htlm files) or unstructured (texts, photographs, videos) (McAfee et al., 2012; Kitchin, 2013; Sagiroglu & Simanc, 2014). Big Data supplies a large quantity of information to complement the traditional sources. Tourists leave a digital “footprint” in most of their activities, and these new data sources now make it possible to analyse tourists' behaviour in the cities they visit. They take vast numbers 3 of photographs and upload them to photo-sharing services, they make payments with bank cards, they talk and send messages via their mobile phones, they are active on social networks, and so on. All this activity produces an enormous quantity of digital data (Big Data) which can be analysed to study behaviour patterns (Shoval & Isaacson, 2007; Asakura & Iryo, 2007; Girardin et al., 2008a and b). Much of these data is geolocated, so tourists' activity can be analysed spatially. However, there are very few papers that apply Big Data to examine the spatial distribution of tourists in cities, probably due to the novelty of these information sources and the fact that some are difficult to access. Photo-sharing services provide very useful information for identifying the presence of tourists when they go sightseeing in cities. Although there are several photo-sharing communities that allow the geolocation of photos (such as Flickr or Instagram), Panoramio is probably the most useful service for measuring tourist hotspots, as this website shows photographs taken of places or landscapes when sightseeing, which are then posted online once they have been georeferenced. The records of geolocated photographs can be used not only to identify sightseeing spots (García-Palomares et al., 2015), but also to analyse the spatial and temporal patterns of tourist flows in cities (Girardin et al., 2008b). However, tourists not only visit the most photographed spaces. They also go shopping, go to restaurants and stay in hotels, and they leave their digital footprint in all these establishments when they pay with a bank card or check-in their location on social networks. These digital footprints of tourists offer information which is largely complementary to the data from photo-sharing social networks. The most photographed areas often have very little offer of accommodation and shopping. In the case of business tourists, their hotel and the spaces they frequent tend to be close to business sectors, and not necessarily in the most photographed areas. During their stay in the city, tourists also log onto the Internet to confirm details of their visit, check their e-mail, engage in the social networks, and so on. This activity also leaves a digital footprint in many of the places they visit. Tourists often use the facilities in hotels, hostels, restaurants and certain open spaces to connect to Internet through free WiFi networks, so their activity on social networks may particularly reflect this type of spaces. The main aim of this paper is to compare three geolocated data sources to identify the presence of tourists in cities in terms of their different activities: geolocated photographs from the Panoramio platform area used as a proxy for sightseeing, Foursquare check-ins as a proxy for consumption, and interaction on the social network Twitter as a proxy for being connected- accommodation. The study area is the city of Madrid, one of the European cities with the highest volume of tourists. This paper contributes to the literature on Big Data and tourism activity from a threefold perspective: 1) Three different data sources are compared to obtain the most comprehensive analysis of locations of tourists. 2) The data for tourist activity (photos, check-ins, tweets,) are not analysed directly as in previous works, but the tourists themselves are the unit of analysis. The data are processed to allow the number of tourists to be counted in each place in the city according to each data source, thus avoiding problems of multiple counting, and making the results comparable. 3) The information from the different data sources is integrated through cluster analysis and spatial autocorrelation analysis to characterise the areas of tourist concentration according to the type of activity. The remainder of this paper is structured as follows. Section 2 summarises the existing literature on the use of photo-sharing services, Foursquare check-ins and Twitter in urban studies, with a particular focus on tourism. Sections 3 and 4 describe the data and the 4 methodology respectively. Section 5 describes and discusses the results, while Section 6 presents the main conclusions and suggests further directions for research. 2. RELATED LITERATURE Photo-sharing services Sightseeing is one of the main tourist activities in cities, and leaves its digital footprint on social networks for photo-sharing such as Instagram, Flickr and Panoramio. All three offer the possibility of geolocating photographs, but Panoramio (http://www.panoramio.com) particularly allowed the georeferencing of the photos, as it focused on images of places or landscapes shared by its users, which can be seen on the Panoramio website (until November 4, 2016) or through Google Earth and Google Maps. In fact, Panoramio was a Google service, and had over 120 million geolocated photographs (2015 data according to Panoramio data API). Photo-sharing services have been used for several purposes in the field of tourism, including identifying social events such as festivals, demonstrations, sporting events and so on (Sun and Fan, 2014), estimating tourist numbers (Koerbitz et al., 2013), identifying the presence of tourists (Girardin et al., 2008a, Kisilevich et al., 2013; Straumann et al., 2014, Gutiérrez et al, 2017) and the most common trajectories followed by tourists (Girardin et al., 2008b), proposing or assessing tourist routes (Kurashima et al., 2013), suggesting tourist trips (Lu et al., 2010) and planning trips lasting several days and places to visit (Li, 2013). These data sources make it possible to identify areas with a concentration of tourists in cities (sightseeing spots) through spatial statistical techniques (García-Palomares et al., 2015). Photographs taken by tourists can be differentiated from those taken by residents based on the time period in which the same user takes the photographs. The results obtained in the work of García-Palomares et al. (2015) show clearly differentiated spatial distributions for tourists (more concentrated in sightseeing spots) and local residents (more widely dispersed throughout the city) in eight European cities. Foursquare check Tourists not only leave their digital footprint when they go sightseeing in the city. They also leave a digital trail when they engage in consumption-related activities (shopping, restaurants and so on) and pay with bank cards (Sobolevsky et al., 2014a, 2014b and 2015). However, bank card transactions are a data source that has been very little used in tourist studies, due to the difficulty of accessing these databases. An alternative data source for analysing consumption activities is Foursquare social network. This service enables a user to inform their friends his/her location by generating check-ins, rate and review venues they visit, and read other users’ reviews. Each check-in contains information that reveals who (which user) spends time where (at what location), when (what time of day, what day of week), and doing what (according to the kind of venue) (Çelikten et al., 2016). Foursquare data have been used for analysing geographic distribution of venues across the cities (Çelikten et al., 2016), discovering functional urban areas (Vaca et al., 2015), analysing movement pattern and the popularity of areas in cities (Silva et al., 2013), studying traffic conditions (Ribeiro et al, 2014), and performing trade area analysis (Qu et al., 2013). However, there are few studies using Foursquare data in the field of tourism. One exception is Ferreira et al. (2014), who analyze the spatio-temporal characteristics of the behaviour of tourists and 5 residents in a set of cities, identifying the most visited places and the temporal distributions of both groups of visitors. Other exception is the paper by Serrano-Estrada et al. (2016), who map the most relevant venues in terms of number of visitors within a tourist city. Twitter Most of the studies done with mass data from social networks have used Twitter (Murthy, 2013), not only because this platform has global coverage, but also because its data (the tweets) are available free on the Internet the instant they are produced –that is, in real-time. Each geolocated tweet leaves the digital footprint of the place and the time it was sent. If the data are processed according to their user identifier, they provide an approximation of the places visited by each user at different times of the day and days of the week –that is, their spatial and temporal profile. Activity on social networks can be used as a proxy to analyse the changing population densities in the city throughout the day (Ciuccarelli et al., 2014) and the population's mobility patterns (Wu et al., 2014). Twitter’s daily use profiles serve to classify the space according to the type of dominant activity, whether business, leisure/weekend, nightlife and residential (Frias-Martinez et al, 2012). Geolocated tweets have also been used to analyse the degree of social mixing in the use of space by tracking the movement of social groups in highly segregated cities such as Río de Janeiro (Netto et al., 2015) and Louisville (Shelton et al., 2015). Unlike the information supplied by official sources, which offer data relating to place of residence, these studies apply indicators of multiculturalness and mixing to examine the use of space throughout the day. For example, there are studies on linguistic diversity in cities and regions based on the language used in tweets as an indicator of cultural diversity (Mocanu et al., 2013). Elsewhere, Takhteyev et al. (2012) studied the role of distance, languages and borders in configuring the networks of tweets, and concluded that the relations established over long distances are similar to those observed in air transport flows. Papers on tourism using Twitter data are particularly scant. The very few works that use geolocated tweets in the field of tourism tend to focus on comparing visitor’s spatial behaviour between cities on the national or global scale (Bassolas et al., 2016; Hawelka et al., 2014; Sobolevsky et al., 2015), but do not address the spatial patterns within the city. 3. STUDY AREA: MADRID The area selected for study was the city of Madrid. Madrid is a historic city that has a large concentration of tourists. Delimitation of the study area was based on the administrative reference unit: the municipality of Madrid. This is a relatively compact area of 604,3 km² around the traditional city. With 3,160,000 inhabitants in 2016, the population density of the municipality of Madrid is high, with more than 5,200 inhabitants per km². According to the Statistics National Institute, Madrid received 9 million tourists in 2016. Its popularity rose considerably in the last decade. In 2007, the number of tourists totalled 7.3 million, so there was an increase of 24% in the last ten years. International visitors increased even more, from 3.4 million in 2007 to 4.6 in 2017 (an increase of 35%). This huge influx of visitors has an enormous economic and social impact on the city, generating more than 7.600 millions of euros in 2016. Figure 1 shows the main points of tourist attraction in the city of Madrid. 6 Figure 1: The city of Madrid 4. DESCRIBING AND PRE-PROCESSING THE DATA Panoramio The records of 307,062 geolocated photographs uploaded to Panoramio between 2006 and 2014 covering the municipality of Madrid were downloaded from the Panoramio API. This dataset contains information about the geographic coordinates, the ID of the owner of the photograph, a URL link to the photograph, and the date on which it was uploaded (day, month and year). 7 This downloading generated “.csv” files. The geographical coordinates of each photograph were used to create a layer for each location using ArcGIS 10.4. This data source does not include a field to record the user’s nationality. However, it was possible to “identify” each user as a tourist or a resident using the user’s ID and the date of his/her photograph. Using MongoDB, if a user had taken pictures in Madrid over a period exceeding one week per year, the photographs were attributed to residents; if the period was less than one week per year, then they were attributed to tourists 1 . This methodology is similar to that used by Fischer for his Geotaggers' World Atlas, and García Palomares et al. (2015). A total of 27,573 photographs were assigned to tourists as a result of this process (Figure 2a). Twitter and Foursquare One of the most widely used social networks is Twitter, a platform that allows users to send messages with a maximum of 140 characters –known as tweets– which by default are public. This service has 500 million users all over the world and generates around 65 million tweets a day. Some of these (around 3% until April 2015 and currently around 1%) are geolocated; that is, messages in which the sender’s location is known from their geographic coordinates. Given the vast amount of messages sent on the social networks, this 3% percentage represents an enormous quantity of tweets. Our study was conducted using a dataset of geolocated tweets sent from Madrid between 2012 and 2014. The dataset contains information on the geographic coordinates, the owner’s ID, the language, the date on which it was uploaded (minute, hour, day, month and year), any messages included, etc. This file underwent a similar treatment to the Panoramio file. First, tourists were identified using MongoDB to extract users that had tweeted for a week or less per year. Then, all twees from those users within the municipality of Madrid were used to generate a layer of tweets with ArcGIS using the coordinates of the logs. A total of 234,159 tweets were identified as sent by tourists visiting Madrid. Foursquare is a community of 50 million users allowing tourists to check in to the establishments they visit and providing personalized recommendations of the places to go (restaurants, nightlife spots, shops and other places of interest) in the surrounding area. These check-ins are private by default, but they become publicly accessible for example, when users opt to share their check-ins publicly via Twitter (Çelikten et al., 2016). Thus, it was possible to extract Foursquare data by selecting Foursquare check-ins in our Twitter dataset. From a total of 234,159 tweets assigned to tourists, 20,076 were Foursquare check-ins (Figure 2b) and the rest (214,083) were considered ordinary tweets (Figure 2c). 1 We are aware that the one-week limit to differentiate between tourists and residents can lead to two types of errors: omission errors (some tourists can stay for more than one week in the city and be considered residents), and commission errors (some residents may Tweet or take photos for less than one week, and therefore being considered tourists). The probability associated with the first type of error seems to be very low, since the average length of stay of tourists in Madrid is 2.1 nights (Madrid City Council). The magnitude of the second type of error is difficult to estimate; however, the comparison between the identified spatial clusters and a map of the city (Open Street Map) indicates that our results are consistent, since the spatial clusters coincide with areas in the city that are known to attract the highest concentration of tourists. This consistency was observed in previous papers analysing the footprint of tourists in cities, such as García-Palomares et al. (2015). 8 Figure 2: a) Photographs taken by tourists. b) Foursquare check-in by tourists. c) Tweets sent by tourists. Figure 2a references: 1) Royal Palace; 2) Puerta del Sol square; 3) Plaza de Cibeles square; 4) Plaza Mayor square; 5) Las Ventas bullring; 6) Temple of Debod; 7) Real Madrid football stadium; 8) Atlético de Madrid football stadium; 9) Atocha – Reina Sofía Museum; 10) Cuatro Torres; 11) Torres Kio Figure 2c references: 1) Historic centre; 2) Paseo de la Castellana axis; 3) Salamanca district; 4) Madrid Río axisBackground map: OpenStreetMap. 9 Foursquare data and tweets exhibit very different temporal distribution patterns (Figure 3). Foursquare check-ins are more equally distributed along the day (check-ins in restaurants, shops and other places), while tweets tend to be more concentrated between 18 and 21 hours (what suggests that the use of Twitter is predominantly related to the location of accommodations). Figure 3: Temporal distribution of Foursquare and Twitter data of tourists in Madrid 5. METHODOLOGY The following methodology was used to analyse the spatial distribution of tourists in Madrid: a) Number of tourists per census tract.- The tourists' photos, Foursquare check-ins and tweets were located according to census tracts, counting the number of single tourists in each census tract and for each data source from the user ID. The row data was converted to single tourists using joint spatial (census tracts) aggregation to obtain the number of single users in each census tract. b) Tourist density by census tract.- The number of tourists in each census tract depends on the actual concentration of tourists in the census tract and its size. Larger census tracts tend to register a greater number of tourists. To mitigate this problem (Modifiable Areal Unit Problem -MAUP) (Openshaw and Taylor, 1981), the tourist density per census tract was obtained for each data source. In contrast with regular units (grids), using census tract we obtained relative homogeneous spatial zones from the point of view of land use. Census tract also allow identify some references areas like parks, football stadiums, big shopping centers and others. c) Rescaling of the data.- A very different number of tourists was detected through the three data sources. The density data of visitors by census tract were rescaled to a scale of 0 to 1,000 (by means of a linear transaction) in order to eliminate the effect of different ranges in the data sources and to make density data comparable. d) Density maps and descriptive statistics.- Rescaled data aggregated by census tract were used to obtain density maps and descriptive statistics. Density maps provided an initial visual 10 overview of the density distribution of tourists in Madrid according to the three datasets. Descriptive statistics were useful to compare the degree of concentration of visitors according to the data sources used. e) OLS analysis.- Results obtained for the three data sources were compared using bivariant Ordinary Least Squares (OLS). The coefficient of determination reveals the common part of variation between each pair of data sources, and the differences between each pair of sources can be analysed using the maps of standardised residuals. f) Cluster analysis.- Cluster analysis was used to integrate the information from the three data sources and characterise the census tracts according to the tourist activities performed in them. The K-means clustering algorithm was used. This method looks for a solution where all the features within each group are as similar as possible, and all the groups themselves are as different as possible. g) Spatial autocorrelation analysis.- Unlike previous analyses, spatial autocorrelation techniques do not consider each location in an isolated way, but in relation to the locations in its environment (Anselin, 1995). Global Moran's I statistic and Anselin Local Moran's I (LISA statistic) were calculated separately for the three data sources, using the IDW (inverse distance weighted) method with a 500 m radius. The LISA analysis identified High-High clusters of tourists (a high value surrounded primarily by high values). The results were combined in order to determine the specialisation of each census tract, which can be classified as areas of sightseeing (Panoramio High-High clusters), tourist consumption (Foursquare High-High clusters), Internet consumption (Twitter High-High clusters) or a combination of two or three types (for example, consumption and sightseeing ). A census tract classified as HH cluster according to the three data sources means that in a radius of 500 m there is high potential for the three types of activities. All these calculations and maps were made using ArcGIS 10.4 software. 6. RESULTS Tourist density maps and descriptive statistics Figure 3 shows tourist density maps according to census tracts, with rescaled data and at the same intervals in the three maps. Panoramio (Figure 4a) clearly shows a high spatial concentration of visitors in the historic centre and along the north-south axis of the city (Paseo de La Castellana). The areas of greatest density denote the sightseeing spots in the city, for example Plaza de Cibeles, Puerta de Alcalá, Puerta del Sol, Plaza Mayor, the Royal Palace, the Temple of Debod, Plaza de España, Reina Sofía Museum, Atocha Station, Gran Vía street, the Real Madrid and Atletico de Madrid stadiums, the Las Ventas bullring, Torres Kio, Cuatro Torres, among others. Some census tracts have a high density of photos, not because they themselves contain a sightseeing spot, but because they offer a good vantage point for taking photographs of a sightseeing spot located in an adjacent census tract. The prohibition against taking photographs inside some monuments explains the relatively low density of photos in the census tracts containing them (for example, the Prado Museum and the Royal Palace). The digital footprint of tourists in consumption activities (Figure 4b) is particularly dense in the historic centre (for example, the shopping area of Gran Vía street) and, to a lesser extent, in surrounding areas (for example, the area of luxury shops known as the Golden Mile in Barrio 11 de Salamanca). It is also sporadically explained by the presence of some shopping centres (such as the AZCA centre) or singular places (Real Madrid Stadium museum and shopping area). Finally, tourist density identified according to Twitter (Figure 4c) is more disperse. It is particularly high in the historic centre, and it tends to spread throughout surrounding census tracts, and along the Paseo de la Castellana axis. Most of Madrid's tourist accommodation offer (hotels, hostels, tourist apartments...) is located in these areas (see, for example, booking.com or http://insideairbnb.com/madrid/). The descriptive statistics of tourist density according to the data sources (Table 1) show that a much higher number of tourists have been detected from Twitter than from any of the other two sources, thus justifying the rescaling of the three variables. The coefficient of variation of Foursquare data reveals that, as suggested by the maps, consumption activities exhibits the highest spatial concentration. Table 1: Descriptive statistics Tourist density/Ha according to census tracts Panoramio Foursquare Twitter Count: 2415 2415 2415 Minimum: 0 0 0 Maximum: 3.83 8.25 13.86 Sum: 312.95 424.53 2491.02 Mean: 0.13 0.176 1.03 Standard Deviation: 0.33 0.550 1.31 Variation coefficient 251.0 312.80 127.1 Tourist density/Ha according to census tracts: rescaled data Panoramio Foursquare Twitter Count: 2415 2415 2415 Minimum: 0 0 0 Maximum: 1000 1000 1000 Sum: 81625.85 51420.18 179479.38 Mean: 33.80 21.29 74.32 Standard Deviation: 84.86 66.60 94.51 Variation coefficient 251.0 312.80 127.1 12 Figure 4: Tourist density according to: a) Panoramio, b) Foursquare, c) Twitter. The references to the numbers in Figure 3a can be seen in Figure 1. Comparison between data sources: OLS analysis To determine the degree of association between the three distributions, the coefficient of determination was calculated between each pair of data sources (Table 2). The standardised residuals of the bivariate regressions reveal where the data sources present the greatest differences (Figure 4). The correlation analysis points to a medium positive correlation between the tourist densities provided by the Twitter-Foursquare data sources. The correlation between the number of tourists provided by the Panoramio-Twitter and Panoramio-Foursquare data sources is low, indicating a greater complementarity. The main discrepancies between these three data sources can be analysed through the residuals between the variables in the regression analysis: - Panoramio has more tourists than expected according to Foursquare and Twitter (positive residuals in Figures 5a and 5b) in the historic centre and in the main sightseeing spots in the city (football stadiums, the bullring, Retiro Park, Torres Kio, Cuatro Torres, Temple of Debod), but less than expected (negative residuals) around the city centre. - Foursquare shows more tourists than expected according to Twitter (Figure 5c) in the historic centre, the Salamanca district, in the AZCA shopping and business centre, the large shopping centre of La Vaguada, Real Madrid Stadium, and Atocha station, but less tourists than expected in less central areas. 13 Table 2: Coefficients of determination (adjusted r2) of tourist distribution according to data sources (OLS) Panoramio Foursquare Twitter Panoramio 1 Foursquare 0.27** 1 Twitter 0.31** 0.52** 1 ** Significant to 0.01 Figure 5: Standardised residuals of bivariate regressions: a) Foursquare-Panoramio; b) Twitter- Panoramio; c) Twitter- Foursquare. The references to the numbers in Figure 4a can be seen in Figure 1. Types of spaces according to tourist activities: cluster analysis Cluster analysis enables the census tracts to be typified according to tourist density per activity (sightseeing, consumption or Internet activity). The cluster was calculated using the K-means algorithm and using rescaled density values. Table 3 shows the means and standard deviations that characterise each group and the parallel box plot graph summarises both the groups and the variables within them (Figure 6). Figure 7 shows the spatial distribution of the 6 groups established: ‐ Groups of census tracts with a predominance of tourists related to sightseeing: groups 6 (coffee) and 2 (red). The differences are based on the intensity of the number of tourists. Group 6 contains tracts with a very high number of tourists identified in Panoramio, and high in Twitter and Foursquare users, and corresponds to spaces such as the bullring, Puerta de Alcalá arch, Glorieta de Atocha and Plaza de España square. Group 2 also contains tracts linked to sightseeing, but with a lower number of tourists: 14 the parks of the Temple of Debod and Retiro, the Royal Palace and the surrounding area, the Atlético de Madrid stadium, Atocha station and singular buildings such as Torres Kio and Cuatro Torres. ‐ Groups of census tracts with a predominance of tourists linked to consumption: groups 5 (violet) and 3 (green). Group 5 contains the most commercial tracts in the historic centre (Gran Vía, Puerta del Sol square), which also have a very high number of tourists according to Twitter and Panoramio. Group 3 comprises tracts with a high tourist density related to consumer activities and Twitter, but much less in relation to Panoramio. These correspond to the historic centre (except the most commercial area), with a decrease in specialisation in Twitter towards the periphery, but also commercial areas like Golden Mile-Salamanca district, AZCA and retail spaces in the centre. ‐ Groups of census tracts with a predominance of tourists related to Internet activities: Group 1 (blue), with a lower number of tourists, occupies the limits of the historic centre. ‐ Census tracts with a low presence of tourists: group 4 (yellow). These are the non- tourist spaces in the city and generally correspond to peripheral census tracts. Table 3: Means and standard deviations of the groups of tracts. Tourists according to Panoramio Tourists according to Foursquare Tourists according to Twitter Groups Number of tracts Mean Sd Mean Sd Mean Sd 1 331 23.9 30.0 33.2 35.1 161.3 60.1 2 151 148.5 56.0 21.4 27.7 110.5 66.1 3 76 141.5 82.9 182.6 72.5 282.6 102.1 4 1795 8.2 17.5 3.82 9.2 36.1 25.2 5 26 343.0 217.3 498.7 165.1 599.9 178.1 6 36 466.6 168.2 96.9 78.8 208.9 119.6 Total 2415 33.8 84.8 21.3 66.6 74.3 94.5 15 Figure 6: Standardised means for the groups of tracts (parallel box plot graph). 16 Figure 7: Groups of tracts according to tourist activities Analysis of spatial autocorrelation Spatial autocorrelation allows the data from each census tract to be analysed not in isolation (as in previous sub-section) but in relation to the data in the census tracts in its environment. Using the IDW (inverse distance weighted) procedure with a distance threshold of 500 m, Global Moran’s Index shows a positive spatial autocorrelation in all cases (positive Moran I values), lower for Panoramio than for Twitter and Foursquare (Table 4). Anselin Local Moran’s I statistic reveals the spatial cluster distribution (significant at the 0.01 level) (Figure 8). HH census tracts (high values in a variable surrounded by high values in that same variable) tend to form a single cluster in the case of Foursquare (historic centre and Salamanca district) (Figure 8b), but several clusters in geolocated photographs (historic centre, Real Madrid stadium, Torres Kio-Cuatro Torres) (Figure 8a). 17 Table 4: Global Moran’s I statistics (distance threshold = 500m) Panoramio Foursquare Twitter Global Moran’s Index 0.571 0.692 0.741 z-score 37.88 46.18 49.02 p-value 0.0000 0.0000 0.0000 Figure 8: Results of the LISA analysis (distance threshold = 500m). a) Panoramio, b) Foursquare, c) Twitter. The references to the numbers in Figure 7a can be seen in Figure 1. The results of the previous univariate analysis have been cross-referenced to incorporate the information from the three data sources. Figure 9 shows a classification of the census tracts considering the HH clusters of the three data sources jointly. When a census tract forms part of a HH cluster in the three sources, it indicates that in a radius of 500 m there is a high density of opportunities for sightseeing, consumption and connecting to the internet. This cross- referencing therefore gives a more complete multivariant LISA vision than would be obtained by using bivariant LISA analysis. The results show census tracts that form part of the HH cluster in the three data sources (historic centre), or in only one of them –for example, in areas specialised in sightseeing (such as Las Ventas bullring), in consumption (Salamanca district) or in both (AZCA shopping and business centre). In general, the resulting map shows how tourist specialisation tends to increase radiating outwards from the historic centre towards the periphery: the census tracts in the centre have a mixed character (multifunctional), and are surrounded by others which generally have two activities, while the most peripheral tend to specialise in one activity. 18 Figure 9: Typology combining the HH cluster of the three data sources 7. DISCUSSION The work tracks the digital footprint of urban tourists using several data sources and taking the city of Madrid as the study area. Unlike the paper by García-Palomares et al. (2015), comparing different cities using the same data source (Panoramio), we perform a comparison between several data sources within the same city. Three data sources are used to identify the places where tourists carry out their activities: geolocated photographs from the Panoramio platform (sightseeing), Foursquare check-ins (consumption), and interaction on the social network Twitter (being connected-accommodation). This article not only counts the density of digital footprints (for example, number of photographs or tweets), but also the density of tourists. This mitigates problems of possible bias (compulsive photographers, Foursquare users or tweeters) and allows the three data sources to be compared. Previous literature (Girardin et al., 2008a, García-Palomares, 2015) identified visitors by dividing the study period into one-month blocks. If a user had taken pictures over a period exceeding one week per year, the photographs were attributed to residents; if the period was less than one week per year, then they were attributed to tourists. We have reduced this time limit to one week, more in line with the typical length of the stay in urban tourism, usually 2 or 3 days (Mazanec, 1997). In fact, the average length of stay of tourists in Madrid is 2.1 nights (Madrid City Council). The same time limit is considered in the three data sources, in order to establish comparisons between them. 19 The data sources and methodological approach used in this paper contribute to a better understanding of tourism activities in cities, not yet sufficiently developed (Pearce, 2001; Shoval & Raveh, 2004). These results are consistent with previous studies (for example, Shoval & Raveh, 2004; Ashworth & Page, 2011) showing that tourists visiting a city are selective and make use of only a very small portion of all that the city has on offer. The findings reveal that it is not sufficient to use a single data source to understand the spatial distribution of tourists in cities, as tourists engage in different activities in different spaces. Not surprisingly, the three data sources show a high tourist density in some areas of the historic centre, where there is a significant concentration of monuments, shops, hotels, restaurants and more. Spatial clusters identified in this paper confirm this trend, which is consistent with previous literature (Pearce, 1987; Hayllar & Griffin, 2009). These central areas can be considered as multifunctional tourist areas, since they exhibit a high concentration of tourist attractions, facilities and accommodations. However, tourists’ digital footprint also extends throughout other areas of the city (football stadiums, bullring, commercial districts, large parks), which is consistent with previous literature (for example, Shoval & Ravel, 2004). Contrary to the multifunctional character of the most central tourist areas, the spatial clusters identified outside the historical centre tend to be more specialised from the tourist point of view, so that tourist specialisation tends to increase, radiating outwards from the historic centre towards the periphery. Spatial clusters identified by using these three data sources exhibit some similarities, but also clear differences. The areas of tourist concentration identified using Panoramio data coincide with the main sightseeing points of the city. These results are in line with those obtained in previous studies with data from social networks of geolocalized photographs (eg, Girardin et al., 2008; Straumann et al., 2014, García-Palomares et al., 2015). An exception is the Prado Museum, the most visited museum in the city, which presents a low concentration of tourists according to Panoramio, since it is prohibited to take pictures inside. As in Barcelona (Gutiérrez et al., 2017), the high univariate Global Moran Index value obtained in Madrid shows that this activity is very concentrated spatially. As expected, the use of Twitter seems to be related to the location of accommodations, since Twitter data cover areas with a high selection of accommodation and are temporally concentrated between 18 and 22 hours. The current high telephone fees for international calls force tourists to use Twitter, taking advantage of the WiFi networks in their accommodations. As expected, Twitter data tend to be concentrated within the historic centre and its surroundings. These results are consistent with previous work on location of hotels (Arbel & Pizam, 1977) and P2P accommodations (Gutiérrez et al., 2017), suggesting that proximity to tourist attractions is an important factor in the tourist's decision to choose accommodation. After the suppression of international telephone fees between European Union countries in 2017, this variable could lose its interest as a proxy for this type of tourist activity. Finally, Foursquare check-ins are particularly dense in areas specialised in consumption (restaurants and shopping), not only in the historic centre of the city, but also in luxury shopping areas and shopping centre, suggesting that this social network can be a good proxy for locating tourists in activities related to consumption. 8. CONCLUSIONS The goal of this paper is to track the activities of tourists by comparing different Big Data sources: Panoramio, Foursquare and Twitter. These data sources provide massive information 20 on spatial and temporal patterns of tourists in cities. We perform different types of analyses with our datasets. The bivariant correlation analysis reveals little similarity between the distributions of tourist density on Panoramio and Twitter –on the one hand–, and Panoramio and Foursquare –on the other–, but a certain similarity between the spatial patterns of tourists according to Foursquare and Twitter. In a second step, cluster analysis was applied to classify the city’s tourist areas based on the intensity of the digital footprint left by tourists as being more oriented to sightseeing, consumption or Internet connection. Finally, the spatial autocorrelation analysis shows tourist activity by considering each census sector, not in an isolated way, but in relation to the census tracts in its surroundings. The global analysis confirms that Panoramio reveals the lowest spatial autocorrelation (with several spatial clusters). Areas allowing a wide range of different activities can be differentiated from those with a more specialised nature by combining the results of the three data sources. The degree of specialisation tends to increase from the historic centre radiating outwards towards the periphery. As in other papers that use Big Data, there is also an underlying problem of bias. Most tourists do not upload their photos to photo-sharing communities like Panoramio, and some do not even take photographs. Photographs do not always properly reflect all the monuments in the city, due to the prohibition against taking photographs in some monuments, and particularly in museums. Many tourists do not use social networks like Twitter or Foursquare and only a small proportion of Twitter users send geolocated tweets. The source bias is unquestionably difficult to identify and correct. In this paper the bias has been mitigated by working with the density of tourists rather than with the density of their footprints (photographs, Foursquare check-ins or tweets), and avoids counting the same tourist several times –this is especially important in the case of compulsive users. In addition, by comparing different data sources we consider different tourist activities, and this partly offsets the bias caused by working with only one source. Knowledge of the spatial distribution of urban tourists is extremely important for public policies. Thus for example in spaces with a high prevalence of tourists, the local authorities may envisage actions to improve the experience of the tourists, such as creating pedestrian- only streets or widening pavements, extending public spaces with free WiFi, locating new tourist information points, among others. The spatial distribution of the photographs taken by tourists shows there are spaces with a high tourist potential but which are still under exploited (for example, the Madrid Río axis). The results may also be important for pinpointing new opportunities for business for the private sector, for example by identifying areas with economies for locating retail tourism where there are still opportunities for expansion. Research in tourism using social network data is a promising field. However, the use of these data sources in research raises privacy and ethical issues (Moreno et al., 2013). Researchers should not only preserve the anonymity of users, but should also avoid using information from social networks inappropriately, for example, revealing uncomfortable information about social groups (Zwitter, 2014). Acknowledgments.- We gratefully acknowledge the financial support of the European Union's Seventh Framework Programme (INSIGHT project), the Madrid regional government (SOCIALBIGDATA-CM S2015/HUM-3427), and the Ministerio de Economía y Competitividad of Spain and Universidad Complutense Madrid for the funding provided for a post-doctoral 21 fellowship (FPDI-2013-17001). The authors would also like to thank Luca Piovano from CEDINT (Polytechnic University of Madrid) for help in downloading the Panoramio data. REFERENCES Anselin, L. (1995). Local Indicators of Spatial Association — LISA. Geographical Analysis 27 (2): 93–115. doi:10.1111/j.1538-4632.1995.tb00338.x. Arbel, A., & Pizam, A. (1977). Some determinants of urban hotel location: The tourists' inclinations. Journal of Travel Research, 15(3), 18e22. Asakura, Y., & Iryo, T. (2007). Analysis of tourist behaviour based on the tracking data collected using a mobile communication instrument. Transportation Research Part A: Policy and Practice 41, 684–690. Bassolas, A., Lenormand, M., Gonçalves, B., Tugores, A., & Ramasco, J.J. (2016). Touristic site attractiveness seen through Twitter. EPJ Data Science 5. doi:10.1140/epjds/s13688-016-0073-5 Çelikten, E., Falher, G. L., & Mathioudakis, M. (2016). Extracting Patterns of Urban Activity from Geotagged Social Data. arXiv preprint arXiv:1604.04649. Cooper, C. P. (1981). Spatial and temporal patterns of tourist behaviour. Regional Studies, 15(5), 359-371. Edwards, D., & Griffin, T. (2013). Understanding tourists’ spatial behaviour: GPS tracking as an aid to sustainable destination management. Journal of Sustainable Tourism, 21(4), 580-595. Ferreira, A. P. G., Silva, T. H., & Loureiro, A. A. F. (2014). You are your check-in: Understanding the behavior of tourists and residents using data from foursquare. In Proceedings of the 20th Brazilian Symposium on Multimedia and the Web (pp. 103-110). ACM. Frias-Martinez, V., Soto, V., Hohwald, H., & Frias-Martinez, E. (2012). Characterizing urban landscapes using geolocated tweets. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom) (pp. 239-248). IEEE. García-Palomares, J.C., Gutiérrez, J., & Mínguez, C. (2015): Identification of tourist hot spots based on social networks: a comparative analysis of European metropolises using photo- sharing services and GIS. Applied Geography, 63, 408–417. Girardin, F., Calabrese, F., Fiore, F.D., Ratti, C., & Blat, J. (2008a). Digital Footprinting: Uncovering Tourists with User-Generated Content. IEEE Pervasive Computing 7, 36–43. doi:10.1109/MPRV.2008.71 Girardin, F., Fiore, F. D., Ratti, C., & Blat, J. (2008b). Leveraging explicitly disclosed location information to understand tourist dynamics: a case study. Journal of Location Based Services, 2(1), 41-56. Gotham, K. (2005). Tourism gentrification: The case of New Orleans' Vieux carre (French quarter). Urban Studies, 42(7), 1099e1121. Gutiérrez, J., García-Palomares, J. C., Romanillos, G., & Salas-Olmedo, M. H. (2017). The eruption of Airbnb in tourist cities: Comparing spatial patterns of hotels and peer-to-peer accommodation in Barcelona. Tourism Management, 62, 278-291. Hawelka, B., Sitko, I., Beinat, E., Sobolevsky, S., Kazakopoulos, P., & Ratti, C. (2014). Geo- 22 located Twitter as proxy for global mobility patterns. Cartography and Geographic Information Science 41, 260–271. doi:10.1080/15230406.2014.890072 Hayllar, B., & Griffin, T. (2009). Urban tourist precincts as sites of play. In G. Maciocco, and S. Serreli, (Eds.), Enhancing the city, new perspectives for tourism and leisure. Urban and Landscape Perspectives, 6(1), 65-81. Jansen-Verbeke, M. (1986). Inner-city tourism: resources, tourists and promoters. Annals of Tourism Research, 13(1), 79-100. Jansen-Verbeke, M. (1998). Tourismification of historical cities. Annals of Tourism Research, 25(3), 739-742. Kisilevich, S., Keim, D., Andrienko, N., & Andrienko, G. (2013). Towards acquisition of semantics of places and events by multi-perspective analysis of geotagged photo collections. In A. Moore, & Drecki (Eds.), Geospatial Visualisation, Lecture Notes in Geoinformation and Cartography. Berlin Heidelberg: Springer-Verlag. Kitchin, R. (2013). Big Data and human geography Opportunities, challenges and risks. Dialogues in human geography, 3(3), 262-267. Koerbitz, W., Önder, I., & Hubmann-Haidvogel, A. C. (2013). Identifying Tourist Dispersion in Austria by Digital Footprints (pp. 495-506). Springer Berlin Heidelberg. Kurashima, T., Iwata, T., Irie, G., &Fujimura, K. (2013). Travel route recommendation using geotagged photos. Knowledge and information systems, 37(1), 37–60. Lamsfus, C., Martín, D., Alzua-Sorzabal, A., & Torres-Manzanera, E. (2015). Smart Tourism Destinations: An Extended Conception of Smart Cities Focusing on Human Mobility, in: Tussyadiah, I., & Inversini, A. (Eds.), Information and Communication Technologies in Tourism 2015. Proceedings of the International Conference in Lugano, Switzerland, February 3 - 6, 2015. Springer International Publishing, Cham, pp. 363–375. doi:10.1007/978-3-319-14343- 9_27 Li, X. (2013). Multi-day and multi-stay travel planning using geo-tagged photos. In Proceedings of the Second ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information (pp. 1-8). ACM. Lu, X., Wang, C., Yang, J. M., Pang, Y.,& Zhang, L. (2010). Photo2trip: generating travel routes from geo-tagged photos for trip planning. In Proceedings of the international conference on Multimedia (pp. 143–152). ACM. Mahadevan, P., Prajish Kuman, P., & Vimeesh, M. (2013). Role of social networking in tourism marketing- a case study of tavel industry in Kerala. International Journal of Multidisciplinary Educational Research 2, 107–117. Mazanec, J. A. (1997). Segmenting city tourists into vacation styles. International city tourism: Analysis and strategy, 114-128. McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big data. The management revolution. Harvard Bus Rev, 90(10), 61-67. Mocanu, D., Baronchelli, A., Perra, N., Gonçalves, B., Zhang, Q., & Vespignani, A. (2013). The twitter of babel: Mapping world languages through microblogging platforms. PloS one, 8(4), e61981. 23 Moreno, M. A., Goniu, N., Moreno, P. S., & Diekema, D. (2013). Ethics of social media research: common concerns and practical considerations. Cyberpsychology, Behavior, and Social Networking, 16(9), 708-713. Murthy, D. (2013). Twitter: Social communication in the Twitter age. John Wiley & Sons. Netto, V. M., Pinheiro, M., Meirelles, J. V., & Leite, H. (2015): Digital footprints in the cityscape. International Conference on Social Networks, Athens, USA. Openshaw, S., &Taylor, P.J. (1981), The modifiable areal unit problem. In: Quantitative geography: a British View, (eds) N. Wrigley and R.J. Bennett, (Routledge and Kegan Paul: London), pp 60-70. Pearce, D. G. (1987). Tourism today: A geographical analysis. Burnt Mill: Longman. Pearce, D. G. (2001). An integrative framework for urban tourism research. Annals of Tourism Research, 28(4), 926-946. Qu, Y., & Zhang, J. (2013). Trade area analysis using user generated mobile location data. In Proceedings of the 22nd international conference on World Wide Web (pp. 1053-1064). ACM. Ribeiro, A. I. J. T., Silva, T. H., Duarte-Figueiredo, F., & Loureiro, A. A. (2014). Studying traffic conditions by analyzing foursquare and instagram data. In Proceedings of the 11th ACM symposium on Performance evaluation of wireless ad hoc, sensor, & ubiquitous networks (pp. 17-24). ACM. Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. In Collaboration Technologies and Systems (CTS), 2013 International Conference on (pp. 42-47). IEEE. Serrano-Estrada, L., Martí, P., & Nolasco-Cirugeda, A. (2016): Reading the social preferences of tourist destinations through social media data. In Back to the Sense of the City: International Monograph Book (pp. 1065-1075). Centre de Política de Sòl i Valoracions. Shelton, T., Poorthuis, A., & Zook, M. (2015). Social media and the city: Rethinking urban socio- spatial inequality using user-generated geographic information. Landscape and Urban Planning, 142, 198-211. Shoval, N., & Raveh, A. (2004). Categorization of tourist attractions and the modelling of tourist cities: Based on the co-plot method of multivariate analysis. Tourism Management, 25(6), 741e750. Shoval, N., & Isaacson, M. (2007). Tracking tourists in the digital age. Annals of Tourism Research 34, 141–159. doi:10.1016/j.annals.2006.07.007 Shoval, N., McKercher, B., Ng, E., & Birenboim, A. (2011). Hotel location and tourist activity in cities. Annals of Tourism Research, 38(4), 1594-1612. Silva, T. H., Vaz de Melo, P. O., Almeida, J. M., Salles, J., & Loureiro, A. A. (2013). A comparison of foursquare and instagram to the study of city dynamics and urban social behavior. In Proceedings of the 2nd ACM SIGKDD international workshop on urban computing (p. 4). ACM. Sobolevsky S, Sitko I, Grauwin S, Des Combes RT, Hawelka B, et al. (2014a) Mining Urban Performance: Scale-Independent Classification of Cities based on Individual Economic Transactions. In: Big Data Science and Computing, 2014 ASE International Conference on, May 27–31, Stanford University. p. 10. 24 Sobolevsky S, Sitko I, Tachet des Combes R, Hawelka B, Murillo Arias J, et al. (2014b) Money on the Move: Big Data of Bank Card Transactions as the New Proxy for Human Mobility Patterns and Regional Delineation. The Case of Residents and Foreign Visitors in Spain. In: Big Data (BigData Congress), 2014 IEEE International Congress on, Jun 27-Jul 2, Anchorage, AK. pp.136– 143. doi:10.1109/BigData.Congress.2014.28. Sobolevsky S, Bojic I, Belyi A, Sitko I, Hawelka B, et al. (2015) Scaling of city attractiveness for foreign visitors through big data of human economical and social media activity. arXiv:150406003. Straumann, R. K., Çöltekin, A., & Andrienko, G. (2014). Towards (Re) constructing narratives from georeferenced photographs through visual analytics. The Cartographic Journal, 51(2), 152-165. Sun, Y., & Fan, H. (2014). Event Identification from Georeferenced Images. In Connecting a Digital Europe through Location and Place. (pp. 73-88). Springer International Publishing. Takhteyev, Y., Gruzd, A., & Wellman, B. (2012). Geography of Twitter networks. Social networks, 34(1), 73-81. Vaca, C., Quercia, D., Bonchi, F., & Fraternali, P. (2015, May). Taxonomy-based discovery and annotation of functional areas in the city. In Ninth international AAAI conference on web and social media. AAAI Publications. http://ai2-s2- pdfs.s3.amazonaws.com/218f/70bd5459cb0388247eaeae5a4f5d38173d7f.pdf Wu, L., Zhi, Y., Sui, Z., & Liu, Y. (2014). Intra-urban human mobility and activity transition: evidence from social media check-in data. PloS one, 9(5), e97010. Zwitter, A. (2014). Big data ethics. Big Data & Society, 1(2), 2053951714559253. http://ai2-s2-pdfs.s3.amazonaws.com/218f/70bd5459cb0388247eaeae5a4f5d38173d7f.pdf http://ai2-s2-pdfs.s3.amazonaws.com/218f/70bd5459cb0388247eaeae5a4f5d38173d7f.pdf Dr. María Henar Salas-Olmedo is a geographer and post-doc researcher at Complutense University Madrid with an international background acquired in EU funded research projects and some research stays in Europe and Latin America. Her research interests are the analysis of accessibility, mobility and land use relationships in a variety of scales and topics to help designing better urban and regional policies. Her research methodologies include spatial analysis with the use Geographic Information Systems (GIS) combining geolocated regular and big data. She has co-authored several papers in top scientific journals especially in the areas of geography, economics and transportation, and she regularly attends international conferences. She also teaches in graduate and postgraduate programs and organizes seminars on GIS and transport issues. Borja Moya-Gómez is a researcher at Research Group tGIS (Transport, Infrastructure and Territory). He is a PhD candidate in Geography at Universidad Complutense de Madrid (UCM). He graduated as Civil Engineer, major in Transport and Urban Services, and M.Sc. Logistics, Transport and Mobility at Universitat Politècnica de Catalunya – BarcelonaTech (UPC). His research interests are the traffic congestion, transport, mobility, accessibility, geographic information systems (GIS), and big data on urban and social studies. He has published 4 papers in scientific journals and contributed in several technical reports. He also teaches in graduate and postgraduates programs. Juan Carlos Garcia-Palomares is Associate Professor of Human Geography at the University Complutense of Madrid. He obtained his PhD in Human Geography with Honours. His research activity focuses on urban studies, big data and Geographic Information Systems (GIS). He has published 47 papers in scientific journals, has contributed in more than 25 books, book chapters and research monographs, and he has participated in a number of national and international conferences. He has developed an extensive teaching activity, focusing on the application of GIS, urban geography, transportation and mobility. Javier Gutiérrez Puebla is Full Professor of Human Geography at the Universidad Complutense de Madrid (UCM), director of the Research Group tGIS (Transport, Infrastructure and Space) and head of the Department of Human Geography. He specializes in GIS (Geographic Information Systems), mobility and urban studies. He has written more than 100 scientific publications, with more than 3,000 citations in Google Scholar. He has directed 10 doctoral theses and numerous research projects. http://ees.elsevier.com/jtma/download.aspx?id=353856&guid=1c624b12-cd68-452e-8f32-2f141c7431dc&scheme=1 http://ees.elsevier.com/jtma/download.aspx?id=353860&guid=7ffdc918-4fd9-4524-a7ab-099376e096e4&scheme=1 http://ees.elsevier.com/jtma/download.aspx?id=353861&guid=b94e6627-4ecc-4ae0-9564-0d367e7a43d9&scheme=1 http://ees.elsevier.com/jtma/download.aspx?id=353862&guid=18c056e4-de9c-4f50-87d5-c8a503668c21&scheme=1