At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
What do you need to know to start working with geographical data?
Did you ever question how Google Maps calculates the distance between two places? Or how the government keeps track of all the places where the utility services like sewers, gas and electricity pipes are? Both cases are examples of the use of Geographical Information Systems (mostly known as GIS) with geographical data. Geographical data is data related to a specific place or specific area on earth, for example, your address or some coordinates. If you want to start working with geographical data, the most important concept is GIS. A GIS is an information system with which spatial data or information about geographical objects can be saved, managed, edited, analyzed, integrated and presented.[1]
GIS systems have been around for a while and have been further developed in recent years. But how did it actually start? In the 1960s, Canadian geographer Roger Tomlinson came up with the idea of using a computer to aggregate information from his natural resources and create an overview by province. With this, the first Geographic Information System was born. In 1985, GIS was used for the first time in the Netherlands.[2]
Project information on a map
Today there are numerous possibilities to use GIS and therefore to work with geographical data. Before you make a choice in what way you want to work with it, it is important to understand some concepts. To start with map projections and associated coordinate reference systems (CRS), because how do we ensure that the round earth can be shown on a flat map, i.e. two-dimensional? In order to represent the Earth on a map with reasonable accuracy, cartographers have developed map projections. Map projections try to represent the round world in 2D with as few errors as possible. Each projection deals with this in a different way and has advantages and disadvantages. For example, one projection is good at preserving the shape, but doesn’t display the correct size of all countries, while the other doesn’t keep the right shape but is more accurate in size. If you want to see the real size and shape of the world, you will always have to look at a 3D map or globe. The following video can be recommended if you want to get some more information about the consequences of different projections: Why all world maps are wrong.
[3] https://medium.com/nightingale/understanding-map-projections-8b23ecbd2a2f
Coordinate reference systems are a framework to define the translation from a point on the round earth to the same point on a two-dimensional map. There are two types of reference systems: projected CRS and geographic CRS. A geographic CRS defines where the data is located on the earth’s surface and a projected CRS tells the data how to draw on a flat surface, like on a paper map or a computer screen.[4] Geographic CRS is based on longitude and latitude. Longitude and latitude are numbers that explain where on the round Earth you are. Longitude defines the angle between the Prime Meridian (at Greenwich) and every point on Earth, where the angle is calculated in an easterly direction. Latitude defines the angle between the equator and every point. However, latitude is calculated in two directions and all points on the Southern Hemisphere are negative. Projected CRS defines the place on a two-dimensional map instead of the round world. Here, x- and y-coordinates are used and the distance between all neighboring x- and y-coordinates are the same. [5]
[4] https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/gcs_vs_pcs/
Different types of layers
Two other important concepts are raster and vector layers. GIS files are constructed from different map layers. These layers can be built up in two different manners. Just like that there is a difference between raster pictures and vector pictures, there are also raster layers and vector layers and it defines the way a layer is created. Raster layers consist of a collection of pixels. Vector layers, on the other hand, consist of a collection of objects. These objects can be points, lines, or polygons. Points consist of X and Y coordinates, usually latitude and longitude. Line objects are vectors that connect points. And polygons are areas on the map. Sometimes multiple areas are represented as one object, they are called multipolygons. Vector layers are commonly used when using geographical data.
[6] https://medium.com/analytics-vidhya/raster-vs-vector-spatial-data-types-11325b83852d
Saving your map
If you are working with data and you want to save your file, it is important to know which file formats exist for geo data. Where text files can be saved in .txt or .docx and Excel files can be saved in .xlsx or .csv, there are also specific file formats for geo data. The most common format for vector data is a Shapefile. A Shapefile does not consist of one file, but a collection of files with the same name and which are placed in the same directory, but all with different formats. To be able to open a Shapefile it is necessary to at least have a .shp file (Shapefile), a .shx file (Shapefile index file) and a .dbf file (Shapefile data file). Other files such as .prj (Shapefile projection file) can be included as well for extra information.[7] Another, relatively new, format is GeoPackage. This format stores vector features, tables, and rasterized tiles into a SQLite database.
Both Shapefiles and GeoPackage files can be downloaded from any GIS and can also be uploaded into any other GIS. If you keep working on the same GIS at the same directory, it is also possible to save your project as a GIS project. In that case, it is important that the data you have uploaded into your file stays at the same directory, since the project does not save the data, but the reference to the data.
Create a map yourself
Now you know the basic concepts of working with geographical data. The next step is to decide which software you want to use for geographical data. In general, you could divide it into two types of possibilities:
- Via specifically developed GIS software
- Via common programming languages
Two well-known specifically developed GIS systems are ArcGIS and QGIS. ArcGIS is a paid software where you can use the software by means of a license. QGIS, on the other hand, is an open-source software. That means that it’s free for the user. QGIS is an official project of the Open Source Geospatial Foundation, a non-profit organization that aims to make the use of geodata accessible to everyone. Both programs are similar in use and have similar functionality and capabilities.
In addition to GIS software, it is nowadays also possible to use geodata via Python or R. Several geopackages are already available which makes it possible to use geodata. A well-known geopackage for Python is GeoPandas. The goal of GeoPandas is to make working with geographic data in Python easier. It combines the capabilities of pandas and shapely.[8] Data is stored at GeoPandas in GeoDataFrames. These GeoDataFrames are similar to Pandas DataFrames, but an important difference is that a GeoDataFrame always contains a geometry column. It stores the corresponding geographic data for each row.
The usage of QGIS and GeoPandas with Python are different, but the possibilities with both options are mostly the same. Both systems are able to load different file formats and easily plot the geographical data for you. However, the visualization is better in QGIS since the map in Python is a static map and does not give you the possibility to zoom. In QGIS you can easily zoom and add a standard map layer (such as a layer from Google Maps) to put your data into a broader perspective. Furthermore, a lot of analyses are accessible in both systems. For instance, determining the distance between two places or creating a buffer around your polygons. More information regarding these analyses with geographical data are discussed in this blog.
In short, you can work with geographical data in a GIS. For this, you can use a specific software like QGIS or you can use the GeoPandas package in Python. You have different map projections and associated coordinates reference systems to plot the three-dimensional world into a two-dimensional map. Furthermore, GIS files are constructed from different map layers which can be vector layers or raster layers. And these GIS files can be saved as a Shapefile or as a GeoPackage.
So, now you know everything you need to know to start working with geographical data.
This article is part of our series about working with geographical data. The entire series is listed here:
- Getting value out of geodate with AI: getting started
- Getting value out of geodate with AI: convert locations to their lat and lon
- Getting value out of geodate with AI: data preparation
- Getting value out of geodate with AI: train the model
- Getting value out of geodate with AI: explainability using SHAP
- Getting value out of geodate with AI: visualize the model predictions
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Sources
[1] https://nl.wikipedia.org/wiki/Geografisch_informatiesysteem
[2] https://www.esri.nl/nl-nl/over-ons/wat-is-gis/geschiedenis
[3] https://medium.com/nightingale/understanding-map-projections-8b23ecbd2a2f
[4] Geographic vs Projected Coordinate Systems (esri.com)
[5] https://desktop.arcgis.com/en/arcmap/10.3/guide-books/map-projections/about-projected-coordinate-systems.htm#:~:text=A%20projected%20coordinate%20system%20is%20always%20based%20on%20a%20geographic,the%20center%20of%20the%20grid.
[6] https://medium.com/analytics-vidhya/raster-vs-vector-spatial-data-types-11325b83852d
[7] https://www.e-education.psu.edu/geog585/node/691
[8] https://geopandas.org/en/stable/