Session 4: Geometric operation and Data classificationยถ

Written by Men Vuthy, 2021


Overview:ยถ

In this session, we are going to learn about operating geometric functions (i.e. intersection, symetrical difference, union, and difference) and how to classify the data based on specific criteria and map classifiers. These operations are very common in GIS; however, here we will do it Python programming language. It is important to know how it can be done since it will help you deal with large amount of data or big-data.

After this session, you will understand how to:

  • conduct overlay analysis such as clipping or intersection, union, difference, etc

  • conduct a loop operation of overlay analysis

  • aggregate the geometries

  • classify data features based on standard classification methods

  • create custom classifier for data feature classification

Downloading dataset

The data used in this lesson is retrieved from the book titled โ€œHands-On Data Visualizationโ€. The data shows the location of bicycle theft in London reported by the Metropolitan Police Service. Visualize the interactive map of bicycle theft here. The data source is given as a list of coordinates (lat and lon) of location where the bicycles were stolen in a csv file format. Thus, to conduct geometric operation, we need to convert these coordinates into a GeoDataFrame with an appropriate coordinate reference system.

bicycle-theft

1. Geometry operationยถ

a. Spatial overlay analysisยถ

Geometric operation such as interection, union or difference is usually known as overlay analysis. The overlay analysis is one of the spatial GIS operations which aims at creating a new layer based on the overlaying shapes of two geometries and on the attribute information. In other words, it allows you to identify the relationships between two polygon features that share all or part of the same area.

Typical overlay operations are:

overlay

Source: Vector Spatial Analysis - QGIS

Letโ€™s get started to work on practial examples.

From now, we will learn how to conduct overlay operations on our data of bicycle theft in London. The question is โ€œHow many and where the bicycle theft in the southern area of Thames river in London?โ€

[1]:
# import necessary module
import geopandas as gpd
import matplotlib.pyplot as plt
import shapely.speedups
%matplotlib inline

# Let's enable speedups to make queries faster
shapely.speedups.enable()

Letโ€™s first read the border data and see how it looks:

[2]:
# Set file paths
border_fp = "data/greater_london_const_region.shp"
south_river_fp = "data/south-area.shp"

# Read files
border = gpd.read_file(border_fp)
south_river = gpd.read_file(south_river_fp)

# Plot the layers
ax = border.plot(facecolor='#28B463', edgecolor='white')
south_river.plot(ax=ax, facecolor='None', edgecolor='blue')

plt.show()
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_10_0.png

The green area is London city which we want to use for conducting overlay analysis, while the blue polygon is the boundary of the southern area of Thames river.

We will clip the green area within the blue polygon by using a function called overlay() from geopandas. This function takes few inputs such as:

  1. geometries data (i.e. the data we want to clip or make difference)

  2. boundary geometries (i.e. area for clipping or making difference)

  3. type of overlay operation: how (i.e. how=intersection, how=difference, how=union and how=symmetric_difference)

Before conducting overlay operation, it is better to check whether or not the coordinate systems of each data are consistent. It will result in an error if their crs are not the same.

[3]:
# Ensure that the CRS matches, if not raise an AssertionError
assert border.crs == south_river.crs, "CRS are not the same between layers!"

Now, it can be sure that both data have the same crs. Thus, we can start the overlay analysis.

Intersection

[4]:
intersection = gpd.overlay(border, south_river, how='intersection')
[5]:
intersection.head()
[5]:
NAME AREA_CODE DESCRIPTIO FILE_NAME NUMBER NUMBER0 POLYGON_ID UNIT_ID CODE HECTARES AREA TYPE_CODE DESCRIPT0 TYPE_COD0 DESCRIPT1 id Name geometry
0 Havering and Redbridge GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 1 1 112074 41452 E32000009 17089.120 215.025 VA CIVIL VOTING AREA None None 1 South Thames River MULTIPOLYGON (((552057.328 178434.366, 552047....
1 Croydon and Sutton GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 2 2 112056 41443 E32000005 13033.641 0.000 VA CIVIL VOTING AREA None None 1 South Thames River POLYGON ((531409.308 171042.621, 531439.333 17...
2 Bexley and Bromley GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 3 3 112066 41448 E32000002 21444.820 373.731 VA CIVIL VOTING AREA None None 1 South Thames River POLYGON ((549944.025 181074.392, 549992.231 18...
3 South West GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 4 4 112076 41453 E32000013 15259.573 198.064 VA CIVIL VOTING AREA None None 1 South Thames River MULTIPOLYGON (((522034.141 178039.708, 522044....
4 Lambeth and Southwark GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 8 8 112062 41446 E32000010 5716.068 147.597 VA CIVIL VOTING AREA None None 1 South Thames River POLYGON ((535280.948 177955.306, 535280.821 17...
  • Letโ€™s plot our data and see what we have:

[6]:
intersection.plot(facecolor='#28B463', edgecolor='white');
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_19_0.png

Difference

[7]:
difference = gpd.overlay(border, south_river, how='difference')
[8]:
difference.head()
[8]:
NAME AREA_CODE DESCRIPTIO FILE_NAME NUMBER NUMBER0 POLYGON_ID UNIT_ID CODE HECTARES AREA TYPE_CODE DESCRIPT0 TYPE_COD0 DESCRIPT1 geometry
0 Havering and Redbridge GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 1 1 112074 41452 E32000009 17089.120 215.025 VA CIVIL VOTING AREA None None POLYGON ((551329.848 179223.210, 551325.201 17...
2 Bexley and Bromley GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 3 3 112066 41448 E32000002 21444.820 373.731 VA CIVIL VOTING AREA None None MULTIPOLYGON (((553318.724 178598.531, 553382....
3 South West GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 4 4 112076 41453 E32000013 15259.573 198.064 VA CIVIL VOTING AREA None None MULTIPOLYGON (((523299.645 177410.580, 523300....
4 Ealing and Hillingdon GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 5 5 112054 41442 E32000006 17124.643 0.000 VA CIVIL VOTING AREA None None POLYGON ((504288.900 193543.098, 504292.998 19...
5 Brent and Harrow GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 6 6 112060 41445 E32000003 9370.064 0.000 VA CIVIL VOTING AREA None None POLYGON ((517376.114 194089.315, 517378.181 19...
  • Letโ€™s plot our data and see what we have:

[9]:
difference.plot(facecolor='#28B463', edgecolor='white');
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_24_0.png

Union

[10]:
union = gpd.overlay(intersection, difference, how='union')
[11]:
union.head()
[11]:
NAME_1 AREA_CODE_1 DESCRIPTIO_1 FILE_NAME_1 NUMBER_1 NUMBER0_1 POLYGON_ID_1 UNIT_ID_1 CODE_1 HECTARES_1 ... POLYGON_ID_2 UNIT_ID_2 CODE_2 HECTARES_2 AREA_2 TYPE_CODE_2 DESCRIPT0_2 TYPE_COD0_2 DESCRIPT1_2 geometry
0 Havering and Redbridge GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 1.0 1.0 112074.0 41452.0 E32000009 17089.120 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN MULTIPOLYGON (((552057.328 178434.366, 552047....
1 Croydon and Sutton GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 2.0 2.0 112056.0 41443.0 E32000005 13033.641 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN POLYGON ((531409.308 171042.621, 531439.333 17...
2 Bexley and Bromley GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 3.0 3.0 112066.0 41448.0 E32000002 21444.820 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN POLYGON ((549944.025 181074.392, 549992.231 18...
3 South West GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 4.0 4.0 112076.0 41453.0 E32000013 15259.573 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN MULTIPOLYGON (((522034.141 178039.708, 522044....
4 Lambeth and Southwark GL Assembly Const LAC Greater London Authority Assembly Constituency GREATER_LONDON_AUTHORITY 8.0 8.0 112062.0 41446.0 E32000010 5716.068 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN POLYGON ((535280.948 177955.306, 535280.821 17...

5 rows ร— 33 columns

  • Letโ€™s plot our data and see what we have:

[12]:
union.plot(facecolor='#28B463', edgecolor='white');
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_29_0.png

As you can see, that was how we make intersection, difference and union in overlay analysis in Python.

However, it is not yet done to answer our question above. We are looking for the number and location of bicycle theft in the South part of London based on the data from the Metropolitan Police Service. The data is data/bicycle-theft.csv, and it has to be convert to a GeoDataFrame for overlay analysis.

Practical exercise: Bicycle Theft in London

  • Letโ€™s first read the data into Python using pandas.read_csv() to read .csv -file.

[13]:
# import necessary module
import pandas as pd

# Set file paths
bicycle_fp = "data/bicycle-theft.csv"

# Read files
bicycle = pd.read_csv(bicycle_fp)
bicycle.head()
[13]:
latitude longitude
0 51.552200 0.153621
1 51.545110 0.165546
2 51.536326 0.076951
3 51.535612 0.079154
4 51.537440 0.080059
  • Create a point geometry of bicycle theft by using Point() from Shapely and gpd.GeoDataFrame()

[14]:
# Since we will create Points, so we will use Point function from Shapely
from shapely.geometry import Point

# Create an empty list
point_geo = []

# A loop for creating point geometry
for i,j in zip(bicycle.longitude, bicycle.latitude):

    # make point geometry using Point function
    point = Point(i,j)

    # locate all the result in the empty list created above
    point_geo.append(point)
  • Now we have a list of point geometry, but it is still a list not a geodataframe. To be able to conduct analysis, we have to convert a list to a GeoDataFrame which will contain a proper coordinate system and so forth. Because the latitude and longitude are in decimal degree, the crs should be epsg:4326 which is the world coordinate system - WGS 84.

  • You can also determine the crs from latitude and longitude by importing them to Google Earth Pro and save it as a kmz file. After that, by import this kmz file to QGIS, you can clearly see its original crs.

[15]:
from pyproj import CRS

# Create a geopandas dataframe
wgs84 = CRS.from_epsg(4326)
geo_bicycle = gpd.GeoDataFrame({'geometry':point_geo}, crs=wgs84)
geo_bicycle.head()
[15]:
geometry
0 POINT (0.15362 51.55220)
1 POINT (0.16555 51.54511)
2 POINT (0.07695 51.53633)
3 POINT (0.07915 51.53561)
4 POINT (0.08006 51.53744)
[16]:
geo_bicycle.plot()
[16]:
<AxesSubplot:>
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_37_1.png
[17]:
# Check if the crs of geo_bicycle is consistent with others'.
assert geo_bicycle.crs == border.crs, "CRS are not the same between layers!"
C:\Users\a9418\AppData\Roaming\Python\Python37\site-packages\pyproj\crs\crs.py:68: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-eb3361acb26c> in <module>
      1 # Check if the crs of geo_bicycle is consistent with others'.
----> 2 assert geo_bicycle.crs == border.crs, "CRS are not the same between layers!"

AssertionError: CRS are not the same between layers!
[18]:
# Create plot to view the geometry location when the crs are different
fig, ax = plt.subplots(figsize = (5,5))
border.plot(ax = ax, facecolor='#ADB6B5', edgecolor='white')
geo_bicycle.plot(ax = ax, markersize = 0.2, color='blue')
plt.show();
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_39_0.png
  • Since the crs are not consistent with each other, letโ€™s change its original crs to fit with othersโ€™ crs.

[19]:
# Re-project new crs
geo_bicycle = geo_bicycle.to_crs(border.crs)
geo_bicycle.crs
C:\Users\a9418\AppData\Roaming\Python\Python37\site-packages\pyproj\crs\crs.py:68: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))
[19]:
{'init': 'epsg:27700'}
[20]:
# Check if the crs of geo_bicycle is consistent with others'.
assert geo_bicycle.crs == border.crs, "CRS are not the same between layers!"
[21]:
# Create plot to view the geometry location when the crs are the same
fig, ax = plt.subplots(figsize = (5,5))
border.plot(ax = ax, facecolor='#ADB6B5', edgecolor='white')
geo_bicycle.plot(ax = ax, markersize = 0.2, color='blue')
plt.show();
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_43_0.png
  • Great! Everything is okay, right now! Letโ€™s see the location of bicycle theft in a map.

[22]:
# Create a figure with one subplot
fig, ax = plt.subplots(figsize=(5,7))

# Plot the layers
border.plot(ax=ax, facecolor='#ADB6B5', edgecolor='white')
geo_bicycle.plot(ax=ax, markersize = 0.2, color='blue')
south_river.plot(ax=ax, facecolor='None', edgecolor='red', linewidth = 1.5)

plt.show();
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_45_0.png
[23]:
geo_bicycle.to_file('result/resut.shp')
  • As we can see, there are a lot of bicycle theft in the South part of London. Letโ€™s use overlay() to extract the location points inside the polygon based on intersection method.

[24]:
# Clip the points inside the red polygon
bicycle_south = gpd.overlay(border, south_river, how='intersection')
  • Letโ€™s see the number of bicycle theft in the South of London

[25]:
print('The number of bicycle theft in southern London is:', len(bicycle_south),'bikes.')
The number of bicycle theft in southern London is: 9 bikes.
  • Letโ€™s plot 3490 location in a map

[26]:
# Create a figure with one subplot
fig, ax = plt.subplots(figsize=(5,7))

# Plot the layers
border.plot(ax=ax, facecolor='#ADB6B5', edgecolor='white')
bicycle_south.plot(ax=ax, markersize = 0.2, color='blue')
south_river.plot(ax=ax, facecolor='None', edgecolor='red', linewidth = 1.5)

plt.show();
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_52_0.png
  • Itโ€™s time to save the result to a GeoJSON file that is commonly used file format nowadays for storing spatial data.

[27]:
# set output path
outfp = 'result/bicycle_south.shp'

# export to GeoJSON
bicycle_south.to_file(outfp)

b. Data aggregationยถ

Data aggregation is very uselful when you want to combine or merge geometries together based on its similarity; for example, grouping the same type of fishes, building, region and other similar categories. As in our case, we are going to categorize the region of New York into East, West, Central and Long Island by zone in the column NYSP_ZONE.

For doing the aggregation we will use a function called dissolve() that takes as input from the column that will be used for conducting the aggregation:

  • Letโ€™s first read the data

[28]:
# Set filepath
fp = "data/NY_counties_shoreline.geojson"

# Read file using gpd.read_file()
ny_state = gpd.read_file(fp)

ny_state
[28]:
NAME ABBREV GNIS_ID FIPS_CODE SWIS NYSP_ZONE POP1990 POP2000 POP2010 POP2020 DOS_LL DOSLL_DATE NYC CALC_SQ_MI DATEMOD Shape_Leng Shape_Area geometry
0 Albany ALBA 974099 36001 010000 East 292594 294565 304204 0 None None N 532.791779 2017-11-10 166077.834242 1.379924e+09 MULTIPOLYGON (((608219.061 4737612.320, 608193...
1 Allegany ALLE 974100 36003 020000 West 50470 49927 48946 0 None None N 1035.209131 2019-04-26 210499.339279 2.681179e+09 MULTIPOLYGON (((250193.017 4712087.456, 250192...
2 Bronx BRON 974101 36005 600000 Long Island 1203789 1332650 1385108 0 None None Y 42.313958 2019-10-04 110631.250768 1.095926e+08 MULTIPOLYGON (((593060.323 4516716.000, 593002...
3 Broome BROO 974102 36007 030000 Central 212160 200536 200600 0 None None N 715.287465 2019-04-26 227933.332021 1.852586e+09 MULTIPOLYGON (((465447.405 4671514.322, 465436...
4 Cattaraugus CATT 974103 36009 040000 West 84234 83955 80317 0 None None N 1324.309219 2019-04-26 276084.505560 3.429945e+09 MULTIPOLYGON (((215534.406 4715146.842, 215536...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
57 Wyoming WYOM 974158 36121 560000 West 42507 43424 42155 0 None None N 596.822633 2018-09-20 176096.925232 1.545764e+09 MULTIPOLYGON (((217004.051 4751905.154, 217012...
58 Yates YATE 974159 36123 570000 Central 22810 24621 25348 0 None None N 375.771539 2018-02-08 141535.638713 9.732438e+08 MULTIPOLYGON (((338694.813 4736525.498, 338999...
59 Madison MADI 974125 36053 250000 Central 69120 69441 73442 0 None None N 660.767520 2018-07-18 256920.383650 1.711380e+09 MULTIPOLYGON (((428112.869 4778465.622, 428112...
60 Oneida ONEI 974131 36065 300000 Central 250836 235469 234878 0 None None N 1257.212143 2018-03-09 307349.034135 3.256165e+09 MULTIPOLYGON (((482676.562 4747513.488, 482378...
61 Jefferson JEFF 974121 36045 220000 Central 110943 111738 116229 0 None None N 1286.010807 2018-11-06 722815.761071 3.330753e+09 MULTIPOLYGON (((390245.995 4859325.132, 390310...

62 rows ร— 18 columns

[29]:
# Plot geometry
ny_state.plot();
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_59_0.png

If we look at column NYSP_ZONE, each geometry belongs to different zone. Sometimes, there are so many categories in a long list. Then it will be hard to observe every row. Therefore, in order to quickly check how many categories and what they are in each columns, we can just simply use function called .unique().

  • Now, letโ€™s check the categories in column NYSP_ZONE

[30]:
# check categories in column
print(ny_state['NYSP_ZONE'].unique())
['East' 'West' 'Long Island' 'Central']
[31]:
# dissolve polygon based on zone
polygon = ny_state.dissolve(by='NYSP_ZONE')
polygon
[31]:
geometry NAME ABBREV GNIS_ID FIPS_CODE SWIS POP1990 POP2000 POP2010 POP2020 DOS_LL DOSLL_DATE NYC CALC_SQ_MI DATEMOD Shape_Leng Shape_Area
NYSP_ZONE
Central MULTIPOLYGON (((387917.657 4859852.977, 387818... Broome BROO 974102 36007 030000 212160 200536 200600 0 None None N 715.287465 2019-04-26 227933.332021 1.852586e+09
East MULTIPOLYGON (((610809.635 4533652.000, 610804... Albany ALBA 974099 36001 010000 292594 294565 304204 0 None None N 532.791779 2017-11-10 166077.834242 1.379924e+09
Long Island MULTIPOLYGON (((580478.885 4490937.000, 580407... Bronx BRON 974101 36005 600000 1203789 1332650 1385108 0 None None Y 42.313958 2019-10-04 110631.250768 1.095926e+08
West MULTIPOLYGON (((181132.417 4760968.000, 181499... Allegany ALLE 974100 36003 020000 50470 49927 48946 0 None None N 1035.209131 2019-04-26 210499.339279 2.681179e+09
[32]:
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.lines import Line2D

# Create a figure with one subplot
fig, ax = plt.subplots(figsize=(5,5))
color = ['#528413', '#2088B2', '#D86E3F', '#B52735']

# Plot boundary
polygon['geometry'].plot(ax=ax, color= color)

# Add legend
plt.legend(loc=2, handles=[Patch(color='#B52735', label='West'),
                            Patch(color='#528413', label='Central'),
                            Patch(color='#2088B2', label='East'),
                            Patch(color='#D86E3F', label='Long Island')])

plt.title('New York State')
plt.tight_layout()
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_64_0.png

As you can see, we have merged 62 geometries into 4 geometries based on regions.


2. Data classificationยถ

Data classification is a common method in GIS to add symbology to the data based on specific criteria or map classifiers. The well-known map classifiers in GIS include Equal Count (Quantile), Equal Interval, Logarithmic Scale, etc. Each classifier is described in Data classification methods.

In Python, we will use Pysal-module which is an extensive library including various functions and tools for spatial data analysis. This module includes various classifiers for data visualization. The classifiers in Pysal-module are as follows:

  • Box_Plot

  • Equal_Interval

  • HeadTail_Breaks

  • Jenks_Caspall

  • Jenks_Caspall_Forced

  • Jenks_Caspall_Sampled

  • Max_P_Classifier

  • Maximum_Breaks

  • Natural_Breaks

  • Quantiles

  • Percentiles

  • Std_Mean

  • User_Defined

Check further details in Source code for mapclassify.classifiers

In this section, we use the data of US COVID-19 Daily Cases by counties from Harvard Dataverse to classify data based on ready-made classifiers in Pysal-module and self-made classifier to visualize the data features.

a. Pysal classifierยถ

  • Letโ€™s start first by importing the data

[33]:
# import necessary module
import geopandas as gpd

# input filepath
fp = 'data/Covid19_confirmed_cases_for_US_counties.shp'

# read data
dataframe = gpd.read_file(fp)

# check the number of counties
print('There are',len(dataframe),'counties.')

# check the head row in dataframe
dataframe.head()
There are 3108 counties.
[33]:
County Nam State POP10 HHD10 covid geometry
0 Autauga County AL 54571.0 20221.0 6879.0 POLYGON ((-86.90310 32.54063, -86.90312 32.540...
1 Baldwin County AL 182265.0 73180.0 20847.0 POLYGON ((-87.99069 30.55550, -87.99051 30.560...
2 Barbour County AL 27457.0 9820.0 2296.0 POLYGON ((-85.42982 32.04599, -85.42986 32.046...
3 Bibb County AL 22915.0 7953.0 2584.0 POLYGON ((-87.31227 33.08623, -87.31218 33.087...
4 Blount County AL 57322.0 21578.0 6571.0 POLYGON ((-86.74919 33.99760, -86.74903 33.997...
[34]:
# plot covid confirmed case
dataframe.plot(column="covid", legend=True)
[34]:
<AxesSubplot:>
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_73_1.png

As you can see from the dataframe and the map, there are more than 3000 counties with the information about covid confirmed cases, population and households, while the data is plotted by default without specific classification. Hence, by using ready-made classifiers from Pysal-module, we will be able to create classes for plotting.

  • Plot covid confirmed case with different classifier

[35]:
# import necessary module
import mapclassify
[36]:
# Standard Deviation classifier
dataframe.plot(column="covid", scheme="Std_Mean", k=5, cmap = 'Blues', legend=True);

# Table of classification
mapclassify.StdMean(y=dataframe['covid'])
[36]:
StdMean

        Interval           Count
--------------------------------
(      -inf,  -61551.10] |     0
( -61551.10,  -25738.23] |     0
( -25738.23,   45887.53] |  2971
(  45887.53,   81700.40] |    69
(  81700.40, 1190603.00] |    68
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_77_1.png
[37]:
# Equal interval classifier
dataframe.plot(column="covid", scheme="Equal_Interval", k=5, cmap = 'Blues', legend=True);

# Table of classification
mapclassify.EqualInterval(y=dataframe['covid'], k=5)
[37]:
EqualInterval

        Interval           Count
--------------------------------
[      0.00,  238120.60] |  3094
( 238120.60,  476241.20] |    10
( 476241.20,  714361.80] |     3
( 714361.80,  952482.40] |     0
( 952482.40, 1190603.00] |     1
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_78_1.png
[38]:
# Equal count (Quantiles) classifier
dataframe.plot(column="covid", scheme="Quantiles", k=5, cmap = 'Blues', legend=True);

# Table of classification
mapclassify.Quantiles(y=dataframe['covid'], k=5)
[38]:
Quantiles

        Interval           Count
--------------------------------
[      0.00,     815.00] |   623
(    815.00,    1771.60] |   620
(   1771.60,    3557.20] |   622
(   3557.20,    8737.20] |   621
(   8737.20, 1190603.00] |   622
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_79_1.png
[39]:
# Natural break classifier
dataframe.plot(column="covid", scheme="Natural_Breaks", k=5, cmap = 'Blues', legend=True);

# Table of classification
mapclassify.NaturalBreaks(y=dataframe['covid'], k=5)
[39]:
NaturalBreaks

        Interval           Count
--------------------------------
[      0.00,   37041.00] |  2929
(  37041.00,  150249.00] |   160
( 150249.00,  297315.00] |    14
( 297315.00,  534430.00] |     4
( 534430.00, 1190603.00] |     1
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_80_1.png
[40]:
# Box plot classifier
dataframe.plot(column="covid", scheme="Box_Plot", k=5, cmap = 'Blues', legend=True);

# Table of classification
mapclassify.BoxPlot(y=dataframe['covid'])
[40]:
BoxPlot

        Interval           Count
--------------------------------
(      -inf,   -7016.00] |     0
(  -7016.00,    1044.25] |   777
(   1044.25,    2501.50] |   777
(   2501.50,    6417.75] |   777
(   6417.75,   14478.00] |   356
(  14478.00, 1190603.00] |   421
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_81_1.png
  • Plot population, household and covid with Quantiles classifier

[41]:
import matplotlib.pyplot as plt
%matplotlib inline

# Initialize subplots
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, nrows=1, figsize=(20, 10), sharey=True)

# Equal count (Quantiles) classifier
dataframe.plot(ax=ax1, column="POP10", scheme="Quantiles", k=5, cmap = 'Reds', legend=True);
dataframe.plot(ax=ax2, column="HHD10", scheme="Quantiles", k=5, cmap = 'PiYG', legend=True);
dataframe.plot(ax=ax3, column="covid", scheme="Quantiles", k=5, cmap = 'Blues', legend=True);

# Add titles
ax1.set_title("Population")
ax2.set_title("Household")
ax3.set_title("Covid19")

# Remove axis
ax1.axis('off')
ax2.axis('off')
ax3.axis('off')

plt.tight_layout()
plt.show();
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_83_0.png

Assign classes to dataframe (Additional)

By using make() function we can take the number of desired classes from classifier as input parameter.

[42]:
# Create a Natural Breaks classifier
classifier = mapclassify.Quantiles.make(k=5)

# Rename the column so that we know that it was classified with natural breaks
dataframe['qt_covid'] = dataframe[['covid']].apply(classifier)

# Check the original values and classification
dataframe[['covid', 'qt_covid']].head()
[42]:
covid qt_covid
0 6879.0 3
1 20847.0 4
2 2296.0 2
3 2584.0 2
4 6571.0 3

b. Custom classiferยถ

The purpose of this section is to classify the data feature based on criteria which can be whatever, such as:

1. the households are less than 7500

AND

2. confirmed covid cases are less than 5000
------------------------------------------------------

IF TRUE: ==> The areas are less dangerous.
IF NOT TRUE: ==> The areas are very dangerous.
  • Letโ€™s create a fuction called custom_classifier that takes two above criteria into account:

[43]:
def custom_classifier(row, src_col1, src_col2, threshold1, threshold2, output_col):
   # If the value in src_col1 is LOWER than the threshold1 value AND the value in src_col2 is LOWER than the threshold2 value, give value 1, otherwise give 0

   if row[src_col1] < threshold1 and row[src_col2] < threshold2:
       row[output_col] = 1

   # If area of input geometry is higher than the threshold value update with value 1
   else:
       row[output_col] = 0

   # Return the updated row
   return row
  • Apply the function we created to the dataframe.

[44]:
# Create column for the classification results
dataframe["dangerous_area"] = None

# Use the function
result = dataframe.apply(custom_classifier, src_col1='HHD10', src_col2='covid', threshold1=7500, threshold2=5000, output_col="dangerous_area", axis=1)

# See the first rows
result
[44]:
County Nam State POP10 HHD10 covid geometry qt_covid dangerous_area
0 Autauga County AL 54571.0 20221.0 6879.0 POLYGON ((-86.90310 32.54063, -86.90312 32.540... 3 0
1 Baldwin County AL 182265.0 73180.0 20847.0 POLYGON ((-87.99069 30.55550, -87.99051 30.560... 4 0
2 Barbour County AL 27457.0 9820.0 2296.0 POLYGON ((-85.42982 32.04599, -85.42986 32.046... 2 0
3 Bibb County AL 22915.0 7953.0 2584.0 POLYGON ((-87.31227 33.08623, -87.31218 33.087... 2 0
4 Blount County AL 57322.0 21578.0 6571.0 POLYGON ((-86.74919 33.99760, -86.74903 33.997... 3 0
... ... ... ... ... ... ... ... ...
3103 Sublette County WY 10247.0 3906.0 765.0 POLYGON ((-109.49667 42.26333, -109.49213 42.2... 0 1
3104 Sweetwater County WY 43806.0 16475.0 4159.0 POLYGON ((-111.04910 44.13237, -111.04911 44.1... 3 0
3105 Teton County WY 21294.0 8973.0 3719.0 POLYGON ((-110.60681 40.99626, -110.60690 40.9... 3 0
3106 Uinta County WY 21118.0 7668.0 2158.0 POLYGON ((-107.63350 43.64635, -107.63353 43.6... 2 0
3107 Washakie County WY 8533.0 3492.0 898.0 POLYGON ((-105.07914 43.81301, -105.07914 43.8... 1 1

3108 rows ร— 8 columns

[45]:
# Get value counts
result['dangerous_area'].value_counts()
[45]:
0    1857
1    1251
Name: dangerous_area, dtype: int64

There seems to be 1857 less degerous areas and 1251 very dangerous area from our classification.

  • Letโ€™s see where those area are located.

[46]:
# Plot
result.plot(column="dangerous_area", cmap = 'brg', linewidth=0);

# Add legend
plt.legend(loc='best', handles=[Patch(color='#25EB1C', label='less dangerous'),
                            Patch(color='blue', label='very dangerous')])

# Use tight layour
plt.tight_layout()
../../../_images/Content_Lesson_geo-python-course_04.Geometric-operation-and-data-reclassification_96_0.png