pandas uses a series of labels for both rows and columns so that we can refer to spefic values in a table, like a spreadsheet's row number and column name, or a feature class' ObjectID and field name. These labels are a fundamental part of pandas.
No more trying to remember what the "i
th element of the j
th row" refers to (and heaven help you if you're in a nested cursor).
Calling geoprocessing tools in arcpy requires setting up input and output layers, either on disk or in memory. The code is constantly jumping back and forth from python to the underlying geoprocessing libraries.
pandas data structures are native python structures kept in memory and are easily modified. No more creating a new feature layer just to change field names.
In addition, most non-spatial table operations are highly optimzied to run against a collection of data at the same time rather than operating element-by-element.
As a consequence of not having to deal with feature layer management and verbose geoprocessing tool calls, pandas code can be much shorter and more concise. Once you're familiar with pandas syntax and programming patterns, you can create brief, expressive statements that perform several operations all in one go.
And again, no more nested cursors. Seriously.
As one of (if not the) main go-to python libraries for data science, the pandas ecosystem is vast.
You can easily pull in data from spreadsheets, databases, web-based sources like (well-formatted) json and xml, and cloud-based tables like Google BigQuery.
Once you've got your data, there's a vast body of tutorials, examples, and production code to build off of (or just shamelessly steal). Other libraries have been written to extend pandas or accept data from pandas data structures, like geopandas and the ArcGIS API for Python's spatially-enabled dataframes.
import numpy as np
import pandas as pd
import arcgis
from arcgis import GeoAccessor, GeoSeriesAccessor
from pathlib import Path
A scalar is just a single value.
#: Most python variables can be considered scalars
foo = 5
bar = 'Midway'
baz = True
A vector is a collection of scalars or other vectors
No, not that.
Or that.
#: python lists and tuples
spam = [1, 2, 3]
eggs = ('foo', 'bar', 'baz')
ham = [foo, 1.7, 'zen']
[spam, eggs, ham] #: a 2-dimensional vector
[[1, 2, 3], ('foo', 'bar', 'baz'), [5, 1.7, 'zen']]
A series is a collection of scalars. Normally it should have the same data type, but it can be mixed.
#: Build a series from a python list
pd.Series(['a', 'b', 'c'])
0 a 1 b 2 c dtype: object
#: Series have an index, which allows you to reference individual elements with arbitrary labels
pd.Series([1, 2, 3], index=['foo', 'bar', 'baz'])
foo 1 bar 2 baz 3 dtype: int64
A dataframe is a two-dimensional collection of vectors with labels for both rows and columns. Basically, a spreadsheet in code.
#: Build a dataframe from a dictionary, where each key is a column name and each value is a list of values for that column
pd.DataFrame({
'foo': [1, 2, 3],
'bar': ['a', 'b', 'c'],
'baz': [True, 1.7, 'zen']
})
foo | bar | baz | |
---|---|---|---|
0 | 1 | a | True |
1 | 2 | b | 1.7 |
2 | 3 | c | zen |
Under the hood, a dataframe is stored in memory as a collection of vectors (numpy
arrays), one for each column. Thus, a lot of pandas operations occur on a column-by-column basis, like adding two columns together and storing the result in a third:
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [4, 5, 6]
})
df['c'] = df['a'] + df['b']
df
a | b | c | |
---|---|---|---|
0 | 1 | 4 | 5 |
1 | 2 | 5 | 7 |
2 | 3 | 6 | 9 |
The key to understanding pandas operations is to think in terms of operations that are applied to every element in a vector, rather than extracting each element and passing it to the operation one by one.
#: Standard python: we're responsible for looping through a vector and calling a function on each element:
for element in [1, 2, 3]:
print(np.sqrt(element))
1.0 1.4142135623730951 1.7320508075688772
#: Vectorized operation: pass a whole vector to a vectorized function:
np.sqrt([1, 2, 3])
array([1. , 1.41421356, 1.73205081])
#: pandas operation: a method on a series or dataframe
pd.Series([1, 2, 3]).transform(np.sqrt)
0 1.000000 1 1.414214 2 1.732051 dtype: float64
for
loops in python are computationally expensive and require extra resources to set up the iteration. In addition, the function has to be called multiple times, requirin even more work behind the scenes for each call.
In contrast, vectorized operations are optimized to perform the same operation on multiple pieces of data. In addition to avoiding the overhead from iteration and multiple function calls, the processor has special logic and routines for parallelizing many operations. However, to use these it needs to know the operation and the data type ahead of time, which it generally can't with python for
loops.
counties_df = pd.DataFrame.spatial.from_featureclass(r'data/county_boundaries.gdb/Counties')
counties_df.set_index('FIPS_STR', inplace=True) #: Replace the default index with one of our columns
counties_df.head()
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... |
49013 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13.0 | Central | 19596 | 20161 | {7F6252B7-137F-4F7E-A19A-03C702111509} | 4 | {"rings": [[[-12273167.1888, 4987644.314599998... |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... |
A dataframe has rows and columns, and each of these has a collection of labels informally called an index. The row labels are considered the main DataFrame index, while the column labels are just called columns.
counties_df.index
Index(['49005', '49013', '49011', '49027', '49051', '49003', '49057', '49023', '49039', '49053', '49019', '49033', '49007', '49009', '49001', '49041', '49017', '49045', '49043', '49031', '49047', '49021', '49015', '49055', '49037', '49029', '49025', '49035', '49049'], dtype='object', name='FIPS_STR')
counties_df.columns
Index(['OBJECTID', 'COUNTYNBR', 'ENTITYNBR', 'ENTITYYR', 'NAME', 'FIPS', 'STATEPLANE', 'POP_LASTCENSUS', 'POP_CURRESTIMATE', 'GlobalID', 'COLOR4', 'SHAPE'], dtype='object')
Each row and column in a dataframe can be extracted as an individual series.
#: Column- series name is column name, series index is the main dataframe index
counties_df.loc[:, 'NAME'].head()
FIPS_STR 49005 CACHE 49013 DUCHESNE 49011 DAVIS 49027 MILLARD 49051 WASATCH Name: NAME, dtype: object
#: Row- series name is the row index label, series index is the column set
counties_df.loc['49005', :]
OBJECTID 1 COUNTYNBR 03 ENTITYNBR 2010031010.0 ENTITYYR 2010.0 NAME CACHE FIPS 5.0 STATEPLANE North POP_LASTCENSUS 133154 POP_CURRESTIMATE 140173 GlobalID {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} COLOR4 2 SHAPE {'rings': [[[-12485167.954, 5160638.807099998]... Name: 49005, dtype: object
Every column has a data type, just like fields in a feature class.
Most data types come from the numpy
library (which, at least currently, provides a lot of the backend data structures for python).
# .info() gives us an overview of the dataframe, including the column types
counties_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 29 entries, 49005 to 49049 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OBJECTID 29 non-null int64 1 COUNTYNBR 29 non-null object 2 ENTITYNBR 29 non-null float64 3 ENTITYYR 29 non-null float64 4 NAME 29 non-null object 5 FIPS 29 non-null float64 6 STATEPLANE 29 non-null object 7 POP_LASTCENSUS 29 non-null int64 8 POP_CURRESTIMATE 29 non-null int64 9 GlobalID 29 non-null object 10 COLOR4 29 non-null int64 11 SHAPE 29 non-null geometry dtypes: float64(3), geometry(1), int64(4), object(4) memory usage: 4.0+ KB
Strings are a special case. By default, pandas uses the object
dtype for columns that contain text. However, it could also contain multiple types.
mixed_df = pd.DataFrame({
'ints': [1, 2, 3],
'mixed': ['zero', 1, 2.1]
})
mixed_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ints 3 non-null int64 1 mixed 3 non-null object dtypes: int64(1), object(1) memory usage: 176.0+ bytes
#: Numeric Nones convert to np.nan, strings stay as None
none_df = pd.DataFrame({
'foo': [1, 2, None],
'bar': ['four', None, 'six']
})
print(none_df.info())
none_df
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 foo 2 non-null float64 1 bar 2 non-null object dtypes: float64(1), object(1) memory usage: 176.0+ bytes None
foo | bar | |
---|---|---|
0 | 1.0 | four |
1 | 2.0 | None |
2 | NaN | six |
#: Convert to nullable types; note capitalized Int64
converted_df = none_df.convert_dtypes()
print(converted_df.info())
converted_df
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 foo 2 non-null Int64 1 bar 2 non-null string dtypes: Int64(1), string(1) memory usage: 179.0 bytes None
foo | bar | |
---|---|---|
0 | 1 | four |
1 | 2 | <NA> |
2 | <NA> | six |
df[]
: selecting columns and rows¶The []
operator on DataFrames is overloaded and will do different things depending on what you pass to it:
[]
to select specific rows.
The length of the sequence must match the number of rows in the dataframe.#: 1: single string gives a single column as a series
counties_df['NAME'].head()
FIPS_STR 49005 CACHE 49013 DUCHESNE 49011 DAVIS 49027 MILLARD 49051 WASATCH Name: NAME, dtype: object
#: 2: list of strings returns multiple columns as a dataframe
counties_df[['NAME', 'POP_LASTCENSUS']].head()
NAME | POP_LASTCENSUS | |
---|---|---|
FIPS_STR | ||
49005 | CACHE | 133154 |
49013 | DUCHESNE | 19596 |
49011 | DAVIS | 362679 |
49027 | MILLARD | 12975 |
49051 | WASATCH | 34788 |
#: 3: Slicing returns the rows, all-inclusive (as opposed to python's all-but-end behavior) ???
counties_df[3:5]
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... |
head_df = counties_df.head().copy()
head_df
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... |
49013 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13.0 | Central | 19596 | 20161 | {7F6252B7-137F-4F7E-A19A-03C702111509} | 4 | {"rings": [[[-12273167.1888, 4987644.314599998... |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... |
head_df[[True, False, True, True, False]] #: Note the list is the same length as our dataframe index
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... |
Let's filter our dataframe down to counties with population greater than 100,000
pop_series = head_df['POP_LASTCENSUS'].copy()
pop_series
FIPS_STR 49005 133154 49013 19596 49011 362679 49027 12975 49051 34788 Name: POP_LASTCENSUS, dtype: int64
#: Performing a comparison on a series returns a new series with the result of each comparison
pop_gt_100k = pop_series > 100000
pop_gt_100k
FIPS_STR 49005 True 49013 False 49011 True 49027 False 49051 False Name: POP_LASTCENSUS, dtype: bool
#: Pass our new boolean series as a boolean indexer
head_df[pop_gt_100k]
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... |
#: All the previous steps, just in one line of code
head_df[head_df['POP_LASTCENSUS'] > 100000]
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... |
#: Use .isin() to filter based on membership in a sequence
head_df[head_df['COLOR4'].isin([2, 3])]
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... |
head_df['COLOR4'].isin([2, 3])
FIPS_STR 49005 True 49013 False 49011 True 49027 True 49051 True Name: COLOR4, dtype: bool
.loc
and .iloc
: Selecting by Label/Index¶.loc
: Label-based¶#: Single value: index label
counties_df.loc['49051']
OBJECTID 5 COUNTYNBR 26 ENTITYNBR 2010261010.0 ENTITYYR 2010.0 NAME WASATCH FIPS 51.0 STATEPLANE Central POP_LASTCENSUS 34788 POP_CURRESTIMATE 36619 GlobalID {3D0C5C1E-2650-458E-B322-2B86AA473441} COLOR4 2 SHAPE {'rings': [[[-12400515.3909, 4966751.283200003... Name: 49051, dtype: object
#: Two values: index label, column label
counties_df.loc['49051', 'NAME']
'WASATCH'
#: Two values with everything slice: column as series
counties_df.loc[:, 'NAME'].head()
FIPS_STR 49005 CACHE 49013 DUCHESNE 49011 DAVIS 49027 MILLARD 49051 WASATCH Name: NAME, dtype: object
#: everything slice and list of labels: columns as DataFrame
counties_df.loc[:, ['NAME', 'POP_LASTCENSUS']].head()
NAME | POP_LASTCENSUS | |
---|---|---|
FIPS_STR | ||
49005 | CACHE | 133154 |
49013 | DUCHESNE | 19596 |
49011 | DAVIS | 362679 |
49027 | MILLARD | 12975 |
49051 | WASATCH | 34788 |
#: everything slice and list of labels reversed: rows as DataFrame
counties_df.loc[['49001', '49003'], :]
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49001 | 15 | 01 | 2.010011e+09 | 2010.0 | BEAVER | 1.0 | South | 7072 | 7327 | {04ADA80E-57CF-40BB-8C5C-6FDED3D82E74} | 4 | {"rings": [[[-12525157.6426, 4660669.5066], [-... |
49003 | 6 | 02 | 2.010021e+09 | 2010.0 | BOX ELDER | 3.0 | North | 57666 | 61498 | {82F842F7-FCE7-402B-9144-34F50FCCA4E5} | 4 | {"rings": [[[-12631966.5192, 5159977.284500003... |
.iloc
: Position-based¶#: Get the first row:
counties_df.iloc[0]
OBJECTID 1 COUNTYNBR 03 ENTITYNBR 2010031010.0 ENTITYYR 2010.0 NAME CACHE FIPS 5.0 STATEPLANE North POP_LASTCENSUS 133154 POP_CURRESTIMATE 140173 GlobalID {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} COLOR4 2 SHAPE {'rings': [[[-12485167.954, 5160638.807099998]... Name: 49005, dtype: object
#: Use slicing to get the first column for the first five rows:
counties_df.iloc[:5, 0]
FIPS_STR 49005 1 49013 2 49011 3 49027 4 49051 5 Name: OBJECTID, dtype: int64
#: investigate the last row
counties_df.iloc[-1]
OBJECTID 29 COUNTYNBR 25 ENTITYNBR 2010251010.0 ENTITYYR 2010.0 NAME UTAH FIPS 49.0 STATEPLANE Central POP_LASTCENSUS 659399 POP_CURRESTIMATE 702434 GlobalID {8DF99710-DCB1-4C52-8EAD-E9555C83618F} COLOR4 3 SHAPE {'rings': [[[-12422592.7433, 4950159.090400003... Name: 49049, dtype: object
SettingWithCopyWarning
¶#: Create a copy so we don't mess with our original
test_df = counties_df.copy()
#: "get the rows that have a Central state plane and set the foo column to 3"
test_df[test_df['STATEPLANE'] == 'Central']['foo'] = 3 #: The [] calls are chained- do the first, then do the second
test_df.head()
<ipython-input-143-e4d3b8d8f772>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df[test_df['STATEPLANE'] == 'Central']['foo'] = 3
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... |
49013 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13.0 | Central | 19596 | 20161 | {7F6252B7-137F-4F7E-A19A-03C702111509} | 4 | {"rings": [[[-12273167.1888, 4987644.314599998... |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... |
#: Fix one: use .loc[] to perform the row and column indexing in one call
test_df.loc[test_df['STATEPLANE'] == 'Central', 'foo'] = 3
test_df.head()
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | foo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | |||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... | NaN |
49013 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13.0 | Central | 19596 | 20161 | {7F6252B7-137F-4F7E-A19A-03C702111509} | 4 | {"rings": [[[-12273167.1888, 4987644.314599998... | 3.0 |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... | NaN |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... | 3.0 |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... | 3.0 |
#: Fix two: create an explicit copy to break up the chain
central_df = test_df[test_df['STATEPLANE'] == 'Central'].copy()
central_df['foo'] = 3
central_df.head()
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | foo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | |||||||||||||
49013 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13.0 | Central | 19596 | 20161 | {7F6252B7-137F-4F7E-A19A-03C702111509} | 4 | {"rings": [[[-12273167.1888, 4987644.314599998... | 3 |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... | 3 |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... | 3 |
49023 | 8 | 12 | 2.010121e+09 | 2010.0 | JUAB | 23.0 | Central | 11786 | 12567 | {FE19C8AA-C706-4A81-9EB1-3C608CA317C3} | 4 | {"rings": [[[-12487614.2694, 4867585.2205], [-... | 3 |
49039 | 9 | 20 | 2.010201e+09 | 2010.0 | SANPETE | 39.0 | Central | 28437 | 29724 | {02C0C074-657F-44AE-A886-44ADB97263BD} | 2 | {"rings": [[[-12423642.5466, 4838788.997599997... | 3 |
Pandas makes working with columns really easy. Whether renaming, re-ordering, or re-calculating, it's usually just a single line of code.
Let's take our counties dataset and calculate the population density, creating a new dataframe with just the relevant columns.
#: Create a copy to avoid altering the original
density_df = counties_df.copy()
#: Create a new column by assigning the results of a calculation against another column
# density_df['sq_km'] = density_df['SHAPE_Area'] / 1000000 #: shapely
density_df['sq_km'] = density_df['SHAPE'].apply(lambda x: x.area / 1000000) #: arcpy
density_df['density'] = density_df['POP_LASTCENSUS'] / density_df['sq_km']
density_df.head()
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | sq_km | density | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... | 5456.779633 | 24.401572 |
49013 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13.0 | Central | 19596 | 20161 | {7F6252B7-137F-4F7E-A19A-03C702111509} | 4 | {"rings": [[[-12273167.1888, 4987644.314599998... | 14489.452124 | 1.352432 |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... | 2889.969718 | 125.495779 |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... | 29421.994253 | 0.440997 |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... | 5396.133031 | 6.446839 |
#: Change dtype of FIPS column
density_df['FIPS'] = density_df['FIPS'].astype(int)
density_df.head()
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | sq_km | density | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... | 5456.779633 | 24.401572 |
49013 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13 | Central | 19596 | 20161 | {7F6252B7-137F-4F7E-A19A-03C702111509} | 4 | {"rings": [[[-12273167.1888, 4987644.314599998... | 14489.452124 | 1.352432 |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... | 2889.969718 | 125.495779 |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... | 29421.994253 | 0.440997 |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... | 5396.133031 | 6.446839 |
#: Rename columns with .rename() and a dictionary
density_df.rename(columns={'density': 'population_per_sq_km', 'POP_LASTCENSUS': 'pop_2020'}, inplace=True)
density_df.head()
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | pop_2020 | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | sq_km | population_per_sq_km | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||||
49005 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... | 5456.779633 | 24.401572 |
49013 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13 | Central | 19596 | 20161 | {7F6252B7-137F-4F7E-A19A-03C702111509} | 4 | {"rings": [[[-12273167.1888, 4987644.314599998... | 14489.452124 | 1.352432 |
49011 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... | 2889.969718 | 125.495779 |
49027 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... | 29421.994253 | 0.440997 |
49051 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... | 5396.133031 | 6.446839 |
#: Subset down to just the desired columns; reindex doesn't have inplace option
density_df = density_df.reindex(columns=['population_per_sq_km', 'NAME', 'ENTITYYR', 'STATEPLANE'])
density_df.head()
population_per_sq_km | NAME | ENTITYYR | STATEPLANE | |
---|---|---|---|---|
FIPS_STR | ||||
49005 | 24.401572 | CACHE | 2010.0 | North |
49013 | 1.352432 | DUCHESNE | 2010.0 | Central |
49011 | 125.495779 | DAVIS | 2010.0 | North |
49027 | 0.440997 | MILLARD | 2010.0 | Central |
49051 | 6.446839 | WASATCH | 2010.0 | Central |
#: Another way to delete individual columns
del density_df['STATEPLANE']
density_df.head()
population_per_sq_km | NAME | ENTITYYR | |
---|---|---|---|
FIPS_STR | |||
49005 | 24.401572 | CACHE | 2010.0 |
49013 | 1.352432 | DUCHESNE | 2010.0 |
49011 | 125.495779 | DAVIS | 2010.0 |
49027 | 0.440997 | MILLARD | 2010.0 |
49051 | 6.446839 | WASATCH | 2010.0 |
#: Or .drop, which returns a new dataframe
density_df.drop('NAME', axis='columns').head()
population_per_sq_km | ENTITYYR | |
---|---|---|
FIPS_STR | ||
49005 | 24.401572 | 2010.0 |
49013 | 1.352432 | 2010.0 |
49011 | 125.495779 | 2010.0 |
49027 | 0.440997 | 2010.0 |
49051 | 6.446839 | 2010.0 |
#: Update ENTITYYR
density_df['ENTITYYR'] = 2020
density_df.head()
population_per_sq_km | NAME | ENTITYYR | |
---|---|---|---|
FIPS_STR | |||
49005 | 24.401572 | CACHE | 2020 |
49013 | 1.352432 | DUCHESNE | 2020 |
49011 | 125.495779 | DAVIS | 2020 |
49027 | 0.440997 | MILLARD | 2020 |
49051 | 6.446839 | WASATCH | 2020 |
density_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 29 entries, 49005 to 49049 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 population_per_sq_km 29 non-null float64 1 NAME 29 non-null object 2 ENTITYYR 29 non-null int64 dtypes: float64(1), int64(1), object(1) memory usage: 2.0+ KB
#: Use .str to access string methods of a series
density_df['NAME'] = density_df['NAME'].str.title() + ' County'
density_df.head()
population_per_sq_km | NAME | ENTITYYR | |
---|---|---|---|
FIPS_STR | |||
49005 | 24.401572 | Cache County | 2020 |
49013 | 1.352432 | Duchesne County | 2020 |
49011 | 125.495779 | Davis County | 2020 |
49027 | 0.440997 | Millard County | 2020 |
49051 | 6.446839 | Wasatch County | 2020 |
#: Chain multiple .str calls together to get the first result from a split operation
density_df['NAME'].str.split().str[0].head()
FIPS_STR 49005 Cache 49013 Duchesne 49011 Davis 49027 Millard 49051 Wasatch Name: NAME, dtype: object
We have a CSV of school names and addresses that we want to geocode using the UGRC API Client.
The address field contains the entire address as a single string, but the API Client requires separate street and city/zip fields.
Some addresses use a newline character \n
in between the street address and the city/state/zip address, while others just use a comma.
We need to pull out the street address from both types, and then grab the zip code as well.
#: Read a csv in, reanme the columns
in_df = pd.read_csv('data/schools.csv').rename(columns={'School name ': 'school', 'School address ': 'address'})
in_df.head()
school | address | |
---|---|---|
0 | Academy Park Elementary School | 4580 W Westpoint Drive (4575 S)\nWest Valley, ... |
1 | Arcadia Elementary | 3461 W 4850 S, Salt Lake City, UT 84129 |
2 | Beehive Elementary School | 5655 South 5220 West\nKearns, UT 84118-7500 |
3 | Bennion Jr. High | 6055 S 2700 W, Salt Lake City, UT 84129 |
4 | Bonneville Junior High | 5330 Gurene Dr, Holladay, UT 84117 |
#: First, work on addresses that use newlines
newline_df = in_df[in_df['address'].str.contains(r'\n')].copy()
print(len(newline_df))
#: Split on newline, then split on "(" to remove alternative street names, then strip whitespace
newline_df['street_addr'] = newline_df['address'].str.split(r'\n').str[0].str.split(r'(').str[0].str.strip()
newline_df.head()
49
school | address | street_addr | |
---|---|---|---|
0 | Academy Park Elementary School | 4580 W Westpoint Drive (4575 S)\nWest Valley, ... | 4580 W Westpoint Drive |
2 | Beehive Elementary School | 5655 South 5220 West\nKearns, UT 84118-7500 | 5655 South 5220 West |
6 | Churchill Junior High | 3450 E Oakview Drive (4275 S)\nSalt Lake City,... | 3450 E Oakview Drive |
8 | Cottonwood Elementary School | 5205 S Holladay Boulevard (2600 E)\nHolladay, ... | 5205 S Holladay Boulevard |
10 | Crestview Elementary School | 2100 E Lincoln Lane (4350 S)\nHolladay, UT 841... | 2100 E Lincoln Lane |
#: Now operate on all the addresses that don't have a newline
#: The ~ is panda's negating operator (similar to !). We wrap the whole expression to be negated in ().
comma_df = in_df[~(in_df['address'].str.contains(r'\n'))].copy()
print(len(comma_df))
#: Just split on comma, taking the first piece
comma_df['street_addr'] = comma_df['address'].str.split(',').str[0]
comma_df.head()
35
school | address | street_addr | |
---|---|---|---|
1 | Arcadia Elementary | 3461 W 4850 S, Salt Lake City, UT 84129 | 3461 W 4850 S |
3 | Bennion Jr. High | 6055 S 2700 W, Salt Lake City, UT 84129 | 6055 S 2700 W |
4 | Bonneville Junior High | 5330 Gurene Dr, Holladay, UT 84117 | 5330 Gurene Dr |
5 | Calvin S. Smith Elementary | 2150 W 6200 S, Taylorsville, UT 84129 | 2150 W 6200 S |
7 | Copper Hills Elementary School | 7635 W Washington Rd, Magna, UT 84044 | 7635 W Washington Rd |
#: Combine them back together with pd.concat (more on this later)
recombined_df = pd.concat([newline_df, comma_df])
#: The zip code is always the text after the last space
recombined_df['zip'] = recombined_df['address'].str.split(' ').str[-1]
recombined_df.sort_index().head()
school | address | street_addr | zip | |
---|---|---|---|---|
0 | Academy Park Elementary School | 4580 W Westpoint Drive (4575 S)\nWest Valley, ... | 4580 W Westpoint Drive | 84120-5920 |
1 | Arcadia Elementary | 3461 W 4850 S, Salt Lake City, UT 84129 | 3461 W 4850 S | 84129 |
2 | Beehive Elementary School | 5655 South 5220 West\nKearns, UT 84118-7500 | 5655 South 5220 West | 84118-7500 |
3 | Bennion Jr. High | 6055 S 2700 W, Salt Lake City, UT 84129 | 6055 S 2700 W | 84129 |
4 | Bonneville Junior High | 5330 Gurene Dr, Holladay, UT 84117 | 5330 Gurene Dr | 84117 |
#: Write to csv
recombined_df.to_csv('data\combined.csv')
.apply()
¶#: Get the values of each row as a named tuple- "fastest" iteration if you absolutely have to iterate
#: Find the FIPS value of all counties with population over 200,000
for row in counties_df.itertuples():
if row.POP_LASTCENSUS > 200000:
print(row.FIPS)
11.0 57.0 35.0 49.0
Iterating over the rows of a dataframe is like using using a a set of pliers to drive in a nail. It can be done, but it's slow and everyone will tell you to use a hammer instead.
Instead, change your thought process. Think about how your output could be expressed as a function of other columns within the dataframe. Using pandas' built-in vectorized functions is much faster and ultimately more readable.
#: use filtering and lists
list(counties_df[counties_df['POP_LASTCENSUS'] > 200000]['FIPS'])
[11.0, 57.0, 35.0, 49.0]
.apply
Instead¶The .apply()
method can be used to perform an aribtrary operation against data in a DataFrame. This is a shift in thinking: instead of extracting the data from the dataframe to pass to another function, you pass the function to the dataframe. This is much faster than iterating over the rows to get individual elements.
.apply
sends a series of data to the specified function and combines the resulting data. If called directly on a series, it just sends that data. If called on a DataFrame, it either sends each column as the series of values in each row or each row as the series of values in each column.
The function passed via .apply
can either aggregate the data (create a new output that is a function of the inputs) or transform the data (create a new element for eact input element).
#: Get some numeric data to work on
county_pop_df = counties_df[['POP_LASTCENSUS', 'POP_CURRESTIMATE']]
county_pop_df.head()
POP_LASTCENSUS | POP_CURRESTIMATE | |
---|---|---|
FIPS_STR | ||
49005 | 133154 | 140173 |
49013 | 19596 | 20161 |
49011 | 362679 | 369948 |
49027 | 12975 | 13330 |
49051 | 34788 | 36619 |
An agregation function takes a set of data and computes a value for each set. The output will have one dimension less than the input. Applying to a dataframe will result in a series, like taking the average of values:
county_pop_df.apply(np.mean) #: default is axis='rows', which applies the function to every row in a column
POP_LASTCENSUS 112814.344828 POP_CURRESTIMATE 116579.310345 dtype: float64
county_pop_df.apply(np.mean, axis='columns').head() #: change to pass columns (applied along the columns)
FIPS_STR 49005 136663.5 49013 19878.5 49011 366313.5 49027 13152.5 49051 35703.5 dtype: float64
Note the differences with axis='columns'
in aggregating functions. This parameter controls the contents of the series that is passed to the function.
The default (axis='rows'
) sends a series containing all the row values in a column to the function, repeating for however many columns there are. Thus, the function is applied along the rows.
Using axis='columns'
instead sends a series containing all the columns to the function, repeating for however many rows are in the dataframe. Thus, the function is applied along the columns.
A transformation function returns an output for every input and thus has the same dimensions as the input, such as taking the square root of all the values in the dataframe:
county_pop_df.apply(np.sqrt).head()
POP_LASTCENSUS | POP_CURRESTIMATE | |
---|---|---|
FIPS_STR | ||
49005 | 364.902727 | 374.396848 |
49013 | 139.985714 | 141.989436 |
49011 | 602.228362 | 608.233508 |
49027 | 113.907857 | 115.455619 |
49051 | 186.515415 | 191.360916 |
lambda
Functions for Arbitrary Operations¶lambda
functions are small, one-line functions that don't use the normal def function_name(args):
syntax.
They are useful for creating simple bits of code you can use with .apply
without having to declare a normal function elsewhere in your code.
lambda
s are callable objects meant to be passed to another function imediately after creation, instead of the normal behavior of assigning them a name for later reference.
#: Calculate the square kilometers of a geometry
def get_sq_km(geometry):
return geometry.area / 1000000
counties_df['SHAPE'].apply(get_sq_km).head()
FIPS_STR 49005 5456.779633 49013 14489.452124 49011 2889.969718 49027 29421.994253 49051 5396.133031 Name: SHAPE, dtype: float64
#: Access the `.area` property of each geometry in the SHAPE column by applying a lambda function instead
counties_df['SHAPE'].apply(lambda x: x.area / 1000000).head()
FIPS_STR 49005 5456.779633 49013 14489.452124 49011 2889.969718 49027 29421.994253 49051 5396.133031 Name: SHAPE, dtype: float64
lambda
syntax¶lambda
functions are defined with the statement lambda var_name: <operations on var_name>
.
var_name
is a name you choose to refer to the input; x
is used by convention but you can choose another name that is more applicable to your problem.
The body of the statement, everything after :
, is what you want to do with the input
Rather than explicitely using a return
statement, it implicitely returns whatever the operation creates.
#: Create a custom aggregation function for each row that references the column names
county_pop_df.apply(lambda row: (row['POP_LASTCENSUS'] + row['POP_CURRESTIMATE'])/2, axis='columns').head()
FIPS_STR 49005 136663.5 49013 19878.5 49011 366313.5 49027 13152.5 49051 35703.5 dtype: float64
.groupby
splits a dataframe by the values of a column, applies an operation on that each chunk's sub-frame, and then combines the results into a data structure based on the type of operation performed.
#: Split by the different state plane projections, compute the mean of the population column, and recombine into a series
counties_df.groupby('STATEPLANE')['POP_LASTCENSUS'].mean()
STATEPLANE Central 163228.076923 North 109227.375000 South 34479.000000 Name: POP_LASTCENSUS, dtype: float64
Each groupby chunk is its own DataFrame, and any operation that can be done on a DataFrame can be done to the chunk. .groupby()
returns a groupby object that handles the iteration over the DataFrames, and it also gives you access to the individual groups' DataFrames
#: Get a groupby object and list the groups
grouped = counties_df.groupby('STATEPLANE')
grouped.groups
{'Central': ['49013', '49027', '49051', '49023', '49039', '49019', '49007', '49041', '49045', '49047', '49015', '49035', '49049'], 'North': ['49005', '49011', '49003', '49057', '49033', '49009', '49043', '49029'], 'South': ['49053', '49001', '49017', '49031', '49021', '49055', '49037', '49025']}
#: Access an individual group's dataframe
grouped.get_group('South').head()
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS_STR | ||||||||||||
49053 | 10 | 27 | 2.010271e+09 | 2010.0 | WASHINGTON | 53.0 | South | 180279 | 197680 | {83A50666-0CF4-475E-8766-4B034A67F5CB} | 1 | {"rings": [[[-12631933.3297, 4524079.526000001... |
49001 | 15 | 01 | 2.010011e+09 | 2010.0 | BEAVER | 1.0 | South | 7072 | 7327 | {04ADA80E-57CF-40BB-8C5C-6FDED3D82E74} | 4 | {"rings": [[[-12525157.6426, 4660669.5066], [-... |
49017 | 17 | 09 | 2.010091e+09 | 2010.0 | GARFIELD | 17.0 | South | 5083 | 5281 | {BC231D40-6F41-45B1-B85D-8182B762CB64} | 1 | {"rings": [[[-12497349.647300001, 4600620.4720... |
49031 | 20 | 16 | 2.010161e+09 | 2010.0 | PIUTE | 31.0 | South | 1438 | 1487 | {D6B3A246-ECFB-42BF-96AA-970553380419} | 2 | {"rings": [[[-12504544.7647, 4651843.050300002... |
49021 | 22 | 11 | 2.010111e+09 | 2010.0 | IRON | 21.0 | South | 57289 | 62429 | {2ACA2EB9-31B5-4D6A-A858-A0F6B16C64E2} | 3 | {"rings": [[[-12619832.6598, 4600452.2751], [-... |
.groupby
and .apply
¶Because .groupby
creates dataframes and iterates an operation on each one, we can use .apply
to perform any arbitrary function on each dataframe.
The function passed by .apply
operates on the rows or columns of each chunk sub-frame just like it would when you use .apply
on a normal dataframe, and the results from each group are combined back together.
If you use a tranformation function that returns a value for each input value, .apply
thus returns a dataframe. The groubpy combine step then concats all the dataframes together into a new dataframe with the same index as the original.
This can be useful if you want to compare a value to the group's average, or apply a different correction value to each group.
#: calculate the percent contribution of each county's population to the group's total
plane_pop_df = counties_df[['STATEPLANE', 'POP_LASTCENSUS', 'POP_CURRESTIMATE']]
plane_pop_df.groupby('STATEPLANE').apply(lambda x: x/x.sum()).head()
POP_LASTCENSUS | POP_CURRESTIMATE | |
---|---|---|
FIPS_STR | ||
49005 | 0.152382 | 0.155628 |
49013 | 0.009235 | 0.009245 |
49011 | 0.415050 | 0.410738 |
49027 | 0.006115 | 0.006113 |
49051 | 0.016394 | 0.016793 |
If you use an aggregation function that returns a series for each group, the combine step concats these series into a new dataframe.
This can be useful for running the same operation on multiple columns in each group, like a descriptive statistic.
#: Get the average for each column by group. The apply acts across two series for each group dataframe and returns
#: a series for each, and then these are added as columns of our new dataframe
plane_pop_df.groupby('STATEPLANE')[['POP_LASTCENSUS', 'POP_CURRESTIMATE']].apply(np.mean)
POP_LASTCENSUS | POP_CURRESTIMATE | |
---|---|---|
STATEPLANE | ||
Central | 163228.076923 | 167744.230769 |
North | 109227.375000 | 112586.250000 |
South | 34479.000000 | 37429.375000 |
Finally, if you use an aggregation function that returns a single value for each group, they are combined into a series.
A commone use case is to get the total value for each group, like summing populations.
plane_pop_df.groupby('STATEPLANE')['POP_LASTCENSUS'].sum()
STATEPLANE Central 2121965 North 873819 South 275832 Name: POP_LASTCENSUS, dtype: int64
While these different recombinations may seem a little trivial, it's important to understand them when you pass more complicated functions.
groupby
Example: Broadband Data¶The FCC has released new broadband availability data based on individual Broadband Servicable Locations (BSLs). While the BSL locations themselves are protected by license, we can download the available service info from broadbandmap.gov and analyze it.
The data are available for download by technology type, and there can be multiple records per location id within any technology types—one per provider that serves that location.
We'll take a folder of the downloaded CSVs, load and combine them into a single dataframe, classify the speeds into the FCC's three service levels (served, underserved, and unserved), and use .groupby
to apply a classification function to determine which locations are served based on a subset of technologies.
#: Build a list of CSVs within a directory using Path's .glob() method
csv_dir = Path('data/fcc/')
csvs = list(csv_dir.glob('*.csv'))
#: Build a list of dataframes by reading in each one and adding a column with the technology name from the filename
dataframes = []
for csv in csvs:
tech = csv.name.split('_')[2]
tech_df = pd.read_csv(csv)
tech_df['technology_name'] = tech
dataframes.append(tech_df)
#: Combine all dataframes into a single dataframe
all_df = pd.concat(dataframes)
all_df['technology_name'].value_counts()
GSO-Satellite 2922195 Licensed-Fixed-Wireless 1651885 Unlicensed-Fixed-Wireless 1537818 NGSO-Satellite 974146 Cable 819488 Copper 658352 Fiber-to-the-Premises 569896 Name: technology_name, dtype: int64
#: np.select uses a list of boolean arrays or series to determine which choice should be returned
#: at each appropriate index.
conditions = [
(all_df['max_advertised_download_speed'] >= 100) & (all_df['max_advertised_upload_speed'] >= 20), # 1
((all_df['max_advertised_download_speed'] >= 100) & ((all_df['max_advertised_upload_speed'] < 20) & (all_df['max_advertised_upload_speed'] >=3)))
| ((all_df['max_advertised_upload_speed'] >= 3) & ((all_df['max_advertised_download_speed'] < 100) & (all_df['max_advertised_download_speed'] >= 20))),
(all_df['max_advertised_download_speed'] < 25) | (all_df['max_advertised_upload_speed'] < 3),
]
choices = ['above 100/20', 'between 100/20 and 25/3', 'under 25/3']
all_df['classification'] = np.select(conditions, choices, default='n/a')
all_df.groupby('location_id').get_group(1010272409)
provider_id | frn | brand_name | location_id | block_fips | h3index_hex8 | technology_code | max_advertised_download_speed | max_advertised_upload_speed | low_latency | business_residential_code | technology_name | classification | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 131310 | 22516330 | TDS Telecom | 1010272409 | 490211105022030 | 882991ca17fffff | 40 | 1000 | 20 | 1 | X | Cable | above 100/20 |
8557 | 130228 | 18626853 | CenturyLink | 1010272409 | 490211105022030 | 882991ca17fffff | 10 | 0 | 0 | 1 | X | Copper | under 25/3 |
229718 | 131219 | 1607175 | SC BROADBAND | 1010272409 | 490211105022030 | 882991ca17fffff | 50 | 1000 | 1000 | 1 | B | Fiber-to-the-Premises | above 100/20 |
13 | 130627 | 12369286 | HughesNet | 1010272409 | 490211105022030 | 882991ca17fffff | 60 | 25 | 3 | 0 | X | GSO-Satellite | between 100/20 and 25/3 |
974185 | 290111 | 4963088 | Viasat, Inc. | 1010272409 | 490211105022030 | 882991ca17fffff | 60 | 10 | 1 | 0 | B | GSO-Satellite | under 25/3 |
974186 | 290111 | 4963088 | Viasat, Inc. | 1010272409 | 490211105022030 | 882991ca17fffff | 60 | 10 | 1 | 0 | R | GSO-Satellite | under 25/3 |
108 | 130403 | 6945950 | T-Mobile US | 1010272409 | 490211105022030 | 882991ca17fffff | 71 | 0 | 0 | 1 | B | Licensed-Fixed-Wireless | under 25/3 |
1298375 | 170054 | 31777865 | InfoWest | 1010272409 | 490211105022030 | 882991ca17fffff | 71 | 100 | 20 | 1 | X | Licensed-Fixed-Wireless | above 100/20 |
13 | 430076 | 26043968 | Starlink | 1010272409 | 490211105022030 | 882991ca17fffff | 61 | 350 | 40 | 1 | X | NGSO-Satellite | above 100/20 |
141850 | 170054 | 31777865 | InfoWest | 1010272409 | 490211105022030 | 882991ca17fffff | 70 | 100 | 20 | 1 | X | Unlicensed-Fixed-Wireless | above 100/20 |
def get_location_id_status(location_df):
if (location_df['classification'] == 'above 100/20').any():
return 'served'
if (location_df['classification'] == 'between 100/20 and 25/3').any():
return 'underserved'
if (location_df['classification'] == 'under 25/3').any():
return 'unserved'
#: Subset to the desired techs
reliable_techs = ['Cable', 'Copper', 'Fiber-to-the-Premises', 'Licensed-Fixed-Wireless']
reliable_techs_df = all_df[all_df['technology_name'].isin(reliable_techs)]
#: Groupby individual locations (location_id) and apply our classification function
reliable_service_df = reliable_techs_df.groupby('location_id').apply(get_location_id_status)
reliable_service_df.head()
location_id 1010272362 served 1010272363 served 1010272364 unserved 1010272365 served 1010272370 unserved dtype: object
pd.concat
: Adding rows or columns¶Mainly useful when an operation creates another dataframe with the same column labels (ie, adding rows with the same schema) or the same index labels (ie, creating new columns for existing data).
#: Create a new dataframe of bike routes
bike_routes_df = pd.DataFrame({
'name': ['Main Street Trail', 'Benches', 'Beltway'],
'type': ['sidewalk', 'paved', 'paved']
})
bike_routes_df
name | type | |
---|---|---|
0 | Main Street Trail | sidewalk |
1 | Benches | paved |
2 | Beltway | paved |
#: Add another row
new_trail_df = pd.DataFrame({
'name': ['Provo Express'],
'type': ['paved']
})
combined_df = pd.concat([bike_routes_df, new_trail_df])
combined_df
name | type | |
---|---|---|
0 | Main Street Trail | sidewalk |
1 | Benches | paved |
2 | Beltway | paved |
0 | Provo Express | paved |
#: Add a pair of new columns, which are added according to the index
new_columns_df = pd.DataFrame({
'status': ['open', 'open', 'open', 'planned'],
'condition': ['good', 'poor', 'failed', None]
})
print(combined_df.index)
print(new_columns_df.index)
new_combined_df = pd.concat([combined_df, new_columns_df], axis='columns') #: Note axis=1 to append columns instead of rows
Int64Index([0, 1, 2, 0], dtype='int64') RangeIndex(start=0, stop=4, step=1)
--------------------------------------------------------------------------- InvalidIndexError Traceback (most recent call last) <ipython-input-73-9b53eda5ce76> in <module> 6 print(combined_df.index) 7 print(new_columns_df.index) ----> 8 new_combined_df = pd.concat([combined_df, new_columns_df], axis='columns') #: Note axis=1 to append columns instead of rows ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs) 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) 312 313 return wrapper ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy) 305 ) 306 --> 307 return op.get_result() 308 309 ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\reshape\concat.py in get_result(self) 526 obj_labels = obj.axes[1 - ax] 527 if not new_labels.equals(obj_labels): --> 528 indexers[ax] = obj_labels.get_indexer(new_labels) 529 530 mgrs_indexers.append((obj._mgr, indexers)) ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\indexes\base.py in get_indexer(self, target, method, limit, tolerance) 3440 3441 if not self._index_as_unique: -> 3442 raise InvalidIndexError(self._requires_unique_msg) 3443 3444 if not self._should_compare(target) and not is_interval_dtype(self.dtype): InvalidIndexError: Reindexing only valid with uniquely valued Index objects
combined_df.reset_index(inplace=True)
print(combined_df.index)
print(new_columns_df.index)
new_combined_df = pd.concat([combined_df, new_columns_df], axis='columns')
new_combined_df
RangeIndex(start=0, stop=4, step=1) RangeIndex(start=0, stop=4, step=1)
index | name | type | status | condition | |
---|---|---|---|---|---|
0 | 0 | Main Street Trail | sidewalk | open | good |
1 | 1 | Benches | paved | open | poor |
2 | 2 | Beltway | paved | open | failed |
3 | 0 | Provo Express | paved | planned | None |
.merge
allows you to do SQL-style joins on two different dataframes based on a common key.
This provides much more flexibility than pd.concat
on the columns used for the keys and allows you to specify the join type (inner, outer, etc).
#: first, let's drop the index column from the previous .reset_index() call
new_combined_df.drop(columns=['index'], inplace=True)
new_combined_df
name | type | status | condition | |
---|---|---|---|---|
0 | Main Street Trail | sidewalk | open | good |
1 | Benches | paved | open | poor |
2 | Beltway | paved | open | failed |
3 | Provo Express | paved | planned | None |
#: Build our new dataframe of surface types and descriptions
surface_description_df = pd.DataFrame({
'surface_type': ['sidewalk', 'paved', 'gravel'],
'surface_description': ['A shared-use path usually consisting of concrete four to eight feet wide', 'An asphalt-paved shared-use path at least 10 feet wide', 'A gravel-based natural-surface path'],
})
surface_description_df
surface_type | surface_description | |
---|---|---|
0 | sidewalk | A shared-use path usually consisting of concre... |
1 | paved | An asphalt-paved shared-use path at least 10 f... |
2 | gravel | A gravel-based natural-surface path |
#: Inner merge: only rows whose key is in both dataframes
new_combined_df.merge(surface_description_df, left_on='type', right_on='surface_type', how='inner')
name | type | status | condition | surface_type | surface_description | |
---|---|---|---|---|---|---|
0 | Main Street Trail | sidewalk | open | good | sidewalk | A shared-use path usually consisting of concre... |
1 | Benches | paved | open | poor | paved | An asphalt-paved shared-use path at least 10 f... |
2 | Beltway | paved | open | failed | paved | An asphalt-paved shared-use path at least 10 f... |
3 | Provo Express | paved | planned | None | paved | An asphalt-paved shared-use path at least 10 f... |
#: Outer: Keep all rows, no matter if the key is missing in one
new_combined_df.merge(surface_description_df, left_on='type', right_on='surface_type', how='outer', indicator=True)
name | type | status | condition | surface_type | surface_description | _merge | |
---|---|---|---|---|---|---|---|
0 | Main Street Trail | sidewalk | open | good | sidewalk | A shared-use path usually consisting of concre... | both |
1 | Benches | paved | open | poor | paved | An asphalt-paved shared-use path at least 10 f... | both |
2 | Beltway | paved | open | failed | paved | An asphalt-paved shared-use path at least 10 f... | both |
3 | Provo Express | paved | planned | None | paved | An asphalt-paved shared-use path at least 10 f... | both |
4 | NaN | NaN | NaN | NaN | gravel | A gravel-based natural-surface path | right_only |
Question: How many people are there per supermarket in each county?
#: Load in the data
places_df = pd.DataFrame.spatial.from_featureclass(r'data/open_source_places.gdb/OpenSourcePlaces')
new_counties_df = pd.DataFrame.spatial.from_featureclass(r'data/county_boundaries.gdb/Counties')
#: Gives us the frequency of all the unique values in a series
places_df['category'].value_counts()
building 5479 restaurant 2352 christian 1913 park 1870 fast_food 1681 ... greengrocer 3 jewish 2 embassy 2 christian_protestant 1 hindu 1 Name: category, Length: 105, dtype: int64
#: Returns all the unique values in a series and then sorts them
sorted(places_df['category'].unique())
['airport', 'archaeological', 'arts_centre', 'attraction', 'bakery', 'bank', 'bar', 'beauty_shop', 'beverages', 'bicycle_rental', 'bicycle_shop', 'bookshop', 'buddhist', 'building', 'butcher', 'cafe', 'camp_site', 'car_dealership', 'car_rental', 'car_wash', 'caravan_site', 'chemist', 'christian', 'christian_anglican', 'christian_catholic', 'christian_lutheran', 'christian_methodist', 'christian_protestant', 'cinema', 'clothes', 'college', 'community_centre', 'computer_shop', 'convenience', 'courthouse', 'dentist', 'department_store', 'doctors', 'doityourself', 'embassy', 'fast_food', 'fire_station', 'florist', 'furniture_shop', 'garden_centre', 'general', 'gift_shop', 'golf_course', 'graveyard', 'greengrocer', 'guesthouse', 'hairdresser', 'helipad', 'hindu', 'hospital', 'hostel', 'hotel', 'jeweller', 'jewish', 'kindergarten', 'laundry', 'library', 'mall', 'market_place', 'memorial', 'mobile_phone_shop', 'monument', 'motel', 'museum', 'muslim', 'nightclub', 'nursing_home', 'optician', 'outdoor_shop', 'park', 'pharmacy', 'picnic_site', 'playground', 'police', 'post_office', 'pub', 'railway_station', 'restaurant', 'ruins', 'school', 'shelter', 'shoe_shop', 'sports_centre', 'sports_shop', 'stadium', 'stationery', 'supermarket', 'swimming_pool', 'theatre', 'tourist_info', 'tower', 'town_hall', 'toy_shop', 'travel_agent', 'university', 'vending_any', 'veterinary', 'video_shop', 'viewpoint', 'zoo']
#: Filter down to just supermarkets, make a copy
supermarkets_df = places_df[places_df['category'] == 'supermarket'].copy()
supermarkets_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 273 entries, 68 to 18846 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OBJECTID 273 non-null int64 1 addr_dist 179 non-null float64 2 osm_id 273 non-null object 3 category 273 non-null object 4 name 273 non-null object 5 county 273 non-null object 6 city 273 non-null object 7 zip 273 non-null object 8 block_id 273 non-null object 9 ugrc_addr 273 non-null object 10 disclaimer 273 non-null object 11 lon 273 non-null float64 12 lat 273 non-null float64 13 amenity 273 non-null object 14 cuisine 273 non-null object 15 tourism 273 non-null object 16 shop 273 non-null object 17 website 273 non-null object 18 phone 273 non-null object 19 open_hours 273 non-null object 20 osm_addr 273 non-null object 21 SHAPE 273 non-null geometry dtypes: float64(3), geometry(1), int64(1), object(17) memory usage: 49.1+ KB
supermarkets_df.head()
OBJECTID | addr_dist | osm_id | category | name | county | city | zip | block_id | ugrc_addr | ... | lat | amenity | cuisine | tourism | shop | website | phone | open_hours | osm_addr | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
68 | 69 | NaN | 306835249 | supermarket | Winegar's | WEBER | OGDEN | 84067 | 490572105133005 | None | ... | 41.177380 | None | None | None | supermarket | None | None | None | None | {"x": -112.06271881999999, "y": 41.17738674000... |
87 | 88 | 22.674714 | 307362738 | supermarket | Kents | WEBER | OGDEN | 84067 | 490572105091021 | 3535 W 5500 S | ... | 41.162968 | None | None | None | supermarket | None | None | None | None | {"x": -112.06581891999997, "y": 41.16297414000... |
111 | 112 | 11.190251 | 308444242 | supermarket | Natural Foods | WEBER | OGDEN | 84405 | 490572105122014 | 1050 W RIVERDALE RD | ... | 41.177574 | None | None | None | supermarket | None | None | None | None | {"x": -112.00384751999997, "y": 41.17757994000... |
121 | 122 | NaN | 308973217 | supermarket | Kent's | DAVIS | CLEARFIELD | 84015 | 490111257013011 | None | ... | 41.104546 | None | None | None | supermarket | None | None | None | None | {"x": -112.01850231899999, "y": 41.10455224000... |
161 | 162 | NaN | 355819779 | supermarket | Winegar's Grocery | DAVIS | CLEARFIELD | 84015 | 490111255011027 | None | ... | 41.119408 | None | None | None | supermarket | None | None | None | None | {"x": -112.04676911899998, "y": 41.11941383900... |
5 rows × 22 columns
#: Try the join
supermarkets_df.spatial.join(new_counties_df)
--------------------------------------------------------------------------- Exception Traceback (most recent call last) <ipython-input-84-e1c3f1afe7cb> in <module> 1 #: Try the join ----> 2 supermarkets_df.spatial.join(new_counties_df) ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\arcgis\features\geo\_accessor.py in join(self, right_df, how, op, left_tag, right_tag) 1552 ) 1553 if self.sr != right_df.spatial.sr: -> 1554 raise Exception("Difference Spatial References, aborting operation") 1555 index_left = "index_{}".format(left_tag) 1556 index_right = "index_{}".format(right_tag) Exception: Difference Spatial References, aborting operation
print(supermarkets_df.spatial.sr)
print(new_counties_df.spatial.sr)
{'wkid': 4326, 'latestWkid': 4326} {'wkid': 102100, 'latestWkid': 3857}
#: Reproject the supermarkets to Web Mercator
supermarkets_df.spatial.project(3857)
True
supermarkets_df.spatial.join(new_counties_df, how='inner', op='within')
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-87-a0d7db3a0141> in <module> ----> 1 supermarkets_df.spatial.join(new_counties_df, how='inner', op='within') ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\arcgis\features\geo\_accessor.py in join(self, right_df, how, op, left_tag, right_tag) 1616 check_predicates( 1617 left_df[self.name].apply(lambda x: x)[l_idx], -> 1618 right_df[right_df.spatial._name][r_idx], 1619 ), 1620 ] ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\series.py in __getitem__(self, key) 964 return self._get_values(key) 965 --> 966 return self._get_with(key) 967 968 def _get_with(self, key): ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\series.py in _get_with(self, key) 999 # (i.e. self.iloc) or label-based (i.e. self.loc) 1000 if not self.index._should_fallback_to_positional(): -> 1001 return self.loc[key] 1002 else: 1003 return self.iloc[key] ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 929 930 maybe_callable = com.apply_if_callable(key, self.obj) --> 931 return self._getitem_axis(maybe_callable, axis=axis) 932 933 def _is_scalar_access(self, key: tuple): ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1151 raise ValueError("Cannot index with multidimensional key") 1152 -> 1153 return self._getitem_iterable(key, axis=axis) 1154 1155 # nested tuple slicing ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis) 1091 1092 # A collection of keys -> 1093 keyarr, indexer = self._get_listlike_indexer(key, axis) 1094 return self.obj._reindex_with_indexers( 1095 {axis: [keyarr, indexer]}, copy=True, allow_dups=True ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis) 1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr) 1313 -> 1314 self._validate_read_indexer(keyarr, indexer, axis) 1315 1316 if needs_i8_conversion(ax.dtype) or isinstance( ~\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcpy\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis) 1375 1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) -> 1377 raise KeyError(f"{not_found} not in index") 1378 1379 KeyError: '[14081, 8355, 13348, 8259, 13419, 8848, 11120, 882, 18196, 8181, 8183, 8634, 15324, 1159, 967, 970, 1162, 990, 4480, 10754, 10758, 1547, 11033, 3099, 14329, 11048, 11689, 3376, 15412, 15546, 11579, 11462, 8134, 17763, 10986, 2286, 9071, 12923, 8062, 16385, 12866, 18821, 14764, 18316, 11475, 950, 13147, 3039, 1166, 4881, 9368, 1701, 4774, 1194, 12337, 9272, 5949, 11714, 9539, 1093, 7120, 4817, 11730, 982, 4835, 11877, 998, 11753, 1014, 18706, 17705, 5394, 13557, 12906, 5491, 5717, 5625, 5792, 1186, 5859, 13758, 12843, 13868, 1071, 1075, 6707, 1238, 3225, 2814, 1247, 3467, 1276, 1292, 5086, 3694, 3090, 18020, 13718, 1731, 16745, 6042, 16747, 3324, 1866, 17693, 3314, 15374, 1024, 13601, 5826, 2787, 16123, 9498, 3071, 4243, 6164, 6168, 13730, 16039, 15296, 12867, 12241, 13820, 13809, 13822, 874, 12288, 7621, 2055, 2572, 2573, 14353, 8212, 8224, 8226, 11300, 3110, 8232, 8233, 9257, 11309, 15920, 11314, 16947, 17460, 2104, 2556, 8790, 1113, 12892, 16991, 15457, 13924, 1129, 15468, 12402, 11379, 12420, 11406, 14481, 2195, 9883, 1183, 2213, 12455, 7336, 5805, 9906, 10936, 11453, 18627, 12995, 17101, 5332, 8416, 12521, 13034, 17131, 16622, 9457, 8947, 2803, 1790, 8970, 5901, 3854, 5905, 6422, 11548, 12576, 9507, 9003, 2347, 5935, 5936, 13104, 2353, 7988, 2358, 1849, 13120, 13633, 3909, 3913, 5450, 13135, 3920, 9044, 9046, 8546, 12643, 12647, 13168, 6004, 9081, 14714, 15227, 3455, 14722, 6041, 12187, 18846, 16292, 4520, 4533, 16823, 4538, 4550, 7119, 8150, 16343, 12783, 8176, 5105, 11256, 4604, 17405, 11911, 10775, 10780, 6519, 12065, 5284, 13989, 12202, 11954, 11317, 11957, 11448, 11321, 11964, 5948, 3266, 7105, 14659, 16069, 1735, 11978, 15179, 10828, 3024, 15185, 8403, 13658, 1376, 9959, 7024, 10231] not in index'
#: Reset the index and call the spatial join again using method chaining
supermarkets_df.reset_index().spatial.join(new_counties_df, how='inner', op='within').head()
level_0 | OBJECTID_left | addr_dist | osm_id | category | name | county | city | zip | block_id | ... | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | FIPS_STR | COLOR4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 69 | NaN | 306835249 | supermarket | Winegar's | WEBER | OGDEN | 84067 | 490572105133005 | ... | 2.010291e+09 | 2010.0 | WEBER | 57.0 | North | 262223 | 269561 | {1757A80B-1895-4975-81DA-3AD2EBBA64C9} | 49057 | 1 |
1 | 1 | 88 | 22.674714 | 307362738 | supermarket | Kents | WEBER | OGDEN | 84067 | 490572105091021 | ... | 2.010291e+09 | 2010.0 | WEBER | 57.0 | North | 262223 | 269561 | {1757A80B-1895-4975-81DA-3AD2EBBA64C9} | 49057 | 1 |
2 | 2 | 112 | 11.190251 | 308444242 | supermarket | Natural Foods | WEBER | OGDEN | 84405 | 490572105122014 | ... | 2.010291e+09 | 2010.0 | WEBER | 57.0 | North | 262223 | 269561 | {1757A80B-1895-4975-81DA-3AD2EBBA64C9} | 49057 | 1 |
3 | 10 | 983 | 22.493216 | 490548656 | supermarket | Wangsgards | WEBER | OGDEN | 84404 | 490572003013001 | ... | 2.010291e+09 | 2010.0 | WEBER | 57.0 | North | 262223 | 269561 | {1757A80B-1895-4975-81DA-3AD2EBBA64C9} | 49057 | 1 |
4 | 12 | 999 | NaN | 509040436 | supermarket | Smith's Marketplace | WEBER | OGDEN | 84414 | 490572102041016 | ... | 2.010291e+09 | 2010.0 | WEBER | 57.0 | North | 262223 | 269561 | {1757A80B-1895-4975-81DA-3AD2EBBA64C9} | 49057 | 1 |
5 rows × 36 columns
#: Now let's save the join and only get the name and population columns from the counties
new_supermarkets_df = supermarkets_df.reset_index(drop=True).spatial.join(new_counties_df[['NAME', 'POP_LASTCENSUS', 'SHAPE']], how='inner', op='within')
new_supermarkets_df.head()
OBJECTID | addr_dist | osm_id | category | name | county | city | zip | block_id | ugrc_addr | ... | tourism | shop | website | phone | open_hours | osm_addr | SHAPE | index_right | NAME | POP_LASTCENSUS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 69 | NaN | 306835249 | supermarket | Winegar's | WEBER | OGDEN | 84067 | 490572105133005 | None | ... | None | supermarket | None | None | None | None | {"x": -12474764.795952193, "y": 5038541.480961... | 6 | WEBER | 262223 |
1 | 88 | 22.674714 | 307362738 | supermarket | Kents | WEBER | OGDEN | 84067 | 490572105091021 | 3535 W 5500 S | ... | None | supermarket | None | None | None | None | {"x": -12475109.8975056, "y": 5036410.11323035... | 6 | WEBER | 262223 |
2 | 112 | 11.190251 | 308444242 | supermarket | Natural Foods | WEBER | OGDEN | 84405 | 490572105122014 | 1050 W RIVERDALE RD | ... | None | supermarket | None | None | None | None | {"x": -12468211.272813853, "y": 5038570.054997... | 6 | WEBER | 262223 |
3 | 983 | 22.493216 | 490548656 | supermarket | Wangsgards | WEBER | OGDEN | 84404 | 490572003013001 | 145 HARRISVILLE RD | ... | None | supermarket | None | None | None | None | {"x": -12464702.716234982, "y": 5050771.313416... | 6 | WEBER | 262223 |
4 | 999 | NaN | 509040436 | supermarket | Smith's Marketplace | WEBER | OGDEN | 84414 | 490572102041016 | None | ... | None | supermarket | None | None | Mo-Su 06:00-24:00 | 2434 North 400 East North Ogden 84414 | {"x": -12464039.419049088, "y": 5057129.544228... | 6 | WEBER | 262223 |
5 rows × 25 columns
Now lets use our join results to get the number of supermarkets per county and the total county population
#: Use groupby to get total count of rows in each group, 'category' is arbitrary column
new_supermarkets_df.groupby('NAME')['category'].count().head()
NAME BEAVER 1 BOX ELDER 3 CACHE 10 CARBON 2 DAVIS 18 Name: category, dtype: int64
#: And just get the first population value in each group (they're all the same per group)
new_supermarkets_df.groupby('NAME')['POP_LASTCENSUS'].first().head()
NAME BEAVER 7072 BOX ELDER 57666 CACHE 133154 CARBON 20412 DAVIS 362679 Name: POP_LASTCENSUS, dtype: int64
#: Concat our groupby outputs into a new DataFrame
answer_df = pd.concat([new_supermarkets_df.groupby('NAME')['category'].count(), new_supermarkets_df.groupby('NAME')['POP_LASTCENSUS'].first()], axis=1)
answer_df.head()
category | POP_LASTCENSUS | |
---|---|---|
NAME | ||
BEAVER | 1 | 7072 |
BOX ELDER | 3 | 57666 |
CACHE | 10 | 133154 |
CARBON | 2 | 20412 |
DAVIS | 18 | 362679 |
#: Calculate our metric and clean up the column names
answer_df['people_per_supermarket'] = answer_df['POP_LASTCENSUS'] / answer_df['category']
answer_df.rename(columns={'category': 'supermarkets', 'POP_LASTCENSUS': 'pop_last_census'}, inplace=True)
answer_df.head()
supermarkets | pop_last_census | people_per_supermarket | |
---|---|---|---|
NAME | |||
BEAVER | 1 | 7072 | 7072.000000 |
BOX ELDER | 3 | 57666 | 19222.000000 |
CACHE | 10 | 133154 | 13315.400000 |
CARBON | 2 | 20412 | 10206.000000 |
DAVIS | 18 | 362679 | 20148.833333 |
#: Use chaining and line continuation to do it in one statement
new_answer_df = (pd.concat([new_supermarkets_df.groupby('NAME')['category'].count(), new_supermarkets_df.groupby('NAME')['POP_LASTCENSUS'].first()], axis=1)
.assign(people_per_supermarket= lambda x: x['POP_LASTCENSUS'] / x['category'])
.rename(columns={'category': 'supermarkets', 'POP_LASTCENSUS': 'pop_last_census'}))
new_answer_df.head()
supermarkets | pop_last_census | people_per_supermarket | |
---|---|---|---|
NAME | |||
BEAVER | 1 | 7072 | 7072.000000 |
BOX ELDER | 3 | 57666 | 19222.000000 |
CACHE | 10 | 133154 | 13315.400000 |
CARBON | 2 | 20412 | 10206.000000 |
DAVIS | 18 | 362679 | 20148.833333 |
#: Now let's join our new data back to the geometries using just name and shape columns
merged_df = counties_df[['NAME', 'SHAPE']].merge(answer_df, left_on='NAME', right_on='NAME')
merged_df.sort_values(by='people_per_supermarket').head()
NAME | SHAPE | supermarkets | pop_last_census | people_per_supermarket | |
---|---|---|---|---|---|
15 | GARFIELD | {'rings': [[[-12497349.647300001, 4600620.4720... | 4 | 5083 | 1270.75 |
21 | WAYNE | {'rings': [[[-12435226.4528, 4651746.087300003... | 1 | 2486 | 2486.00 |
11 | RICH | {'rings': [[[-12361664.9952, 5161235.351800002... | 1 | 2510 | 2510.00 |
10 | GRAND | {'rings': [[[-12139959.1677, 4793360.2117], [-... | 3 | 9669 | 3223.00 |
17 | SUMMIT | {'rings': [[[-12245108.298, 5011966.523199998]... | 12 | 42357 | 3529.75 |
merged_df.spatial.plot()
We've already used some aspects of Esri's spatially-enabled dataframes, which are an extension to normal pandas dataframe namespace provided by the ArcGIS API for Python.
from arcgis.features import GeoAccessor, GeoSeriesAccessor
These imports add the .spatial
attribute to pd.DataFrame
, which provides access to a bunch of spatial methods and attributes. We've already used pd.DataFrame.spatial.from_featureclass()
, .project()
, .join()
, and .sr
.
The documentation for all the methods and attributes exposed through .spatial
can be found in the GeoAccessor
class of the arcgis.features
module in the ArcGIS API for Python docs
Because the ArcGIS API for Python does not require ArcGIS Pro/Enterprise, it can use two different geometry engines for spatial data types and operations.
If arcpy
is available in your python environment via ArcGIS Pro/Enterprise, it uses arcpy
's underlying geometry operations (just without all the feature layer nonsense).
If arcpy
is not available, it uses the shapely
open-source library for geometry operations. The geometry objects will look a little different, and you won't be able to write to File GDBs (though you can still read from them).
#: From a Feature Class
pd.DataFrame.spatial.from_featureclass(r'data/county_boundaries.gdb/Counties').head()
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | FIPS_STR | COLOR4 | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | {AD3015BE-B3C9-4316-B8DC-03AFBB56B443} | 49005 | 2 | {"rings": [[[-12485167.954, 5160638.807099998]... |
1 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13.0 | Central | 19596 | 20161 | {7F6252B7-137F-4F7E-A19A-03C702111509} | 49013 | 4 | {"rings": [[[-12273167.1888, 4987644.314599998... |
2 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | {214E29A1-CC25-4867-9148-2817DB623088} | 49011 | 3 | {"rings": [[[-12458980.0088, 5032817.971900001... |
3 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | {BC91C742-7508-4118-8469-297719E423CB} | 49027 | 3 | {"rings": [[[-12494338.1205, 4801287.949100003... |
4 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | {3D0C5C1E-2650-458E-B322-2B86AA473441} | 49051 | 2 | {"rings": [[[-12400515.3909, 4966751.283200003... |
#: From a hosted feature layer
feature_layer = arcgis.features.FeatureLayer('https://services1.arcgis.com/99lidPhWCzftIe9K/arcgis/rest/services/UtahCountyBoundaries/FeatureServer/0')
pd.DataFrame.spatial.from_layer(feature_layer).head()
OBJECTID | COUNTYNBR | ENTITYNBR | ENTITYYR | NAME | FIPS | STATEPLANE | POP_LASTCENSUS | POP_CURRESTIMATE | GlobalID | FIPS_STR | COLOR4 | Shape__Area | Shape__Length | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 03 | 2.010031e+09 | 2010.0 | CACHE | 5.0 | North | 133154 | 140173 | ad3015be-b3c9-4316-b8dc-03afbb56b443 | 49005 | 2 | 5.456780e+09 | 388684.773484 | {"rings": [[[-12485167.954, 5160638.8071], [-1... |
1 | 2 | 07 | 2.010071e+09 | 2010.0 | DUCHESNE | 13.0 | Central | 19596 | 20161 | 7f6252b7-137f-4f7e-a19a-03c702111509 | 49013 | 4 | 1.448945e+10 | 517976.270842 | {"rings": [[[-12273167.1888, 4987644.3146], [-... |
2 | 3 | 06 | 2.010061e+09 | 2010.0 | DAVIS | 11.0 | North | 362679 | 369948 | 214e29a1-cc25-4867-9148-2817db623088 | 49011 | 3 | 2.889970e+09 | 279016.597702 | {"rings": [[[-12458980.0088, 5032817.9719], [-... |
3 | 4 | 14 | 2.010141e+09 | 2010.0 | MILLARD | 27.0 | Central | 12975 | 13330 | bc91c742-7508-4118-8469-297719e423cb | 49027 | 3 | 2.942199e+10 | 742424.559718 | {"rings": [[[-12494338.1205, 4801287.9491], [-... |
4 | 5 | 26 | 2.010261e+09 | 2010.0 | WASATCH | 51.0 | Central | 34788 | 36619 | 3d0c5c1e-2650-458e-b322-2b86aa473441 | 49051 | 2 | 5.396133e+09 | 436612.959291 | {"rings": [[[-12400515.3909, 4966751.2832], [-... |
#: From a dataframe containing lat/longs
input_df = pd.DataFrame({
'ugic': ['Midway', 'Vernal'],
'latitude': [40.52528, 40.45389],
'longitude': [-111.48883, -109.52327]
})
pd.DataFrame.spatial.from_xy(input_df, x_column='longitude', y_column='latitude').head()
ugic | latitude | longitude | SHAPE | |
---|---|---|---|---|
0 | Midway | 40.52528 | -111.48883 | {"spatialReference": {"wkid": 4326}, "x": -111... |
1 | Vernal | 40.45389 | -109.52327 | {"spatialReference": {"wkid": 4326}, "x": -109... |
SEDFs can be exported to several different forms using the df.spatial.to_*
methods.
to_featurest
/to_feature_collection
: arcgis.features.FeatureSet
or .FeatureCollection
objects.to_featureclass
: Write to a feature class within a GDB or shapefile (depending on file extension and whether arcpy is present)to_featurelayer
: Create or overwrite a hosted feature layer in a arcgis.gis.GIS
(AGOL or Portal organization).Sometimes, we perform a .spatial
operation only to get a weird error that seems to be related to geometries. We can use .spatial.validate()
to make sure the Spatially-Enabled DataFrame is, well, spatially-enabled.
#: Use .validate to check if all the spatial bits are working
counties_df.spatial.validate()
True
What if this returns False
? There are a couple common fixes.
#: First, make sure it's point to the right geometry column.
#: Don't try to keep multiple geometry columns in one DataFrame.
counties_df.spatial.set_geometry('SHAPE')
counties_df.spatial.name #: The name of the geometry column
'SHAPE'
#: If using shapely, projecting often misnames the .sr property, so set it manually
counties_df.spatial.sr = {'wkid': 3857}