Package palletjack
A library for extracting tabular/spatial data from many different sources, transforming it to meet ArcGIS Online hosted feature service requirements, and loading it into existing feature services.
Overview
palletjack
is a library of modules for
updating data in ArcGIS Online (AGOL) based on information from external data. It handles the
repetitive parts of the extract, transform, and load stages of the ETL process, allowing the user to
focus on the custom transform steps unique to every project.
As a library, palletjack is not meant to be used by itself. Rather, its classes and methods should be used by other apps written for specific use cases. These other apps are referred to as the "client" apps, and internally we often refer to them as "skids".
pandas dataframes are the main unifying data structure between the different steps. The client loads data from an external source into a dataframe and then modifies the dataframe according to their business needs. Once the dataframe is ready to go, it is used to update the hosted feature service in AGOL.
Organization
The individual modules within palletjack each handle their own step of the ETL process. Each module contains classes for accomplishing its task organized by source, operation, or destination. There may be multiple, similar methods in a class depending on exactly how you want to perform a given step—you probably won't use all the available classes and methods in every application. The publicly-exposed methods usually call several private methods to keep functions small and testable.
Classes in palletjack.extract
handle the extract stage, pulling data in from external sources. Tabular data (csvs, Google Sheets)
are loaded into dataframes, while non-tabular data for attachments are just downloaded to the
specified locations. Spatial data are loaded into spatially-enable dataframes. You'll instantiate
the desired class with the basic connection info and then call the appropriate method on the
resulting object to extract the data.
There are a handful of classes in
palletjack.transform
with
methods for cleaning and preparing your dataframes for upload to AGOL. You may also need to modify
your data to fit your specific business needs: calculating fields, renaming fields, performing
quality checks, etc. Some classes only have static methods can be called directly without needing to
instantiate the class.
Once your dataframe is looking pretty, the
palletjack.load
module will help you
update a hosted feature service with your new data. The FeatureServiceUpdater
class
represents a single feature layer and its supporting information, allowing you to call the
individual methods for the necessary parts of the update process (add, remove, update, or completely
truncate and load).
While many parts of the classes' functionality are hidden in private methods, commonly-used code is
exposed publicly in the
palletjack.utils
module. You will
probably not need any of the methods provided, but they may be useful for other projects. This is
palletjack's junk drawer.
Data Considerations
Under the hood, palletjack uses the arcgis.features.FeatureLayer.append()
method to
upload data. To eliminate the dependency on arcpy
(and thus ArcGIS Pro/Enterprise), it
uses GeoPandas and pyogrio to save data to a geodatabase, uploads the geodatabase to AGOL, calls
.append()
using the geodatabase as the source, and then deletes the geodatabase item
from AGOL. palletjack tests for all the known gotchas and raises an error if the data needs extra
work before uploading. In addition, AGOL imposes its own set of constraints.
Field Names
The column names in your dataframes should match the field names in AGOL one-to-one, with the
exception of AGOL's auto-generated fields (shape length/area, editor tracking, etc). You can use
DataCleaning.rename_dataframe_columns_for_agol()
to handle the majority of field name formatting, but you'll want to double check the results.
Field Types
The upload process is very particular about data types and missing data.
FieldChecker.check_live_and_new_field_types_match()
contains a mapping of dataframe dtypes to Esri field types. They generally follow what you would
expect. However, because pandas (currently) handles missing data by with np.nan
by
default, you may have integer data assigned a float dtype. In addition, some sources render missing
data as an empty string, creating an object dtype. Finally, datetimes must be in UTC and stored in
the non-timezone-aware datetime64[ns]
dtype.
DataCleaning
has methods to help convert your data to these dtypes. In addition, it's a good practice to use
pandas' nullable dtypes via pd.DataFrame.convert_dtypes()
(see also the section on nullable ints).
OBJECTID and Join Keys
If you want to update existing data without truncating and loading, you will need a join key between
the incoming new data and the existing AGOL data. Do not use OBJECTID for this field; it may change
at any time. Instead, use your own custom field that you have complete control over. You will
perform the join manually in the transform step with pandas by loading the live AGOL data into a
dataframe, joining the new data into the live data, and then passing the resulting dataframe to
FeatureServiceUpdater.update_features()
.
This method uses the live data's OBJECTID to apply the edits to the proper rows.
Error handling
The client is responsible for handling errors and warnings that arise during the process. palletjack will raise its own errors when something occurs that would keep the process from continuing or when one of its internal data checks fails. It will also captured and chain errors from the underlying libraries to include additional, context-specific messages for the user. It will raise warnings when something happens that the client should be aware of but will not keep the process from completing.
Logging
palletjack takes full advantage of python's built-in logging
library to perform its feedback and reporting. It employs a hierarchical structure, creating module-level
loggers for each module and then each class in a module creates their own child logger. This allows
rapid identification of where log events are occurring.
Accessing palletjack logs
The client can get a reference to the palletjack logger and add their handlers, formatters, etc to it alongside its own logger:
myapp_logger = logging.getLogger('my_app')
myapp_logger.setLevel(logging.INFO)
palletjack_logger = logging.getLogger('palletjack')
palletjack_logger.setLevel(logging.INFO)
#: set up handlers and formatters
#: ...
myapp_logger.addHandler(log_handler)
palletjack_logger.addHandler(log_handler)
Log levels
logging.DEBUG
includes verbose debug info that should allow you to manually (possibly programmatically) undo or redo any operation.logging.INFO
includes standard runtime progress reports and result information.logging.WARNING
includes negative results from checks or other situations that the user should be aware of.
Updating from v3 to v4
palletjack v4's biggest breaking change requires you to now instantiate a
FeatureServiceUpdater
class yourself before calling the appropriate methods. v3 used
class methods to handle instantiation for you, but we've broken this out into the more traditional
pattern to store the additional information that's common to all the steps.
v4 also completely does away with JSON for uploads and storage. Because of this, you no longer need to worry about projecting to WGS84 or (relatively sane) dataset sizes. In addition, truncate and load now uses a simple boolean flag to save the existing data, and saves it as a file gdb instead of a JSON file.
Updating from v2 to v3
palletjack v3 has several changes from the previous version that users will need to consider when updating existing clients. Version 3 is designed to align with each step in the ETL process and to better follow the single-responsibility principle.
Namespace Changes
The largest change is that the namespace has been refactored to match the ETL steps.
loaders.py
has been changed to extract.py
and updaters.py
has
been changed to load.py
. This eliminates the confusion created by "loaders" being used
in the ETL "extract" stage.
As a corollary to this, clients now import each module rather than palletjack exposing the classes
directly. The recommended import is
from palletjack import palletjack.extract, palletjack.transform, palletjack.load, palletjack.utils
(omitting unused modules as necessary).
Version 3 also introduces the use of class methods to take care of object instantiation for the
client. These are used the most in
FeatureServiceUpdater
,
where the client just calls the relevant methods.
One Step at a Time
As part of the refactor, each method generally only tries to do one thing: extract one piece of data, clean one common data error, or perform one load operation.
Previous versions focused on being able to call a single method to do everything. This quickly got unwieldly and led to a lot of complexity trying to match the complexity of real-world data.
Version 3 load methods expect the client to have already cleaned the data and made it ready for
uploading. Many of the cleaning steps that were coupled to the load methods in version 2 have been
refactored to the
palletjack.transform
module.
.append instead of .edit_features
Under the hood, version 3 has completely replaced FeatureLayer.edit_features()
with
FeatureLayer.append()
based on the recommendation in the ArcGIS API for Python docs.
This has a couple ramifications for the client. First, in order to avoid the arcpy dependency, all
data are converted to geojson for upload. This requires the client to project the dataframes to
WGS84/wkid 4326 prior to updating the feature service. Secondly, the client must separate out add,
update, and delete operations into individual method calls.
Expand source code
"""A library for extracting tabular/spatial data from many different sources, transforming it to meet ArcGIS Online hosted feature service requirements, and loading it into existing feature services.
.. include:: ../../docs/README.md
"""
import locale
from . import extract, load, transform, utils
from .errors import IntFieldAsFloatError, TimezoneAwareDatetimeError
#: If the locale is not set explicitly, set it to the system default for text to number conversions
if not locale.getlocale(locale.LC_NUMERIC)[0]:
locale.setlocale(locale.LC_NUMERIC, locale.getlocale())
Sub-modules
palletjack.errors
-
Errors specific to palletjack
palletjack.extract
-
Extract tabular/spatial data from various sources into a pandas dataframe …
palletjack.load
-
Modify existing ArcGIS Online content (mostly hosted feature services). Contains classes for updating hosted feature service data, modifying the …
palletjack.transform
-
Transform pandas dataframes in preparation for loading to AGOL.
palletjack.utils
-
Utility classes and methods that are used internally throughout palletjack. Many are exposed publicly in case they are useful elsewhere in a client's …
palletjack.version
-
A single source of truth for the version in a programmatically-accessible variable. This must only include a single line: version = '0.0.0'