Package palletjack

A library for extracting tabular/spatial data from many different sources, transforming it to meet ArcGIS Online hosted feature service requirements, and loading it into existing feature services.

Overview

palletjack is a library of modules for updating data in ArcGIS Online (AGOL) based on information from external data. It handles the repetitive parts of the extract, transform, and load stages of the ETL process, allowing the user to focus on the custom transform steps unique to every project.

As a library, palletjack is not meant to be used by itself. Rather, its classes and methods should be used by other apps written for specific use cases. These other apps are referred to as the "client" apps, and internally we often refer to them as "skids".

pandas dataframes are the main unifying data structure between the different steps. The client loads data from an external source into a dataframe and then modifies the dataframe according to their business needs. Once the dataframe is ready to go, it is used to update the hosted feature service in AGOL.

Organization

The individual modules within palletjack each handle their own step of the ETL process. Each module contains classes for accomplishing its task organized by source, operation, or destination. There may be multiple, similar methods in a class depending on exactly how you want to perform a given step—you probably won't use all the available classes and methods in every application. The publicly-exposed methods usually call several private methods to keep functions small and testable.

Classes in palletjack.extract handle the extract stage, pulling data in from external sources. Tabular data (csvs, Google Sheets) are loaded into dataframes, while non-tabular data for attachments are just downloaded to the specified locations. Spatial data are loaded into spatially-enable dataframes. You'll instantiate the desired class with the basic connection info and then call the appropriate method on the resulting object to extract the data.

There are a handful of classes in palletjack.transform with methods for cleaning and preparing your dataframes for upload to AGOL. You may also need to modify your data to fit your specific business needs: calculating fields, renaming fields, performing quality checks, etc. Some classes only have static methods can be called directly without needing to instantiate the class.

Once your dataframe is looking pretty, the palletjack.load module will help you update a hosted feature service with your new data. The FeatureServiceUpdater class represents a single feature layer and its supporting information, allowing you to call the individual methods for the necessary parts of the update process (add, remove, update, or completely truncate and load).

While many parts of the classes' functionality are hidden in private methods, commonly-used code is exposed publicly in the palletjack.utils module. You will probably not need any of the methods provided, but they may be useful for other projects. This is palletjack's junk drawer.

Data Considerations

Under the hood, palletjack uses the arcgis.features.FeatureLayer.append() method to upload data. To eliminate the dependency on arcpy (and thus ArcGIS Pro/Enterprise), it uses GeoPandas and pyogrio to save data to a geodatabase, uploads the geodatabase to AGOL, calls .append() using the geodatabase as the source, and then deletes the geodatabase item from AGOL. palletjack tests for all the known gotchas and raises an error if the data needs extra work before uploading. In addition, AGOL imposes its own set of constraints.

Field Names

The column names in your dataframes should match the field names in AGOL one-to-one, with the exception of AGOL's auto-generated fields (shape length/area, editor tracking, etc). You can use DataCleaning.rename_dataframe_columns_for_agol() to handle the majority of field name formatting, but you'll want to double check the results.

Field Types

The upload process is very particular about data types and missing data. FieldChecker.check_live_and_new_field_types_match() contains a mapping of dataframe dtypes to Esri field types. They generally follow what you would expect. However, because pandas (currently) handles missing data by with np.nan by default, you may have integer data assigned a float dtype. In addition, some sources render missing data as an empty string, creating an object dtype. Finally, datetimes must be in UTC and stored in the non-timezone-aware datetime64[ns] dtype.

DataCleaning has methods to help convert your data to these dtypes. In addition, it's a good practice to use pandas' nullable dtypes via pd.DataFrame.convert_dtypes() (see also the section on nullable ints).

OBJECTID and Join Keys

If you want to update existing data without truncating and loading, you will need a join key between the incoming new data and the existing AGOL data. Do not use OBJECTID for this field; it may change at any time. Instead, use your own custom field that you have complete control over. You will perform the join manually in the transform step with pandas by loading the live AGOL data into a dataframe, joining the new data into the live data, and then passing the resulting dataframe to FeatureServiceUpdater.update_features(). This method uses the live data's OBJECTID to apply the edits to the proper rows.

Error handling

The client is responsible for handling errors and warnings that arise during the process. palletjack will raise its own errors when something occurs that would keep the process from continuing or when one of its internal data checks fails. It will also captured and chain errors from the underlying libraries to include additional, context-specific messages for the user. It will raise warnings when something happens that the client should be aware of but will not keep the process from completing.

Logging

palletjack takes full advantage of python's built-in logging library to perform its feedback and reporting. It employs a hierarchical structure, creating module-level loggers for each module and then each class in a module creates their own child logger. This allows rapid identification of where log events are occurring.

Accessing palletjack logs

The client can get a reference to the palletjack logger and add their handlers, formatters, etc to it alongside its own logger:

myapp_logger = logging.getLogger('my_app')
myapp_logger.setLevel(logging.INFO)
palletjack_logger = logging.getLogger('palletjack')
palletjack_logger.setLevel(logging.INFO)
#: set up handlers and formatters
#: ...
myapp_logger.addHandler(log_handler)
palletjack_logger.addHandler(log_handler)

Log levels

  • logging.DEBUG includes verbose debug info that should allow you to manually (possibly programmatically) undo or redo any operation.
  • logging.INFO includes standard runtime progress reports and result information.
  • logging.WARNING includes negative results from checks or other situations that the user should be aware of.

Updating from v3 to v4

palletjack v4's biggest breaking change requires you to now instantiate a FeatureServiceUpdater class yourself before calling the appropriate methods. v3 used class methods to handle instantiation for you, but we've broken this out into the more traditional pattern to store the additional information that's common to all the steps.

v4 also completely does away with JSON for uploads and storage. Because of this, you no longer need to worry about projecting to WGS84 or (relatively sane) dataset sizes. In addition, truncate and load now uses a simple boolean flag to save the existing data, and saves it as a file gdb instead of a JSON file.

Updating from v2 to v3

palletjack v3 has several changes from the previous version that users will need to consider when updating existing clients. Version 3 is designed to align with each step in the ETL process and to better follow the single-responsibility principle.

Namespace Changes

The largest change is that the namespace has been refactored to match the ETL steps. loaders.py has been changed to extract.py and updaters.py has been changed to load.py. This eliminates the confusion created by "loaders" being used in the ETL "extract" stage.

As a corollary to this, clients now import each module rather than palletjack exposing the classes directly. The recommended import is from palletjack import palletjack.extract, palletjack.transform, palletjack.load, palletjack.utils (omitting unused modules as necessary).

Version 3 also introduces the use of class methods to take care of object instantiation for the client. These are used the most in FeatureServiceUpdater, where the client just calls the relevant methods.

One Step at a Time

As part of the refactor, each method generally only tries to do one thing: extract one piece of data, clean one common data error, or perform one load operation.

Previous versions focused on being able to call a single method to do everything. This quickly got unwieldly and led to a lot of complexity trying to match the complexity of real-world data.

Version 3 load methods expect the client to have already cleaned the data and made it ready for uploading. Many of the cleaning steps that were coupled to the load methods in version 2 have been refactored to the palletjack.transform module.

.append instead of .edit_features

Under the hood, version 3 has completely replaced FeatureLayer.edit_features() with FeatureLayer.append() based on the recommendation in the ArcGIS API for Python docs. This has a couple ramifications for the client. First, in order to avoid the arcpy dependency, all data are converted to geojson for upload. This requires the client to project the dataframes to WGS84/wkid 4326 prior to updating the feature service. Secondly, the client must separate out add, update, and delete operations into individual method calls.

Expand source code
"""A library for extracting tabular/spatial data from many different sources, transforming it to meet ArcGIS Online hosted feature service requirements, and loading it into existing feature services.

.. include:: ../../docs/README.md
"""
import locale

from . import extract, load, transform, utils
from .errors import IntFieldAsFloatError, TimezoneAwareDatetimeError

#: If the locale is not set explicitly, set it to the system default for text to number conversions
if not locale.getlocale(locale.LC_NUMERIC)[0]:
    locale.setlocale(locale.LC_NUMERIC, locale.getlocale())

Sub-modules

palletjack.errors

Errors specific to palletjack

palletjack.extract

Extract tabular/spatial data from various sources into a pandas dataframe …

palletjack.load

Modify existing ArcGIS Online content (mostly hosted feature services). Contains classes for updating hosted feature service data, modifying the …

palletjack.transform

Transform pandas dataframes in preparation for loading to AGOL.

palletjack.utils

Utility classes and methods that are used internally throughout palletjack. Many are exposed publicly in case they are useful elsewhere in a client's …

palletjack.version

A single source of truth for the version in a programmatically-accessible variable. This must only include a single line: version = '0.0.0'