Releases: ipums/hlink
v3.5.5
What's Changed
- Support a variable number of columns in the array feature selection transform by @riley-harper in #135
Full Changelog: v3.5.4...v3.5.5
v3.5.4
What's Changed
- Document column_mappings transform concat_two_cols by @riley-harper in #126. These new docs are here: https://hlink.docs.ipums.org/column_mappings.html#concat-two-cols.
- Document column mapping overrides by @riley-harper in #129. These can let you read two columns with different names from the two input files into a single hlink column. Check out the documentation at https://hlink.docs.ipums.org/column_mappings.html#advanced-usage and following.
- Fix a bug with the override_column_X attributes in conf_validations.py by @riley-harper in #131. Previously config validation was raising spurious errors because it didn't take override_column_a and override_column_b into account.
Full Changelog: v3.5.3...v3.5.4
v3.5.3
Highlights
In this release we start supporting Python 3.12 and remove the ceiling on most of our dependency versions to support this. We also fix a bug with one-hot encoding and add an additional check to config validation that looks for duplicate output columns for some config sections.
What's Changed
- Refactor to use colorama in a simpler way by @jrbalch543 in #115. User-facing functionality should be unchanged.
- Add checks for duplicated comparison features, feature selection, and column mappings by @jrbalch543 in #113. This will cause a validation error when there are duplicated aliases or output columns for these sections.
- Clean up a couple of core modules by @jrbalch543 in #117. These changes are internal refactoring and don't affect functionality.
- Upgrade dependencies by pinning them more loosely and support Python 3.12 by @riley-harper in #119. This removes the upper limit on almost all of our dependencies so that users can more freely pick versions for themselves. Our tests run on the most recent available versions for each particular version of Python. Unpinning these dependencies allows us to easily support Python 3.12, which we now support and run in CI/CD.
- Update the docs to include Python 3.12 by @riley-harper in #120
- Revert to handleInvalid = "keep" for OneHotEncoder by @riley-harper in #121. This is a bug that we introduced in the last release. Although it's not common, it does sometimes happen that our training data doesn't cover every category present in the matching data. We would rather silently continue and ignore these cases by giving them a coefficient of 0 than error out on them.
- Put the config file name in the script prompt by @riley-harper in #123. This is a small quality of life feature that makes it easier to remember which config file you're running during long hlink runs.
Full Changelog: v3.5.2...v3.5.3
v3.5.2
What's Changed
- Fixed zipping issue in Training step 3 by @jrbalch543 in #104
- Fix a bug in Training step 3 for categorical features by @jrbalch543 and @riley-harper in #107. Each categorical feature was getting a single coefficient when each category should get its own coefficient instead.
- Error out on invalid categories in training data instead of creating a new category for them by @riley-harper in #109. This bug fix reduces the number of categories created by hlink by 1. The last category represented missing or invalid data, but these categories were pretty much always unused because hlink creates exhaustive categories whenever possible. Users can still manually mark missing data by creating their own category for it, but hlink will not do this by default anymore. This should help prevent silent errors and confusion with missing data.
- Fix a bug where categorical features created by interaction caused Training step 3 to crash by @riley-harper in #111
- Tweak the format of Training step 3's output by @riley-harper in #112. There are now 3 columns: feature_name, category, and coefficient_or_importance. Feature names aren't suffixed with the category value anymore.
Full Changelog: v3.5.1...v3.5.2
v3.5.1
What's Changed
- Implement a new Training step that replaces Model Exploration step 3 by @jrbalch543 and @riley-harper in #101. This new step replaces the broken "get feature importances" step in Model Exploration, which now is removed. Training step 3 saves model feature importances or coefficients when
training.feature_importances
is set to true in the config file.
New Contributors
- @jrbalch543 made their first contribution in #102! 🎉
Full Changelog: v3.5.0...v3.5.1
v3.5.0
What's Changed
- Make the CI Dockerfile more flexible and maintainable by @riley-harper in #92. This allowed us to support Python 3.11 and also cleared up some questions about which versions of Java are supported by hlink and pyspark.
- Support Python 3.11 by @riley-harper in #94. This required upgrading Spark from 3.3 to 3.5. We are now also less strict about the versions of numpy and pandas used.
- Fix 2 small command-line bugs by @riley-harper in #96. One was a typo in some documentation, and the other was a bug where the autocomplete cache was not reloaded consistently. It is now reloaded after each command.
- Deprecate the
interaction_transformer
module by @riley-harper in #97. This is a backport from when we were on Spark 2. Users of hlink should use Spark'spyspark.ml.feature.Interaction
class instead. Theinteraction_transformer
module will be removed in the future. - Add a new
multi_jaro_winkler_search
comparison feature by @riley-harper in #99. This is a complex comparison feature that supports conditional Jaro-Winkler comparisons between lists of columns with similar names. You can read more in the documentation at https://hlink.docs.ipums.org/comparison_types.html#multi-jaro-winkler-search.
Full Changelog: v3.4.0...v3.5.0
v3.4.0
What's Changed
New Features and Improvements
- Add a new
convert_ints_to_longs
configuration setting by @riley-harper in #87. This configuration setting is especially helpful for reading from CSV files, which don't contain an explicit schema. Documentation forconvert_ints_to_longs
can be found at https://hlink.docs.ipums.org/config.html#data-sources. - Drop the comment column in the hlink script's
desc
command by @riley-harper in #88. This column was always full ofnull
s and was just cluttering up the screen.
Documentation Updates
- Add more information to Link Tasks docs page by @riley-harper in #86. See the new and improved page at https://hlink.docs.ipums.org/link_tasks.html!
Developer-Facing Changes
- Pin the Docker image to Debian bullseye by @riley-harper in #84
- Bump the version to 3.4.0 by @riley-harper in #89
Full Changelog: v3.3.1...v3.4.0
v3.3.1
What's Changed
Bug Fixes
- Fix categorical variable bug by @anpumn in #82. This fixes issue #81, which caused comparison features to be marked as categorical even when the user set
categorical = false
in the configuration file.
Documentation Updates
- Update column_mapping_transforms docs page by @riley-harper in #77
- Update docs for present_both_years and neither_are_null by @riley-harper in #79
Developer-Facing Changes
- Don't reload modules for the reload command by @riley-harper in #78. This removes some old developer-facing functionality for hot-reloading hlink modules. Now the
reload
command in the hlink script just reloads the config file.
New Contributors
Full Changelog: v3.3.0...v3.3.1
v3.3.0
Overview
This release contains several new features like separate log files for each run, logging user input, and a loosening of production dependency requirements. It also contains an important bug fix for Jaro-Winkler scores on blank names and many other smaller enhancements.
Changes
- Started writing to a unique log file for each hlink script run. The name of the log file is
"{config_name}-{session_id}.log"
, wheresession_id
is a UUID uniquely generated for the particular run of the script. - Started logging user input in the main loop. This helps give more context to errors and other logging information.
- Loosened production dependency requirements so that they are not pinned to particular patch versions which may quickly become out of date. Adjusted some development dependency requirements.
- Fixed a bug where the Scala
jw
user-defined function returned a similarity of 1.0 for two empty strings. It now returns 0.0. - Added syntax highlighting to the TOML example config file in the README (thanks @bollwyvl).
- Documented some previously undocumented comparison types:
not_zero_and_not_equals
,present_and_matching_categorical
,caution_comp_3_012
,caution_comp_4_012
,sql_condition
,present_and_equal_categorical_in_universe
. - Updated documentation for a few more comparison types:
caution_comp_3
,caution_comp_4
,not_zero_and_not_equals
. - Updated the Introduction and Installation documentation pages to make them more reader friendly and helpful.
- Updated the tutorial in examples/tutorial and added some small datasets so that it can be run for real. It can now be run with the commands
$ cd examples/tutorial
$ python tutorial.py
- Updated and added type hints for the following classes and modules:
Table
,LinkRun
,LinkTask
,LinkStep
,linking.util
,configs.load_config
.
Developer-Facing Changes
- Updated developer instructions for generating the Sphinx docs, adding some more context and tips.
- Renamed some private functions and methods to use a single leading underscore instead of two leading underscores. This should complete the transition from two leading underscores to one leading underscore.
- Allowed the Dockerfile to pull the most recent patch version of Python 3.10 for CI instead of pinning to a particular patch version.
- Moved from setup.py and setup.cfg to pyproject.toml for specifying package metadata. Added and tweaked package metadata for installation and PyPI.
- Started using the
build
package for creating an sdist and wheel. Added a step to the CI to runpython -m build
to generate the sdist and wheel. - Moved the declaration of
pytest_plugins
to a top-level conftest.py file to allow for running tests with just the commandpytest
. Updated CI and the docs frompytest hlink/tests
to justpytest
.
v3.2.7
Overview
This release of hlink contains some bug fixes and maintenance items, along with some tuning of hlink for large datasets. It modifies the hlink.spark.session.SparkConnection
class to allow easier adjustment of the spark.driver.memory
configuration setting, and it upgrades hlink from Spark 3.2 to 3.3.
Changes
- Upgraded from Spark 3.2 to 3.3.0. This required only a few internal changes to hlink.
- Fixed a bug where
feature_selections
was always required in the config file. Now it defaults to[]
as intended. - Fixed a bug where an error message in
conf_validations
wasn't formatted correctly. - Added a check to
conf_validations
to confirm that both data sources contain the id column specified in the config file. - Improved the project README.
- Capped the number of Spark partitions requested at 10,000 to prevent hlink from requesting too many partitions with very large datasets.
- Added driver memory options to
SparkConnection
.
Notes
- Added developer documentation on how to push hlink to PyPI.
- Cleaned up some old files and did some reorganization. Did some work to organize some test files that were in a confusing place.