Skip to content

Releases: ipums/hlink

v3.5.5

31 May 17:30
bd69a9e
Compare
Choose a tag to compare

What's Changed

  • Support a variable number of columns in the array feature selection transform by @riley-harper in #135

Full Changelog: v3.5.4...v3.5.5

v3.5.4

20 Feb 20:20
da9db20
Compare
Choose a tag to compare

What's Changed

Full Changelog: v3.5.3...v3.5.4

v3.5.3

02 Nov 15:58
c0f0619
Compare
Choose a tag to compare

Highlights

In this release we start supporting Python 3.12 and remove the ceiling on most of our dependency versions to support this. We also fix a bug with one-hot encoding and add an additional check to config validation that looks for duplicate output columns for some config sections.

What's Changed

  • Refactor to use colorama in a simpler way by @jrbalch543 in #115. User-facing functionality should be unchanged.
  • Add checks for duplicated comparison features, feature selection, and column mappings by @jrbalch543 in #113. This will cause a validation error when there are duplicated aliases or output columns for these sections.
  • Clean up a couple of core modules by @jrbalch543 in #117. These changes are internal refactoring and don't affect functionality.
  • Upgrade dependencies by pinning them more loosely and support Python 3.12 by @riley-harper in #119. This removes the upper limit on almost all of our dependencies so that users can more freely pick versions for themselves. Our tests run on the most recent available versions for each particular version of Python. Unpinning these dependencies allows us to easily support Python 3.12, which we now support and run in CI/CD.
  • Update the docs to include Python 3.12 by @riley-harper in #120
  • Revert to handleInvalid = "keep" for OneHotEncoder by @riley-harper in #121. This is a bug that we introduced in the last release. Although it's not common, it does sometimes happen that our training data doesn't cover every category present in the matching data. We would rather silently continue and ignore these cases by giving them a coefficient of 0 than error out on them.
  • Put the config file name in the script prompt by @riley-harper in #123. This is a small quality of life feature that makes it easier to remember which config file you're running during long hlink runs.

Full Changelog: v3.5.2...v3.5.3

v3.5.2

26 Oct 15:21
d51c254
Compare
Choose a tag to compare

What's Changed

  • Fixed zipping issue in Training step 3 by @jrbalch543 in #104
  • Fix a bug in Training step 3 for categorical features by @jrbalch543 and @riley-harper in #107. Each categorical feature was getting a single coefficient when each category should get its own coefficient instead.
  • Error out on invalid categories in training data instead of creating a new category for them by @riley-harper in #109. This bug fix reduces the number of categories created by hlink by 1. The last category represented missing or invalid data, but these categories were pretty much always unused because hlink creates exhaustive categories whenever possible. Users can still manually mark missing data by creating their own category for it, but hlink will not do this by default anymore. This should help prevent silent errors and confusion with missing data.
  • Fix a bug where categorical features created by interaction caused Training step 3 to crash by @riley-harper in #111
  • Tweak the format of Training step 3's output by @riley-harper in #112. There are now 3 columns: feature_name, category, and coefficient_or_importance. Feature names aren't suffixed with the category value anymore.

Full Changelog: v3.5.1...v3.5.2

v3.5.1

23 Oct 20:10
6711c54
Compare
Choose a tag to compare

What's Changed

  • Implement a new Training step that replaces Model Exploration step 3 by @jrbalch543 and @riley-harper in #101. This new step replaces the broken "get feature importances" step in Model Exploration, which now is removed. Training step 3 saves model feature importances or coefficients when training.feature_importances is set to true in the config file.

New Contributors

Full Changelog: v3.5.0...v3.5.1

v3.5.0

16 Oct 21:00
00b395e
Compare
Choose a tag to compare

What's Changed

  • Make the CI Dockerfile more flexible and maintainable by @riley-harper in #92. This allowed us to support Python 3.11 and also cleared up some questions about which versions of Java are supported by hlink and pyspark.
  • Support Python 3.11 by @riley-harper in #94. This required upgrading Spark from 3.3 to 3.5. We are now also less strict about the versions of numpy and pandas used.
  • Fix 2 small command-line bugs by @riley-harper in #96. One was a typo in some documentation, and the other was a bug where the autocomplete cache was not reloaded consistently. It is now reloaded after each command.
  • Deprecate the interaction_transformer module by @riley-harper in #97. This is a backport from when we were on Spark 2. Users of hlink should use Spark's pyspark.ml.feature.Interaction class instead. The interaction_transformer module will be removed in the future.
  • Add a new multi_jaro_winkler_search comparison feature by @riley-harper in #99. This is a complex comparison feature that supports conditional Jaro-Winkler comparisons between lists of columns with similar names. You can read more in the documentation at https://hlink.docs.ipums.org/comparison_types.html#multi-jaro-winkler-search.

Full Changelog: v3.4.0...v3.5.0

v3.4.0

09 Aug 21:18
8bec713
Compare
Choose a tag to compare

What's Changed

New Features and Improvements

  • Add a new convert_ints_to_longs configuration setting by @riley-harper in #87. This configuration setting is especially helpful for reading from CSV files, which don't contain an explicit schema. Documentation for convert_ints_to_longs can be found at https://hlink.docs.ipums.org/config.html#data-sources.
  • Drop the comment column in the hlink script's desc command by @riley-harper in #88. This column was always full of nulls and was just cluttering up the screen.

Documentation Updates

Developer-Facing Changes

Full Changelog: v3.3.1...v3.4.0

v3.3.1

02 Jun 16:44
2392f3d
Compare
Choose a tag to compare

What's Changed

Bug Fixes

  • Fix categorical variable bug by @anpumn in #82. This fixes issue #81, which caused comparison features to be marked as categorical even when the user set categorical = false in the configuration file.

Documentation Updates

Developer-Facing Changes

  • Don't reload modules for the reload command by @riley-harper in #78. This removes some old developer-facing functionality for hot-reloading hlink modules. Now the reload command in the hlink script just reloads the config file.

New Contributors

  • @anpumn made their first contribution in #82! 🎉

Full Changelog: v3.3.0...v3.3.1

v3.3.0

13 Dec 18:38
254d358
Compare
Choose a tag to compare

Overview

This release contains several new features like separate log files for each run, logging user input, and a loosening of production dependency requirements. It also contains an important bug fix for Jaro-Winkler scores on blank names and many other smaller enhancements.

Changes

  • Started writing to a unique log file for each hlink script run. The name of the log file is "{config_name}-{session_id}.log", where session_id is a UUID uniquely generated for the particular run of the script.
  • Started logging user input in the main loop. This helps give more context to errors and other logging information.
  • Loosened production dependency requirements so that they are not pinned to particular patch versions which may quickly become out of date. Adjusted some development dependency requirements.
  • Fixed a bug where the Scala jw user-defined function returned a similarity of 1.0 for two empty strings. It now returns 0.0.
  • Added syntax highlighting to the TOML example config file in the README (thanks @bollwyvl).
  • Documented some previously undocumented comparison types: not_zero_and_not_equals, present_and_matching_categorical, caution_comp_3_012, caution_comp_4_012, sql_condition, present_and_equal_categorical_in_universe.
  • Updated documentation for a few more comparison types: caution_comp_3, caution_comp_4, not_zero_and_not_equals.
  • Updated the Introduction and Installation documentation pages to make them more reader friendly and helpful.
  • Updated the tutorial in examples/tutorial and added some small datasets so that it can be run for real. It can now be run with the commands
$ cd examples/tutorial
$ python tutorial.py
  • Updated and added type hints for the following classes and modules: Table, LinkRun, LinkTask, LinkStep, linking.util, configs.load_config.

Developer-Facing Changes

  • Updated developer instructions for generating the Sphinx docs, adding some more context and tips.
  • Renamed some private functions and methods to use a single leading underscore instead of two leading underscores. This should complete the transition from two leading underscores to one leading underscore.
  • Allowed the Dockerfile to pull the most recent patch version of Python 3.10 for CI instead of pinning to a particular patch version.
  • Moved from setup.py and setup.cfg to pyproject.toml for specifying package metadata. Added and tweaked package metadata for installation and PyPI.
  • Started using the build package for creating an sdist and wheel. Added a step to the CI to run python -m build to generate the sdist and wheel.
  • Moved the declaration of pytest_plugins to a top-level conftest.py file to allow for running tests with just the command pytest. Updated CI and the docs from pytest hlink/tests to just pytest.

v3.2.7

14 Sep 13:57
ab206ff
Compare
Choose a tag to compare

Overview

This release of hlink contains some bug fixes and maintenance items, along with some tuning of hlink for large datasets. It modifies the hlink.spark.session.SparkConnection class to allow easier adjustment of the spark.driver.memory configuration setting, and it upgrades hlink from Spark 3.2 to 3.3.

Changes

  • Upgraded from Spark 3.2 to 3.3.0. This required only a few internal changes to hlink.
  • Fixed a bug where feature_selections was always required in the config file. Now it defaults to [] as intended.
  • Fixed a bug where an error message in conf_validations wasn't formatted correctly.
  • Added a check to conf_validations to confirm that both data sources contain the id column specified in the config file.
  • Improved the project README.
  • Capped the number of Spark partitions requested at 10,000 to prevent hlink from requesting too many partitions with very large datasets.
  • Added driver memory options to SparkConnection.

Notes

  • Added developer documentation on how to push hlink to PyPI.
  • Cleaned up some old files and did some reorganization. Did some work to organize some test files that were in a confusing place.