31 May 17:30

riley-harper

bd69a9e

v3.5.5 Latest

Latest

What's Changed

Support a variable number of columns in the array feature selection transform by @riley-harper in #135

Full Changelog: v3.5.4...v3.5.5

Contributors

riley-harper

Assets 2

20 Feb 20:20

riley-harper

v3.5.4

da9db20

v3.5.4

What's Changed

Document column_mappings transform concat_two_cols by @riley-harper in #126. These new docs are here: https://hlink.docs.ipums.org/column_mappings.html#concat-two-cols.
Document column mapping overrides by @riley-harper in #129. These can let you read two columns with different names from the two input files into a single hlink column. Check out the documentation at https://hlink.docs.ipums.org/column_mappings.html#advanced-usage and following.
Fix a bug with the override_column_X attributes in conf_validations.py by @riley-harper in #131. Previously config validation was raising spurious errors because it didn't take override_column_a and override_column_b into account.

Full Changelog: v3.5.3...v3.5.4

Contributors

riley-harper

Assets 2

02 Nov 15:58

riley-harper

v3.5.3

c0f0619

v3.5.3

Highlights

In this release we start supporting Python 3.12 and remove the ceiling on most of our dependency versions to support this. We also fix a bug with one-hot encoding and add an additional check to config validation that looks for duplicate output columns for some config sections.

What's Changed

Refactor to use colorama in a simpler way by @jrbalch543 in #115. User-facing functionality should be unchanged.
Add checks for duplicated comparison features, feature selection, and column mappings by @jrbalch543 in #113. This will cause a validation error when there are duplicated aliases or output columns for these sections.
Clean up a couple of core modules by @jrbalch543 in #117. These changes are internal refactoring and don't affect functionality.
Upgrade dependencies by pinning them more loosely and support Python 3.12 by @riley-harper in #119. This removes the upper limit on almost all of our dependencies so that users can more freely pick versions for themselves. Our tests run on the most recent available versions for each particular version of Python. Unpinning these dependencies allows us to easily support Python 3.12, which we now support and run in CI/CD.
Update the docs to include Python 3.12 by @riley-harper in #120
Revert to handleInvalid = "keep" for OneHotEncoder by @riley-harper in #121. This is a bug that we introduced in the last release. Although it's not common, it does sometimes happen that our training data doesn't cover every category present in the matching data. We would rather silently continue and ignore these cases by giving them a coefficient of 0 than error out on them.
Put the config file name in the script prompt by @riley-harper in #123. This is a small quality of life feature that makes it easier to remember which config file you're running during long hlink runs.

Full Changelog: v3.5.2...v3.5.3

Contributors

riley-harper and jrbalch543

Assets 2

26 Oct 15:21

riley-harper

v3.5.2

d51c254

v3.5.2

What's Changed

Fixed zipping issue in Training step 3 by @jrbalch543 in #104
Fix a bug in Training step 3 for categorical features by @jrbalch543 and @riley-harper in #107. Each categorical feature was getting a single coefficient when each category should get its own coefficient instead.
Error out on invalid categories in training data instead of creating a new category for them by @riley-harper in #109. This bug fix reduces the number of categories created by hlink by 1. The last category represented missing or invalid data, but these categories were pretty much always unused because hlink creates exhaustive categories whenever possible. Users can still manually mark missing data by creating their own category for it, but hlink will not do this by default anymore. This should help prevent silent errors and confusion with missing data.
Fix a bug where categorical features created by interaction caused Training step 3 to crash by @riley-harper in #111
Tweak the format of Training step 3's output by @riley-harper in #112. There are now 3 columns: feature_name, category, and coefficient_or_importance. Feature names aren't suffixed with the category value anymore.

Full Changelog: v3.5.1...v3.5.2

Contributors

riley-harper and jrbalch543

Assets 2

23 Oct 20:10

riley-harper

v3.5.1

6711c54

v3.5.1

What's Changed

Implement a new Training step that replaces Model Exploration step 3 by @jrbalch543 and @riley-harper in #101. This new step replaces the broken "get feature importances" step in Model Exploration, which now is removed. Training step 3 saves model feature importances or coefficients when training.feature_importances is set to true in the config file.

New Contributors

@jrbalch543 made their first contribution in #102! 🎉

Full Changelog: v3.5.0...v3.5.1

Contributors

riley-harper and jrbalch543

Assets 2

16 Oct 21:00

riley-harper

v3.5.0

00b395e

v3.5.0

What's Changed

Make the CI Dockerfile more flexible and maintainable by @riley-harper in #92. This allowed us to support Python 3.11 and also cleared up some questions about which versions of Java are supported by hlink and pyspark.
Support Python 3.11 by @riley-harper in #94. This required upgrading Spark from 3.3 to 3.5. We are now also less strict about the versions of numpy and pandas used.
Fix 2 small command-line bugs by @riley-harper in #96. One was a typo in some documentation, and the other was a bug where the autocomplete cache was not reloaded consistently. It is now reloaded after each command.
Deprecate the interaction_transformer module by @riley-harper in #97. This is a backport from when we were on Spark 2. Users of hlink should use Spark's pyspark.ml.feature.Interaction class instead. The interaction_transformer module will be removed in the future.
Add a new multi_jaro_winkler_search comparison feature by @riley-harper in #99. This is a complex comparison feature that supports conditional Jaro-Winkler comparisons between lists of columns with similar names. You can read more in the documentation at https://hlink.docs.ipums.org/comparison_types.html#multi-jaro-winkler-search.

Full Changelog: v3.4.0...v3.5.0

Contributors

riley-harper

Assets 2

09 Aug 21:18

riley-harper

v3.4.0

8bec713

v3.4.0

What's Changed

New Features and Improvements

Add a new convert_ints_to_longs configuration setting by @riley-harper in #87. This configuration setting is especially helpful for reading from CSV files, which don't contain an explicit schema. Documentation for convert_ints_to_longs can be found at https://hlink.docs.ipums.org/config.html#data-sources.
Drop the comment column in the hlink script's desc command by @riley-harper in #88. This column was always full of nulls and was just cluttering up the screen.

Documentation Updates

Add more information to Link Tasks docs page by @riley-harper in #86. See the new and improved page at https://hlink.docs.ipums.org/link_tasks.html!

Developer-Facing Changes

Pin the Docker image to Debian bullseye by @riley-harper in #84
Bump the version to 3.4.0 by @riley-harper in #89

Full Changelog: v3.3.1...v3.4.0

Contributors

riley-harper

Assets 2

02 Jun 16:44

riley-harper

v3.3.1

2392f3d

v3.3.1

What's Changed

Bug Fixes

Fix categorical variable bug by @anpumn in #82. This fixes issue #81, which caused comparison features to be marked as categorical even when the user set categorical = false in the configuration file.

Documentation Updates

Update column_mapping_transforms docs page by @riley-harper in #77
Update docs for present_both_years and neither_are_null by @riley-harper in #79

Developer-Facing Changes

Don't reload modules for the reload command by @riley-harper in #78. This removes some old developer-facing functionality for hot-reloading hlink modules. Now the reload command in the hlink script just reloads the config file.

New Contributors

@anpumn made their first contribution in #82! 🎉

Full Changelog: v3.3.0...v3.3.1

Contributors

riley-harper and anpumn

Assets 2

13 Dec 18:38

riley-harper

v3.3.0

254d358

v3.3.0

Overview

This release contains several new features like separate log files for each run, logging user input, and a loosening of production dependency requirements. It also contains an important bug fix for Jaro-Winkler scores on blank names and many other smaller enhancements.

Changes

Started writing to a unique log file for each hlink script run. The name of the log file is "{config_name}-{session_id}.log", where session_id is a UUID uniquely generated for the particular run of the script.
Started logging user input in the main loop. This helps give more context to errors and other logging information.
Loosened production dependency requirements so that they are not pinned to particular patch versions which may quickly become out of date. Adjusted some development dependency requirements.
Fixed a bug where the Scala jw user-defined function returned a similarity of 1.0 for two empty strings. It now returns 0.0.
Added syntax highlighting to the TOML example config file in the README (thanks @bollwyvl).
Documented some previously undocumented comparison types: not_zero_and_not_equals, present_and_matching_categorical, caution_comp_3_012, caution_comp_4_012, sql_condition, present_and_equal_categorical_in_universe.
Updated documentation for a few more comparison types: caution_comp_3, caution_comp_4, not_zero_and_not_equals.
Updated the Introduction and Installation documentation pages to make them more reader friendly and helpful.
Updated the tutorial in examples/tutorial and added some small datasets so that it can be run for real. It can now be run with the commands

$ cd examples/tutorial
$ python tutorial.py

Updated and added type hints for the following classes and modules: Table, LinkRun, LinkTask, LinkStep, linking.util, configs.load_config.

Developer-Facing Changes

Updated developer instructions for generating the Sphinx docs, adding some more context and tips.
Renamed some private functions and methods to use a single leading underscore instead of two leading underscores. This should complete the transition from two leading underscores to one leading underscore.
Allowed the Dockerfile to pull the most recent patch version of Python 3.10 for CI instead of pinning to a particular patch version.
Moved from setup.py and setup.cfg to pyproject.toml for specifying package metadata. Added and tweaked package metadata for installation and PyPI.
Started using the build package for creating an sdist and wheel. Added a step to the CI to run python -m build to generate the sdist and wheel.
Moved the declaration of pytest_plugins to a top-level conftest.py file to allow for running tests with just the command pytest. Updated CI and the docs from pytest hlink/tests to just pytest.

Contributors

bollwyvl

Assets 2

14 Sep 13:57

riley-harper

v3.2.7

ab206ff

v3.2.7

Overview

This release of hlink contains some bug fixes and maintenance items, along with some tuning of hlink for large datasets. It modifies the hlink.spark.session.SparkConnection class to allow easier adjustment of the spark.driver.memory configuration setting, and it upgrades hlink from Spark 3.2 to 3.3.

Changes

Upgraded from Spark 3.2 to 3.3.0. This required only a few internal changes to hlink.
Fixed a bug where feature_selections was always required in the config file. Now it defaults to [] as intended.
Fixed a bug where an error message in conf_validations wasn't formatted correctly.
Added a check to conf_validations to confirm that both data sources contain the id column specified in the config file.
Improved the project README.
Capped the number of Spark partitions requested at 10,000 to prevent hlink from requesting too many partitions with very large datasets.
Added driver memory options to SparkConnection.

Notes

Added developer documentation on how to push hlink to PyPI.
Cleaned up some old files and did some reorganization. Did some work to organize some test files that were in a confusing place.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

Highlights

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

New Features and Improvements

Documentation Updates

Developer-Facing Changes

Contributors

What's Changed

Bug Fixes

Documentation Updates

Developer-Facing Changes

New Contributors

Contributors

Overview

Changes

Developer-Facing Changes

Contributors

Overview

Changes

Notes

Releases: ipums/hlink

v3.5.5

What's Changed

Contributors

v3.5.4

What's Changed

Contributors

v3.5.3

Highlights

What's Changed

Contributors

v3.5.2

What's Changed

Contributors

v3.5.1

What's Changed

New Contributors

Contributors

v3.5.0

What's Changed

Contributors

v3.4.0

What's Changed

New Features and Improvements

Documentation Updates

Developer-Facing Changes

Contributors

v3.3.1

What's Changed

Bug Fixes

Documentation Updates

Developer-Facing Changes

New Contributors

Contributors

v3.3.0

Overview

Changes

Developer-Facing Changes

Contributors

v3.2.7

Overview

Changes

Notes