Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using laz-perf as a LAZ compressor? #3074

Open
ryan-salo opened this issue Apr 13, 2022 · 4 comments
Open

Using laz-perf as a LAZ compressor? #3074

ryan-salo opened this issue Apr 13, 2022 · 4 comments

Comments

@ryan-salo
Copy link

I've been looking into storing pointcloud data in TileDB arrays, but one of my hesitations is the larger data volumes relative to LAZ files. I played around with different compressors/levels for coords/attributes, but couldn't get anything close to the original LAZ filesize.

Would it be possible to link in laz-perf and provide laz as an additional compressor?

@stavrospapadopoulos
Copy link
Member

Hi @ryan-salo, we will soon publish several tutorials on tweaking the TileDB compression for LAZ data. The current defaults are not appropriate, we are fixing those in the next imminent release.

To achieve even better compression, we are designing a new compressor that will be especially beneficial for the GpsTime field that is of type double. This is what is hurting TileDB vs. LAZ currently, not the rest of the fields which compress pretty well with off-the-shelf compressors (like zstd and bzip2). To address this issue, the new compressor:

  1. Sorts on GpsTime within the GpsTime and X, Y and Z tiles (without impacting the rest of the attributes)
  2. Computes and sorts the pairwise XORs of the sorted GpsTime values
  3. Compresses the result with bzip2

In my local experiments, the above achieves massive compression for GpsTime (~10x versus 2x we currently achieve with zstd). I believe that will get TileDB to be on par with LAZ in terms of data sizes.

The reason why we don't use laz-perf off-the-shelf is that TileDB is a columnar format (like Parquet) and stores the values of each field/attribute in separate files. If we coalesced the fields, then we would hinder the ability to rapidly subselect on a subset of the fields, so performance would be impacted significantly. I believe that the new compressor we are working on will achieve the desired compression ratio.

I'll keep you posted on progress on this issue. Thanks for reaching out!

@ryan-salo
Copy link
Author

Thanks for the response @stavrospapadopoulos! I'll keep me eyes on this repo for the next release. Sounds like some good improvements are coming!

@ryan-salo
Copy link
Author

Just noticed the floating scaling compressor in the latest release, 2.11. Any thoughts on if this would improve pointcloud storage/compression?

@stavrospapadopoulos
Copy link
Member

Hi @ryan-salo, it probably will for the case of X, Y and Z. Please stay tuned though, we are working on another compressor that will improve even further the pointcloud storage (specifically the GPSTime field). We'll experiment with all new compressors and select the best defaults in our PDAL ingestor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants