Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploration: Make Locale configurable for formatting of tags (not to be merged) #538

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

rdvdijk
Copy link

@rdvdijk rdvdijk commented May 21, 2021

As initially discussed in drewnoakes/metadata-extractor-images#35, the current formatting of certain tag descriptions are locale-dependent. The only way to influence the formatting is to set the locale globally, using Locale#setLocale. This can have undesirable effects when integrating metadata-extractor in a larger application.

This pull request explores the idea of making the Locale configurable. A specific Locale is passed to JpegMetadataReader#readMetadata, and all the way down to the Descriptors that need it to format specific values. Two descriptors have been altered to show two specific examples:

  • ExifDescriptorBase : to render the F-number (f/6.0 vs f/6,0)
  • GpsDescriptor : to render the longitude and latitude (54° 59' 22.8" vs 54° 59' 22,8")

See the two added tests in JpegMetadataReaderTest (testConfigurableLocaleEnglish and testConfigurableLocaleDutch).

There is still much left to be done:

  • This pull request only addresses the above-mentioned JPEG descriptors formatting, there are many other cases that can be fixed in a similar way
  • I've left some TODOs with further ideas, I did not want to make the pull request even larger
  • This pull request probably violates the "Keep your PRs short and sweet"-guideline, not sure how to deal with that just yet

I am curious what you think of this, any feedback is welcome.

@Nadahar
Copy link
Contributor

Nadahar commented May 21, 2021

I think this should be handled using the long fabled MetadataContext. Getting this implemented would also open other possibilities, like an updated take on #225.

Even though I've said it before in other places, I'd like to say once again that setting the default Locale isn't a realistic option in most cases. The default Locale is derived from the OS settings and will usually be what you need when presenting numbers, dates, currency and more to the user, and it is also used for case-conversion and setting the UI language. It can't be modified for any application with a user interface that needs this to be set correctly for both formatting and localization.

I edited the comment because I mixed this up with the similar issue of relying on default character encoding, which also causes issues.

Generally, if presenting the information directly to the user, using the default Locale can be what you want. But, as soon as you intend to parse the output further, or that the default Locale is involved in parsing of metadata, it will come down to "luck". I don't know if it is, but remember that the default Locale is used in many places, like every call to String.format(), Sting.toLowerCase() and String.toUpperCase(), not to mention NumberFormatter, DecimalFormatter etc.

@rdvdijk
Copy link
Author

rdvdijk commented May 22, 2021

Thanks for your thoughts, and giving some more context, much appreciated.

I had actually considered creating a Settings class that would contain the Locale, instead of just passing the Locale directly. I decided against that to keep this pull request "to the point". However, such a Settings class is of course very similar to the MetadataContext that was discussed in #225. As was discussed in that pull request, I also made the Locale optional (@Nullable), not to break the current API.

I agree with you that the Locale should be applied in many, many more places throughout the code base. We would have to track down every String operation and Formatter, and pass the Locale to each of those.

What could be a way forward? I am thinking it would be relatively easy to introduce a small MetadataContext interface that supplies the Locale, but nothing more for now. E.g.:

interface MetadataContext {
    Locale getLocale();
}

Future concerns like filtering (#225) could be added to the interface in a separate issue.

However, actually applying the Locale in all places would turn into a huge pull request. Maybe create separate issues to apply the Locale for the separate file types?

@Nadahar
Copy link
Contributor

Nadahar commented May 23, 2021

@rdvdijk I think your plan sounds like a good one. I don't know how much it matters how "big" the PR is when measured in code changes or lines, as I see it, making "one logical change"/feature is as small as it can be. However, it doesn't really matter that much what I think, @drewnoakes is the one that will have to make the decisions.

I don't know if it's preferable to make the "context class", the overloaded API calls and a partial application of Locale or to make it all in one go. The important thing is that it all gets there in the end. The risk is of course, that if you make a PR now with only partial implementation of Locale, "the rest" will never follow.

As I remember it, @drewnoakes wanted to minimize the number of places where such an object would have to be passed on. This is where I didn't see clearly what would be the best solution, as storing it as any kind of "global" introduces potential concurrency traps very fast. I guess that if it could be attached to a "reader" that is used for only a single extraction for example, it would be safe though. I think agreeing on exactly how the "context object" should be made accessible all the places it's needed is essential before a lot of work is invested.

@drewnoakes
Copy link
Owner

This is very interesting. Thanks for picking this up.

I agree with @Nadahar that, given you're going to the effort of threading this through, it'd be worth adding a broader type into which other kinds of state/settings/etc can be added.

I'm not worried about a large PR, so long as the change is clear and consistent and doesn't change the world in some fundamental way. Your idea of breaking it down by file type, for example, makes sense too.

@rdvdijk
Copy link
Author

rdvdijk commented May 28, 2021

I'll introduce the MetadataContext interface in this branch, and create a separate branch to apply it in the JPEG code. That might give us a rough indication on how much work it will be to apply through the entire code base.

@Nadahar wrote above:

As I remember it, @drewnoakes wanted to minimize the number of places where such an object would have to be passed on.

Related to that, I have a concrete implementation question: Shall I add MetadataContext to the TagDescriptor class? Or would the Locale be sufficient there?

Option 1: MetadataContext

public class TagDescriptor<T extends Directory>
{
    @NotNull
    protected final T _directory;
    @NotNull
    protected final MetadataContext _context;

    public TagDescriptor(@NotNull T directory, @NotNull MetadataContext context)
    {
        _directory = directory;
        _context = context;
    }

Option 2: Locale

public class TagDescriptor<T extends Directory>
{
    @NotNull
    protected final T _directory;
    @NotNull
    protected final Locale _locale;

    public TagDescriptor(@NotNull T directory, @NotNull Locale locale)
    {
        _directory = directory;
        _locale = locale;
    }

I am leaning towards Option 2 in this case, but I'm not sure if there is a need for other state/settings/etc here.

@drewnoakes
Copy link
Owner

I'll introduce the MetadataContext interface

Please make it a class. It's easier to version a class over time than an interface. If we add to an interface, it breaks all existing implementations. I don't think this would be an extension point for external code either.

and create a separate branch to apply it in the JPEG code

I'm happy for you to add it in the same branch. It'll be easier to see how it works that way, and change it before it is merged if it needs to.

Shall I add MetadataContext to the TagDescriptor class? Or would the Locale be sufficient there?

I'd suggest adding the whole context. There may be other bits on it we want in future.

@rdvdijk
Copy link
Author

rdvdijk commented May 28, 2021

Thanks for the swift reply. I'll work on this and report back when I think the branch is at a point to discuss some more details.

@rdvdijk
Copy link
Author

rdvdijk commented Jun 25, 2021

Did some work on this. My approach was to start at JpegMetadataReader, and work my way down to all the Readers that were called from there. I passed down the MetadataContext to those Readers, and in turn to their Directory and Descriptor classes. I've tried to find all of the DecimalFormat, String#format and String#toLowerCase/toUppercase calls, and configure those using the locale in the new MetadataContext class.

It's not all done yet, but it's a start. Here is the progress as far as I can tell:

  • JpegMetadataReader.readMetadata
    • Readers[]
      • JpegReader
        • JpegDirectory
        • JpegDescriptor
        • JpegComponent (--> contains two 'simple' String.format calls)
      • JpegCommentReader
        • JpegCommentDirectory
        • JpegCommentDescriptor
      • JfifReader
        • JfifDirectory
      • IccReader
        • IccDirectory
      • JfxxReader
        • JfxxDirectory
      • ExifReader
        • TiffReader
          • exif.ExifTiffHandler
            • AppleMakernoteDirectory
            • CanonMakernoteDirectory
            • CasioType1MakernoteDirectory
            • CasioType2MakernoteDirectory
            • ExifIFD0Directory
            • ExifImageDirectory
            • ExifInteropDirectory
            • ExifSubIFDDirectory
              • ExifSubIFDDescriptor
            • ExifThumbnailDirectory
            • FujifilmMakernoteDirectory
            • GpsDirectory
              • GpsDescriptor
              • GeoLocation
            • KyoceraMakernoteDirectory
            • LeicaMakernoteDirectory
            • LeicaType5MakernoteDirectory
            • NikonType1MakernoteDirectory
            • NikonType2MakernoteDirectory
            • OlympusCameraSettingsMakernoteDirectory
            • OlympusEquipmentMakernoteDirectory
            • OlympusFocusInfoMakernoteDirectory
            • OlympusImageProcessingMakernoteDirectory
            • OlympusMakernoteDirectory
            • OlympusRawDevelopment2MakernoteDirectory
            • OlympusRawDevelopmentMakernoteDirectory
            • OlympusRawInfoMakernoteDirectory
            • PanasonicMakernoteDirectory
            • PanasonicRawIFD0Directory
            • PentaxMakernoteDirectory
            • RicohMakernoteDirectory
            • SamsungType2MakernoteDirectory
            • SanyoMakernoteDirectory
            • SigmaMakernoteDirectory
            • SonyType1MakernoteDirectory
            • SonyType6MakernoteDirectory
      • XmpReader
        • XmpDirectory
        • XmpDescriptor
      • PhotoshopReader
        • PhotoshopDirectory
        • PhotoshopDescriptor
      • DuckyReader
        • DuckyDirectory
      • IptcReader
        • IptcDirectory
        • IptcDescriptor
      • AdobeJpegReader
        • AdobeJpegDirectory
      • JpegDhtReader
        • HuffmanTablesDirectory
        • HuffmanTablesDescriptor
      • JpegDnlReader
        • ErrorDirectory (? --> contains one simple String.format call)

@drewnoakes Can you see if this is going in the right direction?

Things of note:

  • A few tests were added to show that this actually works (GPS coordinates, ICC float array)
  • Some Readers now have (even more) overloaded extract/readMetadata methods, to keep backwards compatibility, but allow the passing of the MetadataContext. You'll find comments there that say TODO document this default context?, I'm interested in your view on that.
  • There are still 63 usages of the TagDescriptor without a MetadataContext, so plenty of work to be done still

@rdvdijk
Copy link
Author

rdvdijk commented Jun 25, 2021

@Nadahar Also curious what you think of the changes so far.

@NotNull
public static Metadata readMetadata(@NotNull InputStream inputStream) throws JpegProcessingException, IOException
{
return readMetadata(inputStream, null);
// TODO document this default context?
Copy link
Author

@rdvdijk rdvdijk Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained in the PR discussion: I've added this comment throughout to mark where new MetadataContext() is called, which is done mostly for backwards compatibility.

@@ -55,11 +58,17 @@
// Ev=BV+Sv Sv=log2(ISOSpeedRating/3.125)
// ISO100:Sv=5, ISO200:Sv=6, ISO400:Sv=7, ISO125:Sv=5.32.

// TODO can be removed once context has been added to all sub-classes
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 4 sub-classes of this, that are still on the to-do list, so I expect this constructor to be removed in this PR, eventually.

@@ -59,7 +60,13 @@
{
public ExifTiffHandler(@NotNull Metadata metadata, @Nullable Directory parentDirectory)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor cannot be removed, because it is also used by Png, Tiff and Eps, which are out of scope of this pull request. Eventually it can be removed.

/**
* Locale used to format metadata.
*/
private Locale _locale;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make MetadataContext immutable so that it can be passed around without worrying about threads. Thus, this, and any other fields in this class should be final in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with that! Since we only have a Locale for now, I could create a constructor on MetadataContext.

However, using the builder pattern would make this more future proof (albeit more verbose).

@Nadahar
Copy link
Contributor

Nadahar commented Jun 26, 2021

This is quite a lot to look through. Once things I've noticed already is the chosen approach to preserve the existing API. This is only my opinion, and others might not share it, but I would have done it slightly differently.

Instead of requiring the MetadataContext instance in method calls, I would make it @Nullable everywhere, and make the code that use it be prepared that it might be null instead. I would then make sure that as long as it is null no behavior would change at all, the existing code should run as it is when that's the case. I would make overloads of methods (similar to what you've done) where all the existing methods would simply call the corresponding new method with null for MetadataContext.

I don't know if it makes much practical difference, but I prefer not to create objects when they're not needed, and I think using null as "preserve current behavior" would make it easy to make sure that nothing actually changes for those that use it without using MetadataContext.

Here's an example to illustrate what I mean, here is the "pre context version":

public SomeType foo(SomeOtherType bar) {
  ...
}

This is the corresponding "post context version":

public SomeType foo(SomeOtherType bar) {
  return foo(bar, null);
}

public SomeType foo(SomeOtherType bar, @Nullable MetadataContext context) {
  ...
}

Don't make any changes based on my opinion alone though.

@rdvdijk
Copy link
Author

rdvdijk commented Jun 28, 2021

@Nadahar Thanks for the notes, much appreciated.

Instead of requiring the MetadataContext instance in method calls, I would make it @Nullable everywhere, and make the code that use it be prepared that it might be null instead.

I personally disagree with that. I really dislike passing along null everywhere, and would rather avoid that at all costs. (The inventor of "null", Tony Hoare, even called it his "billion-dollar mistake" 😉 )

Allowing null would mean that we'd have to add a null-check in many places throughout the code, and make decisions on what to do when we do not have a context. Having a "sensible default context" made more sense to me.

but I prefer not to create objects when they're not needed

My approach was to create one MetadataContext at the most, just as there is one Metadata instance being passed around. The overhead of such an object instance would be minimal.

@deckerst
Copy link

@rdvdijk I hope this PR will eventually reach maturity and get merged. Is there any blocker?

@rdvdijk
Copy link
Author

rdvdijk commented Jun 16, 2023

@rdvdijk I hope this PR will eventually reach maturity and get merged. Is there any blocker?

I do not think there are any blockers, but discussion here kind of just stalled two years ago.

What do you think, @Nadahar and @drewnoakes ? Should I spend some time to bring this PR back to life?

@rdvdijk
Copy link
Author

rdvdijk commented Oct 19, 2023

Ping @Nadahar and @drewnoakes .

Today we ran into something similar: a difference in system timezone caused unexpected timestamps. This is something that could also be fixed by the changes in this branch, or a subsequent pull request that adds configurable TimeZone support.

I would like to make progress with this PR, but I really need some more feedback if this PR is heading in the right direction. I'd be happy to take another approach if that is preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants