Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor StdDateFormat #1671

Closed
wants to merge 3 commits into from
Closed

Conversation

brenuart
Copy link
Contributor

@brenuart brenuart commented Jun 19, 2017

This PR addresses issues highlighted in DateDeserializationTest test cases + #1667 & #1668

The original version had different code path to handle +hhmm, +hh:mm, Z and not TZ hence producing different and inconsistent results for edge cases (like missing digits or time field).
This new version makes sure the sanitisation process is the same for all cases. For instance, if we detect a Z timezone indicator, it is replaced by +0000 before going through the next steps.

I made sure this new version behaves the same as the original and accepts the same input. Of course it differs for cases that were "wrongly" accepted or refused... The test cases defined in DateDeserilizationTest were very helpful to detect changes in behavior...

Concerning ISO8601, this new version behaves like described by the following BNF:

   date-fullyear   = 4DIGIT
   date-month      = 1*2DIGIT
   date-mday       = 1*2DIGIT
   time-hour       = 1*2DIGIT
   time-minute     = 1*2DIGIT
   time-second     = 1*2DIGIT
   time-secfrac    = "." 1*DIGIT
   time-numoffset  = ("+" / "-") time-hour [[":"] time-minute]
   time-offset     = "Z" / time-numoffset

   partial-time    = time-hour [":" time-minute [":" time-second [time-secfrac]]]
   full-date       = date-fullyear "-" date-month "-" date-mday
   full-time       = partial-time time-offset

   date-time       = full-date "T" full-time

The parser accepts date-time, full-date and partial-time inputs.

Optional Digits
Month, days, hours, minutes and second can be expressed with 1 or 2 digits.

Optional Fields
As described in ISO8601, any number of values may be dropped from any of the date and time representations, but in the order from the least to the most significant.
The parser supports optional hours, minutes and seconds. This means the following forms are now accepted whatever the timezone indicator.

  • 2000-01-02T03:04:05
  • 2000-01-02T03:04
  • 2000-01-02T03
  • 2000-01-02

The original version had support for optional seconds but only if a timezone was present (failed if no timezone).

Fraction of Seconds vs. millis
Cfr. #1668

This new version dropped the millis concept in favour of the more correct fraction of seconds. This means the following forms now produce the correct/expected result:

  • 2000-01-02T03:04:05.1 --> 100 millis
  • 2000-01-02T03:04:05.01 --> 10 millis
  • 2000-01-02T03:04:05.001 --> 1 millis

If more than 3 digits after the dot:

  • 2000-01-02T03:04:05.1234 --> 5 seconds + 0.1234 seconds

Leniency
Leniency is fully supported on all fields except the old "millis" as it is now interpreted as a faction of second. The original version had a hard limit of 2 digits on some fields that somehow reduced the leniency support.
Although not expressed in the BNF above (maybe I should write another for the lenient mode), the following is now accepted:

  • 2000-01-32 --> equivalent to 2000-02-01
  • 2000-01-02T03:04:61 --> equivalent to 2000-01-02T03:05:01

When Leniency is turned off, the parser throws a ParseException when the field is not within the accepted range.

Others
Some other "technical" changes:

  • The error message did not refer to the input string but the version after sanitisation which is potentially different.
  • I managed to refactor part of the code to reduce the amount and the depth of if/then/else conditions and make the code flow easier to follow.
  • Code was duplicated in parse(String)and parse(String, ParsePosition) methods
  • The error index in thrownParseException did not take into account the current position set in the ParsePosition
  • Compared to the previous version, a single StringBuffer is now used through out the sanitisation process reducing the String manipulation overhead

WHAT NOW?
I reviewed the existing test cases so they reflect the new behaviour. Diff against the previous version to see the impact...
I also discovered some of these tests are non-sense, some are now deprecated but some new may have to be written in the light of the proposed BNF. I have not done it yet so the only changes clearly highlight the diff between this version and the previous.

IMHO most of these tests should explicitly test StdDateFormat without Jackson and be hosted in a separated file. The stuff that involve Jackson - like timezone settings, annotations, etc - should be separated and could stay in the current test class.

@cowtowncoder
Copy link
Member

Sounds good; I'll have a look. Forgot to add, that to me combination of improvements is fine, so that:

  1. It's fine to outlaw some types of theoretically valid forms: for example, accepting more than 3 digits for second fractions as JDK Date/Calendar only support millisecond resolution so more digits would need to be discarded anyway (leading to question of truncate vs round)
  2. If some forms are not accepted, exception should indicate failure
  3. Anything accepted should be handled in a sensible way (so if we accept more than 3 fraction digits, make first 3 are used, 4th either discarded or used for rounding)

I assume this is along the ideas you are doing anyway, but thought I'll mention this.


if (looksLikeISO8601(dateStr)) { // also includes "plain"
dt = parseAsISO8601(dateStr, pos, true);
} else {
// Also consider "stringified" simple time stamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this removed because it is handled at higher level, by deserializer (and not DateFormat implementation)?

Copy link
Contributor Author

@brenuart brenuart Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

}
}

protected int indexOfAnyFromEnd(String s, int startPos, int endPos, char... chars)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like only 2 calls; one with 1 char (which can just use String.lastIndexOf(ch)), another with 2. So maybe write version for second case, to avoid inner loop as well as varargs.


private transient Map<DateFormat, SimpleDateFormat> _clonedFormats = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maps are bit expensive, esp. when key is date format (depending on how equals and hashCode are defined).
What would work better would be EnumMap since that's basically just an array with index. So, create private enum with variants, pass that.

Although alternative could be to just always clone()/re-create instance for every call, and completely do away with recycling.

And yet third possibility would be to use ThreadLocal, with container object. That would probably work best as that would allow reuse across calls, without possibility of race conditions. The only (?) concern would be that of how to clear settings if they are changes. So maybe that doesn't work after all. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum is a good option indeed - I think we could go that way.
ThreadLocal is another I thought off. We could even build something like a DateFormatCache/factory from which we would ask for pre-configured instances. This cache could be used at different places: by the StdDateFormat but maybe also by the DateBasedDeserializer for when it creates Contextual instances. I would love to add something like the baseline vs. local DateFormat instances like what you did here initially. I planned to look at that possibility just after...

@cowtowncoder
Copy link
Member

Ok, I think this would be a solid improvement.

Given amount of change, I think this would make most sense for 2.9; leaving 2.8 as is.
I know handling is sub-optimal, but at the same time I think there is still non-trivial risk for breakage.
Risk is low enough that I think 2.9 is fine (esp. due to improved testing, thank you for the new tests!), but 2.8 patch version is pretty far along.
I am open to being convinced otherwise if you disagree here.

One other thing I was to suggest, was just this: it is probably fine to make use of java.util.regex.Pattern for both matching valid patterns for ISO-8601 and extracting parts, since that would simplify code (matching valid inputs; slicing and dicing parts; give better error messages).
Doing that might even lead to perhaps avoiding use of SimpleDateFormat for some parts, or, maybe even completely? Since we don't really need any actual configurability, but mostly just chopping of numeric values, and passing them to Calendar... well, that could perhaps actually let us fully solve concurrency problems (race condition vs locking).

What do you think? Apologies for adding more work here, but I think that perhaps existing code just should be mostly killed, at least wrt 8601-handling - with - SimpleDateFormat aspects....

@brenuart
Copy link
Contributor Author

brenuart commented Jun 19, 2017

Given amount of change, I think this would make most sense for 2.9; leaving 2.8 as is.

I agree with you.
I was even wondering if we should offer some opt-out option. Say the new version would be the default in 2.9 but you could somehow configure the mapper to use the old one. This could help people if they have serious issue with the new one...
Unfortunately, DateBasedDeserializer optimizes itself when it detects the DateFormat is a StdDateFormat instance. We would have to define a common ancestor for the old and new versions and/or extract some sort of interfaces the DateBasedDeserializer could rely on.

it is probably fine to make use of java.util.regex.Pattern

Maybe... This would certainly make the code more readable - but I'm not sure about performance. We should do some benchmarking to have some figures I guess...

The current code could have been made simpler if we drop support for the optional time fields. I kept that feature because the previous version tried to make both minutes and seconds optional. If that is not a desired feature - then we would have less to parse from the input string and would delegate more to the SimpleDateFormat.

Writing a tolerant parser is not a trivial task... There are so many weird cases to take into account. Honestly, I think we should validate the proposed BNF first. Possibly with additional comments like the actual behavior with +3 secfrac digits... Maybe remove some not so desired flexibility, etc. Once we are fine with it - then we can think about another implementation and the see if we can make it without the SimpleDateFormat...

Apologies for adding more work here

That's fine - I was expecting your comments. It helps me to understand your vision and expectations too... Expect some more commits on this PR after your comments.

@cowtowncoder
Copy link
Member

@brenuart JDK's regex Pattern is pretty efficient, and it easily supports optional fields. I have done some benchmarking (when I wrote https://github.com/salesforce/gorp -- JDK impl was faster than google's re2j for example, esp. for simple cases), so I am not particularly worried when using basic RE features.

I suspect SimpleDateFormat does not use regexps may be since it was included much earlier (JDK 1.0 I think), whereas RE was added in 1.4.

Support for optional fields is, at least with basic regexp grouping, simple to add. Validation then is just ensuring that input string matches, after which fields can be extracted ("missing" optional field is expressed as empty String for matching group).
I don't know how familiar you are with java.util.regex; I can help if need be.

@cowtowncoder
Copy link
Member

One alternate suggestion: would it be possible to refactor new tests into 2 sections:

  1. Tests that pass with existing implementation
  2. Tests that fail with existing implementation

latter would be added under .....failing package, which does not block regular build.
I realized that code added was made to pass by using values currently produced (but wrong); it'd be nice to only have actual real expected values, as that makes it bit easier to see when progress is made.

Plus most importantly we could add these test cases before patching implementation to pass more failing tests.

@cowtowncoder
Copy link
Member

I decided to go bit different route, as per: #1678, and made some changes to tests.
All tests pass in a way that makes sense to me: handling is bit stricter than before, but allows certain things:

  1. seconds are optional (in addition to second fractions aka milliseconds)
  2. 1-3 digit second fractions accepted; longer not. We could accept longer, and define truncate/rounding rule, but I am not sure there's much point (truncate would be easier as there's no need to roll up changes)
  3. timezone offset: colon is optional for timezone offset, as are minutes (but hours must be two-digit)

I will also see if I can rewrite serialization; changes so far checked are for deserialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants