The Perils of DateTime.Parse()

Chris Oldwood from The OldWood Thing

The error message was somewhat flummoxing, largely because it was so generic, but also because the data all came from a database extract rather than manual input:

Input string was not in a correct format.

Naturally I looked carefully at all the various decimal and date values as I knew this was the kind of message you get when parsing those kind of values when they’re incorrectly formed, but none of them appeared to be at fault. The DateTime error message is actually slightly different [1] but I’d forgotten that at the time and so I eyeballed the dates as well as decimal values just in case.

Then I remembered that empty string values also caused this error, but lo-and-behold I was not missing any optional decimals or dates in my table either. Time to hit the debugger and see what was going on here [2].

The Plot Thickens

I changed the settings for the FormatException error type to break on throw, sent in my data to the service, and waited for it to trip. It didn’t take long before the debugger fired into life and I could see that the code was trying to parse a decimal value as a double but the string value was “0100/04/01”, i.e. the 1st April in the year 100. WTF!

I immediately went back to my table and checked my data again, aware that a date like this would have stood out a mile first time around, but I was happy to assume that I could have missed it. This time I used some regular expressions just to be sure my eyes were not deceiving me.

The thing was I knew what column the parser thought the value was in but I didn’t entirely trust that I hadn’t mucked up the file structure and added or removed an errant comma in the CSV input file. I didn’t appear to have done that and so the value that appeared to be causing this problem was the decimal number “100.04”, but how?

None of this made any sense and so I decided to debug the client code, right from reading in the CSV data file through to sending it across the wire to the service, to see what was happening. The service was invoked via a fairly simple WCF client assembly and as I stepped into that code I came across a method called NormaliseDate()...

The Mist Clears

What this method did was to attempt to parse the input string value as a date and if it was successful it would rewrite it in an unusual (to me) “universal” format – YYYY/MM/DD [3].

The first two parsing attempts it did were very specific, i.e. it used DateTime.ParseExact() to match the intended output format and the “sane” local time format of DD/MM/YYYY. So far, so good.

However the third and last attempt, for whatever reason, just used DateTime.Parse() in its no-frills form and that was happy to take a decimal number like “100.04” and treat it as a date in the format YYY.MM! At first I wondered if it was treating it as a serial or OLE date of some kind but I think it’s just more liberal in its choice of separators than the author of our method intended [4].

Naturally there are no unit tests for this code or any type of regression test suite that shows what kind of scenarios this method was intended to support. Due to lack of knowledge around deployment and use in the wild of the client library I was forced to pad the values in the input file with trailing zeroes in the short term to workaround the issue, yuck! [5]

JSON Parsers

This isn’t the first time I’ve had a run-in with a date parser. When I was working on REST APIs I always got frustrated by how permissive the JSON parser would be in attempting to coerce a string value into a date (and time). All we ever wanted was to keep it simple and only allow ISO-8601 format timestamps in UTC unless there was a genuine need to support other formats.

Every time I started writing the acceptance tests though for timestamp validation I’d find that I could never quite configure the JSON parser to reject everything but the desired format. In the earlier days of my time with ASP.Net even getting it to stop accepting local times was a struggle and even caused us a problem as we discovered a US/UK date format confusion error which the parser was hiding from us.

In the end we resorted to creating our own Iso8601DateTime type which used the .Net DateTimeOffest type under the covers but effectively allowed us to use our own custom JSON serializer methods to only support the exact format we wanted.

More recently JSON.Net has gotten better at letting you control the format and parsing of dates but it’s still not perfect and there are unit tests in past codebases that show variants that would unexpectedly pass, despite using the strictest settings. I wouldn’t be surprised if our Iso8601DateTime type was still in use as I can only assume everyone else is far less pedantic about the validation of datetimes and those that are have taken a similar route to ensure they control parsing.

A Dangerous Game

One should not lose sight though of the real issue here which the attempt to classify string values by attempting to parse them. Even if you limit yourself to a single locale you might get away with it but when you try and do that across arbitrary locales you’re just asking for trouble.

 

[1] “String was not recognized as a valid DateTime.

[2] This whole fiasco falls squarely in the territory I’ve covered before in my Overload article “Terse Exception Messages”. Fixing this went to the top of my backlog, especially after I discovered it was a problem for our users too.

[3] Why they didn’t just pick THE universal format of ISO-8601 is anyone’s guess.

[4] I still need to go back and read the documentation for this method because it clearly caters for scenarios I just don’t normally see in my normal locale or user base.

[5] That’s what happens with tactical solutions, no one ever quite gets around to documenting anything because they never think it’ll survive for very long...

Network Saturation

Chris Oldwood from The OldWood Thing

The first indication that we seemed to have a problem was when some of the background processing jobs failed. The support team naturally looked at the log files where the jobs had failed and discovered that the cause was an inability to log-in to the database during process start-up. Naturally they tried to log-in themselves using SQL Server Management Studio or run a simple “SELECT GetDate();” style query via SQLCMD and discovered a similar problem.

Initial Symptoms

With the database appearing to be up the spout they raised a priority 1 ticket with the DBA team to investigate further. Whilst this was going on I started digging around the grid computation services we had built to see if any more light could be shed on what might be happening. This being the Windows Server 2003 era I had to either RDP onto a remote desktop or use PSEXEC to execute remote commands against our app servers. What surprised me was that these were behaving very erratically too.

This now started to look like some kind of network issue and so a ticket was raised with the infrastructure team to find out if they knew what was going on. In the meantime the DBAs came back and said they couldn’t find anything particularly wrong with the database, although the transaction log consumption was much higher than usual at this point.

Closing In

Eventually I managed to remote onto our central logging service [1] and found that the day’s log file was massive by comparison and eating up disk space fast. TAILing the central log file I discovered page upon page of the same error about some internal calculation that had failed on the compute nodes. At this point it was clearly time to pull the emergency chord and shut the whole thing down as no progress was being made for the business and very little in diagnosing the root of the problem.

With the tap now turned off I was able to easily jump onto a compute node and inspect its log. What I discovered there was every Monte Carlo simulation of every trade it was trying to value was failing immediately in some set-up calculation. The “best efforts” error handling approach meant that the error was simply logged and the valuation continued for the remaining simulations – rinse and repeat.

Errors at Scale

Of course what compounded the problem was the fact that there were approaching 100 compute nodes all sending any non-diagnostic log messages, i.e. all warnings and errors, across the network to one central service. This service would in turn log any error level messages in the database’s “error log” table.

Consequently with each compute node failing rapidly (see “Black Hole - The Fail Fast Anti-Pattern”) and flooding the network with thousands of log messages per-second the network eventually became saturated. Those processes which had long-lived network connections (we used a high-performance messaging product for IPC) would continue to receive and generate traffic, albeit slowly, but establishing new connections usually resulted in some form of timeout being hit instead.

The root cause of the compute node set-up calculation failure was later traced back to some bad data which itself had resulted from poor error handling in some earlier initial batch-level calculation.

Points of Failure

This all happened just before Michael Nygard published his excellent book Release It! Some months later when I finally read it I found myself frequently nodding my head as his tales of woe echoed my own experiences.

One of the patterns he talks about in his book is the use of bulkheads to stop failures “jumping the cracks”. On the compute nodes the poor error handling strategy meant that the same error occurred over-and-over needlessly instead of failing once. The use of a circuit breaker could also have mitigated the volume of errors generated and triggered some kind of cooling off period.

Duplicating the operational log data in the same database as the business data might have been a sane thing to do when the system was tiny and handling manual requests, but as the system became more automated and scaled out this kind of data should have been moved elsewhere where it could be used more effectively.

One of the characteristics of a system like this is that there are a lot of calculations forming a pipeline, so garbage-in, garbage-out means something might not go pop right away but sometime later when the error has compounded. In this instance an error return value of –1 was persisted as if it was normal data instead of being detected. Latter stages could do sanity checks on data to avoid poisoning the whole thing before it’s too late. It should also have been fairly easy to run a dummy calculation on the core inputs before opening the flood gates to mitigate a catastrophic failure, at least, for one due to bad input data.

Aside from the impedance mismatch in the error handling of different components there was also a disconnect in the error handling in the code that was biased towards one-off trader and support calculations, where the user is present, versus batch processing where the intention is for the system to run unattended. The design of the system needs to take both needs into consideration and adjust the error handling policy as appropriate. (See “The Generation, Management and Handling of Errors” for further ideas.)

Although the system had a monitoring page it only showed the progress of the entire batch – you needed to know the normal processing speed to realise something was up. A dashboard needs a variety of different indicators to show elevated error rates and other anomalous behaviour, ideally with automatic alerting when the things start heading south. Before you can do that though you need the data to work from, see “Instrument Everything You Can Afford To”.

The Devil is in the (Non-Functional) Details

Following Gall’s Law to the letter this particular system had grown over many, many years from a simple ad-hoc calculation tool to a full-blown grid-based compute engine. In the meantime some areas around stability and reliably had been addressed but ultimately the focus was generally on adding support for more calculation types rather than operational stability. The non-functional requirements are always the hardest to get buy-in for on an internal system but without them it can all come crashing down and end in tears with some dodgy inputs.

 

[1] Yes, back then everyone built their own logging libraries and tools like Splunk.

Stack Overflow With Custom JsonConverter

Chris Oldwood from The OldWood Thing

[There is a Gist on GitHub that contains a minimal working example and summary of this post.]

We recently needed to change our data model so that what was originally a list of one type, became a list of objects of different types with a common base, i.e. our JSON deserialization now needed to deal with polymorphic types.

Naturally we googled the problem to see what support, if any, Newtonsoft’s JSON.Net had. Although it has some built-in support, like many built-in solutions it stores fully qualified type names which we didn’t want in our JSON, we just wanted simple technology-agnostic type names like “cat” or “dog” that we would be happy to map manually somewhere in our code. We didn’t want to write all the deserialization logic manually, but was happy to give the library a leg-up with the mapping of types.

JsonConverter

Our searching quickly led to the following question on Stack Overflow: “Deserializing polymorphic json classes without type information using json.net”. The lack of type information mentioned in the question meant the exact .Net type (i.e. name, assembly, version, etc.), and so the answer describes how to do it where you can infer the resulting type from one or more attributes in the data itself. In our case it was a field unsurprisingly called “type” that held a simplified name as described earlier.

The crux of the solution involves creating a JsonConverter and implementing the two methods CanConvert and ReadJson. If we follow that Stack Overflow post’s top answer we end up with an implementation something like this:

public class CustomJsonConverter : JsonConverter
{
  public override bool CanConvert(Type objectType)
  {
    return typeof(BaseType).
                       IsAssignableFrom(objectType);
  }

  public override object ReadJson(JsonReader reader,
           Type objectType, object existingValue,
           JsonSerializer serializer)
  {
    JObject item = JObject.Load(reader);

    if (item.Value<string>(“type”) == “Derived”)
    {
      return item.ToObject<DerivedType>();
    }
    else
    . . .
  }
}

This all made perfect sense and even agreed with a couple of other blog posts on the topic we unearthed. However when we plugged it in we ended up with an infinite loop in the ReadJson method that resulted in a StackOverflowException. Doing some more googling and checking the Newtonsoft JSON.Net documentation didn’t point out our “obvious” mistake and so we resorted to the time honoured technique of fumbling around with the code to see if we could get this (seemingly promising) solution working.

A Blind Alley

One avenue that appeared to fix the problem was manually adding the JsonConverter to the list of Converters in the JsonSerializerSettings object instead of using the [JsonConverter] attribute on the base class. We went back and forth with some unit tests to prove that this was indeed the solution and even committed this fix to our codebase.

However I was never really satisfied with this outcome and so decided to write this incident up. I started to work through the simplest possible example to illustrate the behaviour but when I came to repro it I found that neither approach worked – attribute or serializer settings - I always got into an infinite loop.

Hence I questioned our original diagnosis and continued to see if there was a more satisfactory answer.

ToObject vs Populate

I went back and re-read the various hits we got with those additional keywords (recursion, infinite loop and stack overflow) to see if we’d missed something along the way. The two main candidates were “Polymorphic JSON Deserialization failing using Json.Net” and “Custom inheritance JsonConverter fails when JsonConverterAttribute is used”. Neither of these explicitly references the answer we initially found and what might be wrong with it – they give a different answer to a slightly different question.

However in these answers they suggest de-serializing the object in a different way, instead of using ToObject<DerivedType>() to do all the heavy lifting, they suggest creating the uninitialized object yourself and then using Populate() to fill in the details, like this:

{
  JObject item = JObject.Load(reader);

  if (item.Value<string>(“type”) == “Derived”)
  {
    var @object = new DerivedType();
    serializer.Populate(item.CreateReader(), @object);
    return @object;
  }
  else
    . . .
}

Plugging this approach into my minimal example worked, and for both the converter techniques too: attribute and serializer settings.

Unanswered Questions

So I’ve found another technique that works, which is great, but I still lack closure around the whole affair. For example, how come the answer in the the original Stack Overflow question “Deserializing polymorphic json classes” didn’t work for us? That answer has plenty of up-votes and so should be considered pretty reliable. Has there been a change to Newtonsoft’s JSON.Net library that has somehow caused this answer to now break for others? Is there a new bug that we’ve literally only just discovered (we’re using v10)? Why don’t the JSON.Net docs warn against this if it really is an issue, or are we looking in the wrong part of the docs?

As described right at the beginning I’ve published a Gist with my minimal example and added a comment to the Stack Overflow answer with that link so that anyone else on the same journey has some other pieces of the jigsaw to work with. Perhaps over time my comment will also acquire up-votes to help indicate that it’s not so cut-and-dried. Or maybe someone who knows the right answer will spot it and point out where we went wrong.

Ultimately though this is probably a case of not seeing the wood for the trees. It’s so easy when you’re trying to solve one problem to get lost in the accidental complexity and not take a step back. Answers on Stack Overflow generally carry a large degree of gravitas, but they should not be assumed to be infallible. All documentation can go out of date even if there are (seemingly) many eyes watching over it.

When your mind-set is one that always assumes the bugs are of your own making, unless the evidence is overwhelming, then those times when you might actually not be entirely at fault seem to feel all the more embarrassing when you realise the answer was probably there all along but you discounted it too early because your train of thought was elsewhere.