Posts Tagged xml

Hexadecimal value 0x is an invalid character

kick it on DotNetKicks.com

MIT LICENSED

Ever get a System.Xml.XmlException that says:

“Hexadecimal value 0x[whatever] is an invalid character”

…when trying to load a XML document using one of the .NET XML API objects like XmlReader, XmlDocument, or XDocument? Was “0x[whatever]” by chance one of these characters?

0x00

0x01

0x02

0x03

0x04

0x05

0x06

0x07

0x08

0x0B

0x0C

0x0E

0x0F

0x10

0x11

0x12

0x13

0x14

0x15

0x1A

0x1B

0x1C

0x1D

0x1E

0x1F

0x16

0x17

0x18

0x19

0x7F

The problem that causes these XmlExceptions is that the data being read or loaded contains characters that are illegal according to the XML specifications. Almost always, these characters are in the ASCII control character range (think whacky characters like null, bell, backspace, etc). These aren’t characters that have any business being in XML data; they’re illegal characters that should be removed, usually having found their way into the data from file format conversions, like when someone tries to create an XML file from Excel data, or export their data to XML from a format that may be stored as binary.

The decimal range for ASCII control characters is 0 – 31, and 127. Or, in hex, 0x00 – 0x1F. (The control character 0x7F is not disallowed, but its use is “discouraged” to avoid compatibility issues.) If any character in the string or stream that contains the XML data contains one of these control characters, an XmlException will be thrown by whatever System.Xml or System.Xml.Linq class (e.g. XmlReader, XmlDocument, XDocument) is trying to load the XML data. In fact, if XML data contains the character ‘\b’ (bell), your motherboard will actually make the bell sound before the XmlException is thrown.

There are a few exceptions though: the formatting characters ‘\n’, ‘\r’, and ‘\t’ are not illegal in XML, per the 1.0 and 1.1 specifications, and therefore do not cause this XmlException. Thus, if you’re encountering XML data that is causing an XmlException because the data “contains invalid characters”, the feeds you’re processing need to be sanitized of illegal XML characters per the XML 1.0 specification (which is what System.Xml conforms to—not XML 1.1) should be removed. The methods below will accomplish this:

/// <summary>
/// Remove illegal XML characters from a string.
/// </summary>
public string SanitizeXmlString(string xml)
{
	if (xml == null)
	{
		throw new ArgumentNullException("xml");
	}
	
	StringBuilder buffer = new StringBuilder(xml.Length);
	
	foreach (char c in xml)
	{
		if (IsLegalXmlChar(c))
		{
			buffer.Append(c);
		}
	}
		
	return buffer.ToString();
}

/// <summary>
/// Whether a given character is allowed by XML 1.0.
/// </summary>
public bool IsLegalXmlChar(int character)
{
	return
	(
		 character == 0x9 /* == '\t' == 9   */          ||
		 character == 0xA /* == '\n' == 10  */          ||
		 character == 0xD /* == '\r' == 13  */          ||
		(character >= 0x20    && character <= 0xD7FF  ) ||
		(character >= 0xE000  && character <= 0xFFFD  ) ||
		(character >= 0x10000 && character <= 0x10FFFF)
	);
}

Useful as these methods are, don’t go off pasting them into your code anywhere. Create a class instead. Here’s why: let’s say you use the routine to sanitize a string in one section of code. Then another section of code uses that same string that has been sanitized. How does the other section positively know that the string doesn’t contain any control characters anymore, without checking? It doesn’t. Who knows where that string has been (if it’s been sanitized) before it gets to a different routine, further down the processing pipeline. Program defensive and agnostically. If the sanitized string isn’t a string and is instead a different type that represents sanitized strings, you can guarantee that the string doesn’t contain illegal characters.

Now, if the strings that need to be sanitized are being retrieved from a Stream, via a TextReader, for example, we can create a custom StreamReader class that will skip over illegal characters. Let’s say that you’re retrieving XML like so:

string xml;

using (WebClient downloader = new WebClient())
{
	using (TextReader reader =
		new StreamReader(downloader.OpenRead(uri)))
	{
		xml = reader.ReadToEnd();
	}
}

// Do something with xml...

You could use the sanitizing methods above like this:

string xml;

using (WebClient downloader = new WebClient())
{
	using (TextReader reader =
		new StreamReader(downloader.OpenRead(uri)))
	{
		xml = reader.ReadToEnd();
	}
}

// Sanitize the XML

xml = SanitizeXmlString(xml);

// Do something with xml...

But creating a class that inherits from StreamReader and avoiding the costly string-building operation performed by SanitizeXmlString() is much more efficient. The class will have to override a couple methods when it’s finished, but when it is, a Stream could be consumed and sanitized like this instead:

string xml;

using (WebClient downloader = new WebClient())
{
	using(XmlSanitizingStream reader =
		new XmlSanitizingStream(downloader.OpenRead(uri)))
	{
		xml = reader.ReadToEnd()
	}
}

// xml contains no illegal characters

The declaration for this XmlSanitizingStream, with IsLegalXmlChar() that we’ll need, looks like:

public class XmlSanitizingStream : StreamReader
{
	// Pass 'true' to automatically detect encoding using BOMs.
	// BOMs: http://en.wikipedia.org/wiki/Byte-order_mark

	public XmlSanitizingStream(Stream streamToSanitize)
		: base(streamToSanitize, true)
	{ }

	/// <summary>
	/// Whether a given character is allowed by XML 1.0.
	/// </summary>
	public static bool IsLegalXmlChar(int character)
	{
		return
		(
			 character == 0x9 /* == '\t' == 9   */          ||
			 character == 0xA /* == '\n' == 10  */          ||
			 character == 0xD /* == '\r' == 13  */          ||
			(character >= 0x20    && character <= 0xD7FF  ) ||
			(character >= 0xE000  && character <= 0xFFFD  ) ||
			(character >= 0x10000 && character <= 0x10FFFF)
		);
	}

	// ...

To get this XmlSanitizingStream working correctly, we’ll first need to override two methods integral to the StreamReader: Peek(), and Read(). The Read method should only return legal XML characters, and Peek() should skip past a character if it’s not legal.

	private const int EOF = -1;

	public override int Read()
	{
		// Read each char, skipping ones XML has prohibited

		int nextCharacter;

		do
		{
			// Read a character

			if ((nextCharacter = base.Read()) == EOF)
			{
				// If the char denotes end of file, stop
				break;
			}
		}

		// Skip char if it's illegal, and try the next

		while (!XmlSanitizingStream.
		        IsLegalXmlChar(nextCharacter));

		return nextCharacter;
	}

	public override int Peek()
	{
		// Return next legal XML char w/o reading it 

		int nextCharacter;

		do
		{
			// See what the next character is 
			nextCharacter = base.Peek();
		}
		while
		(
			// If it's illegal, skip over 
			// and try the next.

			!XmlSanitizingStream
			.IsLegalXmlChar(nextCharacter) &&
			(nextCharacter = base.Read()) != EOF
		);

		return nextCharacter;

	}

Next, we’ll need to override the other Read* methods (Read, ReadToEnd, ReadLine, ReadBlock). These all use Peek() and Read() to derive their returns. If they are not overridden, calling them on XmlSanitizingStream will invoke them on the underlying base StreamReader. That StreamReader will then use its Peek() and Read() methods, not the XmlSanitizingStream’s, resulting in unsanitized characters making their way through.

To make life easy and avoid writing these other Read* methods from scratch, we can disassemble the TextReader class using Reflector, and copy its versions of the other Read* methods, without having to change more than a few lines of code related to ArgumentExceptions.

The complete version of XmlSanitizingStream can be downloaded here. Rename the file extension to “.cs” from “.doc” after downloading.

kick it on DotNetKicks.com

Advertisements

Comments (84)

Serializing Exceptions to XML

kick it on DotNetKicks.com

Exceptions are fundamental to languages like Java and C#. They’re suppose to make error-handling easier than dealing with return codes, which is more common in earlier languages like C or C++. But many times, Exceptions will arise that were unanticipated, and will need to be reviewed by a developer to possibly make changes to the responsible code. It is therefore common practice to instrument some logging mechanism to record Exceptions where necessary.

Since XML has become so ubiquitous, XML is an obvious choice to represent the data structure of an Exception in. Any other format like JSON may work just fine, but more people are familiar with XML. Thus, recording an Exception as XML is a common means of capturing unrecognized errors in programs.

But what’s the easiest way to serialize an Exception into XML? If you’ve ever tried using the .NET 2.0 XML serializer—XmlSerializer class in System.Xml.Serialization—you’ll quickly find out that it’s not possible without a workaround. Any object implementing IDictionary, or any object with a member that implements IDictionary (e.g. the property Exception.Data) cannot be serialized. For example, try this in a console application:

new XmlSerializer(typeof(Exception))
    .Serialize(Console.Out, new Exception());

And you’ll see:

An unhandled exception of type ‘System.InvalidOperationException’ occurred in System.Xml.dll
Additional information: There was an error reflecting type ‘System.Exception’.

(Why isn’t IDictionary serializable? Maybe someone knows. Or, JFGI with q=”idictionary+serialize“.)

Even if XmlSerializer could serialize an Exception with its IDictionary Data property, the XML it generates isn’t necessarily what we’d want. Take, for example, this simple struct:

public struct Message
{
	public string Sender = "Chris";
	public DateTime Timestamp = DateTime.Now;
}

And serialize it using XmlSerializer. The resulting output is a full-fledged XML document:

<?xml version="1.0" encoding="IBM437"?>
<Message
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Sender>Chris</Sender>
  <Timestamp>2008-09-10T10:14:00.117-07:00</Timestamp>
</Message>

If the Exception is simply being serialized to be added to an existing document or we’re just interested in the XML element that would represent the Exception, the declaration and its namespaces are extraneous. Also, what the hell is that encoding?! What—UTF-8 isn’t good enough for you, XmlSerializer?

Instead of using XmlSerializer, the new System.Xml.Linq API (new as of .NET 3.5) can be used very easily. By rolling our own serialization method using the new XML API, we can also control what information gets serialized—and what doesn’t. Capture all the important information, and don’t capture any of the unimportant information. For most situations, the “important information” is a short list. It’s typically the Exception’s:

  • Type
  • .Message
  • .StackTrace
  • .InnerException
  • .Data collection

In our XML data of the serialized Exception then, we’ll only include these members as XML elements, and if one of these members is missing from an Exception, instead of an empty element (e.g. <Message />, we’ll omit the node from the XML data to keep the data (size) small and tight.

The Exception we’re aiming for should look something like this:

<System.ArgumentException>
	<Message>URI is relative.</Message>
	<StackTrace>
		at ConsoleApp...
	</StackTrace>
</System.ArgumentException>

Using the new XML API is straight-forward. Since we want the root element to be the Exception’s Type, we create an XElement with the Type as its name:

XElement root = new XElement(exception.GetType().ToString());

To add the Exception’s Message as a child:

if (exception.Message != null)
{
	root.Add(new XElement("Message", exception.Message));
}

Note that if there are any unsanitary characters (characters that need escaping: <, >, &, ', and ") in exception.Message, the new XElement automatically escapes them.

Next, the StackTrace:

if (exception.StackTrace != null)
{
	root.Add(new XElement("StackTrace", exception.StackTrace));
}

One may wonder, “Why is the StackTrace checked for null?” Here’s why: Exceptions that are thrown will always have a StackTrace, but if an Exception is instantiated but not thrown, the StackTrace has no value (is null):

[Test] // Passes
public void NewExceptionHasNullStackTrace()
{
	Assert.IsNull(new Exception().StackTrace);
}

Now for that pesky Data member whose IDictionary Type is so loathed by XmlSerializer.

// Data is never null; it's empty if there is no data

if (exception.Data.Count > 0)
{
	root.Add
	(
		new XElement("Data",
			from entry in exception.Data.Cast<DictionaryEntry>()
			let key = entry.Key.ToString()
			let value = (entry.Value == null) ?
							"null" : entry.Value.ToString()
			select new XElement(key, value))
	);
}

If there are any items in the Data collection, a Data element is created, and each element of the Data collection are added into the XML for the Exception as children elements of Data:

<System.Exception>
	...
	<Data>
		<Uri>/assets/images/logo1.jpg</Uri>
		<StatusCode>404</StatusCode >
	</Data>
	...
</System.Exception>

The last element to add to this serialized Exception is its InnerException. To do this, we’ll recursively run the InnerException through the same process that the original Exception goes through. If you take a look at the full source code at the bottom of this post, you’ll see that the small snippets throughout this post are taken from a class that encapsulates these procedures and strongly-types this resulting XML data as an ExceptionXElement, inheritting from XElement.

if (exception.InnerException != null)
{
	root.Add
	(
		new ExceptionXElement
			(exception.InnerException, omitStackTrace)
	);
}

Once we have finished populating our root XElement with subelements, the XML markup can be retrieved using the ToString() method:

Console.WriteLine(root.ToString());

And the output:

<System.ArgumentException>
	<Message>URI is relative.</Message>
	<Data>
		<Uri>/assets/images/logo1.jpg</Uri>
	</Data>
	<StackTrace>
		at ConsoleApp...
	</StrackTrace>
</System.ArgumentException>

There you have it. Check out how this is all wrapped up in the full source below.

using System;
using System.Collections;
using System.Linq;
using System.Xml.Linq;

/// <summary>Represent an Exception as XML data.</summary>
public class ExceptionXElement : XElement
{
	/// <summary>Create an instance of ExceptionXElement.</summary>
	/// <param name="exception">The Exception to serialize.</param>
	public ExceptionXElement(Exception exception)
		: this(exception, false)
	{ }

	/// <summary>Create an instance of ExceptionXElement.</summary>
	/// <param name="exception">The Exception to serialize.</param>
	/// <param name="omitStackTrace">
	/// Whether or not to serialize the Exception.StackTrace member
	/// if it's not null.
	/// </param>
	public ExceptionXElement(Exception exception, bool omitStackTrace)
		: base(new Func<XElement>(() =>
		{
			// Validate arguments

			if (exception == null)
			{
				throw new ArgumentNullException("exception");
			}

			// The root element is the Exception's type

			XElement root = new XElement
				(exception.GetType().ToString());

			if (exception.Message != null)
			{
				root.Add(new XElement("Message", exception.Message));
			}

			// StackTrace can be null, e.g.:
			// new ExceptionAsXml(new Exception())

			if (!omitStackTrace && exception.StackTrace != null)
			{
				root.Add
				(
					new XElement("StackTrace",
						from frame in exception.StackTrace.Split('\n')
						let prettierFrame = frame.Substring(6).Trim()
						select new XElement("Frame", prettierFrame))
				);
			}

			// Data is never null; it's empty if there is no data

			if (exception.Data.Count > 0)
			{
				root.Add
				(
					new XElement("Data",
						from entry in
							exception.Data.Cast<DictionaryEntry>()
						let key = entry.Key.ToString()
						let value = (entry.Value == null) ?
							"null" : entry.Value.ToString()
						select new XElement(key, value))
				);
			}

			// Add the InnerException if it exists

			if (exception.InnerException != null)
			{
				root.Add
				(
					new ExceptionXElement
						(exception.InnerException, omitStackTrace)
				);
			}

			return root;
		})())
	{ }
}

kick it on DotNetKicks.com

Comments (14)

Local XMLHttpRequest Debugging

kick it on DotNetKicks.com

Disconnected XMLHttpRequest Debugging

A recent project has me working closely with the browser XMLHttpRequest object. Like any good little dev., I’ve been writing tests for various Ajax routines. These test are simple .htm files run from a local directory on my computer—they do not reside on an HTTP server. Consequently, after an XMLHttpRequest has its send() fired, no actual HTTP request is made. Instead, the browser simply accesses the file on the harddrive via I/O operations—not the HTTP protocol. During my testing experience with a local setup, I’ve noticed a few quirks that deserve attention. If you’re a developer also debugging in a disconnected environment, you may find these notes useful.

Just a note when I get started: when I refer to “Mozilla”, I’m referring to Mozilla 5, and cannot speak for other version.

As another aside, shouldn’t “XMLHttpRequest” have been called simply “HttpRequest”? Think outside the (XML) box, people.

Cross-browser new XMLHttpRequest()

Now, with that out of the way, as many already known, XMLHttpRequest has to be accessed through ActiveX in IE. There are more than a few different versions of XMLHttpRequest. To instantiate a new XMLHttpRequest without having to worry about whether the client browser is IE or other, I was using the snippet below in the global (window) scope:

if (typeof(XMLHttpRequest) == "undefined") {
	function XMLHttpRequest() {
		try { return new ActiveXObject("MSXML3.XMLHTTP") }catch(e){}
		try { return new ActiveXObject("MSXML2.XMLHTTP.3.0") }catch(e){}
		try { return new ActiveXObject("Msxml2.XMLHTTP") }catch(e){}
		try { return new ActiveXObject("Microsoft.XMLHTTP") }catch(e){}
		return null;
	};
}
// var httpRequest = new XMLHttpRequest();

One can see that if the browser is IE, the latest commonly-distributed version is attempted to be used, then an older one, and down the line. An interesting thing happens when you try to call the open() an XMLHttpRequest to a local file or URL “#” with only MSXML3.XMLHTTP: it doesn’t work—an Error is thrown:

// Throws Error, just like if "asset.xml" was "#"
var httpRequest = new ActiveXObject("MSXML3.XMLHTTP");
httpRequest.open("GET", "asset.xml", true);

// Or any version from the code block above this one
var httpRequest = new ActiveXObject("MSXML2.DOMDocument");
// No Error, responseXML is loaded if asset.xml is valid XML
httpRequest.open("GET", "asset.xml", true);

// Mozilla
var httpRequest = new XMLHttpRequest();
// No Error, responseXML is loaded if asset.xml is valid XML
httpRequest.open("GET", "asset.xml", true);

Local files cannot be accessed using the MSXML3.XMLHTTP library. Don’t waste your time.

XMLHttpRequest.status

To determine if an XMLHttpRequest successfully retrieved a response, code should check for the “done” readyState, 4, and an HTTP response status code, status, of 200, meaning OK:

// Pre-defined constants are used here
// READYSTATE.DONE == 4
// STATUSCODE.OK == 200
if (httpRequest.readyState == READYSTATE.DONE
     && httpRequest.status == STATUSCODE.OK) {
	successCallback(httpResponse);
}

In both Mozilla and IE, if the HTTP request is to a local file, however, the status never changes from 0. The readyState of course does, however. Thus, the snippet above becomes:

// STATUSCODE.DEFAULT == 0
if (httpRequest.readyState == READYSTATE.DONE &&
   (httpRequest.status == STATUSCODE.DEFAULT ||
    httpRequest.status == STATUSCODE.OK)) {
	successCallback(httpRequest);
}

Response Headers

Try to get a specific header value such as “content-type” using getResponseHeader("content-type") in Mozilla and IE, and an empty string will be the result. Same goes for getAllResponseHeaders()—nothing but an empty string. That’s logical since HTTP isn’t actually used though, of course.

responseXML‘s Value

With Mozilla, whether an XMLHttpRequest grabs a local file or actually makes an HTTP request, if the data it processes is valid XML, the data will be loaded into a DOM object and made accessible in responseXML. If the file or response cannot be loaded into the DOM, responseXML is null:

// Mozilla example 
if (httpRequest.readyState == READYSTATE.DONE &&
   (httpRequest.status == STATUSCODE.DEFAULT ||
    httpRequest.status == STATUSCODE.OK)) {
	
		if (httpRequest.responseXML == null) {
			// Failed; didn't load data into DOM object
			return;
		}
		// Success
		successCallback(httpRequest.responseXML);
}

IE, on the other hand, creates an empty DOM object and places it in the responseXML regardless of what the file or response is. Then, if the response data is valid XML, it will be loaded into the responseXML DOM object. Checking if responseXML is null, therefore, will tell you nothing. What you want to check for is the DOM root. If responseXML.documentElement (the root element) is null, then the DOM object is empty—the XML data was not loaded into it:

// IE-only example
if (httpRequest.readyState == READYSTATE.DONE &&
   (httpRequest.status == STATUSCODE.DEFAULT ||
    httpRequest.status == STATUSCODE.OK)) {

		if (httpRequest.responseXML.documentElement == null) {
			// Failed; didn't load data into DOM object
			return;
		}
		// Success
		successCallback(httpRequest.responseXML);
}

With local files, IE, once again, behaves unbecomingly. If the XMLHttpRequest object grabs an “.xml” file or file with valid XML, the DOM object will always be empty. My guess is that the XMLHttpRequest checks for the “content-type” header and loads DOM only if the response is “text/xml”. And since there are no headers because no HTTP request is actually made for local files, XMLHttpRequest isn’t smart enough to load the responseXML object. If one is doing local testing and getting a bunk DOM object, simply load the responseText using ActiveX::

// Using...

if (typeof(XMLHttpRequest) == "undefined") {
	function XMLHttpRequest() {
		try { return new ActiveXObject("MSXML3.XMLHTTP") }catch(e){}
		try { return new ActiveXObject("MSXML2.XMLHTTP.3.0") }catch(e){}
		try { return new ActiveXObject("Msxml2.XMLHTTP") }catch(e){}
		try { return new ActiveXObject("Microsoft.XMLHTTP") }catch(e){}
		return null;
	};
}

// Cross-browser, handlers IE and Mozilla cases

httpRequest = new XMLHttpRequest();
httpRequest.open("GET", "assets.xml", true);

httpRequest.onreadystatechange = function() {
	var isLocal = (httpRequest.status == STATUSCODE.DEFAULT);
	
	if (httpRequest.readyState == READYSTATE.DONE &&
	   (httpRequest.status == STATUSCODE.OK || isLocal)) {
		
		if (httpRequest.responseXML == null)
			// Mozilla failure, local or HTTP
			failure();
		else if(httpRequest.resonseXML.document == null) {
			// IE 
			if (!isLocal)
				// HTTP failure
				failure();
			else {
				// Local failure--always happens, try
				// using Microsoft.XMLDOM to load
				var xmlDoc = new ActiveXObject("Microsoft.XMLDOM");
				xmlDoc.async = false;
				xmlDoc.loadXML(httpRequest.responseText);
				
				if (xmlDoc.documentElement != null)
					success(xmlDoc);
				else
					failure();
			}
		}
		else
			success(httpRequest.responseXML);
	}
};

Comments (4)