One of the more subtle aspects of converting (n)varchar or (n)text data to XML is the fact that XML is choosy about which characters are permitted and (n)/varchar/(n)text is not. Any T-SQL programmer who runs conversions of this type is likely to run into this issue. Here's a code block that resolves the issue.
The characters in question are what are commonly called "lower-order ASCII" characters, those below CHAR(32). Of these, only the TAB (CHAR(9)), LF (CHAR(10)), and CR(CHAR(13) are valid within XML. This solution uses trigger code to call a user-defined function to scrub the nvarchar columns, and a loop within the trigger to an ntext column.
Here's the UDF code:
Here's the trigger code:
The trigger code is built to maximize performance in that the NULLIF tests in the UPDATE statement will only run the (relatively expensive) UDF if the inserted and deleted images of a particular column differ (if they don't differ, we can guarantee that the value has already been scrubbed). The UDF and the loop in the trigger for the ntext SupplementDescription column employ the same basic strategy of looping through the source value looking for any invalid character and replacing it with a new character (NCHAR(164)) until the last invalid character is found.
This code was developed for a SQL Server 2000 environment. It would function in a SQL Server 2005 environment, but better performance would likely be had with a CLR-based UDF.
Note also that if you're interested in translating the reserved characters <>& etc., you can do that with a series of nested REPLACE statements.
A recent discussion with several colleagues reminded me of a hard-won insight I've been meaning to share...
Feedback is solicited on a programming issue
A find shared by one friend leads to correspondence from another.. The redoubtable Adam Machanic left
thanks!, worked for me when receiving char #x0001
Instead you widen the valid characters, add this to the header of your xml document!
<?xml version="1.0" encoding="ISO-8859-1"?>
also remove the following characters with this python script...
file = open('C:/where your file is located','r')
myfile = file.readlines()
file = open('C:\root\where you want to save your file\location','w')
for r in myfile:
r = r.replace("\r\n", "");
r = r.replace("\r", "");
r = r.replace("\n", "");
r = r.replace("\\r\\n", "");
r = r.replace("\\r", "");
r = r.replace("\\n", "");
r = r.replace("\u0085", "");
r = r.replace("\u000A", "");
r = r.replace("\u000B", "");
r = r.replace("\u000C", "");
r = r.replace("\u000D", "");
r = r.replace("\u2028", "");
r = r.replace("\u2029", "");
r = r.replace("\\\"", "\\\\\\\"");
Allows for a wider range of valid characters.