Ward Pond's SQL Server blog

Ruminating on issues pertinent to the design and development of sound databases and processes under Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000 (while reserving the right to vent about anything else that's on my mind)

A Solution For Stripping Invalid XML Characters From varchar/text Data Structures

A Solution For Stripping Invalid XML Characters From varchar/text Data Structures

  • Comments 6
  • Likes

One of the more subtle aspects of converting (n)varchar or (n)text data to XML is the fact that XML is choosy about which characters are permitted and (n)/varchar/(n)text is not.  Any T-SQL programmer who runs conversions of this type is likely to run into this issue.  Here's a code block that resolves the issue.

The characters in question are what are commonly called "lower-order ASCII" characters, those below CHAR(32).  Of these, only the TAB (CHAR(9)), LF (CHAR(10)), and CR(CHAR(13) are valid within XML.  This solution uses trigger code to call a user-defined function to scrub the nvarchar columns, and a loop within the trigger to an ntext column.

Here's the UDF code:

CREATE FUNCTION fnStripLowAscii (@InputString nvarchar(4000))
RETURNS nvarchar(4000)
AS
BEGIN
IF
@InputString IS NOT NULL
BEGIN
  DECLARE @Counter int, @TestString nvarchar(40)

  SET
@TestString = '%[' + NCHAR(0) + NCHAR(1) + NCHAR(2) + NCHAR(3) + NCHAR(4) + NCHAR(5) + NCHAR(6) +
NCHAR(7) + NCHAR(8) + NCHAR(11) + NCHAR(12) + NCHAR(14) + NCHAR(15) + NCHAR(16) + NCHAR(17) + NCHAR(18) + NCHAR(19) + NCHAR(20) + NCHAR(21) + NCHAR(22) + NCHAR(23) + NCHAR(24) + NCHAR(25) + NCHAR(26) + NCHAR(27) + NCHAR(28) + NCHAR(29) + NCHAR(30) + NCHAR(31) + ']%'

  SELECT
@Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)

  WHILE @Counter <> 0
  BEGIN
    SELECT @InputString = STUFF(@InputString, @Counter, 1, NCHAR(164))
    SELECT @Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)
  END
END
RETURN
(@InputString)
END

Here's the trigger code:

CREATE TRIGGER dbo.trSupplement_IU
ON dbo.Supplement
FOR INSERT, UPDATE
AS
SET
NOCOUNT ON
SET
TEXTSIZE 1048576

-- first do the text column. given the limitations of function calls, this must be done "brute force"

DECLARE
@SupplementId int, @OldSupplementId int, @ptrval binary(16), @Pointer int, @OldPointer int, @Replace nchar(1), @TestString nvarchar(40), @TitleUpdated nvarchar(20), @PathUpdated nvarchar(20), @MDRUpdated nvarchar(20)

SELECT @OldSupplementId = -1, @Replace = NCHAR(164), @TestString = '%[' + NCHAR(0) + NCHAR(1) + NCHAR(2) + NCHAR(3) + NCHAR(4) + NCHAR(5) + NCHAR(6) + NCHAR(7) + NCHAR(8) + NCHAR(11) + NCHAR(12) + NCHAR(14) + NCHAR(15) + NCHAR(16) + NCHAR(17) + NCHAR(18) + NCHAR(19) + NCHAR(20) + NCHAR(21) + NCHAR(22) + NCHAR(23) + NCHAR(24) + NCHAR(25) + NCHAR(26) + NCHAR(27) + NCHAR(28) + NCHAR(29) + NCHAR(30) + NCHAR(31) + ']%'

WHILE EXISTS ( SELECT SupplementId FROM inserted WHERE SupplementId > @OldSupplementId)
BEGIN
  SELECT @SupplementId = MIN(SupplementId) FROM inserted WHERE SupplementId > @OldSupplementId
  -- first check the SupplementDescription..
  IF EXISTS (SELECT SupplementId FROM dbo.Supplement s (nolock) WHERE SupplementId = @SupplementId AND SupplementDescription IS NOT NULL)
  BEGIN
    SET @Pointer = NULL
    SELECT @ptrval = TEXTPTR(SupplementDescription),
          
@Pointer = PATINDEX (@TestString, SupplementDescription COLLATE Latin1_General_BIN)
   
FROM dbo.Supplement (nolock)
   
WHERE SupplementID = @SupplementId

    WHILE (@Pointer <> 0)
    BEGIN
      SET @Pointer = @Pointer - 1
      UPDATETEXT dbo.Supplement.SupplementDescription @ptrval @Pointer 1 @Replace

      SELECT @Pointer = PATINDEX (@TestString, SupplementDescription COLLATE Latin1_General_BIN)
     
FROM dbo.Supplement (nolock)
      
WHERE SupplementID = @SupplementId
    END
  END
END

-- now do the character columns. this we can do with a join and a function call..
UPDATE s
SET SupplementTitle = CASE 
   
NULLIF(i.SupplementTitle, d.SupplementTitle)
       
WHEN NULL THEN s.SupplementTitle
        ELSE dbo.fnStripLowAscii(s.SupplementTitle)
   
END,
    FileAttachmentPath = CASE
    NULLIF(i.FileAttachmentPath, d.FileAttachmentPath)
       
WHEN NULL THEN s.FileAttachmentPath
       
ELSE dbo.fnStripLowAscii(s.FileAttachmentPath)
   
END,
    MachineDataReference = CASE 
    
NULLIF(i.MachineDataReference, d.MachineDataReference)
       
WHEN NULL THEN s.MachineDataReference
       
ELSE dbo.fnStripLowAscii(s.MachineDataReference)
   
END
FROM
inserted i
JOIN dbo.Supplement s
ON i.SupplementId = s.SupplementId
LEFT OUTER JOIN deleted d
ON i.SupplementId = d.SupplementId

The trigger code is built to maximize performance in that the NULLIF tests in the UPDATE statement will only run the (relatively expensive) UDF if the inserted and deleted images of a particular column differ (if they don't differ, we can guarantee that the value has already been scrubbed).  The UDF and the loop in the trigger for the ntext SupplementDescription column employ the same basic strategy of looping through the source value looking for any invalid character and replacing it with a new character (NCHAR(164)) until the last invalid character is found.

This code was developed for a SQL Server 2000 environment.  It would function in a SQL Server 2005 environment, but better performance would likely be had with a CLR-based UDF.

Comments
  • Note also that if you're interested in translating the reserved characters <>& etc., you can do that with a series of nested REPLACE statements.

  • A recent discussion with several colleagues reminded me of a hard-won insight I've been meaning to share...

  • Feedback is solicited on a programming issue

  • A find shared by one friend leads to correspondence from another.. The redoubtable Adam Machanic left

  • thanks!, worked for me when receiving char #x0001

  • Instead you widen the valid characters, add this to the header of your xml document!

    <?xml version="1.0" encoding="ISO-8859-1"?>

    also remove the following characters with this python script...

    def clean():

       file =  open('C:/where your file is located','r')

       myfile = file.readlines()

       file.close()

       file =  open('C:\root\where you want to save your file\location','w')

       for r in myfile:

           r = r.replace("\r\n", "");

           r = r.replace("\r", "");

           r = r.replace("\n", "");

           r = r.replace("\r", "");

           r = r.replace("\\r\\n", "");

           r = r.replace("\\r", "");

           r = r.replace("\\n", "");

           r = r.replace("\u0085", "");

           r = r.replace("\u000A", "");

           r = r.replace("\u000B", "");

           r = r.replace("\u000C", "");

           r = r.replace("\u000D", "");

           r = r.replace("\u2028", "");

           r = r.replace("\u2029", "");

           r = r.replace("\\\"", "\\\\\\\"");

           file.write(r)

       file.close()

    --------------python 3.2

    Allows for a wider range of valid characters.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment