How to Retrieve a Very Long Xml-String from an SQL Database with R

How to retrieve a very long XML-string from an SQL database with R?

I've written to the package author and have finally received the following answer:

Your inability to read is not my problem, nor is it a reasonable excuse.
The manual says
'\item[Character types] Character types can be classified three ways:
fixed or variable length, by the maximum size and by the character

set used. The most commonly used types\footnote{the SQL names for

these are \code{CHARACTER VARYING} and \code{CHARACTER}, but these

are too cumbersome for routine use.} are \code{varchar} for short

strings of variable length (up to some maximum) and \code{char} for

short strings of fixed length (usually right-padded with spaces).

The value of `short' differs by DBMS and is at least 254, often a

few thousand---often other types will be available for longer

character strings. There is a sanity check which will allow only

strings of up to 65535 bytes when reading: this can be removed by

recompiling \pkg{RODBC}.'

This manual can be found in the doc directory of the RODBC package. This information is not contained within the reference manual.

As in the meantime I've found a good solution to retrieve my data without using RODBC, I haven't tried to recompile this package. But I hope this answer will be helpful for those having trouble with the same issue.

XML string too large to download with dyplr & dbgetquery(). Is there a way to parse XML without downloading the XML text?

UPDATE: I was able to use substring in my query to create multiple columns with subsets of the XML string. I then ran the query with dyplr and downloaded the results. Next, I used the unite function in tidyr to concatenate the strings to a complete XML string. Finally, I was then able to parse the XML with R's read_xml.
My query looked something like this:

Select 
  (substr(xmlText,1,30000))'xmlText1',
  (substr(xmlText,30001,30000))'xmlText2',
  (substr(xmlText,60001,30000))'xmlText3',
  (substr(xmlText,90001,30000))'xmlText4',
  (substr(xmlText,120001,30000))'xmlText5'
From
  myXMLtables

Export XML data from SQL Server to R

we can use:

library(xml2)
df <- read_xml(x) %>% as_list %>% sapply(rbind) %>% as.data.frame

df
# TotalPurchaseYTD DateFirstPurchase   BirthDate MaritalStatus YearlyIncome Gender TotalChildren NumberChildrenAtHome Education Occupation HomeOwnerFlag NumberCarsOwned CommuteDistance
# 1              -31       2003-11-01Z 1962-07-26Z             S      0-25000      M             1                    0 1 Graduate Degree     Manual             0               0       0-1 Miles

Extracting data from xml into SQL server table

Here's a super-duper hack that's not limited to a set number of "columns":

-- Data mock-up.
DECLARE @value VARCHAR(500) = '[My company customer detail - Account ID <3116131311616116>, Subscriber Name <Jon>, Age <52>, Phone <>, Payment<CC>]';

-- Construct the dynamic SQL field/value list.
DECLARE @sql VARCHAR(MAX);
SELECT @sql = ISNULL( @sql, '' ) 
    + ', ' + QUOTENAME( SUBSTRING( value, CHARINDEX( '<', value ) + 1, CHARINDEX( '>', value ) - CHARINDEX( '<', value ) - 1 ), '''' )
    + ' AS [' + LTRIM( RTRIM( LEFT( value, CHARINDEX( '<', value ) - 1 ) ) ) + ']'
FROM STRING_SPLIT( REPLACE( REPLACE( @value, ']', '' ), '[', '' ), ',' )

-- Complete the dynamic SQL.
SET @sql = 'SELECT ' + STUFF( @sql, 1, 2, '' ) + ';';

-- Print the resulting dynamic SQL.
PRINT @sql;

-- Execute the dynamic SQL.
EXEC( @sql );

PRINT displays:

SELECT '3116131311616116' AS [My company customer detail - Account ID], 'Jon' AS [Subscriber Name], '52' AS [Age], '' AS [Phone], 'CC' AS [Payment];

EXEC returns:

+-----------------------------------------+-----------------+-----+-------+---------+
| My company customer detail - Account ID | Subscriber Name | Age | Phone | Payment |
+-----------------------------------------+-----------------+-----+-------+---------+
|                        3116131311616116 | Jon             |  52 |       | CC      |
+-----------------------------------------+-----------------+-----+-------+---------+

Read an XML file in R and create a dataframe when nodes have different lengths and content

The variable number of "BLOC_COMPETENCES" does complicate the problem. In the script below, parsed out all of the fiche nodes then loop through each node to retrieve the desired codes and libelles. Also needed to check for zero length codes and libelles. Since the list of fiche is very long, the lapply statement will take awhile to complete.

library(xml2)
library(dplyr)

#read the document
page <- read_xml("export_fiches_RNCP_2020-08-03.xml")

#read all fiches nodes
fiches <- xml_find_all(page, "//FICHE")

#parse each fiches
dfs <-lapply(fiches, function(node){
   id <- node %>% xml_find_first(".//ID_FICHE")  %>% xml_text()
   codes <-  node %>% xml_find_all(".//BLOC_COMPETENCES/CODE") %>% xml_text()
   libelles <- node %>% xml_find_all(".//BLOC_COMPETENCES/LIBELLE")%>% xml_text()
   #correct for codes which don't exist
   if (length(codes) <1 ) {codes = NA}
   if (length(libelles) <1 ) {libelles = NA}

   df<- data.frame(id, codes, libelles, stringsAsFactors = FALSE)
})

#merge all of the data frames
answer <- bind_rows(dfs)

R: convert XML data to data frame

It may not be as verbose as the XML package but xml2 doesn't have the memory leaks and is laser-focused on data extraction. I use trimws which is a really recent addition to R core.

library(xml2)

pg <- read_xml("http://www.ggobi.org/book/data/olive.xml")

# get all the <record>s
recs <- xml_find_all(pg, "//record")

# extract and clean all the columns
vals <- trimws(xml_text(recs))

# extract and clean (if needed) the area names
labs <- trimws(xml_attr(recs, "label"))

# mine the column names from the two variable descriptions
# this XPath construct lets us grab either the <categ…> or <real…> tags
# and then grabs the 'name' attribute of them
cols <- xml_attr(xml_find_all(pg, "//data/variables/*[self::categoricalvariable or
                                                      self::realvariable]"), "name")

# this converts each set of <record> columns to a data frame
# after first converting each row to numeric and assigning
# names to each column (making it easier to do the matrix to data frame conv)
dat <- do.call(rbind, lapply(strsplit(vals, "\ +"),
                                 function(x) {
                                   data.frame(rbind(setNames(as.numeric(x),cols)))
                                 }))

# then assign the area name column to the data frame
dat$area_name <- labs

head(dat)
##   region area palmitic palmitoleic stearic oleic linoleic linolenic
## 1      1    1     1075          75     226  7823      672        NA
## 2      1    1     1088          73     224  7709      781        31
## 3      1    1      911          54     246  8113      549        31
## 4      1    1      966          57     240  7952      619        50
## 5      1    1     1051          67     259  7771      672        50
## 6      1    1      911          49     268  7924      678        51
##   arachidic eicosenoic    area_name
## 1        60         29 North-Apulia
## 2        61         29 North-Apulia
## 3        63         29 North-Apulia
## 4        78         35 North-Apulia
## 5        80         46 North-Apulia
## 6        70         44 North-Apulia

UPDATE

I'd prbly do the last bit this way now:

library(tidyverse)

strsplit(vals, "[[:space:]]+") %>% 
  map_df(~as_data_frame(as.list(setNames(., cols)))) %>% 
  mutate(area_name=labs)

Convert all XML Special Characters back to Regular Characters (Within SQL)

Update: leaving my previous solution available below but came up with a better one based on what Jeremy Posted.

New solution:

DECLARE @xml XML = 'abc & xyz ><';

SELECT newstring = ((SELECT @xml FOR XML PATH(''), TYPE).value('.', 'varchar(8000)'));

Returns:

abc & xyz ><

OLD SOLUTION (still viable):

I have a couple functions for this type of thing. First you need rangeAB and CharMapAB

RangeAB

CREATE FUNCTION dbo.rangeAB
(
  @low  bigint, 
  @high bigint, 
  @gap  bigint,
  @row1 bit
)
/****************************************************************************************
[Purpose]:
 Creates up to 531,441,000,000 sequentia integers numbers beginning with @low and ending 
 with @high. Used to replace iterative methods such as loops, cursors and recursive CTEs 
 to solve SQL problems. Based on Itzik Ben-Gan's getnums function with some tweeks and 
 enhancements and added functionality. The logic for getting rn to begin at 0 or 1 is 
 based comes from Jeff Moden's fnTally function. 

 The name range because it's similar to clojure's range function. The name "rangeAB" as 
 used because "range" is a reserved SQL keyword.

[Author]: Alan Burstein

[Compatibility]: 
 SQL Server 2008+ and Azure SQL Database

[Syntax]:
 SELECT r.RN, r.OP, r.N1, r.N2
 FROM dbo.rangeAB(@low,@high,@gap,@row1) AS r;

[Parameters]:
 @low  = a bigint that represents the lowest value for n1.
 @high = a bigint that represents the highest value for n1.
 @gap  = a bigint that represents how much n1 and n2 will increase each row; @gap also
         represents the difference between n1 and n2.
 @row1 = a bit that represents the first value of rn. When @row = 0 then rn begins
         at 0, when @row = 1 then rn will begin at 1.

[Returns]:
 Inline Table Valued Function returns:
 rn = bigint; a row number that works just like T-SQL ROW_NUMBER() except that it can 
      start at 0 or 1 which is dictated by @row1.
 op = bigint; returns the "opposite number that relates to rn. When rn begins with 0 and
      ends with 10 then 10 is the opposite of 0, 9 the opposite of 1, etc. When rn begins
      with 1 and ends with 5 then 1 is the opposite of 5, 2 the opposite of 4, etc...
 n1 = bigint; a sequential number starting at the value of @low and incrimentingby the
      value of @gap until it is less than or equal to the value of @high.
 n2 = bigint; a sequential number starting at the value of @low+@gap and  incrimenting 
      by the value of @gap.

[Dependencies]:
N/A

[Developer Notes]:

 1. The lowest and highest possible numbers returned are whatever is allowable by a 
    bigint. The function, however, returns no more than 531,441,000,000 rows (8100^3). 
 2. @gap does not affect rn, rn will begin at @row1 and increase by 1 until the last row
    unless its used in a query where a filter is applied to rn.
 3. @gap must be greater than 0 or the function will not return any rows.
 4. Keep in mind that when @row1 is 0 then the highest row-number will be the number of
    rows returned minus 1
 5. If you only need is a sequential set beginning at 0 or 1 then, for best performance
    use the RN column. Use N1 and/or N2 when you need to begin your sequence at any 
    number other than 0 or 1 or if you need a gap between your sequence of numbers. 
 6. Although @gap is a bigint it must be a positive integer or the function will
    not return any rows.
 7. The function will not return any rows when one of the following conditions are true:
      * any of the input parameters are NULL
      * @high is less than @low 
      * @gap is not greater than 0
    To force the function to return all NULLs instead of not returning anything you can
    add the following code to the end of the query:

      UNION ALL 
      SELECT NULL, NULL, NULL, NULL
      WHERE NOT (@high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0)

    This code was excluded as it adds a ~5% performance penalty.
 8. There is no performance penalty for sorting by rn ASC; there is a large performance 
    penalty for sorting in descending order WHEN @row1 = 1; WHEN @row1 = 0
    If you need a descending sort the use op in place of rn then sort by rn ASC. 

Best Practices:
--===== 1. Using RN (rownumber)
 -- (1.1) The best way to get the numbers 1,2,3...@high (e.g. 1 to 5):
 SELECT RN FROM dbo.rangeAB(1,5,1,1);
 -- (1.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 0 to 5):
 SELECT RN FROM dbo.rangeAB(0,5,1,0);

--===== 2. Using OP for descending sorts without a performance penalty
 -- (2.1) The best way to get the numbers 5,4,3...@high (e.g. 5 to 1):
 SELECT op FROM dbo.rangeAB(1,5,1,1) ORDER BY rn ASC;
 -- (2.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 5 to 0):
 SELECT op FROM dbo.rangeAB(1,6,1,0) ORDER BY rn ASC;

--===== 3. Using N1
 -- (3.1) To begin with numbers other than 0 or 1 use N1 (e.g. -3 to 3):
 SELECT N1 FROM dbo.rangeAB(-3,3,1,1);
 -- (3.2) ROW_NUMBER() is built in. If you want a ROW_NUMBER() include RN:
 SELECT RN, N1 FROM dbo.rangeAB(-3,3,1,1);
 -- (3.3) If you wanted a ROW_NUMBER() that started at 0 you would do this:
 SELECT RN, N1 FROM dbo.rangeAB(-3,3,1,0);

--===== 4. Using N2 and @gap
 -- (4.1) To get 0,10,20,30...100, set @low to 0, @high to 100 and @gap to 10:
 SELECT N1 FROM dbo.rangeAB(0,100,10,1);
 -- (4.2) Note that N2=N1+@gap; this allows you to create a sequence of ranges.
 --       For example, to get (0,10),(10,20),(20,30).... (90,100):
 SELECT N1, N2 FROM dbo.rangeAB(0,90,10,1);
 -- (4.3) Remember that a rownumber is included and it can begin at 0 or 1:
 SELECT RN, N1, N2 FROM dbo.rangeAB(0,90,10,1);

[Examples]:
--===== 1. Generating Sample data (using rangeAB to create "dummy rows")
 -- The query below will generate 10,000 ids and random numbers between 50,000 and 500,000
 SELECT
   someId    = r.rn,
   someNumer = ABS(CHECKSUM(NEWID())%450000)+50001 
 FROM rangeAB(1,10000,1,1) r;

--===== 2. Create a series of dates; rn is 0 to include the first date in the series
 DECLARE @startdate DATE = '20180101', @enddate DATE = '20180131';

 SELECT r.rn, calDate = DATEADD(dd, r.rn, @startdate)
 FROM dbo.rangeAB(1, DATEDIFF(dd,@startdate,@enddate),1,0) r;
 GO

--===== 3. Splitting (tokenizing) a string with fixed sized items
 -- given a delimited string of identifiers that are always 7 characters long
 DECLARE @string VARCHAR(1000) = 'A601225,B435223,G008081,R678567';

 SELECT
   itemNumber = r.rn, -- item's ordinal position 
   itemIndex  = r.n1, -- item's position in the string (it's CHARINDEX value)
   item       = SUBSTRING(@string, r.n1, 7) -- item (token)
 FROM dbo.rangeAB(1, LEN(@string), 8,1)  r;
 GO

--===== 4. Splitting (tokenizing) a string with random delimiters
 DECLARE @string VARCHAR(1000) = 'ABC123,999F,XX,9994443335';

 SELECT
   itemNumber = ROW_NUMBER() OVER (ORDER BY r.rn), -- item's ordinal position 
   itemIndex  = r.n1+1, -- item's position in the string (it's CHARINDEX value)
   item       = SUBSTRING
               (
                 @string,
                 r.n1+1,
                 ISNULL(NULLIF(CHARINDEX(',',@string,r.n1+1),0)-r.n1-1, 8000)
               ) -- item (token)
 FROM dbo.rangeAB(0,DATALENGTH(@string),1,1) r
 WHERE SUBSTRING(@string,r.n1,1) = ',' OR r.n1 = 0;
 -- logic borrowed from: http://www.sqlservercentral.com/articles/Tally+Table/72993/

--===== 5. Grouping by a weekly intervals
 -- 5.1. how to create a series of start/end dates between @startDate & @endDate
 DECLARE @startDate DATE = '1/1/2015', @endDate DATE = '2/1/2015';
 SELECT 
   WeekNbr   = r.RN,
   WeekStart = DATEADD(DAY,r.N1,@StartDate), 
   WeekEnd   = DATEADD(DAY,r.N2-1,@StartDate)
 FROM dbo.rangeAB(0,datediff(DAY,@StartDate,@EndDate),7,1) r;
 GO

 -- 5.2. LEFT JOIN to the weekly interval table
 BEGIN
  DECLARE @startDate datetime = '1/1/2015', @endDate datetime = '2/1/2015';
  -- sample data 
  DECLARE @loans TABLE (loID INT, lockDate DATE);
  INSERT @loans SELECT r.rn, DATEADD(dd, ABS(CHECKSUM(NEWID())%32), @startDate)
  FROM dbo.rangeAB(1,50,1,1) r;

  -- solution 
  SELECT 
    WeekNbr   = r.RN,
    WeekStart = dt.WeekStart, 
    WeekEnd   = dt.WeekEnd,
    total     = COUNT(l.lockDate)
  FROM dbo.rangeAB(0,datediff(DAY,@StartDate,@EndDate),7,1) r
  CROSS APPLY (VALUES (
    CAST(DATEADD(DAY,r.N1,@StartDate) AS DATE), 
    CAST(DATEADD(DAY,r.N2-1,@StartDate) AS DATE))) dt(WeekStart,WeekEnd)
  LEFT JOIN @loans l ON l.lockDate BETWEEN  dt.WeekStart AND dt.WeekEnd
  GROUP BY r.RN, dt.WeekStart, dt.WeekEnd ;
 END;

--===== 6. Identify the first vowel and last vowel in a along with their positions
 DECLARE @string VARCHAR(200) = 'This string has vowels';

 SELECT TOP(1) position = r.rn, letter = SUBSTRING(@string,r.rn,1)
 FROM dbo.rangeAB(1,LEN(@string),1,1) r
 WHERE SUBSTRING(@string,r.rn,1) LIKE '%[aeiou]%'
 ORDER BY r.rn;

 -- To avoid a sort in the execution plan we'll use op instead of rn
 SELECT TOP(1) position = r.op, letter = SUBSTRING(@string,r.op,1)
 FROM dbo.rangeAB(1,LEN(@string),1,1) r
 WHERE SUBSTRING(@string,r.rn,1) LIKE '%[aeiou]%'
 ORDER BY r.rn;

---------------------------------------------------------------------------------------
[Revision History]:
 Rev 00 - 20140518 - Initial Development - Alan Burstein
 Rev 01 - 20151029 - Added 65 rows to make L1=465; 465^3=100.5M. Updated comment section
                   - Alan Burstein
 Rev 02 - 20180613 - Complete re-design including opposite number column (op)
 Rev 03 - 20180920 - Added additional CROSS JOIN to L2 for 530B rows max - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS 
(
  SELECT 1
  FROM (VALUES
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0)) T(N) -- 90 values 
),
L2(N)  AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT  
  r.RN,
  r.OP,
  r.N1,
  r.N2
FROM
(
  SELECT
    RN = 0,
    OP = (@high-@low)/@gap,
    N1 = @low,
    N2 = @gap+@low
  WHERE @row1 = 0
  UNION ALL -- COALESCE required in the TOP statement below for error handling purposes
  SELECT TOP (ABS((COALESCE(@high,0)-COALESCE(@low,0))/COALESCE(@gap,0)+COALESCE(@row1,1)))
    RN = i.rn,
    OP = (@high-@low)/@gap+(2*@row1)-i.rn,
    N1 = (i.rn-@row1)*@gap+@low,
    N2 = (i.rn-(@row1-1))*@gap+@low
  FROM iTally AS i
  ORDER BY rn
) AS r
WHERE @high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0;

CharMapAB

CREATE FUNCTION dbo.charmapAB
(
  @asciiOnly BIT,
  @xmlCheck  BIT
) 
/*****************************************************************************************
[Purpose]:
 Generates a table containing the numbers 1 through 65535 along with the
 corrsponding CHAR(N) value (e.g. CHAR(65) = "A") and/or UNICODE value (e.g. 
 NCHAR(324) = "ń", aka the Latin minuscule: ń. 

 The ascii_xml_special and unicode_xml_special columns at bits that indicate if 
 the character is an ASCII or UNICODE Reserved XML character. The ascii_xml and 
 unicode_xml columns show what will be displayed when the character is output as
 in XML format (e.g. SELECT CAST('>' AS XML) will return ">". 

 is_ascii_whitespace indicates if the character is a "whitespace character" (such
 as CHAR(9), CHAR(32) and CHAR(160)). abin is the character's ascii binary value 
 and ubin is the characters unicode binary value. 

[Developer Notes]:
 1. Have not determined UNICODE whitespace characters. 

[Examples]:
--===== Get a list of ASCII whitespace characters
  SELECT cm.* -- WhiteSpaceCharacters = 'CHAR('+CAST(n AS varchar(3))+')'
  FROM   dbo.CharmapAB(0,0) AS cm;

  SELECT cm.* -- WhiteSpaceCharacters = 'CHAR('+CAST(n AS varchar(3))+')'
  FROM   dbo.CharmapAB(1,1) AS cm;

  SELECT cm.* -- WhiteSpaceCharacters = 'CHAR('+CAST(n AS varchar(3))+')'
  FROM  dbo.CharmapAB(0,1) AS cm
  WHERE cm.char_nbr IN (9,10,13,32,38,60,62);
-----------------------------------------------------------------------------------------
[Revision History]:
 Rev 00 - May 2015 - Initial Development - Alan Burstein
 Rev 01 - 20150819 changed whitespace val, column names, added quoted_val
        - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH rowz(N) AS (SELECT CASE @asciiOnly WHEN 0 THEN 255 ELSE 65535 END)
SELECT
char_nbr        = i.RN, 
ascii_val       = CHAR(cs.RN),
unicode_val     = u.unicode_val,
quoted_val      = uq.quoted_val,
is_unicode_only = SIGN(i.RN&256),
is_acsii_ws     = CASE WHEN cs.RN IN ((2),(9),(10),(13),(32),(160)) THEN 1 ELSE 0 END,
is_ascii_blank  = CASE WHEN cs.RN BETWEEN 28  AND 31 
                         OR cs.RN BETWEEN 129 AND 159 THEN 1 ELSE 0 END,
unicode_xml_val = x.unicode_xml_val,
bin             = CAST(NCHAR(cs.RN) AS varbinary)
FROM rowz
CROSS APPLY dbo.rangeAB(1,rowz.N,1,1)       AS i
CROSS APPLY (VALUES(CHECKSUM(i.RN)))        AS cs(RN)
CROSS APPLY (SELECT TOP (@xmlCheck*1) NCHAR(cs.RN) 
             WHERE @xmlCheck = 1 
             FOR XML PATH(''))              AS x(unicode_xml_val)
CROSS APPLY (VALUES(NCHAR(cs.RN)))          AS u(unicode_val)  
CROSS APPLY (VALUES('"'+u.unicode_val+'"')) AS uq(quoted_val);

CharmapAB will help you identify which characters are XML:

If you run this query you can identify which ASCII characters are "XML Protected"

SELECT cm.*
FROM  dbo.CharmapAB(0,1) AS cm;

Returns (truncated for brevity)

char_nbr  ascii_val unicode_val quoted_val is_unicode_only      is_acsii_ws is_ascii_blank unicode_xml_val      bin
--------- --------- ----------- ---------- -------------------- ----------- -------------- -------------------- ------
1                             ""        0                    0           0                             0x0100
2                             ""        0                    1           0                             0x0200
....
32                              " "        0                    1           0                              0x2000
33        !         !           "!"        0                    0           0              !                    0x2100
34        "         "           """        0                    0           0              "                    0x2200
35        #         #           "#"        0                    0           0              #                    0x2300
36        $         $           "$"        0                    0           0              $                    0x2400
37        %         %           "%"        0                    0           0              %                    0x2500
38        &         &           "&"        0                    0           0              &                0x2600
39        '         '           "'"        0                    0           0              '                    0x2700
...

My experience has been that the first 31 characters are never used except char(9),char(10) and char(13) (tab carriage return and line returns). As well as char(32),char(38),char(60) and char(62) which are: space, ampersand (&), then greater than and less than ("<" and ">"). This query will likely be enough to get you the characters you need:

DECLARE @yourstring VARCHAR(8000) = 'ABC&123<xxx>'

SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@yourstring,
  '	', CHAR(9)),
  '
', CHAR(10)),
  '', CHAR(13)),
  ' ', CHAR(32)),
  '&', CHAR(38)),
  '<', CHAR(60)),
  '>', CHAR(62));

Returns: ABC&123

You can use CharMapAB to update this as needed.

string literal too long - how to assign long xml data to clob data type in oracle 11g r2

One approach is to use sqlldr. First, create a small holding table:

create table tstclob
(
id number,
doc clob
);

Assuming your large document is the file "c:\data\test_doc.txt", create a sqlldr control file ("test_doc.ctl") to load it:

load data
infile *
replace 
into table tstclob
fields terminated by ','
(
 ID char(1),
 lob_file FILLER char,
  DOC LOBFILE(lob_file) TERMINATED BY EOF
 )
begindata
1,c:\data\test_doc.txt

Then run sqlldr (in this case, from c:\data directory):

sqlldr control=test_doc.ctl userid=someuser@somedb/somepass

You can then update whatever table you want using tstclob table.

How to Retrieve a Very Long Xml-String from an SQL Database with R