Openxml Tag Search

OpenXML tag search

The problem with trying to find tags is that words are not always in the underlying XML in the format that they appear to be in Word. For example, in your sample XML the <!TAG1!> tag is split across multiple runs like this:

<w:r>
    <w:rPr>
        <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t><!TAG1</w:t>
</w:r>
<w:proofErr w:type="gramEnd"/>
    <w:r>
    <w:rPr>
        <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>!></w:t>
</w:r>

As pointed out in the comments this is sometimes caused by the spelling and grammar checker but that's not all that can cause it. Having different styles on parts of the tag could also cause it for example.

One way of handling this is to find the InnerText of a Paragraph and compare that against your Regex. The InnerText property will return the plain text of the paragraph without any formatting or other XML within the underlying document getting in the way.

Once you have your tags, replacing the text is the next problem. Due to the above reasons you can't just replace the InnerText with some new text as it wouldn't be clear as to which parts of the text would belong in which Run. The easiest way round this is to remove any existing Run's and add a new Run with a Text property containing the new text.

The following code shows finding the tags and replacing them immediately rather than using two passes as you suggest in your question. This was just to make the example simpler to be honest. It should show everything you need.

private static void ReplaceTags(string filename)
{
    Regex regex = new Regex("<!(.)*?!>", RegexOptions.Compiled);

    using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filename, true))
    {
        //grab the header parts and replace tags there
        foreach (HeaderPart headerPart in wordDocument.MainDocumentPart.HeaderParts)
        {
            ReplaceParagraphParts(headerPart.Header, regex);
        }
        //now do the document
        ReplaceParagraphParts(wordDocument.MainDocumentPart.Document, regex);
        //now replace the footer parts
        foreach (FooterPart footerPart in wordDocument.MainDocumentPart.FooterParts)
        {
            ReplaceParagraphParts(footerPart.Footer, regex);
        }
    }
}

private static void ReplaceParagraphParts(OpenXmlElement element, Regex regex)
{
    foreach (var paragraph in element.Descendants<Paragraph>())
    {
        Match match = regex.Match(paragraph.InnerText);
        if (match.Success)
        {
            //create a new run and set its value to the correct text
            //this must be done before the child runs are removed otherwise
            //paragraph.InnerText will be empty
            Run newRun = new Run();
            newRun.AppendChild(new Text(paragraph.InnerText.Replace(match.Value, "some new value")));
            //remove any child runs
            paragraph.RemoveAllChildren<Run>();
            //add the newly created run
            paragraph.AppendChild(newRun);
        }
    }
}

One downside with the above approach is that any styles you may have had will be lost. These could be copied from the existing Run's but if there are multiple Run's with differing properties you'll need to work out which ones you need to copy where. There's nothing to stop you creating multiple Run's in the above code each with different properties if that's what is required. Other elements would also be lost (e.g. any symbols) so those would need to be accounted for too.

How to handle smartTag nodes using OpenXML

The solution is to use Linq to XML (or the System.Xml classes if you like those better) to remove the w:smartTag elements as shown in the following code:

public class SmartTagTests
{
    private const string Xml =
        @"<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
<w:body>
    <w:p>
        <w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
            <w:r w:rsidRPr=""00BF444F"">
                <w:rPr>
                    <w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
                    <w:b/>
                    <w:bCs/>
                    <w:sz w:val=""40""/>
                    <w:szCs w:val=""40""/>
                </w:rPr>
                <w:t>ST</w:t>
            </w:r>
        </w:smartTag>
        <w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
            <w:r w:rsidRPr=""00BF444F"">
                <w:rPr>
                    <w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
                    <w:b/>
                    <w:bCs/>
                    <w:sz w:val=""40""/>
                    <w:szCs w:val=""40""/>
                </w:rPr>
                <w:t>AR</w:t>
            </w:r>
        </w:smartTag>
        <w:r w:rsidRPr=""00BF444F"">
            <w:rPr>
                <w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
                <w:b/>
                <w:bCs/>
                <w:sz w:val=""40""/>
                <w:szCs w:val=""40""/>
            </w:rPr>
            <w:t xml:space=""preserve"">T</w:t>
        </w:r>
    </w:p>
</w:body>
</w:document>";

    [Fact]
    public void CanStripSmartTags()
    {
        // Say you have a WordprocessingDocument stored on a stream (e.g., read
        // from a file).
        using Stream stream = CreateTestWordprocessingDocument();

        // Open the WordprocessingDocument and inspect it using the strongly-
        // typed classes. This shows that we find OpenXmlUnknownElement instances
        // are found and only a single Run instance is recognized.
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
        {
            MainDocumentPart part = wordDocument.MainDocumentPart;
            Document document = part.Document;

            Assert.Single(document.Descendants<Run>());
            Assert.NotEmpty(document.Descendants<OpenXmlUnknownElement>());
        }

        // Now, open that WordprocessingDocument to make edits, using Linq to XML.
        // Do NOT use the strongly typed classes in this context.
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
        {
            // Get the w:document as an XElement and demonstrate that this
            // w:document contains w:smartTag elements.
            MainDocumentPart part = wordDocument.MainDocumentPart;
            string xml = ReadString(part);
            XElement document = XElement.Parse(xml);

            Assert.NotEmpty(document.Descendants().Where(d => d.Name.LocalName == "smartTag"));

            // Transform the w:document, stripping all w:smartTag elements and
            // demonstrate that the transformed w:document no longer contains
            // w:smartTag elements.
            var transformedDocument = (XElement) StripSmartTags(document);

            Assert.Empty(transformedDocument.Descendants().Where(d => d.Name.LocalName == "smartTag"));

            // Write the transformed document back to the part.
            WriteString(part, transformedDocument.ToString(SaveOptions.DisableFormatting));
        }

        // Open the WordprocessingDocument again and inspect it using the 
        // strongly-typed classes. This demonstrates that all Run instances
        // are now recognized.
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
        {
            MainDocumentPart part = wordDocument.MainDocumentPart;
            Document document = part.Document;

            Assert.Equal(3, document.Descendants<Run>().Count());
            Assert.Empty(document.Descendants<OpenXmlUnknownElement>());
        }
    }

    /// <summary>
    /// Recursive, pure functional transform that removes all w:smartTag elements.
    /// </summary>
    /// <param name="node">The <see cref="XNode" /> to be transformed.</param>
    /// <returns>The transformed <see cref="XNode" />.</returns>
    private static object StripSmartTags(XNode node)
    {
        // We only consider elements (not text nodes, for example).
        if (!(node is XElement element))
        {
            return node;
        }

        // Strip w:smartTag elements by only returning their children.
        if (element.Name.LocalName == "smartTag")
        {
            return element.Elements();
        }

        // Perform the identity transform.
        return new XElement(element.Name, element.Attributes(),
            element.Nodes().Select(StripSmartTags));
    }

    private static Stream CreateTestWordprocessingDocument()
    {
        var stream = new MemoryStream();

        using var wordDocument = WordprocessingDocument.Create(stream, WordprocessingDocumentType.Document);
        MainDocumentPart part = wordDocument.AddMainDocumentPart();
        WriteString(part, Xml);

        return stream;
    }

    #region Generic Open XML Utilities

    private static string ReadString(OpenXmlPart part)
    {
        using Stream stream = part.GetStream(FileMode.Open, FileAccess.Read);
        using var streamReader = new StreamReader(stream);
        return streamReader.ReadToEnd();
    }

    private static void WriteString(OpenXmlPart part, string text)
    {
        using Stream stream = part.GetStream(FileMode.Create, FileAccess.Write);
        using var streamWriter = new StreamWriter(stream);
        streamWriter.Write(text);
    }

    #endregion
}

You could also use the PowerTools for Open XML, which provide a markup simplifier that directly supports the removal of w:smartTag elements.

c# OpenXML search and replace text not saving

I very luckily stumbled across the answer at 18:52 into the OpenXmlRegex youtube video (https://youtu.be/rDGL-i5zRdk?t=18m52s).. I need to call this PutXDocument() method on the MainDocumentPart for the changes to go into effect (instead of the doc.Save() I was trying to do)

doc.MainDocumentPart.PutXDocument();

Openxml Tag Search