#Bash one liner: Remove all occurrences of a tag block from an XSD or XML file using #sed #xmlstarlet

By | April 1, 2022

Sometimes you get some XLS schemas that are full of additional annotations and comments you want strip.

Annotations and comments are nice but annoying when trying to compare two incremental versions of some schema and try to pin point just the differences.

Bellow there is a script used to remove all comments and the <xs:annotation> block from an xsd file.

A sample of the schema is given bellow:

<!--    AccountIdentification4Choice1 --> 
<xs:complexType name="AccountIdentification4Choice1">
        <xs:annotation>
            <xs:documentation source="Name" xml:lang="EN">AccountIdentification4Choice__1</xs:documentation>
            <xs:documentation source="Definition" xml:lang="EN">Specifies the unique identification of an account as assigned by the account servicer.</xs:documentation>
        </xs:annotation>
        <xs:choice>
            <xs:element name="IBAN" type="IBAN2007Identifier">
                <xs:annotation>
                    <xs:documentation source="Name" xml:lang="EN">IBAN</xs:documentation>
                    <xs:documentation source="Definition" xml:lang="EN">International Bank Account Number (IBAN) - identifier used internationally by financial institutions to uniquely identify the account of a customer. Further specifications of the format and content of
 the IBAN can be found in the standard ISO 13616 "Banking and related financial services - International Bank Account Number (IBAN)" version 1997-10-01, or later revisions.</xs:documentation>
                </xs:annotation>
            </xs:element>
            <xs:element name="Othr" type="GenericAccountIdentification11">
                <xs:annotation>
                    <xs:documentation source="Name" xml:lang="EN">Other</xs:documentation>
                    <xs:documentation source="Definition" xml:lang="EN">Unique identification of an account, as assigned by the account servicer, using an identification scheme.</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:choice>
    </xs:complexType>

The following script will do the clean-up for all the xsd schema files from the current directory.

#!/bin/bash

for file in *.xsd
do
    xmlstarlet ed -P -d "//*/xs:annotation" $file \
    | sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' \
    | grep -zv '^<!--' \
    | tr -d '\0' \
    | sed '/^[[:space:]]*$/d' \
    >> ${file%".xsd"}-clean.xsd
done

Where :

xmlstarlet ed -P -d "//*/xs:annotation" $file \

Parses the xsd/xml and removes all occurrences of a given block , <xs:annotation> block in my case

| sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' \
| grep -zv '^<!--' \
| tr -d '\0' \

Removes all the comment blocks from the XSD/XML

| sed '/^[[:space:]]*$/d' \

Removes all empty lines containing just spaces. This are some artefacts left behind by the previous two remove operations

>> ${file%".xsd"}-clean.xsd

Dumps the result to a new file. If initial file was myschema1.xsd the result file will be myschema1-clean.xsd

Resources:

All the commands are standard bash command line commands like sed, grep, tr that can be found on other flavours of Unix systems not just Linux.

The only special additional software is the wonderful open source tool XMLStarlet that has releases for Linux, Windows and Solaris.

Basically the above scripts can be applied with small changes to any other supported OS.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.