December 22, 2019

How to generate an XML file from Excel with Python

Introduction

In this post, we will take a look into how we can generate Extensible Markup Language (XML) files from an Excel file with Python. We will be using the Yattag package to generate our XML file and the OpenPyXL package for reading our Excel data.

Packages

Yattag

Yattag is described in its documentation as following:

Yattag is a Python library for generating HTML or XML in a pythonic way.

That pretty much sums Yatttag up, I find it as a simple, easy to use library that just works. I had been searching for this kind of library in order to more easily generate different XML files.

To install Yattag with pip: pip install yattag

Using Yattag

Adding a tag with Yattag is as easy as using the With keyword:

with tag('h1'):

    text('Hello world!')

Tags are automatically closed. To use Yattag we need to import Doc from Yattag and create our Doc, tag and text with Doc().tagtext().

from yattag import Doc

doc, tag, text = Doc().tagtext()

with tag('h1'):

    text('Hello world!')

doc.getvalue()

Output:
'<h1>Hello world!</h1>'

OpenPyXL

OpenPyXL is a library for interacting with Excel 2010 files. OpenPyXL can read and write to .xlsx and .xlsm files.

To install OpenPyXL with pip: pip install openpyxl

Using OpenPyXL

To load an existing workbook in OpenPyXl we need to use the load_workbook method. We also need to select the sheet we are reading the data from. In our example, we are using popular baby names in New York City. You can access the dataset from the link at the bottom of this post.

I have created a workbook named NY_baby_names.xlsx with one sheet of data, Sheet1. The worksheet has the following headers: Year of Birth, Gender, Child's First Name, Count, Rank. You can download the Excel file from my website here.

To access the data with OpenPyXL, do:

from openpyxl import load_workbook

wb = load_workbook("NY_baby_names.xlsx")

ws = wb.worksheets[0]

for row in ws.iter_rows(min_row=1, max_row=2, min_col=1, max_col=4):

    print([cell.value for cell in row])

Output:
['Year of Birth', 'Gender', "Child's First Name", 'Count']
[2011, 'FEMALE', 'GERALDINE', 13]

First, we load the workbook with load_workbook, and then select the first worksheet. We then iterate through the first two rows with the iter_rows method.

Generating the XML from Excel

After the imports, we load the workbook and the worksheet. We then create our Yattag document. We fill the headers with Yattags asis() method. The asis method enables us to input any string as the next line.

We then create our main tag, Babies. We loop through our sheet with the iter_rows method. The iter_rows method returns a generator with all the cells. We use a list comprehension to get all the values from the cells.

Next, we are adding the babies. Notice the use of the With tag and text. When we are finished we indent our result with Yattags indent method.

Finally, we save our file. The output should look like below. Notice that I only included two babies in the output.

<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema>

<Babies>

    <Baby>

        <Name>

            GERALDINE

        </Name>

        <Gender>

            FEMALE

        </Gender>

        <year>

            2011

        </year>

        <count>

            13

        </count>

        <rank>

            75

        </rank>

    </Baby>

    <Baby>

        <Name>

            GIA

        </Name>

        <Gender>

            FEMALE

        </Gender>

        <year>

            2011

        </year>

        <count>

            21

        </count>

        <rank>

            67

        </rank>

    </Baby>

</Babies>

Complete code

from openpyxl import load_workbook

from yattag import Doc, indent

wb = load_workbook("NY_baby_names.xlsx")
ws = wb.worksheets[0]

# Create Yattag doc, tag and text objects
doc, tag, text = Doc().tagtext()

xml_header = '<?xml version="1.0" encoding="UTF-8"?>'
xml_schema = '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema>'

doc.asis(xml_header)
doc.asis(xml_schema)

with tag('Babies'):
# Use ws.max_row for all rows
for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=5):
row = [cell.value for cell in row]
with tag("Baby"):
with tag("Name"):
text(row[2])
with tag("Gender"):
text(row[1])
with tag("year"):
text(row[0])
with tag("count"):
text(row[3])
with tag("rank"):
text(row[4])

result = indent(
doc.getvalue(),
indentation = ' ',
indent_text = True
)

with open("baby_names.xml", "w") as f:
f.write(result)

Data sets

Python, Excel