There is a way to iterate over Excel tables with OpenPyXL. The functionality has changed in newer...
How to generate an XML file from Excel with Python
Introduction
In this post, we will take a look into how we can generate Extensible Markup Language (XML) files from an Excel file with Python. We will be using the Yattag package to generate our XML file and the OpenPyXL package for reading our Excel data.
Packages
Yattag
Yattag is described in its documentation as following:
Yattag is a Python library for generating HTML or XML in a pythonic way.
That pretty much sums Yatttag up, I find it as a simple, easy to use library that just works. I had been searching for this kind of library in order to more easily generate different XML files.
To install Yattag with pip: pip install yattag
Using Yattag
Adding a tag with Yattag is as easy as using the With
keyword:
with tag('h1'):
text('Hello world!')
Tags are automatically closed. To use Yattag we need to import Doc from Yattag and create our Doc, tag and text with Doc().tagtext().
from yattag import Doc doc, tag, text = Doc().tagtext() with tag('h1'): text('Hello world!') doc.getvalue()
Output:
'<h1>Hello world!</h1>'
OpenPyXL
OpenPyXL is a library for interacting with Excel 2010 files. OpenPyXL can read and write to .xlsx and .xlsm files.
To install OpenPyXL with pip: pip install openpyxl
Using OpenPyXL
To load an existing workbook in OpenPyXl we need to use the load_workbook
method. We also need to select the sheet we are reading the data from. In our example, we are using popular baby names in New York City. You can access the dataset from the link at the bottom of this post.
I have created a workbook named NY_baby_names.xlsx with one sheet of data, Sheet1. The worksheet has the following headers: Year of Birth, Gender, Child's First Name, Count, Rank. You can download the Excel file from my website here.
To access the data with OpenPyXL, do:
from openpyxl import load_workbook wb = load_workbook("NY_baby_names.xlsx") ws = wb.worksheets[0] for row in ws.iter_rows(min_row=1, max_row=2, min_col=1, max_col=4): print([cell.value for cell in row])
Output:
['Year of Birth', 'Gender', "Child's First Name", 'Count']
[2011, 'FEMALE', 'GERALDINE', 13]
First, we load the workbook with load_workbook
, and then select the first worksheet. We then iterate through the first two rows with the iter_rows
method.
Generating the XML from Excel
After the imports, we load the workbook and the worksheet. We then create our Yattag document. We fill the headers with Yattags asis()
method. The asis method enables us to input any string as the next line.
We then create our main tag, Babies. We loop through our sheet with the iter_rows
method. The iter_rows
method returns a generator with all the cells. We use a list comprehension to get all the values from the cells.
Next, we are adding the babies. Notice the use of the With tag and text. When we are finished we indent our result with Yattags indent method.
Finally, we save our file. The output should look like below. Notice that I only included two babies in the output.
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema> <Babies> <Baby> <Name> GERALDINE </Name> <Gender> FEMALE </Gender> <year> 2011 </year> <count> 13 </count> <rank> 75 </rank> </Baby> <Baby> <Name> GIA </Name> <Gender> FEMALE </Gender> <year> 2011 </year> <count> 21 </count> <rank> 67 </rank> </Baby> </Babies>
Complete code
from openpyxl import load_workbook from yattag import Doc, indent wb = load_workbook("NY_baby_names.xlsx") ws = wb.worksheets[0] # Create Yattag doc, tag and text objects doc, tag, text = Doc().tagtext() xml_header = '<?xml version="1.0" encoding="UTF-8"?>' xml_schema = '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema>' doc.asis(xml_header) doc.asis(xml_schema) with tag('Babies'): # Use ws.max_row for all rows for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=5): row = [cell.value for cell in row] with tag("Baby"): with tag("Name"): text(row[2]) with tag("Gender"): text(row[1]) with tag("year"): text(row[0]) with tag("count"): text(row[3]) with tag("rank"): text(row[4]) result = indent( doc.getvalue(), indentation = ' ', indent_text = True ) with open("baby_names.xml", "w") as f: f.write(result)
Data sets
Popular Baby Names dataset: https://catalog.data.gov/dataset/most-popular-baby-names-by-sex-and-mothers-ethnic-group-new-york-city-8c742