Ever found yourself stuck with a PowerPoint presentation full of tables brimming with vital data? You might have tried manually copying the data from these tables, only to find the process tedious, time-consuming, and not to mention, prone to mistakes.
But wait, if you've got Python by your side, you don't have to fret about all this. With its rich arsenal of libraries, Python gives you an efficient way to handle this situation. This blog post is your handy guide on how to extract table data from PowerPoint presentations using Python's
What's in Our Toolkit?
Before we dive in, let's ensure we have all the tools we need in our Python toolkit:
collections.abc: Python's own treasure trove for creating data structures. Although we aren't using them directly in our code, they're good to have in our toolkit for future needs.
pptx: This nifty Python library lets us create and modify PowerPoint (.pptx) files. It's our main tool for this job.
pandas: This heavyweight champion of Python libraries is a data scientist's best friend. We'll use it to store and fiddle with the data we yank out from the PowerPoint file.
json: These Python modules are our go-to for system-specific parameters and functions and for playing around with JSON data.
You can get these tools handy with a simple pip command:
pip install python-pptx pandas
Let's dive right into the code:
import collections import collections.abc from pptx import Presentation import pandas as pd import sys import json def read_ppt(filename): presentation = Presentation(filename) tables =  for slide in presentation.slides: for shape in slide.shapes: if shape.has_table: table = shape.table table_data =  for row in table.rows: row_data =  for cell in row.cells: cell_text = '' for paragraph in cell.text_frame.paragraphs: for run in paragraph.runs: cell_text += run.text row_data.append(cell_text) table_data.append(row_data) df = pd.DataFrame(table_data) tables.append(df) return tables tables = read_ppt('mypresentation.pptx') # Let's print the first table as an example if tables: # print(json.dumps(tables)) table_list =  count = 0 for table in tables: table_list.append(table.to_json(orient='columns')) print(table_list)
Deciphering the Code
Now, let's get our hands dirty and see what our code does. The heart of our script is the
read_ppt() function. It takes the name of a PowerPoint file and spits out a list of pandas DataFrames. Each data frame is a table from the PowerPoint file, neatly extracted and ready for us to work with.
Here's how it pulls off this magic trick:
- Our function kicks off by opening the PowerPoint file using the
Presentationclass from the
- It then takes a leisurely stroll through each slide in the presentation.
- On each slide, it looks at every shape (anything you see on the slide, like a text box, table, or image). If it finds a table (checked using
shape.has_table), it gets ready to extract the data from the table.
- To yank out the data from a table, it goes row by row, cell by cell. For each cell, it pulls out the text and stashes it in a list. This list is like a digital version of the row from our table. After it's been through every cell in the row, it adds the list (our row) to a bigger list (our table).
- Once it's done with all rows, it converts this big list (the digital avatar of our table) into a pandas DataFrame and adds it to an even bigger list, which will hold all our tables.
- After it's had its fill of slides and tables, it finally returns the list of DataFrames (tables).
Ready, Set, Go!
Now that you know what's happening under the hood, it's time to put our
read_ppt() function to work:
tables = read_ppt('mypresentation.pptx')
Once you run this code,
tables will be your list of pandas DataFrames. Each data frame is a table that our function diligently extracted from the PowerPoint file.
You can find the entire code here.
So, there you have it - Python's prowess at automating the extraction of data from PowerPoint presentations. With
pandas at your disposal, extracting table data from .pptx files is no more a chore. It's not just a timesaver but also slashes the risk of errors you might make in manual extraction.
So, the next time you're faced with a PowerPoint presentation loaded with tables, you know Python's got your back. With this handy tool in your Python arsenal, even the biggest and most complex PowerPoint presentations won't break a sweat!