Unstructured Document Parsing With OpenAI
What happens when the primary champion of your company’s data (namely — grant applications) steps away unexpectedly?
The answer is: a big headache
Here’s the business problem:
I have a large set of unstructured files, in a folder structure that kind of makes sense. Loosely, here’s an example of what one of the folders may look like:
ClientName/
├── 📄 Document1.pdf
├── 📊 Spreadsheet1.xlsx
├── 📝 Notes.docx
├── 📂 2011/
│ ├── 📄 Invoice_Jan.pdf
│ └── 📝 Report_2011.docx
├── 📂 2012/
│ └── 📊 Summary.xlsx
├── 📂 2013/
│ └── 📄 Budget_2013.pdf
├── ⋯
├── 📂 2025/
│ ├── 📄 Invoice_May.pdf
│ └── 📝 Meeting_Notes.docx
├── 📁 Reports/
│ ├── 📄 Annual_Report_2023.pdf
│ └── 📝 Q1_2024_Report.docx
├── 📁 Grant Proposal/
│ ├── 📝 Grant_Draft.docx
│ └── 📁 Supporting_Docs/
│ └── 📊 Budget.xlsx
└── 📁 Misc/
└── 📄 Reference_Materials.pdf
Now, multiply this structure by over 100 clients, and without extensive experience / knowledge on the structure of this repository, you’re going to have a bad time… Unless you have ChatGPT assist you with writing a Python parser of that content, and then use OpenAI’s API to analyze the content of each document 😄
In this post, I’ll be going over my approach to solving this problem, issues I ran into, and the found benefit from using AI to analyze all of these documents.
With an understanding of the desired outcome, armed with a large set of unstructured data, I got to work. Naturally, as it’s 2025, I used ChatGPT to assist with getting me off the ground. I know some Python, but not nearly enough to begin writing this from scratch. Here’s my initial prompt to ChatGPT:
I want you to assist me in developing a python script that will iterate over a working directory. This working directory contains .docx, .xlsx files, and .pdf files. At a high level, these files relate to grant writing proposals. I want this Python script to examine each file and output relevant information (grant award winner, grant award amount, etc.) about the grant (separated by client name and year, this can be determined by folder structure) to a CSV. Please propose and develop this script. I’m thinking the best way to accomplish this is to feed each individual file into ChatGPT (programmatically) and capture the information from the file itself using the API, but I am open to any suggestions you may have. Please ask for any clarifying questions that will help you create this more effectively.
Sidebar — I’ve been working with “AI” since ChatGPT’s release. Bit of an (off topic) ramble ahead:
- It’s AMAZING to see how far it’s come. This turned something that was likely a week’s worth of work into just a few days for me.
- If you have a good understanding of prompt engineering (namely, what you expect to get out of the response), It’s immensely helpful, and should be used daily, or whenever you’re working on solving a problem.
- In my day job, I sometimes run ‘art of the possible’ sessions, where I demonstrate the capabilities of Copilot in an enterprise standpoint. It’s remarkable how users will interact with the tool, just based off of its ability for natural language
- It’s better at some things than others. If I wanted to rewrite this entire thing in PowerShell (not that I would want to), I’d likely run into more issues trying to get it to work, and I’d imagine it’d be less performant than Python.
ChatGPT returned me a high level expected JSON object and asked some clarifying questions around the folder structure. After answering, we were off to the races with the first draft of a script:
Feel free to skip this if you’re not interested in the “in the weeds” logic.
import os
import re
import csv
import time
import argparse
from datetime import datetime
from pathlib import Path
import openai
import pandas as pd
from docx import Document
from PyPDF2 import PdfReader
from openpyxl import load_workbook
# ──────────────────────────────────────────────────────────────────────────────
# CONFIGURATION
# ──────────────────────────────────────────────────────────────────────────────
# 1) Set your OpenAI API key here or as an environment variable.
openai.api_key = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY_HERE")
# 2) The prompt template you’ll use. We ask ChatGPT to output JSON.
PROMPT = """
You are an assistant that extracts grant metadata.
From the text below, return a JSON object with these keys:
- award_winner (string)
- award_amount (string, include currency symbol)
- grant_title (string)
- submission_date (string, ISO format if available)
- award_date (string, ISO format if available)
If a field isn't present, return an empty string.
Respond with _only_ the JSON object.
Text:
\"\"\"
{doc_text}
\"\"\"
"""
# ──────────────────────────────────────────────────────────────────────────────
# HELPERS: text extraction by file type
# ──────────────────────────────────────────────────────────────────────────────
def extract_from_docx(path):
doc = Document(path)
return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
def extract_from_xlsx(path):
wb = load_workbook(path, data_only=True)
texts = []
for sheet in wb.worksheets:
for row in sheet.iter_rows(values_only=True):
for cell in row:
if isinstance(cell, str):
texts.append(cell)
return "\n".join(texts)
def extract_from_pdf(path):
reader = PdfReader(path)
texts = []
for page in reader.pages:
texts.append(page.extract_text() or "")
return "\n".join(texts)
# ──────────────────────────────────────────────────────────────────────────────
# Determine client & year from path
# ──────────────────────────────────────────────────────────────────────────────
YEAR_RE = re.compile(r'^(19|20)\d{2}$')
def infer_client_and_year(file_path, root_dir):
rel = Path(file_path).relative_to(root_dir)
parts = rel.parts[:-1] # drop the filename
year = None
client = None
# find any 4-digit year folder
for p in parts:
if YEAR_RE.match(p):
year = p
break
# client is the folder immediately above the year, or first folder if no year
if year:
idx = parts.index(year)
client = parts[idx-1] if idx > 0 else "Unknown"
elif parts:
client = parts[-1]
else:
client = "Unknown"
# fallback to file mtime for year
if not year:
ts = os.path.getmtime(file_path)
year = datetime.fromtimestamp(ts).year
# strip any date suffix from client (e.g. "AcmeCorp_2023" → "AcmeCorp")
client = re.sub(r'[_\-\s](19|20)\d{2}$', '', client)
return client, str(year)
# ──────────────────────────────────────────────────────────────────────────────
# Call OpenAI to parse text → JSON
# ──────────────────────────────────────────────────────────────────────────────
def parse_with_openai(text):
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role":"system","content":"You extract grant data."},
{"role":"user","content":PROMPT.format(doc_text=text)}],
temperature=0.0,
max_tokens=300,
)
# assume the assistant returns pure JSON
return response.choices[0].message.content.strip()
# ──────────────────────────────────────────────────────────────────────────────
# MAIN
# ──────────────────────────────────────────────────────────────────────────────
def main(root_dir, out_csv):
rows = []
for dirpath, _, filenames in os.walk(root_dir):
for fn in filenames:
ext = fn.lower().split('.')[-1]
if ext not in ('docx','xlsx','pdf'):
continue
full = os.path.join(dirpath, fn)
try:
if ext == 'docx':
text = extract_from_docx(full)
elif ext == 'xlsx':
text = extract_from_xlsx(full)
else: # pdf
text = extract_from_pdf(full)
except Exception as e:
print(f"⚠️ Failed to extract {full}: {e}")
continue
client, year = infer_client_and_year(full, root_dir)
# send to OpenAI
try:
parsed = parse_with_openai(text)
except Exception as e:
print(f"⚠️ OpenAI API error on {full}: {e}")
continue
# build row
row = {
"client": client,
"year": year,
"file_path": full,
}
# merge in the JSON fields
try:
data = pd.read_json(parsed, typ='series')
row.update(data.to_dict())
except Exception:
print(f"⚠️ Could not parse JSON from API for {full}")
continue
rows.append(row)
# optional throttle
time.sleep(0.2)
# write out
df = pd.DataFrame(rows)
df.to_csv(out_csv, index=False)
print(f"✅ Wrote {len(df)} records to {out_csv}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Extract grant info from .docx/.xlsx/.pdf via OpenAI")
parser.add_argument("root_dir", help="Top-level working directory")
parser.add_argument("out_csv", help="Destination CSV file path")
args = parser.parse_args()
main(args.root_dir, args.out_csv)Initial Code Block from ChatGPT
Aaaaand.. we’re done!
Just kidding.
This is an amazing start, especially for someone who is more skilled with Power Platform and PowerShell, but there were (more than) a few problems to solve first:
- What happens when a PDF is just a bunch of scanned images? We need OCR (tesseract)
- The OpenAI call was outdated and was pre 1.0.. that isn’t gonna fly
- Some of these documents are really old, and Windows doesn’t want to open them. How do we gracefully handle a timeout?
Add in a few JSON parsing errors, some corrupted files, and a few more issues... After a bit more back and forth with ChatGPT, this is what I came up with:

I’ll be splitting these code blocks into snippets and talking through how they work (and why they’re there, if you’re interested). If not, feel free to skip down to the conclusion at the bottom.
# ──────────────────────────────────────────────────────────────────────────────
# CONFIGURATION
# ──────────────────────────────────────────────────────────────────────────────
FIXER_SYSTEM = '''
You are a JSON-fixing assistant. A user tried to parse the following text as JSON and failed.
Return a valid JSON object by correcting any trailing commas, mismatched quotes, or missing brackets.
Respond with only the corrected JSON, nothing else.
'''
PROMPT = '''
You are an assistant that extracts grant metadata.
From the text below, return a JSON object with these keys:
- award_winner (string)
- award_winner_organization (string; if grant award mentions person(s) and organization by name.)
- award_amount (string, include currency symbol; Please note that there may be multiple mentions of monies. The one we are concerned about will likely be prefixed with award, amount requested, etc.)
- funding_type (string; determined based on if the description of monies in the metadata is restricted or unrestricted. Please use your best judgement. Respond with 'Restricted', 'Unrestricted', or 'Unsure *with your best guess*')
- program_area (string; tags of what the grant monies are applied to. There are 5 possible tags: Climate Justice, Environmental Justice, Air and Water Quality, Regenerative Ecosystems, Pesticides and Communities. Please use your best judgement. Respond with one of the five possible tags.)
- grant_title (string)
- submission_date (string, ISO format if available)
- award_date (string, ISO format if available)
- summary (string; 1-2 sentence high-level summary of what this document is about. Begin summary by stating if the document appears as a grant award or not.)
- scraped_emails (string; semicolon delimited of all emails found in the document)
If a field isn't present, return an empty string.
Respond with _only_ the JSON object.
Text:
"""
{doc_text}
"""The above is pretty straightforward, as you might imagine. This is the prompt fed to ChatGPT, with the doc_text at the bottom.
Pro tip: you can use a reasoning model (o3, o4-mini, etc.) to evaluate your prompt and make suggestions to improve it.
I also used ChatGPT to build a Counter-Strike betting analytics engine, and I learned that you can feed a function into the API to guarantee a JSON object is returned to you, saving on tokens (don’t need to say GIVE JSON when it knows to give json), and ensuring consistency.
# ──────────────────────────────────────────────────────────────────────────────
# TEXT EXTRACTION
# ──────────────────────────────────────────────────────────────────────────────
def extract_from_docx(path):
doc = Document(path)
return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
def extract_from_doc(path):
try:
import win32com.client
pythoncom.CoInitialize()
word = win32com.client.Dispatch("Word.Application")
word.Visible = False
doc = word.Documents.Open(path)
temp_dir = tempfile.gettempdir()
temp_docx = os.path.join(temp_dir, os.path.basename(path) + ".docx")
doc.SaveAs(temp_docx, FileFormat=16) # wdFormatDocumentDefault
doc.Close()
word.Quit()
document = Document(temp_docx)
text = "\n".join(p.text for p in document.paragraphs if p.text.strip())
return text.strip()
except Exception as e:
print(f"⚠️ Failed to convert or extract .doc from {path}: {e}")
return ""
def extract_from_xlsx(path):
wb = load_workbook(path, data_only=True)
texts = []
for sheet in wb.worksheets:
for row in sheet.iter_rows(values_only=True):
for cell in row:
if isinstance(cell, str):
texts.append(cell)
return "\n".join(texts)
def extract_from_pdf(path):
try:
reader = fitz.open(path)
if reader.is_encrypted:
print(f"⚠️ Skipping password-protected PDF: {path}")
return ""
except Exception as e:
print(f"⚠️ Could not open PDF {path}: {e}")
return ""
pages = []
for page in reader:
try:
pages.append(page.get_text() or "")
except Exception as e:
print(f"⚠️ Error extracting text for PDF {path}: {e}")
text = "\n".join(pages).strip()
if len(text) < 100:
ocr_text = []
for page in reader:
try:
pix = page.get_pixmap(dpi=200)
img = Image.open(io.BytesIO(pix.tobytes("png")))
ocr_text.append(pytesseract.image_to_string(img))
except Exception as oe:
print(f"⚠️ OCR failed for PDF {path} page: {oe}")
text = "\n".join(ocr_text)
reader.close()
return textHere’s our block of document extractors. Notable items in this block:
- I ran into issues with .doc — To get around it, we just resave the file as .docx and then feed it into the prompt.
- Some of the PDFs were images, or password protected. Because I wanted this to run unattended, we just skip the password protected documents. For the images — if the length of text found in the PDF was less than 100 characters, we use Pytesseract to convert the image into a string, and feed that into the prompt.
# ──────────────────────────────────────────────────────────────────────────────
# TIMEOUT WRAPPER
# ──────────────────────────────────────────────────────────────────────────────
def run_with_timeout(func, *args, timeout=25):
"""Run a function with a timeout."""
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
future = executor.submit(func, *args)
try:
return future.result(timeout=timeout)
except concurrent.futures.TimeoutError:
print(f"⚠️ Function {func.__name__} timed out after {timeout} seconds.")
return None
I ran into some issues with corrupted or weird documents. Enter run_with_timeout. This allows for:
Multithreading (we ended up being limited by OpenAI, but that’s no big deal. I probably should’ve used batching, where you send a bunch of info off to OpenAI and then wait for it to be completed, but this seemed like a more straightforward approach. If I ever have to do this on a larger scale, I’ll look into that capability).
Auto function kill — if it hangs on a document, it has 25 seconds until it gives up and moves onto the next one. I learned this after leaving it overnight and finding it hung halfway through. :D
# ──────────────────────────────────────────────────────────────────────────────
# CLIENT/YEAR INFERENCE
# ──────────────────────────────────────────────────────────────────────────────
YEAR_RE = re.compile(r'^(19|20)\d{2}$')
def infer_client_and_year(file_path, root_dir):
rel = Path(file_path).relative_to(root_dir)
parts = rel.parts
client = parts[1] if len(parts) > 1 else "Unknown" # Second folder level
year = next((p for p in parts if YEAR_RE.match(p)), None)
if not year:
year = str(datetime.fromtimestamp(os.path.getmtime(file_path)).year)
return client, year
Because the folder structure isn’t consistent, we use regex to pull the year. If a year isn’t found, we use the created date of the file.
We also pull the client name out of the second level of the folder structure (this worked well).
def parse_with_openai(text):
def _call_openai(model, prompt, fixer_system=None, json_str=None):
try:
resp = openai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You read through extracted text and do your best to extract requested information from it."},
{"role": "user", "content": prompt}
],
temperature=0.0,
max_tokens=400,
)
candidate = resp.choices[0].message.content.strip()
json_sub = extract_json_substring(candidate)
return json.loads(json_sub), candidate
except openai.BadRequestError as e:
# Check for context length error
if "context length" in str(e).lower() or "maximum context length" in str(e).lower():
return None, "CONTEXT_LENGTH_EXCEEDED"
raise
except json.JSONDecodeError:
if fixer_system and json_str:
try:
fix = openai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": fixer_system},
{"role": "user", "content": json_str}
],
temperature=0.0,
max_tokens=400,
).choices[0].message.content.strip()
fixed_str = extract_json_substring(fix)
return json.loads(fixed_str), candidate
except Exception as e:
print(f"⚠️ JSON still invalid after fix: {e}")
return None, candidate
else:
return None, candidate
# Try with gpt-4o-mini first
prompt = PROMPT.format(doc_text=text)
result, candidate = _call_openai("gpt-4o-mini", prompt, FIXER_SYSTEM, None)
if candidate == "CONTEXT_LENGTH_EXCEEDED":
# Try with gpt-4-1106-preview (128k context window)
try:
result, candidate = _call_openai("gpt-4-1106-preview", prompt, FIXER_SYSTEM, None)
except Exception as e:
print(f"⚠️ Failed with larger context model: {e}")
return None, None
return result, candidateHere’s our function that sends our prompt and text to OpenAI. Some notes:
- Temperature is 0.0 — meaning it will be almost entirely deterministic. Accuracy is important here.
- Some documents were huge, so sometimes gpt-4o-mini would return CONTEXT_LENGTH_EXCEEDED. To get around this, we use gpt-4–1106-preview, a model that’s better at handling large documents. Now, I’ve since spent more time working with the API, and I realize that simply cranking up the max tokens would’ve probably worked as well, but that’s for another day.
# ──────────────────────────────────────────────────────────────────────────────
# MAIN LOOP
# ──────────────────────────────────────────────────────────────────────────────
def main(root_dir, out_csv):
processed_files_log = f"{out_csv}.processed" # Log file to track processed files
processed_files = set()
# Load already processed files if resuming
if os.path.exists(processed_files_log):
with open(processed_files_log, "r") as log:
processed_files = set(log.read().splitlines())
# Ensure the CSV file exists and has headers
if not os.path.exists(out_csv):
with open(out_csv, "w", newline="") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=["client", "year", "file", "file_name", "is_grant_award", "funding_type", "program_area"] + EXPECTED_KEYS)
writer.writeheader()
total_files = sum(
len([f for f in files if f.lower().rsplit('.', 1)[-1] in ("doc", "docx", "xlsx", "pdf") and '~' not in f])
for _, _, files in os.walk(root_dir)
)
processed_count = 0
for dirpath, _, files in os.walk(root_dir):
for fn in files:
if '~' in fn: # Skip files with '~' in their names
continue
ext = fn.lower().rsplit('.', 1)[-1]
if ext not in ("doc", "docx", "xlsx", "pdf"):
continue
full = os.path.join(dirpath, fn)
file_name = os.path.basename(full)
# Skip already processed files
if full in processed_files:
print(f"Skipping already processed file: {file_name}")
processed_count += 1
continue
processed_count += 1
print(f"Processing file {processed_count}/{total_files}: {file_name}")
try:
if ext == "docx":
text = run_with_timeout(extract_from_docx, full)
elif ext == "doc":
text = run_with_timeout(extract_from_doc, full)
elif ext == "xlsx":
text = run_with_timeout(extract_from_xlsx, full)
else:
text = run_with_timeout(extract_from_pdf, full)
if text is None: # Skip to the next file if timeout occurred
continue
except Exception as e:
print(f"⚠️ Extraction failed for {full}: {e}")
continue
client, year = infer_client_and_year(full, root_dir)
parsed, candidate = parse_with_openai(text)
if parsed is None:
with open("failed_json.log", "a", encoding='utf-8') as flog:
flog.write(f"--- {full}\n{candidate}\n\n")
continue
is_grant_award = "TRUE" if parsed.get("summary", "").startswith("The document appears as a grant") else "FALSE"
row = {
"client": client,
"year": year,
"file": full,
"file_name": file_name,
"is_grant_award": is_grant_award,
"funding_type": parsed.get("funding_type", ""),
"program_area": parsed.get("program_area", "")
}
for k in EXPECTED_KEYS:
row[k] = parsed.get(k, "")
# Append the row to the CSV file
with open(out_csv, "a", newline="", encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=row.keys())
writer.writerow(row)
# Log the processed file
with open(processed_files_log, "a", encoding='utf-8') as log:
log.write(full + "\n")
print(f"✅ Processed {processed_count} files. Results saved to {out_csv}")Here’s the main big boy that puts it all together. Final notes (we’re almost there, I promise)
- You’ll notice that there’s a processed_files_log. I didn’t want to waste compute or API credits by resending data in the event of the script hanging, so we write processed files by path and skip over them if they’ve been analyzed already. We also write each row to CSV as its processed instead of holding in memory (because that would be dumb)
- Here’s the excerpt of the headers that’s written to CSV:
EXPECTED_KEYS = [
"client", "year", "file", "file_name", "is_grant_award",
"funding_type", "program_area", "award_winner",
"award_winner_organization", "award_amount", "grant_title",
"submission_date", "award_date", "summary", "scraped_emails",
"funding_type", "program_area"
]
So, there you have it. Using OpenAI and Python, I took a daunting ‘dataset’, and turned it into something that’s actually digestible by someone who may be coming into this role fresh. I hope you enjoyed the read as much as I enjoyed solving the problem.
Oh, also, you may be wondering how much this cost (beyond my time, of course). From an API perspective, it was less than $5:

— Ian Gallimore
You made it! Thanks for your attention. If anything you read here today interests you, or you’d like to discuss some problems you’d like to solve, please feel free to add me on LinkedIn or send me an email:
https://www.linkedin.com/in/ianmgallimore/
ianmgallimore@gmail.com
Comments ()