I recently had to migrate stack of web pages from Adobe Experience Manager (AEM) into Sitecore. Migrating a website from one platform to another usually involves exporting content out of the old system and importing it into the new. Nine times out of ten, this’ll often be in a structured data format like JSON; simplifying the import process. But this time, I was presented with a unique challenge: exporting content out of Microsoft Word document files (.docx)!
Content in AEM Franklin is authored in documents (think Google Docs or Word) and rendered as web pages via “blocks” like tables, columns, and tabs. While Franklin leans into document-based authoring, Sitecore is built and used in a completely different way!
This posed the question: If this content is in .docx files, how do I migrate this into Sitecore?!
Then, like a bolt of lightning the answer hit me, why not use AI? I’ve been looking for an excuse to leverage AI for some of the more time-consuming aspects of web development, and this was the perfect use case. I would use AI to convert the .docx files into a JSON array of pages, extracting all of the content based on whatever available structure there was in the documents.
Why .docx to JSON?
Franklin/Edge Delivery is happy with document sources, but I needed to reuse the same content in Sitecore which meant transforming each Word page into structured JSON for bulk import. The upside: once you have JSON, it’s pretty easy to parse, import, validate, and adjust if content changes. Thankfully, I didn’t have to deal with content changing all that much, but in the scenario where authors were still actively updating content on the AEM live site, this makes it much easier to keep content up to date in for the new Sitecore site.
Benefits I saw:
- Time saved: I did not need to manually parse or enter content.
- Fewer mistakes: since I didn’t have to make any manual updates, there was less chances for user error.
- Easy updates: thankfully, I didn’t have to deal with content changing all that much. In a scenario where authors were still actively updating content on the AEM live site, this would make it much easier to keep content up to date on the new Sitecore site.
Using AI to Convert Data
I zipped up all of the .docx files and asked AI to create a JSON array where each element in the array represented a page.
I provided some basic parameters, such as how I wanted the data of each page structured:
slug,title,content[](headings, paragraphs),metadata(title/description/tags/image), and special blocks:columns:{ title, leftColumn, rightColumn }peopleCards:["/m/mary-martinez", ...]hero:{ image }officeInfo.websiteLink:{ url, linkText }secondaryNavigation
For tables, I taught AI how to map document tables into objects. A row like:

became:
"sectionMetadata": { "style": "Opacity 30" }
And when a table’s header is “Columns”, I converted:

to:
"columns": {
"title": "Academy",
"leftColumn": "...intro text...",
"rightColumn": "/OfficeImagesColumns/albuquerque/academy-columns-right.jpeg"
}
Getting the Images
Most images lived inside the Word files (like the “right” image in the Columns block above). When they didn’t, I asked AI to fall back to the live page:
- Primary: extract images embedded in the
.docx. - Fallback: fetch the page’s
<meta property="og:image">and use that as:metadata.image(if missing)columns.rightColumn(if missing)
This was reliable and simpler than DOM scraping because every page I checked set og:image to the correct hero/visual.
Once all of the images were gathered and their relative paths added to the JSON data, I asked AI to zip up those images in the same folder structure. This allowed me to simply upload the whole zip file to the Sitecore Media Library and have SItecore auto extract it, in the exact structure that I needed!
Gotchas I Hit
It took some refinement to make sure that everything was being exported smoothly and accurately. Here are a few that I ran into:
Table Name Inconsistencies
Some docs had tables named slightly differently. For example, 90% of documents had a “People Cards” table, but on the other 10%, it was called “People Cards (Phone)”.
I solved this by instructing AI to consider inconsistencies like these as the same property type.
Image Detection
In some cases, the images weren’t properly extracting from the .docx files. I believe this was due to the way the images were intially copied into the documents and as such were likely formatted differently.
I solved this by instruction AI to find the live web page of the document and extract it’s og:image URL and download the image.
Importing into Sitecore
At this point, I was almost done with the tough stuff! The last thing to do was to write a PowerShell script to read the JSON data and create Sitecore items. Thankfully, there wasn’t too much data on each individual page to import (around 15~20 fields each) so my script was quite simple to write.
I’ve provided a basic sample script below if you’re curious on how to convert JSON data into Sitecore items:
# Your JSON file path.
$AllMarketsJsonPath = [System.Web.Hosting.HostingEnvironment]::MapPath("~/App_Data/AllMarkets.json")
# Read the JSON data.
$markets = Get-Content -LiteralPath $AllMarketsJsonPath -Raw | ConvertFrom-Json
# Loop through each object in the JSON array.
ForEach($market in $markets.markets){
# Create market item...
# Update market item...
# etc...
}
Final Thoughts
I was really happy that this method proved successful, because otherwise this could have been a very time-intensive migration. I’m glad to now know that AI can reliably convert data sets like .docx files into something that is much more friendly to a developer. I will definitely be using this methodology going forward for any complicated migrations!
Thanks for reading, and if you have any questions or have any commentary to provide, don’t hesitate to drop a comment below!
Until next time, Happy Sitecoreing!

Leave a comment