Microsoft Word 2003 and prior used the DOC file format to save documents. This is a proprietary binary file format. PowerGREP can convert DOC files to plain text so that you can search through them.
Microsoft Word 2007 and later use the DOCX file format. DOCX files are technically ZIP archives that contain XML and assorted files. While DOCX is an open format in principle, the XML it uses is still really complicated. PowerGREP can convert DOCX files to plain text so you can easily search through them, without having to deal with the XML. PowerGREP can also reconvert its plain text conversion back into the original DOCX file so you can easily search-and-replace through DOCX files.
To be able to search through Word documents as if they were plain text documents, you need to set the “file formats to convert to plain text” on the File Selector panel to a configuration that converts Word documents to plain text. In the configuration, the option “Use PowerGREP’s built-in decoder to convert files to plain text” should be turned on for the file formats “Microsoft Word 95 to 2003 (DOC)” and “Microsoft Word 2007 to 2016”. Default configurations that use these options are “proprietary formats”, “all formats”, “attachments & proprietary formats”, and “attachments & all formats”.
If you want to search only through Word documents, enter the file mask *.do[ct];*.do[ct][xm] in the “include files” box on the File Selector panel. If you leave the “include files” and “exclude files” boxes blank, then PowerGREP searches through the plain text conversion of all file formats enabled by the configuration, as well as through the raw contents of all files that are not recognized as one of those file formats.
This file selection is available in the PowerGREP5.pgl library as “Office: Search through Word documents”.
To indicate which Word documents to search through, click on the folder that contains them in the “folders and files” tree. Then select Include File or Folder or Include Folder and Subfolders from the File Selector menu.
Finally, prepare and execute your search on the Action panel.
When PowerGREP converts Word documents to plain text, you can only search through the body text of the documents. The conversion does not show any metadata, so you can’t search through that. For DOC files, this is the only way.
If you are familiar with the XML format used by DOCX files, you can tell PowerGREP to search through the raw XML instead. This allows you to search for anything in the files, as long as you know how it is represented in the XML. To do so, select a file format configuration on the File Selector panel that uses the option “search through the individual files inside the compound document” for the “Microsoft Word 2007 to 2016” file format. Default configurations that do so are “compound documents”, “compound documents & proprietary formats”, and “compound documents & writable proprietary formats”. Choose “Compound documents & proprietary formats” if you want to search through the plain text conversion of DOC files in addition to searching through the XML inside DOCX files. The other two skip DOC files.
If you want to search only through DOCX files, enter the file mask *.do[ct][xm] in the “include files” box on the File Selector panel.
This file selection is available in the PowerGREP5.pgl library as “Office: Search through the raw XML inside DOCX files”.
To effectively work with the XML, you will likely want to use file sectioning. This makes it easy to restrict the main part of the action to the contents of specific XML tags. To search through the body text of DOCX files, for example, take these steps on the Action panel:
Note that in .docx files, paragraphs with mixed formatting (bold, italics, etc.) are broken up into multiple <w:t> tags, one for each block of text with contiguous formatting. This means that the PowerGREP action above will process each contiguously formatted part of the paragraph in separate sections. The action will not find any search terms that span across sections.
This action is available in the PowerGREP5.pgl library as “Office: Search printable text in the raw XML inside DOCX files”.