With its default settings, PowerGREP does a very good job of automatically handling all Unicode text files. When you have files in a variety of legacy encodings that cannot be auto-detected, you can use a text encoding configuration to make sure PowerGREP always shows you the correct text. PowerGREP is also very flexible at handling files that contain bytes that aren’t strictly valid for their encoding. Even when searching and replacing through such files, PowerGREP preserves any invalid bytes in the files.
But many other applications aren’t as flexible. Many scripting languages, for example, simply let your scripts crash when they read a file as UTF-8 and the file contains even one byte that is not part of a valid UTF-8 sequence. This example shows how you can disable PowerGREP’s smart handling of text file encodings and instead look at the raw bytes in UTF-8 files. Then you can search for bytes that aren’t valid UTF-8 sequences.
After running this example, the Results panel will show you all the bytes that PowerGREP found that aren’t part of valid UTF-8 sequences. This allows you to manually fix those files, or determine the cause of these files not being valid UTF-8.
If you just want to get a list of files that aren’t valid UTF-8 without seeing the individual bytes, load the action “Encodings: Find files with bytes that are not part of valid UTF-8 sequences” from the library instead. This action uses the “list files” rather than the “search” action type. This is faster as PowerGREP will continue with the next file as soon as one invalid byte is found.
If you want to remove the offending bytes, load the action “Encodings: Delete bytes that are not part of valid UTF-8 sequences” from the library. This uses the “search and delete” action type to delete all bytes matched by the regular expression.