JpnIHDS.dat - File Format Details and the parser introduction

更新日：2021.07.09

.header

Recently during one of our penetration tests, we spend some time extracting all possible information from the machine we have got.
We noticed that one of the system files contained some of user-inputted text. Even if this behavior is not explicitly hidden
(GUI has option to disable it), some users may consider it as privacy issue.

.text

If you use Windows PC with Japanese localization the default input method is Microsoft IME.

While there are different variants of how to input text with Kanji, usually you would use the Romaji Input.
This means that you would input sounds and then allow system to guess what Kanji you may have expected to enter (example at picture 1).

Picture 1.IME hint example.

In the same time this system is trying to remember what choices you have made in the past, so most frequent results will appear in the top of the list.
This is somewhat expected behavior, and there are settings to control it

Picture 2.Settings that control the Predictive input.

Some people already wondered how to clean it programmatically
at [StackOverflow].
But I haven't found any notes regarding the contents of this file, so went looking inside.

Predictive input history file location:

%UserProfile%\AppData\Roaming\Microsoft\InputMethod\Shared\JpnIHDS.dat

Contains information about last auto substitution choices for Japanese IME keyboard (default Japanese input for Windows 10).

My expectations were to see the generic mapping between sounds (Katakana, Hiragana, Romaji) to Kanji,
but instead there was full log of text that was entered by user with timestamps (later we identified conditions for records to be created).

Format of this file is very simple, blocks of mappings bound to a timestamp.

Can be parsed in following output:

Picture 3.Mapping-like representation.

But if you are not interested in mappings, you can easily convert it in pretty sentences:

Picture 4.Chat log-like representation.

It seems that every time you type a text and use prediction system, system will create a record with a timestamp, entered text and all substitution choices you've made.

Interestingly it will remember all the sentence and not just the places you allowed to substitute, making this log extremely informative.
You are looking almost at the chat log.

We have checked contents of this file on different computers and were able to extract records from years back.
But it seems that this file is limited to only ~512kb in size, so in case of excessive input only recent text will remain.
Not only Japanese input is recorded, English text that was entered while in "Hiragana/Katakana mode" (picture 5) is also recorded (for example while entered with Shift key).

Picture 5.Japanese Hiragana Predictive input mode (top) - WILL be logged, while English pure input (bottom) - WILL NOT be logged.

Structure of this file is pretty simple.
File starts with a header containing timestamp of the last time this file was updated and number of sentence records
(think of it as a text you typed before started to make substitutions).

Each sentence is divided into a pair of 2 words (the one that user has entered and substitution result).

At the end of the file there are also slack space which seems to contain more records.
Simple logic can be applied to extract it (I just made assumptions about the time span of Timestamp field so, most some of invalid records are filtered out, I also display unparsed space after UTF-16le decoding, so we will not lose anything important).

If you are interested in code related to creation and updating this file, check the following library:

C:\Windows\System32\IME\IMEJP\IMJPPRED.DLL

.file_structures

Format of this file is very simple, so instead of reversing the code, I tried to understand it based on the contents I saw.

Here is the simple description I've made:

Picture 6.File format diagram

.bof_module

Since this functionality is enabled by default on Japanese PCs, we wrote a BOF module for a [Cobalt Strike] framework to parse this file, so pentesters could quickly check last user input.

You can specify path to the JpnIHDS.dat file is needed as argument.
By default, BOF will try to read the file from the current user's folder.

Compile this BOF with:

VisualStudio: cl.exe /c /GS- /TP BOF_read_jpn_pred.c /Foread_jpn_pred.o

MinGW: x86_64-w64-mingw32-gcc -c BOF_read_jpn_pred.cpp -o read_jpn_pred.o