Publishing house Radiotekhnika

"Publishing house Radiotekhnika":
scientific and technical literature.
Books and journals of publishing houses: IPRZHR, RS-PRESS, SCIENCE-PRESS

Тел.: +7 (495) 625-9241


Comparison of the natural and machine languages


A. V. Balakin D. A. Edel

Gap completion problem has not been solved so far. There are several methods to fix it but none of them offers ultimate solution. Most effective methods works as signature detectors or least they are based on them. According to Intel IA-32 specification every sequences of bytes longer then 16 bytes may be represented as a command. If disassembling will be performed from the different positions, then effect of command superposition appears. It means that one byte can be a part of different commands at the same time. So almost all data variables can be represented as commands. The present paper proposes comparison of the words from natural and machine languages based on the language model of executable code as a way to fix this problem. This language model is based on homogeneous Markov chains. Let be {Xn} the sequence of descreete random commands in which, P(Xn+1=in+1|Xn=in, Xn–1=in–1,…, X0=i0)=P(Xn+1=in+1|Xn=in) – the following command only depends on one previous command. Then {Xn} is homogenous Markov chain. Let {Oj}, j=0...L – be the tested sequence of commands to be classified in the tasks of gap completion. Then probability that it belongs to a certain model M is PM(O)=PM(O1)×PM(O2|O1)×PM(O3|O2)×…×PM(OL|OL–1). To resolve the task in questing – building experiment-based models of executable and non-executable sequence of commands: we used the file set for models training. File set contains Windows PE and non-executable files. The test file set only consists of 30 files. The classifying method is identical to methods of language classification by the given text. In experiments the training file set contained 23126 executable and 62883 non-executable files. To improve classification all files sets were devided into 13 executable classes and 3 non-EXE. The experiments proved that the best classification results are obtained when the sequence of commands is longer than 30 bytes. Each model contains 342 unique commands. If the built model has less commands then others, it is filled up by zeros to equality. The experiments showed that this method of the gap completion gives good results only in case of finding executable code, but in general case it requires additional file set and choosing commands most essential for building models for each class.

© Издательство «РАДИОТЕХНИКА», 2004-2017            Тел.: (495) 625-9241                   Designed by [SWAP]Studio