Comparison of the natural and machine languages | Publishing house Radiotekhnika

350 rub

Journal Neurocomputers №10 for 2010 г.

Article in number:

Comparison of the natural and machine languages

Keywords: Machine code Markov chain gap completion

Authors:

A. V. Balakin D. A. Edel

Abstract:

Gap completion problem has not been solved so far. There are several methods to fix it but none of them offers ultimate solution. Most effective methods works as signature detectors or least they are based on them. According to Intel IA-32 specification every sequences of bytes longer then 16 bytes may be represented as a command. If disassembling will be performed from the different positions, then effect of command superposition appears. It means that one byte can be a part of different commands at the same time. So almost all data variables can be represented as commands. The present paper proposes comparison of the words from natural and machine languages based on the language model of executable code as a way to fix this problem. This language model is based on homogeneous Markov chains. Let be {Xn} the sequence of descreete random commands in which, P(Xn+1=in+1|Xn=in, Xn-1=in-1,?, X0=i0)=P(Xn+1=in+1|Xn=in) ? the following command only depends on one previous command. Then {Xn} is homogenous Markov chain. Let {Oj}, j=0...L - be the tested sequence of commands to be classified in the tasks of gap completion. Then probability that it belongs to a certain model M is PM(O)=PM(O1)×PM(O2|O1)×PM(O3|O2)×?×PM(OL|OL-1). To resolve the task in questing - building experiment-based models of executable and non-executable sequence of commands: we used the file set for models training. File set contains Windows PE and non-executable files. The test file set only consists of 30 files. The classifying method is identical to methods of language classification by the given text. In experiments the training file set contained 23126 executable and 62883 non-executable files. To improve classification all files sets were devided into 13 executable classes and 3 non-EXE. The experiments proved that the best classification results are obtained when the sequence of commands is longer than 30 bytes. Each model contains 342 unique commands. If the built model has less commands then others, it is filled up by zeros to equality. The experiments showed that this method of the gap completion gives good results only in case of finding executable code, but in general case it requires additional file set and choosing commands most essential for building models for each class.

Pages: 63-67

References

Linn, C., Debray, S., Obfuscation of Executable Code to Improve Resistance to Static Disassembly // ACM Conference on Computer and Communications Security. 2003. P. 290-299.
IA-32 Assembly Language Reference Manual.:Sun Microsystems. Inc. 2000.
Rosenblum, N., Zhu, X., Miller, B., Hunt, K., Machine Learning-Assisted Binary Code Analysis // NIPS 2007 Workshop on Machine Learning in Adversarial Environments for Computer Security, Vancouver. British Columbia. Canada. December 2007.
Silic, A., Chauchat, J., Basic, B. D., Morin, A., N-grams and Morphological Normalization in Text Classification: a Comparison on a Croatian-English Parallel Corpus // SpringerLink - Lecture Notes in Computer Science. 2007. P. 571-682.
PeID[Электронный ресурс]. Режим доступа http://www.peid.info/, свободный (дата обращения: 29.04.2010).