ImpELF: Unmasking Linux Malware with a Novel Imphash Approach for ELF Binaries

As someone that primarily does Linux security research, I was frustrated that there wasn't an equivalent of an imphash for Linux ELF binaries. So, I decided to make one myself. Introducing ImpELF.

ImpELF is a Python-based ELF (Executable and Linkable Format) hashing utility that generates unique fingerprints for ELF binaries using their imported functions and libraries, aiding in malware analysis and similarity detection.

Imported Symbols and Libraries

In Linux, ELF (Executable and Linkable Format) binaries are a common format for executables, shared libraries, and object files. ELF binaries contain something known as a symbol table, which holds information about symbols (functions, variables, etc.) defined within the binary, as well as those imported from external libraries.

When a program uses an imported symbol, the dynamic linker (ld.so) resolves the symbol's address at runtime by locating it in the appropriate shared library. This mechanism allows multiple programs to share the same library code, reducing the overall memory footprint and binary size.

For example, if you are analying an ELF binary that uses the "printf" function from the C stdlib, than the "printf" function would be an example of an imported symbol, as it is defined in the libc library and not within the ELF binary itself.

The libraries that we sort and add to the hash are the libraries from which the imported symbols exist.

Implementation

Run the ImpELF script on an ELF binary:

python impelf.py /path/to/your/elf_file

The script will output the ImpELF hash for the given ELF binary.

By analyzing an ELF binaries dynamic symbols (imported functions) and libraries, we can create a hash similar to the PE file's imphash. Suppose we have an ELF binary with the following imported symbols and libraries:

Imported symbols:

  • printf
  • malloc
  • strcpy
  • strcmp

Libraries:

  • libc.so.6
  • libm.so.6

Using impelf.py, the get_imported_symbols_and_libraries function extracts the imported symbols and libraries from the ELF binary. The imported symbols and libraries are then returned as two separate lists.

After obtaining the lists of imported symbols and libraries, the create_hash function is called with these two lists as arguments. In this function, the symbols and libraries are first sorted:

Sorted imported symbols:

  • malloc
  • printf
  • strcmp
  • strcpy

Sorted libraries:

  • libc.so.6
  • libm.so.6

Then, the sorted imported symbols list is concatenated with the sorted libraries list to create a single string:

Example Concatenated string: mallocprintfstrcmpstrcpylibc.so.6libm.so.6

Finally, the concatenated string is hashed using the MD5 hashing algorithm (or another algorithm of your choice) to create the final ELF hash:

ImpELF hash: 4e4d4d4e8f8a96d30b9dab9d6deac8b3

Keep in mind that the specific example provided here might not match the actual output you would get when running the script, as the output will depend on the specific ELF binary being analyzed.

Final Thoughts

ImpELF, like imphashing, has it's drawbacks that you should understand and be aware of.

Limited scope These techniques only focus on imported symbols and libraries, which provide just one aspect of a binary's characteristics. They do not account for all data contained in a binary which may likely also be relevant for analysis and comparison.

Evasion Attackers may manipulate the imported symbols and libraries to modify the resulting hash, making it harder to identify malicious binaries. This can be done by using different function names, adding irrelevant functions or libraries, or even statically linking the libraries instead of dynamically linking them.

False positives Since ImpELF and imphashes focus on imported symbols and libraries, unrelated binaries that happen to import the same set of symbols and libraries may produce the same hash. This can lead to false positives when comparing or identifying binaries.

Incomplete information If a binary is packed, obfuscated, or encrypted, it may not be possible to accurately extract the imported symbols and libraries. This could result in an incorrect or misleading hash value.

Github Repo: https://github.com/signalblur/impelf