0.9.6 is out: Kesakode malware identification!

Sun 26 May 2024 malcat team news

Today we are happy to announce the release of version 0.9.6. This releases introduces Kesakode, a remote hash lookup service exclusive to Malcat users and tightly integrated inside Malcat's UI. It can be used to match known functions, strings and constant sets against a database of known clean, malware and library files. The Kesakode service can be used in various situation, such as:

  • identify unpacked (e.g. a memory/sandbox dump) malware samples
  • show similarities shared between malware families
  • assist in the creation of better Yara rules
  • speed up reverse engineering by identifying know libraries / runtime code

During the whole process, only hashes are sent to our platform, your sample never leaves your computer.

Note: to use the Kesakode service, you need a running license of Malcat

If you want to see the complete list of improvements, have a look at the changelog, at the bottom of this article. This release note will focus on Kesakode.

How does it work?

While there is some technicality behind the scene, the main idea fueling Kesakode is rather simple and will be summarized below.

Indexing process

We have built over the last months a big library of 300+ of the most recent malware families, alongside a million of unique clean programs and libraries. Add to that the impressive 2000+ malware families corpus of Malpedia, and you get a reasonable training dataset. For each sample, three sets of features are extracted:

  • Hashes of every interesting function found in the sample, with their absolute offsets masked out (to cope with code relocation)
  • Hashes of every interesting string found in the sample (we use Malcat's scoring system there)
  • A single fuzzy hash computed over the list of interesting code immediates and data constants identified by our constant scanner

Hashes are stored in a huge relational database, and linked to their corresponding sample.

How Kesakode computes similarities for your file
Figure 1: How Kesakode computes similarities for your file

At query time

When you make a Kesakode query from within the Kesakode view, the same three sets of hashes are computed and sent to our matching service.

For function and string hashes, our cloud service will simply query the database and try to recollect if and where we have seen this hash before. Then a simple decision tree is used:

  1. If we have seen a function hash in a library, it will be labelled as LIBRARY code and the name of the library will be returned
  2. Otherwise if we have seen the string/function hash in a clean program, we label the hash as CLEAN
  3. Finally, if the hash was only seen in malware, it is labelled as MALICIOUS and Kesakode returns all the malware family names where it was found

For code immediates and data constants, we take a different approach though, as indexing every constant/immediate found in programs would not be technically achievable. Instead, we only focus on malicious samples there, and store for each sample a single fuzzy hash summarizing the whole constant/immediate sets in 128 bytes. Fuzzy hashes allow for fast fixed-time comparisons of two constants sets and greatly improves performance. At query time, the fuzzy hash is compared to its nearest neighbours and all malware families having a similarity score greater than 80% are returned.

The new Kesakode view in Malcat
Figure 2: The new Kesakode view in Malcat

All results are finally display in the Kesakode view, where you can see every matching function and string. A global likelihood score is also computed for matching malware families to help you make the final call.

And does it work?

It works well! Malware are notably tricky and try to camouflage themselves by using deception techniques such as code obfuscation or data/string encryption. By using three different sets of features (functions, strings and constants), Kesakode is still able to identify the vast majority of the malware samples one way or another. In fact, the only limiting factor is our database. But with your help, this factor can be alleviated by submitting FP/FN samples:

Submitting false positives/negatives
Figure 3: Submitting false positives/negatives

Regarding performances, you can expect your typical lookup query to take between 1 and 4 seconds. This can of course vary depending on the number of functions and strings found in your program. If you want to learn more, have a look at a few use cases of the Kesakode service below.

Use cases

Whether you are a malware analyst, detection engineer or a more casual reverse engineer, Kesakode lookups can help you save time in a few situations. We'll see how below.

Malware identification

The main and most obvious usage than can be made of Kesakode is malware identification, i.e answering the question: to which malware family does this sample belong?

While there exist a lot of public Yara rules on the Internet that can identify malware families, finding good Yara rules is another story. And Yara rules sets can be incomplete and/or can get outdated quickly if the rules are poorly written. Kesakode can help you there, as it can identify samples using more patterns than your standard Yara rule. And if we don't have a particular family in our database, submitting FP/FN samples is just a couple of clicks away and require less effort than writing your own Yara rule! Note that Kesakode will only work on unpacked/dumped samples, as attributing packers/crypters to a single malware family is often not possible. You can for instance run Kesakode on process dumps issued by the Triage sandbox.

Tria.ge Sandbox dumps are good targets for precise malware identification using Kesakode
Figure 4: Tria.ge Sandbox dumps are good targets for precise malware identification using Kesakode

Kesakode lookups can also help you spot similarities between malware families, information that Yara rules often fail to provide. You'll be able to see which function and/or string of your sample is used in other malware families, and use this information to deduce the family tree of the malware family.

Detection engineering

If you're a detection engineer and have to write Yara rules to detect malware, Kesakode can be rather handy too! The hard part in writing a good Yara rule is finding portions of code and strings which are unique to the malware. Kesakode can help you there, even if we don't have the malware in our database!

Data and strings coloring in Malcat after a kesakode request
Figure 5: Data and strings coloring in Malcat after a kesakode request

By displaying UNKNOWN and MALICIOUS functions/strings inside the Kesakode view, you'll be able to spot artifacts that have never been seen in any library/clean program. These make very good candidates for your new Yara rules. The coloring scheme (LIBRARY, MALICIOUS and CLEAN labels) is also applied to Malcat's strings view, hexadecimal view, structure view and disassembly view if you prefer to build your Yara rule from there. Combined with Malcat's Yara editor, this makes writing detection rules easier.

Faster reverse engineering

Even if malware are not your cup of tee and you just want to reverse engineer a standard application, Kesakode can help you identify low-value functions in your program. Indeed, chances are that you'll want to focus your efforts on the portions of code that are unique to your program: complex algorithms, unique strings, etc.

Color helpers in disassembly view after a kesakode request
Figure 6: Color helpers in disassembly view after a kesakode request

After every Kesakode database lookup, the disassembly view in Malcat will label/color every known function found in either CLEAN programs or LIBRARY (with the name of the library). This can speedup the reversing process a lot: just ignore the boring part and focus on the unique parts of your program!

Frequently asked questions

Where does the name "Kesakode" comes from?

The word Kesakode is a combination of "Kesako" (from the occitan "qu'es aquò", meaning "what is it?") and the word "code", so basically: What is this code?

How can I use this service? How much does it cost?

This service comes for free with any full/pro license. You can access it within Malcat's Kesakode view. Note that you'll need a running license of Malcat, i.e you must be within the 1 year update period. See below for questions regarding your monthly quota.

How many queries can I make?

There are two limits built into Kesakode. The first limit is in the Web server, which will block you if you make more than 60 queries / hour. Note that you should not reach this limit under normal usage.

The second limit is your montly quota. This makes sure things stay civilized and every user has access to the service. Currently, the service is in beta-test and we want to put the server under stress. Your monthly quota is thus currently 160 requests / month for full users and 320 queries / month for pro users.

Depending on the result of the stress test, we'll set a more reasonable quota in the future, most likely 40 requests for full users and 80 for pro users. But this is not written in stone and will depend on actual usage of the service.

Kesakode is great, but I need more quota!

We are happy that you like the new detection service. If you need more quota, we can always setup a server just for you. Note that this requires some effort, time and money. The easiest way is to just contact us and we will work something out.

Kesakode sucks, it can't identify my malware!

Nowadays, 99% of the malware found in the wild are packed/obfuscated. Kesakode, works only on unpacked/plain-text malware. To unpack your malware, there are many solutions. The simplest one is to download dumps issued by the Triage sandbox or AnyRun, chances are your unpacked malware is in one of them.

If you are indeed analysing a plain-text malware, then maybe we just don't have it in our database, it happens. You can always share the sample with us and we will make sure it gets detected in the next indexation.

How does it compare against Intezer, Threatray, Glimps or other similar services?

Kesakode shares similar goals with all these solutions, that is: identify malware families and known artifacts in programs. But there are of course differences.

On the plus side, Kesakode is fast (between 1 and 4 seconds query time for most malware). It also tries to identify malware on three different levels (code, strings and immediates/constants). As far as we know it's the only solution to do that.

On the minus side, most of the aforementioned solutions offer more: a bigger dataset, threat-intelligence capabilities (i.e explore their online sample database) and most of the time they also have a small sandbox to try to unpack the sample for you. .

Note that I only have incomplete information on these solutions, as I don't have the budget to afford any of them. So take these points with a grain of salt. The best way to compare things is always to do your own test!

Can I include it in my malware analysis pipeline?

Yes, in a very near future! Thanks to Malcat's GUI-less python module, you can already analyse files directly from your python program. In the next release, we will also add python bindings to programmatically query the Kesakode service. Note that you'll most likely need an OEM license for Malcat: don't hesitate to contact us.

Why can't I read the word AI anywhere?

There are problematic where AI is not well suited, and in our opinion this is one of them. The dataset on the (unpacked) malware side is small (a few thousands families is considered small for machine learning), but is very large on the clean side: this is far from ideal for proper machine learning.

Additionally, ML/DL models are prone to false positives, which is particularly dangerous when doing malware attribution. We believe that a good algorithm and solid optimisation will take your farther, but just see for yourself!

Thanks and credits

We would like to thank Dr. Plohmann from FKIE for the inspiration (MCrit showed us that malware similarity at scale is possible) and more generally all the people behind Malpedia. Without access to Malpedia's corpus during training, Kesakode wouldn't be able to detect as many malware.

Also a big thank you to all the great minds behind the PostgreSQL database, that definitely know how to write performant software!

Other changes

This release also comes with a few additional changes:

  • Your Tri.age account can now be added into the list of threat intelligence providers
  • If Malcat fails to extract a password-protected file from an archive using one of the standard passwords, it will now ask the users to provide one (only in the GUI)
  • We have improved the install_api.py script: it can now install and activate Malcat at the same time
  • We have added python bindings to access identified constants
  • ... and the usual suspects: new anomalies, Yara rules and bug fixes

Here is the complete changelog of this release:

● Kesakode:
    - Added Kesakode hash lookup service! (see website for more details)
    - Added Kesakode view
    - Made several views display Kesakode information
● Analysis:
    - User highlights now also get added to the symbols
    - Improved stackstring detection for x86/x64
    - Improved performances in function & loop discovery algorithms
● Parsers:
    - Added support for Golang 1.20+ symbol parsing
    - Added support .p7x certificate files
● Intelligence:
    - Add an option in Preferences>Intelligence>Misc. options to disable SSL checks for HTTPS requests
    - Added Triage as threat intel source
● Scripting:
    - Added python bindings for the identified constants (analysis.constants)
    - Added documentation for the analysis.constants object
    - Added attribute Analysis.entrypoint
    - Added a script to remap PE physical section starts to their virtual counterparts (useful for badly dumped samples)
    - [WINDOWS] install_api.py now immediately throws an error when called from the wrong python interpreter version
    - install_api.py now also offers you to activate malcat for an easy 1-step deployment on online or offline GUI-less servers
● User interface:
    - You can now enter a custom password when extracting virtual files (and the default password does not work)
    - Virtual files are now extracted in a background thread to avoid GUI freezes on large files
    - Added options for _minimum_ column number in structure and hexadecimal views
    - Added "Download and analyse" context menu option to the source code view
    - Improved address context menu with dereference sub-menus for [VA], [RVA] and [OFFSET]
    - Annotation having a danger level (anomalies, yara matches, recognized function / libs) can now be shown using the corresponding danger color
    - Ctrl-E (Goto Entrypoint) now goes to the *best* entry point (i.e the .NET EP for .NET files instead of the PE one)
    - Switching to one of the code view if you've never moved around in the file once now automatically jumps to the entry point of the file
    - Jumping into a code view (disas or proximity) now automatically align the view to the start of the CPU instruction
● Anomalies:
    - Added TableExternalLink anomaly for OLE Word documents (external URL found in Table stream)
    - Added BssNonEmpty anomaly 
    - Added PumpedOverlay anomaly (large overlays with very low entropy)
    - Added PasswordInScript anomaly for InnoSetup installers
    - Added ImportByHash anomaly (know API hash constant were found)
    - Removed RegionHighEntropy anomaly
    - Removed HugeResource anomaly
    - Improved performances of CPU-intensive anomalies
● Transforms:
    - Added GCM mode for aes
● Bug fixing:
    - Fixed file passwords in _unicode_ InnoSetup installers need to be utf16-le encoded
    - Fixed RAR4 and RAR5 file names can be UTF8
    - Fixed a regression where the window title was not updated properly after a project save
    - Fixed a regression where coloring for overlapping annotation would in some rare cases not be applied correctly
    - Fixed some %ls/%s confusion under Linux leading to abberviated names in a few displays
    - Fixed issue in VB p-code disassembler where disassembling an invalid operand would in some rare cases lead to an exception
    - Fixed wrong computation of malcat.Function.num_unique_immediate_bytes
    - Fixed Scintilla editor in Yara view would trash the file on save errors. Go back to the last edit instead.
    - [WINDOWS] In the Yara editor, wxWidgets Scintilla control insisted on saving things as latin1 instead of UTF-8