Data carving is the process of identifying & extracting forensic artifacts from digital evidence using file signatures. For example, files using the Joint Photographic Experts Group (JPEG) compression standard begin with the header 0xFF0xD8 and end with the footer 0xFF0xD9. These are also known as Start-of-Image (SOI) and End-of-Image (EOI) markers respectively.

The following tutorial will demonstrate how these markers can be used to carve a JPEG-based file from a provided forensic image.

Tools required

  • wget, unzip, file,md5sum, sha256sum
  • xxd, grep,cut,tr
  • echo,bc,dd

Download a forensic image

To start, download “image #11” provided by the National Institute of Standards and Technology (NIST) located here. Then, unzip it and move into its main directory. Next, rename the image, verify its file-type, and record its MD5 & SHA256 hash values. I personally recommend renaming the image for simplicity. Doing so does not have an impact on the integrity of our evidence (try hashing the image before and after renaming it to see my point).

cd 11-carve-fat/
mv 11-carve-fat.dd raw.dd	# rename image for simplicity
file raw.dd 			# verify file-type (DOS/MBR boot sector)
md5sum raw.dd			# get MD5 hash of image
sha256sum raw.dd		# get SHA256 hash of image

Running both MD5 and SHA256 algorithms at the beginning of our investigation will help identify the authenticity of evidence as its processed. It also helps reduce the probability of a hash collision (where two different sources of input generate the same output). Forensically speaking, evidence should not be trusted if another image is able to produce the same hash value as the one you are analyzing. Otherwise, how would you know if it’s been modified?

The downloaded image should produce the hashes below.

MD5: 0069813c892a462f88dc6d376624f7d9
SHA256: 83585232e908529286f1ff04c43b4d858604875c733183a9e3b44a07ff818d26

Find possible JPEG SOI markers and their off-sets

We begin our search for JPEG markers by using xxd and grep. xxd generates a text-based hexdump from provided input. In our case, the downloaded image will be the input for xxd here. grep (Globally search for a Regular Expression and Print) will be used to find the markers we are looking for in said hexdump. By the way, -g1 tells xxd to display its output in single, hexadecimal groupings (pairs).

xxd -g1 raw.dd | grep 'ff d8'

As you may find, using only 0xFF0xD9 as our only search criteria will produce a lot of false positives. So, to help narrow down the possibilities, we will include 0xFF0xE0 which represents the JPEG File Interchange Format (JFIF) standard. As our friends at StackOverflow explained here, JPEG is the algorithim used for to compress/encode data within a file while JFIF is one of the most commonly used JPEG file formats.

xxd -g1 raw.dd | grep 'ff d8 ff e0'
00820a00: ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 01 2c  ......JFIF.....,
009a0a00: ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 01 2c  ......JFIF.....,
00a14a00: ff d8 ff e0 00 10 4a 46 49 46 00 01 02 01 00 48  ......JFIF.....H
00a14b80: 00 00 00 01 00 00 00 48 00 00 00 01 ff d8 ff e0  .......H........
00a15b60: 5f 00 18 00 01 ff d8 ff e0 00 10 4a 46 49 46 00  _..........JFIF.

The output here suggests we have five artifacts (JPEGs) to recover. Yet, our focus from this point forward will only be the second finding. Said finding begins with 009a0a00 - this is called an off-set. Since we only need this bit of information and it must be upper-case (I’ll explain a moment), we will re-run the previous command sentence while cutting-out & re-formatting the first eight characters.

xxd -g1 raw.dd | grep 'ff d8 ff e0' | cut -c 1-8 | tr a-z A-Z

Great! Now, our next step will be to take the off-set we are focusing on and feed it into a tool called bc. We will use this tool to identify approximately how many bytes deep our off-set represents and generally where our JPEG should begin. Yet, to execute this process, we’re going to use echo and include another parameter ibase=16;. To explain, we’re telling bc the base of our input is hexadecimal. As a comparison, we would specify ibase=10; if our input was decimal. Lastly, bc requires input values (ibase is a variable in this context) be all upper-case letters so it does not confuse them with syscalls to other utilities.

echo 'ibase=16;009A0A00' | bc
10095104 # we must go 10095104 bytes deep to find our target artifact

Find a corresponding EOI marker and off-set

Using the SOI we found as our seek from (or start) point, we will now look for an EOI. Yet, since 0xFF0xD9 will also generate a lot of false positives like before, we will stop grep after the first match.

xxd -g1 -s 10095104 raw.dd | grep -m1 'ff d9'
009a7eb0: 9e94 d2bb 4f07 bd46 7b7d 2bff d900 0000  ....O..F{}+.....

As before, we can run an extended command sentence so only the needed off-set is displayed as output.

xxd -g1 -s 10095104 raw.dd | grep -m1 'ff d9' | cut -c 1-8 | tr a-z A-Z

Then, we will run the value through bc. Although, it is important to keep in mind the off-set we currently have for our EOI is not necessarily our true EOI. The off-set represents the line where the EOI resides. Therefore, we must count the number of remaining characters in the 16 hexadecimal character line until we reach 0xFF0xD9. In this instance, we need to add 13 more bytes to whatever bc says.

echo 'ibase=16;009A7EB0' | bc
10124976 # + 13 bytes means our EOI is 10124989 bytes deep

Calculate the size

We can now calculate how big our JPEG will be by subtracting the EOI from the SOI.

echo '10124989-10095104' | bc
29885 # nearly 30 kilobytes

Extract & verify

Finally, using the forensic image downloaded earlier, we can feed all of the clues we collected during our analysis into dd. In this step, we are asking dd to carve out an artifact from raw.dd, skipping the first 10095104 bytes and ending 29885 bytes later. bs=1 will be our block-size since we are precisely cutting out 29885 individual blocks of data.

dd if=raw.dd of=artifact.jpg bs=1 skip=10095104 count=29885
md5sum artifact.jpg


This manually carved JPEG should produce the hash below (the same as provided online).

MD5: 37a49f97ed279832cd4f7bd002c826a2

Again, to prove nothing was done to our original image file, we should be able to achieve the same MD5/SHA256 hash values as before.

md5sum raw.dd
sha256sum raw.dd
MD5: 0069813c892a462f88dc6d376624f7d9
SHA256: 83585232e908529286f1ff04c43b4d858604875c733183a9e3b44a07ff818d26

Other considerations

The steps covered within this post are only basic data carving techniques. I recommend trying more than one EOI off-set if you are unable to successfully carve an expected image the first time around. For example, I personally was only able to extract a second image after using the third or fourth off-set with grep 'ff d9'. Advanced data carving techniques may also be required when addressing fragmented files. Lastly, it is worth researching & including segment markers during analysis. These describe the portions between a file’s SOI and EOI.