Back to list

MATLAB Simulated Datasets for Channel-Resilient RF Fingerprinting


Download Datasets:
For convenience, large datasets of Day1 and Day2 are divided into 10 parts. Each part contains signals from one virtual radio and is 4.8 GB. Please use the following links to download the datasets:
Dataset#1: TxData
Dataset#2: Day1 - part1 , part2 , part3 , part4 , part5 , part6 , part7 , part8 , part9 , part10
Dataset#3: Day2 - part1 , part2 , part3 , part4 , part5 , part6 , part7 , part8 , part9 , part10

Note : The datasets are released in SigMF format and must be parsed from binary to float64 before being used.

These datasets are created for the task of RF fingerprinting in a MATLAB simulated environment. They were used in "More is Better: Data Augmentation for Channel-Resilient RF Fingerprinting," IEEE Communication Magazine 2020. Any use of this dataset that results in any kind of publication with a bibliography section, should include a citation to our paper. Here is the PDF and the reference for the paper:

Paper PDF
Nasim Soltani, Kunal Sankhe, Jennifer Dy, Stratis Ioannidis, and Kaushik Chowdhury, "More Is Better: Data Augmentation for Channel-Resilient RF Fingerprinting," in IEEE Communications Magazine, 58 (10), pp. 66-72, 2020.



Problem:
RF fingerprinting involves identifying characteristic transmitter-imposed variations within a wireless signal. Deep Neural Networks (DNNs) that do not rely on handcrafting features have proven to be remarkably effective in fingerprinting tasks, as long as the channel remains invariant. However, DNNs trained at a specific location and time perform poorly on datasets collected under different channel conditions. This paper proposes a data augmentation step within the training pipeline that exposes the DNN to many simulated channel and noise variations that are not present in the original dataset. We have two schemes for data augmentation, one that can be applied to the transmitter data (when the transmitter data before passing the channel is accessible) and another that can be applied to the received raw IQ samples (when only a passive received dataset is available). To simulate scenarios of experiments, we create a dataset collected at the transmitter side (TxData), and two datasets collected at the receiver side (Day1 and Day2).

Data Description:
Here, we describe the three datasets mentioned above for the task of RF fingerprinting. The datasets are created in MATLAB simulation environment where we use RF impairments to model fingerprints of radios. Virtual radios are distinguished by unique pairs of amplitude and phase imbalances. Amplitude imbalances are 10 distinct values in the range of 1dB to 5.5dB with steps of 0.5dB, and phase imbalance is in the range of 1 degree to 82 degrees with 9 degrees. For example, radio1 has amplitude imbalance of 1dB and phase imbalance of 1 degree. radio2 has amplitude imbalance of 1.5dB and phase imbalance of 10 degrees, and so on. These virtual radios transmit WiFi 802.11a packets collected either at the transmitter side or the receiver side.

TxData (2 GB)
TxData dataset contains 10 radios each transmitting 2042 packets. These packets are recorded at the transmitter side, and hence, contain the virtual radio fingerprint but no channel or noise distortion. There are 10 transmissions in this dataset each stored in a file.

Day1 (Rx-Raw) (45 GB)
Day1 dataset contains 10 radios each generating 2042 packets. These packets are then passed over the WLANTGn channel model in MATLAB. The process is done 16 times for 16 different SNR levels in the range of -10dB to 20dB with steps of 2dB. Therefore, the dataset contains 160 transmissions, each stored in a separate file. Since the same channel instance (channel seed) is used throughout each transmission, this dataset could be construed as capturing one "day" of data collecting and is called "Day1".

Day2 (Rx-Raw) (45 GB)
Day2 dataset is generated through the same process as "Day1", however, a set of new channel seeds where used, which captures data collected on "Day2".

Figure 1 shows where in the transmitting-receiving chain, our datasets are recorded.

Figure 1. Transmitter and receiver chain. TxData is collected before the wireless channel. Day1 and Day2 contain raw IQ samples and are collected after the imposed noise at the Rx side.


In the Sigmf format, data sequences are represented in Binary format in .bin files. In our datasets, each .bin file is a flat representation of transmission in the form of interleaved I and Q values. Our data type is float 64, and hence, each I or Q value takes 8 bytes as shown in Figure 2.

Figure 2. An example of interleaved IQ values in a .bin file.


Each .bin file is accompanied by a .json file that contains the meta-data for that .bin file. In what follows, we describe the details of our meta-data files.

Meta Data Description
As mentioned before, for each transmission recorded in the .bin file, there is a .json file with the same name as the .bin file. The .json file contains meta-data for that specific transmission. Inside the .json file, we have a set of key/value pairs that we will describe below:

  • global:
    • dataset_name: name of the dataset
    • version: version of Sigmf
    • sample_rate: sample rate of signal in Hz
    • total_transmissions: total number of transmissions in this dataset
    • description: a short description of what this file is
    • record_date: date that the currnet dataset was created
    • data_type: type of data either float32, float64, etc.
  • captures:
    • number_of_packets: number of packets in this transmission
    • sample_start: the index in the file where the samples start
    • carrier_frequency: carrier frequency in Hz
    • sample_count: number of IQ samples in the transmission
    • transmission_index: the index of this transmission, it is a number between 0 and total_transmissions-1
  • annotations:
    • dataset_type: type of dataset, simulated or passed over the air
    • protocol: transmission protocol used in this transmission
    • Waveform_type: Type of waveform in MATLAB, NonHT, HT, VHT, etc.
    • channel_bandwidth: channel bandwidth in Hz
    • environment: the simulated medium used for transmission
    • mcs: modulation-coding scheme used to modulate the data
    • antenna_technology: SISO, MIMO, etc.
    • transmitter:
      • impairment_type: type of impairment, IQ imbalance, phase noise, etc.
      • amplitude_imbalance: amplitude imbalance value for this transmission
      • phase_imbalance: phase imbalance value for this transmission
      • virtual_radio: name of virtual radio, radio1, radio2, etc.


It should be noted that some of the keys are not applicable to some of the datasets. For example, for the dataset TxData whose signals have not passed a wireless channel, "channel_bandwidth" is not a valid entry. In such cases, the values "N/A" is assigned to that specific key.

Snapshot of an example meta-data file from Day1 dataset

"global": {

"core:dataset_name": "Day1",
"core:version": "0.0.1",
"core:sample_rate": 5000000,
"core:total_transmissions": 160,
"core:description": "This is the meta file for a specific transmission in the Day1 dataset",
"core:record_date": "September 7, 2020",
"core:datatype": "cf64_le"

},
"captures": {

"core:number_of_packets": 2042,
"core:sample_start": 0,
"core:carrier_frequency": 2400000000,
"core:sample_count": 19603200,
"core:transmission_index": 29

},
"annotations": {

"core:dataset_type": "simulated",
"core:protocol": "WiFi 802.11a",
"core:waveform_type": "NonHT",
"core:channel_bandwidth": 5000000,
"core:environment": "MATLAB WLAN TGn Channel",
"core:mcs": 3,
"core:anntena_technology": "SISO",
"core:SNR": 20,

"transmitter":{

"core:impairment_type": "IQ Imbalance",
"core:amplitude_imbalance": "1.5dB",
"core:phase_imbalance": "10 degree",
"core:virtual_radio": "radio2"

},

}