Path Expansion#

Source data are often stored in an organizational structure where some of the metadata of the experiment, such as the subject and session IDs, are included in the folder names and file paths of the data. The path_expansion module allows a user to specify this path pattern so that NeuroConv can identify all matching data and automatically extract the relevant metadata.

Local Path Expander#

Use the LocalPathExpander to find matching paths in your local filesystem. This class requires a source data specification, in which you can list multiple data interfaces by name, and for each provide a base_directory path and a “file_path” or “folder_path” argument in f-string format. The path expander will find all matching paths and automatically extract the specified metadata.

from pathlib import Path
from typing import Dict

from neuroconv.tools.path_expansion import LocalPathExpander

# Specify source data
source_data_spec = {
    "spikeglx": {
        "base_directory": "/path/to/raw_data",
        "file_path": "{subject_id}/{session_id}/{session_id}_g0_imec0/{session_id}_g0_imec0.ap.bin"
    },
    "phy" : {
        "base_directory": "/path/to/processed_data"
        "folder_path": "{subject_id}/{session_id}/phy"
    }
}

# Instantiate LocalPathExpander
path_expander = LocalPathExpander()

# Expand paths and extract metadata
metadata_list = path_expander.expand_paths(source_data_spec)

# Print the results
for metadata in metadata_list:
    print(metadata)

The expand_paths method returns a list of DeepDict objects that contain two dictionaries: source_data and metadata. The source_data dictionary contains the resolved path of each interface, while the metadata dictionary contains the metadata extracted from the filepaths. Currently, only subject_id, session_id, and session_start_time are supported, but this approach could in principle be extended to support extraction of more metadata.

Specifying Metadata Format#

The f-string format allows you to constrain the search for more precise metadata matching using the Format Specification Mini-Language and the 1989 C standard format codes for datetimes. Below are some common examples.

Length#

For example, you might have a data path where the subject_id and session_id are next to each other, and they can be disambiguated because the subject_id is always 4 characters and the session_id is always 5 characters. This can be expressed as "{subject_id:4}{session_id:5}".

Character type#

If the subject_id is always numeric, you could use n, e.g. "{subject_id:n}", which will match only digits.

Datetimes#

If your session start time is present in your data path, you can indicate this following the 1989 C standard format codes for datetimes. For example, "{session_start_time:%Y-%m-%d}" will match "2021-01-02" and evaluate it to datetime.datetime(2021, 1, 2).

Example Usage#

Below are some full examples of how this feature can be used on some organizational patterns taken from real datasets.

Example 1: Allen Institute Visual Coding Dataset #

The Allen Institute’s Visual Coding dataset contains, among other data, motion-corrected videos of each experimental session, with the directory structure shown below.

allen-brain-observatory/
¦   visual-coding-2p/
¦   +-- ophys_movies/
¦   ¦   +-- ophys_experiment_496908818.h5
¦   ¦   +-- ophys_experiment_496934409.h5
¦   ¦   +-- ophys_experiment_496935917.h5
¦   ¦   +-- ...

The video files are all stored in the directory ophys_movies/, and their file names follow the pattern ophys_experiment_ plus a 9-digit session ID. We can use LocalPathExpander to find all of these ophys_movies files and extract their session IDs with the following code block.

source_data_spec = {
    "allen-visual-coding": {
        "base_directory": "/allen-brain-observatory/visual-coding-2p",
        "file_path": "ophys_movies/ophys_experiment_{session_id}.h5"
    }
}
path_expander = LocalPathExpander()
metadata_list = path_expander.expand_paths(source_data_spec)

The metadata_list now contains the information extracted for each matching file found by LocalPathExpander. The information for the first file is shown below.

{
    "source_data": {
        "allen-visual-coding": {
            "file_path": "/allen-brain-observatory/visual-coding-2p/ophys_movies/ophys_experiment_496908818.h5"
        }
    },
    "metadata": {
        "NWBFile": {
            "session_id": "496908818"
        }
    }
}

Example 2: Buszaki Lab SenzaiY Dataset #

The Buszaki Lab’s SenzaiY dataset contains spiking and LFP data from mouse V1 with the directory structure shown below. Sorted unit spiking data are stored in the .res.1 and .clu.1 files, while the LFP data are stored in the .eeg files.

SenzaiY/
¦   YMV01/
¦   +-- YMV01_170818/
¦   ¦   +-- YMV01_170818.eeg
¦   ¦   +-- YMV01_170818.res.1
¦   ¦   +-- YMV01_170818.clu.1
¦   ¦   +-- ...
¦   YMV02/
¦   +-- YMV02_170815/
¦   ¦   +-- YMV01_170815.eeg
¦   ¦   +-- YMV01_170815.res.1
¦   ¦   +-- YMV01_170815.clu.1
¦   ¦   +-- ...
¦   ...

The data are organized into folders first by subject (YMV01, YMV02, etc.) and then by session start times in the format yymmdd (170818, 170815, etc). We can use LocalPathExpander to find both the LFP data files and the sorted unit spiking and extract their corresponding subject IDs and session start times. For the sorted unit spiking, we’ll search for a matching folder_path instead of a file_path, as neuroconv interfaces for such data, like NeuroScopeSortingInterface, expect a folder_path as input.

source_data_spec = {
    "SenzaiY_LFP": {
        "base_directory": "/SenzaiY/",
        "file_path": "{subject_id}/{subject_id}_{session_start_time:%y%m%d}/{subject_id}_{session_start_time:%y%m%d}.eeg"
    },
    "SenzaiY_Spiking": {
        "base_directory": "/SenzaiY/",
        "folder_path": "{subject_id}/{subject_id}_{session_start_time:%y%m%d}/"
    }
}
path_expander = LocalPathExpander()
metadata_list = path_expander.expand_paths(source_data_spec)

The metadata_list now contains the information extracted for each matching file and directory found by LocalPathExpander. The information for the first file is shown below.

{
    "source_data": {
        "SenzaiY_LFP": {
            "file_path": "/SenzaiY/YMV01/YMV01_170818/YMV01_170818.eeg"
        }
    },
    "metadata": {
        "NWBFile": {
            "session_start_time": datetime.datetime(2017, 8, 18, 0, 0)
        },
        "Subject": {
            "subject_id": "YMV01"
        }
    }
}

The information found for the first matching directory is similar.

{
    "source_data": {
        "SenzaiY_Spiking": {
            "folder_path": "/SenzaiY/YMV01/YMV01_170818/"
        }
    },
    "metadata": {
        "NWBFile": {
            "session_start_time": datetime.datetime(2017, 8, 18, 0, 0)
        },
        "Subject": {
            "subject_id": "YMV01"
        }
    }
}

Example 3: IBL Brain Wide Map Data #

The IBL’s Brain Wide Map features data from several labs of mice performing a visual decision-making task. Some experimental sessions, such as those from the Steinmetz Lab, include video recordings of the experiments from three cameras, stored in the following directory structure.

steinmetzlab/
¦   Subjects/
¦   +-- NR_0017/
¦   ¦   +-- 2022-03-22/
¦   ¦   ¦   +-- 001/
¦   ¦   ¦   ¦   +-- raw_video_data/
¦   ¦   ¦   ¦   ¦   +-- _iblrig_leftCamera.raw.6252a2f0-c10f-4e49-b085-75749ba29c35.mp4
¦   ¦   ¦   ¦   ¦   +-- ...
¦   ¦   ¦   ¦   +-- ...
¦   +-- NR_0019/
¦   ¦   +-- 2022-04-29/
¦   ¦   ¦   +-- 001/
¦   ¦   ¦   ¦   +-- raw_video_data/
¦   ¦   ¦   ¦   ¦   +-- _iblrig_leftCamera.raw.9041b63e-02e2-480e-aaa7-4f6b776a647f.mp4
¦   ¦   ¦   ¦   ¦   +-- ...
¦   ¦   ¦   ¦   +-- ...
¦   ...

We can use LocalPathExpander to find these left camera video files and extract the subject ID, the session start time (formatted as yyyy-mm-dd), and a session number (001 for both files shown).

source_data_spec = {
    "IBL_video": {
        "base_directory": "/steinmetzlab/",
        "file_path": "Subjects/{subject_id}/{session_start_time:%Y-%m-%d}/{session_id}/raw_video_data/_iblrig_leftCamera.raw.{}.mp4"
    }
}
path_expander = LocalPathExpander()
metadata_list = path_expander.expand_paths(source_data_spec)

The metadata_list now contains the information extracted for each matching file found by LocalPathExpander. The information for the first file is shown below.

{
    "source_data": {
        "IBL_video": {
            "file_path": "/steinmetzlab/Subjects/NR_0017/2022-03-22/001/raw_video_data/_iblrig_leftCamera.raw.6252a2f0-c10f-4e49-b085-75749ba29c35.mp4"
        }
    },
    "metadata": {
        "NWBFile": {
            "session_id": "001",
            "session_start_time": datetime.datetime(2022, 3, 22, 0, 0)
        },
        "Subject": {
            "subject_id": "NR_0017"
        }
    }
}

If you would like to experiment locally with LocalPathExpander, we provide a helper method in neuroconv.tools.testing that partially replicates the directory structure of the IBL data with dummy files on your machine.

from neuroconv.tools.testing import generate_path_expander_demo_ibl

generate_path_expander_demo_ibl(folder_path="path/to/generate/dummy/files")

Non-local Path Expansion#

Note that LocalPathExpander expands file paths locally, so it can only expand file paths that are on the same system as the code. Other types of path expanders could be implemented to support different platforms, such as Google Drive, Dropbox, or S3. These tools have not yet been developed, but would extend from the AbstractPathExpander