inaSpeechSegmenter Wrapper (v2.0)

About this version

  • Submitter: keighrim
  • Submission Time: 2024-05-07T03:45:14+00:00
  • Prebuilt Container Image: ghcr.io/clamsproject/app-inaspeechsegmenter-wrapper:v2.0
  • Release Notes

    Major update

    • TimeFrame properties are updated to match new CLAMS vocab specification
    • noEnergy label from INA segmenter is stored as silence in MMIF for clarity
    • added parameter silenceRatio to configure silence detection threshold
    • renamed minDuration parameter to minTFDuration for clarity
    • bugfixes in cli.py

About this app (See raw metadata.json)

inaSpeechSegmenter is a CNN-based audio segmentation toolkit. The original software can be found at https://github.com/ina-foss/inaSpeechSegmenter .

Inputs

(Note: “*” as a property value means that the property is required but can be any value.)

One of the following is required: [

]

Configurable Parameters

(Note: Multivalued means the parameter can have one or more values.)

  • minTFDuration: optional, defaults to 0

    • Type: integer
    • Multivalued: False

    minimum duration of a TimeFrame in milliseconds

  • silenceRatio: optional, defaults to 3

    • Type: integer
    • Multivalued: False

    percentage ratio (0-100) of audio energy to to determine silence, ratio to mean every of the input audio.

  • pretty: optional, defaults to false

    • Type: boolean
    • Multivalued: False
    • Choices: false, true

    The JSON body of the HTTP response will be re-formatted with 2-space indentation

Outputs

(Note: “*” as a property value means that the property is required but can be any value.)

(Note: Not all output annotations are always generated.)

  • http://mmif.clams.ai/vocabulary/TimeFrame/v5
    • timeunit = “milliseconds”
    • labelset = a list of [“silence”, “speech”, “noise”, “music”]

    The INA semgmenter uses 5-way classification ([‘noEnergy’, ‘female’, ‘male’, ‘noise’, ‘music’]) and this wrapper remaps the labels to [‘silence’, ‘speech’, ‘noise’, ‘music’], by 1) renaming noEnergy to silence 2) collapsing female and male into speech (leaving additional gender property). Note that the time frame annotations do not exhaustively cover the input audio, but only the segments.