A solution that provides ability to extract deep insights using machine learning models based on Audio and Video. ​


AI use-case Capabilities

Cognitive services to generate knowledge from your data



Tag extraction

Celebrity recognition

Brand recognition



Key phrase extraction

Language detection

Sentiment analysis

Location entity extraction



Text to speech

Speech to text

Speech translation

Video Insights

Face detection

Detects and groups faces appearing in the video.

Shot detection

Determines when a shot changes in video based on visual cues.

Thumbnail extraction for faces 

("best face")

Automatically identifies the best captured face in each froup of faces and extract it as an image asset.

Visual text recognition (OCR)

Extracts text that is visually displayed in the video.

Visual content moderation

Detects adult and/or racy visuals.

Label identification

Identifies visual objects and actions displayed.

Celebrity identification

Audio Insights

Automatic language detection

Automatically identifies the dominant spoken language.

Audio transcription

Converts speech to text in 12 languages and allows extensions.

Speaker enumeration

Maps and understands which speaker spoke with words and when.

Speaker statistics

Provides statistics for speakers' speech ratios.

Tetual content moderation

Detects explicit text in the audio transcript.

Audio effects

Identifies audio effects such as hand claps, speech, and silence.

Emotion detection

Identifies emotions based on speech (what is being said) and voice tonality (how it is being said). The emotion could be joy, sadness, anger, or fear.


Creates translations of the audio transcript to 54 different languages.