Echo Labs | Captioning for Higher Education

Implementing closed captioning guidelines is the key to creating accurate and compliant captions for professors, lecturers, and other teaching staff who produce and manage educational video materials for courses and lectures.

It helps ensure full access to the content for students who are hard of hearing, non-native English speakers, or who prefer to learn through reading.

For any institute looking to integrate accessibility, mainly closed captions, you must follow certain rules to provide a better caption quality. Failure to do so can cost:

Lower enrollment rate, as students will opt for institutes with better captioning facilities
More manual work and one-on-one handling in case of poor accuracy
Federal funding for programs and research
Negatively impact student's participation in studies
Strict legal action to the Office of Civil Rights (OCR) under the Americans with Disabilities Act (ADA)

So, if you want to safeguard your accessibility goals, follow these closed captioning guidelines we discussed in this blog and make your special students feel part of your institute.

Deep Dive into Closed Captioning

Closed captions are texts displayed on the screen that students can read. The text includes spoken words and nonverbal information, such as sound effects, speaker identification, and music, into easy-to-read text.

The fundamental purpose of captioning is to replicate the hearing listener’s aural experience and make videos understandable to non-hearing students.

Source: digital.nsw.gov.au

It helps deaf or hard-of-hearing students access educational videos and make sense of the visuals by reading the captions. It is especially beneficial for visual learners who enjoy several other benefits.

For example, the Oregon State University Ecampus Research Unit surveyed 2,124 students from 15 different universities and colleges. When asked for the reasons they use closed captions, almost two-thirds of participants confirmed it helped with better focus, retention of information, and audio understanding.

Source: er.educause.edu

And for those comfortable listening to the content, you can turn off closed captions by simply clicking on the “CC” icon, which you’ll find on all popular video platforms like YouTube or Facebook.

But closed captions are not limited to special students, even those with no disability use them, as you can see in the graph, which shows an interesting finding for the above survey.

About 49% of respondents, primarily with no hearing difficulty, revealed they too use captions sometimes.

Source: Educause

The advantages of closed captions extend beyond just academic settings, they also help:

Language learners understand and follow spoken content in a foreign language.
Watch content silently in sound-sensitive locations where audio is restricted, like libraries, hospitals, or public transportation.
Students with learning disabilities learn by visually reading text from the screen.
Understand confusing dialogue or unfamiliar accents, especially while watching movies or videos.

Types of closed captions

Pop-on captions: Created in advance for prerecorded and online content, this style helps recognize who is speaking what.
Roll-up captions: Generated live for broadcasts or events, these can be challenging and may cause slight delay or error.
Auto-generated captions: Created by speech recognition software, these are less accurate but can improve with technological advancements.
Paint-on captions: In this style, each phrase and word appears on the screen individually for a set duration.

Key Elements of Accurate Closed Captions

Accuracy

Findings from ACM’s survey on the significance of caption accuracy from students’ perspectives showed that “Errors” are one of the most important factors that can affect their experience of watching videos.

Source: ACM

To ensure your captions are of the highest possible accuracy, consider following these closed captioning guidelines.

Match spoken words in the dialogue and song lyrics in their original language in the correct order.
Captions must contain all words in the order spoken, without paraphrasing, except for any time limit.
Use proper spelling, homophones, capitalization, singular/plural form, punctuation, tense and numbers (symbols vs words).
Keep intentional errors or slang if required.
Always show non-observable information such as speaker identity, sound effects, music, and reactions.
Caption text must be large enough with sufficient spacing.

Source: Maria Veronica Valdivia: National Coordinator for English Advance at Andres Bello University

Maria, a national coordinator for English Advance at Andres Bellow University, also advocates always keeping captions “Accuracy” in check to avoid causing students discomfort in learning.

Synchronization

Synchronization plays a critical role in reducing split attention effort, which is the extra focus deaf students put into dividing their attention between the captions and visuals in the frame.

Not syncing captions with the audio and speakers leads to poor attention, fatigue, and information gaps.

To avoid any of this impact student’s learning, your captions must:

Start to appear and end right when the corresponding audio begins and ends.
Sync with the audio to the highest extent possible.
Display long enough to be read without missing it.

Completeness

Closed captioning guidelines also emphasize completeness, meaning captioning the entire length of a video.

Caption disappearing before the end of the video is a clear denial of accessibility to students with hearing disabilities, even if captioning seems to be available.

Proper Placement

FCC’s closed captioning guidelines state proper placement is important because it helps readers identify and distinguish speakers easily.

Especially for content with multiple speakers that use bottom-center caption placement, it is not ideal because it fails to associate the correct caption with the correct speaker.

Source: Seanzdenek

It worsens when speakers talk fast, interrupt, or overlap each other, which is a clear violation of the rules.

As recommended by The Captioning Key and WebVTT, the solution is to place the captions underneath their respective speakers.

When possible, use caption placement to identify an onscreen speaker by placing the caption under the speaker ~ The Captioning Key

Source: Seanzdenek

You can also place them next to the speaker, which Sean Zdenek, the author of Reading Sounds: Closed Captioned, suggests*.*

To further clarify speaker identification, you can use a separate color for each speaker or change the visual style.

Source: Seanzdenek

It is especially common in the UK to distinguish captions on a black background with a specific color palette in order of priority (white, yellow, cyan, and green).

Must Follow Legal and Regulatory Compliance for Closed Captions

Federal Communications Commission Guidelines

Key closed captioning guidelines from the FCC.

All new non-exempt TV programs must be captioned.
75% of older "pre-rule" non-exempt TV programs must be captioned.
Captioned internet video content shown on TV with specific dates.
Full-length pre-recorded, live, and near-live programs should have captions when shown online.
Separate rules apply to "straight lift" clips, montages, and live/near-live video clips.
Captions should match spoken words accurately with proper spelling, grammar, and punctuation.
Captions should sync with spoken words simultaneously.
Captioning should run from the video's start to end.
Captions should not block any visual.
Internet captions must match the quality of TV captions for the same program.
Late-night programs and advertisements under five minutes do not require captioning.
Students have the right to file complaints about captioning issues with the FCC or video programming distributor within 60 days of the problem.

ADA Requirements

Legalized in 1990, The Americans with Disabilities Act aims to provide equal learning opportunities to deaf or hard of hearing students, particularly in academics.

Under Title II and Title III, the two laws that require public and private entities to provide captioned video content in:

Public meeting halls
Airports
Public transportation
Public venues
State and local government bodies (police station)
Private colleges
Corporate settings

WCAG Standards:

Formulated by the World Wide Web Consortium, Web Content Accessibility Guidelines is a set of specialized measures accepted worldwide.

The WCAG 2.2, released in October 2023, is the latest successor without any alterations to the initial standards.

These closed captioning guidelines state four pillars of presenting information digitally:

Perceivable

Students must be able to perceive information through at least one sense:

Visual
Auditory
Touch (mouse, keyboard, screen readers, etc.)

You should also provide alternatives like video captions, alt text for images, and easy-to-differentiate color contrast.

Operable

It is to ensure websites are easy to operate regardless of how an individual use it.

Keyboard
Mouse
Touchscreen
Joystick
Other input devices

Additional mandates under this rule are easy navigation, no flashy content that could cause seizures, correct anchor text, and sufficient time allowance.

Understandable

Your content has to be very understandable, meaning:

Smooth navigation
No overly complex terms
Simple to the point words
Use the same terms for identical functions (e.g., either "buy" or "add to cart," not both)
Easy to fill forms
No automatic language switches
Logically organized information

Robust

It states the information must work across different technologies forever and mandates:

Device compatibility (computers, tablets, smartphones, etc)
Browser compatibility (Firefox, Edge, Chrome, etc)
Assistive tool support (screen readers, magnifiers, etc.)
Proper visual layout
No reliance on specific input methods (e.g., right-click)
All functions are to be accessible via multiple means

Best Practices for Caption Formatting and Presentation

Font and Color

Using recommended fonts and colors helps improve student's ability to understand the context much more clearly.

Use a white sans-serif font, such as Arial or Helvetica, for optimal readability
Use mixed-case text (uppercase and lowercase letters) for standard dialogue
Reserve ALL CAPS for indicating shouting, screaming, or speaker identification
Keep the font size large enough to be easily read on different screen sizes

Line Length and Timing:

Appropriate caption length and timing allow students an easy-to-read, clean, and clutter-free experience.

Keep a maximum of 30-37 characters per line to avoid eye strain
Show a maximum of two lines of text per caption frame to keep the screen clutter-free

Source: Sub-Reddit: Accessibility

Display for 2-7 seconds, depending on the text complexity and length
Aim for a reading speed of within 180 words per minute (3 words per second)
Synchronize captions precisely with the audio without losing the pace between sound and visuals
Break sentences between pauses
Start new sentences on a new line
Avoid splitting phrases or clauses into different caption frames, as it will make it difficult to understand the context of the new line

Speaker Identification:

Knowing which speaker is speaking what is important for students to recognize the sync between the correct speaker and the associated captions.

Identify speakers by name in parentheses, e.g. (John)
Describe unknown speakers, like STUDENT in non-gendered labels, AUDIENCE MEMBER, or PROFESSOR
Use numbers when multiple unknown speakers are in the frame [STUDENT #1, STUDENT #2]
Describe speakers positions or actions though [(Seated Person), (Speaker at Podium)] labels
Identify speakers continuously whenever a new person speaks or you don't show a speaker.
Identify speakers by name only when introduced.
Use colons after speaker names in all caps [JOHN: Hello, everyone.]

Non-Speech Elements:

No video is complete without its supporting elements. So make sure you captioned them the right way.

Use square brackets for sound effects, music, and noises [applause], [dog barking]
Express off-screen sound effects in italics to distinguish them from on-screen sounds
Use punctuation to indicate the sound speed [dog barking] woof...woof...woof
Describe music styles or moods when required [ethereal classical music]
Use musical note symbol (♪) at the beginning and end of song lyrics
Show speaker emotions or tones when not visually apparent in the frame [angrily], [whispering]
Describe lack of sound or audio issues [silence], [no audio], [muffled]
Caption background music only if it doesn't interfere with dialogue captions
Use an ellipsis (...) for longer pauses
Express non-verbal audio that impacts the story or meaning [laughter], [crowd gasps]

Additional Considerations:

Use proper grammar, spelling, and punctuation
Use verbal bridges (e.g., "um," "uh") only when necessary
Use numerals instead of spelling out numbers for mathematical content
In non-mathematical contexts, spell out numbers and use numerals for 11 and above
Show dialects and accents in captions to clarify the speaker's native voice
For non-English speech, caption the actual foreign words when possible, or use descriptions like [speaking French]

Tools and Techniques for Creating Accurate Captions

Professional Captioning Services

In-house captioning can get complicated. Following the guidelines, maintaining accuracy and synchronization is often compromised.

And even if you manage all that, captioning multiple videos at speed might become a challenge if you prioritize quality. Potentially delaying students from accessing the content.

This is where professional captioning services like speech-to-text systems can help. Here is how a typical automatic caption-generating system works:

Step 1: The broadcaster transmits the audio to the vendor’s server, which generates the text in real-time.

Step 2: This text is edited based on the closed captioning guidelines the provider follows.

Step 3: The final edit is packaged in the captioning format required by the broadcaster.

Step 4: Who then transmits the captions to the viewers.

Every vendor has its own proprietary process, but the basic principle is mostly similar.

Source: springer.com

EL.AI is one such speech transcription platform that has helped Harvard University, California State University, UCF, and many more create 99.8% accurate ADA-compliant closed captions, including detailed speaker labels, technical jargon, and sound effects.

You can caption up to 5000 hours of content per week automatically in just 3 clicks within a 24-hour and save time by integrating with popular platforms like:

Anthology
Blackboard
Panopto
Moodle
Kaltura
Canvas

Automated Tools with Human Oversight

Unlike completely automated captioning tools, those involving some degree of human oversight have better accuracy and output.

A recent experiment done by the University of Texas Rio Grande Valley in collaboration with Schoolworks confirmed this.

They carried out four tests, approximately 17,000 captions, that is, 798 minutes of live captions analyzed using the NER model from 2018 to 2022.

A total of 388 minutes were created by human captioners manually using respeaking and stenography, and 410 minutes by ASR.

In conclusion, human captions showed greater average caption quality than automatically generated ones.

Suggesting to consider choosing tools that offer some manual control over the captions if your priority is to provide students with the highest possible quality.

Source: [scholarworks.utrgv.edu](https://scholarworks.utrgv.edu/cgi/viewcontent.cgi?article=1114&context=wls_fac#:~:text=The rest of the programmes,%25 (0%2F10).)

Quality Control Processes

Over the decades, accessibility stakeholders and organizations have devised metrics to define and measure caption quality.

Here are the top 4 used quality control models

The WER model

Also called simple word error rate, it is the most standard model, originally designed to measure the transcript’s accuracy using automatic speech recognition tools.

Its formula is (S+D+I)/N, where S is the frequency of incorrect words shown in place of the correct words, D means the number of deletions, I is the number of insertions of unspoken words, and N means the total number of words spoken.

NCAM’s WWER

The weighted word error rate started by the National Center of Accessible Media aimed to create an industry standard caption quality measure and automated accuracy assessment tools for live programs.

This advanced version of WER rates errors on a severity weighting basis based on user surveys.

Here are the NCAM’s 17 basic error type:

	Caption Error	Example (caption/actual)
1	Substitute singular/plural	man/men
2	Substitute wrong tense	run/ran
3	Substitute pronoun (nominal) for name	this man/Proper Name
4	Substitute punctuation	period instead of question mark
5	Split compound word/contraction	foot note/footnote, did not/didn’t
6	Two words from one (one wrong)	might yes/mighty
7	Duplicate word or insertion	criticism criticism
8	Word order	would I/I would
9	Correction by steno	disznits–dissidents
10	Dropped word – 1 or 2	“you know”
11	Dropped word(s) – 3+	“figure out what the best options are going to be”
12	Nearly same sound but wrong word(s)/homophone	sail/sale or work/werk
13	Substitute wrong word	blogger/hunger
14	Phonetic similarities/not a valid word	Human milating/humiliating
15	Random letters (gibberish)	Igbavboa
16	Word boundary error (also “stacking error”	paying backpack Stan/paying back Pakistan
17	Transmission errors/garbling	GM sto/GM stock

The NER Model

Developed by Pablo Romero-Fresco and Juan Martinez of the University of Roehampton, NER measures live caption accuracy produced using the respeaking method.

The goal was to create a functional and easy-to-apply model using the formula below, including basic word error rate principles.

N: The number of words in the respoken text, including punctuation marks (which the respeaker needs to read out), identification of speakers, etc. (Note that this differs from the WER model, where N stands for the number of words in the original audio.)

E: ‘Edition’ errors, introduced by the respeaker. (The most common of these is that the respeaker omits information due to the speed of the dialogue.)

Note: Compare the respoken text and the original text to calculate both N and E. Classify errors as serious, standard, or minor, scoring them 1, 0.5, and 0.25, respectively.

R: ‘Recognition’ errors, caused by mispronunciation or the software not recognizing a word.

CE: Correct editions refer to the instances in which the respeaker’s editing has not led to a loss of information, including the omission of redundancies and hesitations. You can calculate it by comparing the respoken text with the original text.

Automated Caption Evaluation (ACE and ACE 2)

Designed by Sushant Kafle and Matt Huenerfauth of Rochester Institute of Technology, NY, USA, ACE helps assess caption quality automatically.

The process calculates two main aspects.

Word importance to identify the keywords in the text based on their context.
Semantic accuracy measures the extent of meaning change when errors occur.

ACE and ACE2 don’t rely on predetermined weights for different error types compared to other evaluation methods. Instead, they figure out the impact of errors automatically.

The CRIM Method

Centre de Recherche Informatique de Montréal attempted to adapt traditional WER methods to the specificity of respeaking.

It suggested adding a new step in between the word accuracy method. That involves a human operator going through the automatically aligned spoken and unspoken text.

And examine whether or not the word’s deletion has caused any loss of information. It allowed proper accountability for both verbatim and edited respeaking.

Conclusion

All the closed captioning guidelines discussed above hold extreme importance in providing deaf students equal access to education.

For any institute to successfully integrate digital accessibility features like closed captions, they must comply with these rules.

And if you want to be extra sure about your caption quality, EL.AI's ADA-compliant captioning will help you caption videos with 99.8% accuracy.

Book a demo today, and let's bring every student closer to knowledge together.