Discussions in the WHATWG and W3C over several months have led up to the announcement of a new <track> element and the WebSRT format. WebSRT is intended to be mostly compatible with existing SRT content and software, in order to hitch a free ride on the popularity of SRT.
Unfortunately, there was never a proper SRT parsing specification, so all media players implement their own parsers and error handling, much like was the case with HTML before HTML5. If these media players are going to support any of the new features in WebSRT, they will have to do so by modifying existing SRT parsers, as there’s nothing to differentiate SRT and WebSRT. Interoperability would be helped if they were are able to converge towards the same parsing algorithm, but they can only do that if WebSRT handles existing content as good as or better than current algorithms. If we cannot achieve that, it might be better to invent a format that has no legacy compatibility constraints.
There’s been some testing of existing media players, but not much analysis of existing content. I asked OpenSubtitles if they could help out, upon which they very kindly provided me with the latest 10000 uploaded SRT* files. I wrote a Python script to analyze them, and I think the results are interesting.
First a note on character encoding. Only 666 files were valid UTF-8 and out of those 472 were pure 7-bit ASCII, so deliberate use of UTF-8 doesn’t even reach 2%. Since WebSRT assumes UTF-8, little existing content can be reused as-is.
This is the typical structure of SRT (source):
1 00:00:10,000 --> 00:00:16,000 The Conceited General 2 00:01:08,520 --> 00:01:10,240 The general returns victorious
I’ll use WebSRT terminology: above are 2 cues, each with 3 lines for identifier, timings and the cue text followed by a blank line. Unfortunately, assuming that a blank line separates cues turns out to be unreliable, as 241 files at some point omitted that blank line. In my code, I let a timing line start a new cue even if not preceded by a blank line. I’m not sure what the best general approach is.
The identifier line is mostly useless and has been made optional in WebSRT. I defined any line preceding a timing line as being the identifier. Under this assumption, 571 files had identifiers that didn’t increase by 1 per cue and 55 files had identifiers which weren’t numbers at all. This doesn’t seem to matter to existing players.
The timings are a bit more interesting. No less than 1707 files had overlapping cues. Most existing players handle this by simply showing (only) the next cue when it begins, so such overlap goes unnoticed. However, the WebSRT parser makes no such adjustments, intending that overlapping cues be shown simultaneously. This will quite certainly be a problem if existing content is reused. Also worth noting is that only 4 files consequently used a period (.) to separate seconds and milliseconds, 2 files mixed (apparent typos) and all the rest used only commas (,). Only 1 file used the SubRip X1: ... syntax and 38 files had something else trailing the timings. This was mostly trailing punctuation (.,?) or due to a missing newline before the cue text or random typos.
What remains is the cue text itself. Markup, which I defined as anything matching the regular expression '<(\w+)>' or the string '<font', was surprisingly common, occurring in 5525 files. The most common are <i> (5273), <b> (937), <font ...> (346) and <u> (71). The WebSRT parser handles italic, bold and ruby markup, ignoring the rest. The fact that markup is so common means that any robust SRT (not just WebSRT) parser must handle it in some way, even if only by ignoring it.
That’s what I could gather from the data I have. If there’s something you want me to check, just leave a comment. Many thanks to OpenSubtitles for providing the data.
*They noted that this regular expression was used to identify SRT files: /^\d\d:\d\d:\d\d[,.]\d\d\d\s*-->\s*\d\d:\d\d:\d\d[,.]\d\d\d\s*(X1:\d+\s+X2:\d+\s+Y1:\d+\s+Y2:\d+)?\s*$/m This means that very broken files won’t have been included.