Experiences using Zencoder for Live Transcoding

Feb 2, 2014 - By Rusty Conover

During the past few months I've been able to work very closely with Zencoder for Live Transcoding for both NY Fashion Week and the AKC Championship show in December 2013. Millions of viewers have watched these shows since they were live. There are a few difficulties in calling Zencoder the solution for live transcoding.

Benefits and Downsides of EC2 for Live Transcoding

Zencoder runs on Amazon EC2. This is fantastic because it means it is scalable, reliable and available across many geographies. The downside is that they do not have dedicated hardware and can suffer from CPU and I/O starvation from busy neighbors or simply over subscription of their own instances. Zencoder does not launch EC2 instances for each of their transcoding jobs, they run larger instances and run individual transcoding jobs on them as needed each getting a slice of a particular instance, and launching new instances when their capacity is insufficient.

We have observed individual outputs having excessive latency on them when Zencoder has many video on demand transcoding jobs scheduled. They have stated that they are addressing this issue, but I would recommend that you test all of your live transcoding outputs and have parallel backup transcoding job in another AWS region such that if the first job runs into latency you can use the outputs of the other job. The show simply must go on, and you can't stop everyone watching if you have one or two outputs that have latency that has gone to the other side of the moon.

Latency in Live Transcoding

When a live transcoding job runs it takes in source video and then converts it to another format or resolution and sends it to the destination. Zencoder needs to do all of this in a time that is consistent across the various outputs and it needs to produce output frames at a rate matching or exceeding the input frames. If Zencoder can't produce frames fast enough the streams start to fall behind the input, and they "lag" for lack of a better term. Some lag is acceptable, but this lag must be consistent between streams if you have adaptive bitrate clients otherwise they will experience time shifts when they change streams.

If you're producing three different output streams from one input for live playback the rate of output of those three streams must be very close to synchronized. When a viewer switches from one bitrate to another their stream will appear to shift in time rather than just resolution. So if a user goes from a low bitrate stream that is processing 12 seconds behind the live input and then switches to a high bitrate stream that is 124 seconds behind the live input it will appear to the user that things look better, but they will have jumped about two minutes behind where they were watching. This leads to confused viewers.

Zencoder places the transcoding jobs on EC2 instances in a fashion such that it is opaque to the user which particular instance is producing their stream. We've experienced situations where Zencoder would put too many transcoding jobs on the same instance leading to there not being enough CPU to produce consistent output at a frame rate matching the input. This meant that some other the larger streams would be encoded at ~20 frames per second rather than 30 and after 10 seconds that large stream is going to be 3.33 seconds behind the others and will continue to get further and further behind as long as the live stream goes on. This FPS measure is not a measurement of the video frames to per second, it the frames per second produced by the encoding process.

For video on demand transcoding the latency of output does not matter because the job will not be shown to be complete until all outputs have finished processing. The need to produce output consistently is one aspect of things that makes live transcoding difficult.

Zencoder does not have a direct measure of the rate at which frames are being produced for an output in their interface, they do list encoder latency and upload latency in seconds. They should list the current FPS being produced for an output, because this will help identify settings that specify too much of a CPU load such that the transcoding job will never be able to run without incurring latency behind real time.

From my experience the most intensive operation at Zencoder is scaling a live video stream. It seems they are using interpolation methods that are too expensive based on the amount of CPU they are allocating for each job if your input is HD video. Scaling from 1920x1080@29.97fps to 1280x720@29.97fps will not be able to be performed in a real time fashion if you're using HLS with normal settings for iOS devices. If they allowed the interpolation/scaling method to be chosen it would be much better, or simply increased the amount of CPU given to live transcoding jobs.

Startup Time

Zencoder is not yet at the point where they have spare capacity at all times in all AWS data centers. Capacity in the form of EC instances ready to start running your job immediately. They need to boot up EC2 instances which takes time.

I've found that when I start a live transcoding job in Oregon there was a five to seven minute waiting period before output is being produced. While Zencoder is launching instances it is buffering the video input. Once the instance has launched and the Zencoder transcoding job is running to produce a particular output, it then sets about transcoding the buffered data. If one output is faster that other outputs, meaning one can produce output at 100fps and the other can only do 30fps, it means that the 100fps output will catch up to real time faster than the slower output. This inconsistency in output rate will lead to a problem where if a viewer changes outputs (adaptive bitrate client for instance) the stream will shift in time rather than just bitrate. Just like the previously discussed latency problem.

To work around this problem the solution I've found is launching the job the first time, wait and monitor in the Zencoder interface until outputs are being produced. If outputs start being produced right away (notably shorter than 7 minutes) that transcoding job will work well. If the job's outputs only start being produced about 7 minutes after the job is started its best to stop that live transcoding job and launch it again. The cancellation of the first live transcoding job is immediate, but Zencoder does not shutdown the EC2 instance so you're able to reuse it, if you're lucky.

Lack of RTMP Metadata

Zencoder doesn't inject metadata in the RTMP output such that it is persisted to the viewer of the RTMP stream. I believe this adds to players taking longer than they should to display the live streams. You can verify the lack of metadata by using the Akamai debug player.

HLS with S3 and CloudFront Permissions Race Condition

Zencoder uses a flag called "public" that says mark their resource as being available to the public.

Using the public flag with HLS and CloudFront causes some serious problems. CloudFront is Amazon's CDN offering, primarily it is setup as a CDN in front of an S3 bucket. When using the public flag Zencoder first uploads the .ts file then does a second call to S3 to mark the permissions on the file as being public. The problem is that there is a race condition in between when the file is uploaded and the permissions change to public. If a request were to come in via CloudFront for the file before its permissions have been updated, the permission denied reply is cached at CloudFront.

To work around this you should always specify a custom header for the output with the name of "x-amz-acl" and the value of "public-read". This will allow Zencoder to upload a file to S3 and not have to do a second call to set public permissions on that file and will avoid the race condition.

Conclusion

Zencoder is great once you put in the time to work around its quirks and bugs. And if you're really lucky and run your live transcoding jobs in a AWS region where there aren't other Zencoder customers competing for your CPU and I/O everything will be peachy.

Look for more solutions in this space to emerge in the future as cloud transcoding is the way of the future.