<div dir="ltr">Hello readers,<br><br>I am curious about some algorithmic / numerical aspects of specifically decoding (not encoding) an AC3 or AAC stream. Let's assume that all sample rates are 48000 and that all audio is mono.<br><br><b>Main section 1</b>. Is my outline of the decoding process correct? Which points are wrong? Some of the points are from <a href="https://libav.org/documentation/doxygen/master/group__lavc__encdec.html">https://libav.org/documentation/doxygen/master/group__lavc__encdec.html</a>. <br><br>1. When decoding an audio stream, as segment the stream into an ordered list of packets, e.g. P[0], P[1], ..., P[999]. (Assume 1000 packets.)<br>2. This segmentation involves syncwords in order to guard against total data corruption in case 1 byte is lost. If 1 byte is lost, then usually only 1 or 2 packets are affected.<br>3. For AAC, the packets have different numbers of bytes. AC3 files usually have a constant packet size.<br>4. In the C process, a decoding object D is initialized<br>5. We pass packet P[0] to the D.avcodec_send_packet() method, returning output Y[0]. This effectively passes a small binary data string of on the order of 500 bytes.<br>6. Since I'm assuming everything is mono audio, this method returns a 1-d array of floats. This method may the internal state of D. We then pass packet P[1], then P[2], ..., P[999]. These successive calls return Y[1], Y[2], ..., Y[999], respectively. Because of the possible state change, it is important to pass the packets in a specific order.<br>7. This array has length (always?) 1024 for AAC and 1536 for AC3.<br>8. This page claims that frames stand alone. Does that mean that packets are decoded independently?; or does this just mean that 1024-sample frames are encoded independently?; or am I just misunderstanding.<br> <a href="https://wiki.multimedia.cx/index.php/Understanding_AAC">https://wiki.multimedia.cx/index.php/Understanding_AAC</a><br>9. (less important for me) If the packet timestamps of the stream are very uniform, then we will simply concatenate all of the returned arrays Y[0], ..., Y[999] into the full array, and this is the decoded array. If the packets have nonuniform timestamps, then we still might concatenate all of the arrays, or maybe insert zero samples, depending on the other parameters of the FFmpeg call.<br><br>--<br><br><b>Main section 2.</b> Let's suppose that my outline in section 1 is accurate. If not, then the rest of my message might be moot.<br><br>Let's suppose we have initial decoder object D and either the AAC or AC3 codec and packets P[0], P[1], ..., P[999]. Assuming that the decoder state matters a lot, I'd like to consider 3 orders of passing the packets to D.<br><b><br>Order 1</b>: The same order as the packets. P[0], P[1], ..., P[999]<br><b>Order 2</b>: we remove P[0] completely.  P[1], P[2], ..., P[999]<br><b>Order 3</b>: We replace P[0] with an arbitrary packet, P_new. (e.g. P_new = P[1], but P_new could be an arbitrary packet not in the list.) P_new, P[1], ..., P[999]<br><br>In order 1, suppose that the output arrays are Y[0], Y[1], ..., Y[999]<br>In order 2, since the state may matter, we can't say that the first array output is Y[1]. Instead, we use different symbols  Y2[1], Y2[2], ..., Y2[999]. (indexing from 1. This output list has 999 elements.)<br>In order 3, suppose that the output arrays are Y3[0], Y3[1], ..., Y3[999]. (1000 elements).<br><br>My main questions are: Is the state of D flushed fairly quickly or is the state very persistent such that any sequence 'mutation' will significantly change state, or somewhere in between? Although the lists Y1, Y2, and Y3 are clearly similar waveforms perceptually, are they completely different at a low level or do they converge.<br><br>If hypothetically the state of D is flushed after 50 packets, then would Y[n], Y2[n], Y3[n] be approximately equal length-1024 float arrays for n >= 51? Is there any such value of n? Or maybe the state of D depends on how many packets are decoded and is otherwised flushed after 50 packets? If so, is Y[n] ~ Y3[n] for n >= 51 but Y[n] != Y2[n] for any large n because the decoder processed n packets before outputting Y[n] but only n-1 packets before Y2[n]<br><br>Note that I have experimented with PyAV and I suspect that for the AC3 codec and a deletion mutation, there is no such value of n. The decoder states will always be different. I do not know about a substitution mutation or the AAC codec or if I am doing my PyAV analysis correctly. I don't know for sure and I would be obliged if a reader knows.) I have only done experimenting with PyAV snce I am not used to using C. <br><div><br></div><div>Sincerely,</div><div>Bobby</div></div>