Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 BOM not accounted for in JsonLocation.getByteOffset() #533

Closed
fabienrenaud opened this issue Apr 23, 2019 · 5 comments
Closed

UTF-8 BOM not accounted for in JsonLocation.getByteOffset() #533

fabienrenaud opened this issue Apr 23, 2019 · 5 comments
Milestone

Comments

@fabienrenaud
Copy link
Contributor

Version: Jackson 2.9.8

parser.getCurrentLocation().getByteOffset() returns the wrong byte offset for the underlying byte array when the payload start with a BOM.
The json parser processes well such json payloads but the JsonLocation it returns ignores the offset introduced by the BOM.

Full standalone repro:

package test;

import com.fasterxml.jackson.core.JsonFactory;
import com.fasterxml.jackson.core.JsonParser;
import com.fasterxml.jackson.core.JsonToken;

import java.io.IOException;
import java.util.Arrays;

public class Test {

    private static final JsonFactory JSON_FACTORY;

    static {
        JSON_FACTORY = new JsonFactory();
        JSON_FACTORY.setCharacterEscapes(null);
        JSON_FACTORY.disable(JsonFactory.Feature.INTERN_FIELD_NAMES);
    }

    static byte[] extract(byte[] bytes, JsonParser parser, JsonToken token) throws IOException {
        switch (token) {
            case START_OBJECT:
                int startIndex = (int) parser.getCurrentLocation().getByteOffset() - 1;
                parser.skipChildren();
                int endIndex = (int) parser.getCurrentLocation().getByteOffset();
                return Arrays.copyOfRange(bytes, startIndex, endIndex);
        }
        throw new RuntimeException();
    }

    static byte[] parseAndExtract(byte[] bytes) throws IOException {
        JsonParser parser = JSON_FACTORY.createParser(bytes);
        System.out.println("parser type: " + parser.getClass().getCanonicalName());

        parser.nextToken(); // skip start object
        switch (parser.nextToken()) {
            case FIELD_NAME:
                if ("payload".equalsIgnoreCase(parser.getCurrentName())) {
                    return extract(bytes, parser, parser.nextToken());
                }
                break;
        }
        throw new RuntimeException();
    }

    public static void main(String[] args) throws Exception {
        String json = "{\"payload\":{\"name\":\"foo\"}}";
        byte[] bytes = json.getBytes();

        System.out.println("UTF-8 no BOM:");
        byte[] result = parseAndExtract(bytes);
        System.out.println(new String(result));
        System.out.println();

        System.out.println("UTF-8 no BOM, with leading characters:");
        result = parseAndExtract(("\r\n\t\n\t\t" + json).getBytes());
        System.out.println(new String(result));
        System.out.println();

        byte[] newBytes = new byte[bytes.length + 3];
        // write BOM
        newBytes[0] = (byte) 0xEF;
        newBytes[1] = (byte) 0xBB;
        newBytes[2] = (byte) 0xBF;
        System.arraycopy(bytes, 0, newBytes, 3, bytes.length);

        System.out.println("UTF-8 BOM:");
        result = parseAndExtract(newBytes);
        System.out.println(new String(result));
        System.out.println();
    }
}

Output:

UTF-8 no BOM:
parser type: com.fasterxml.jackson.core.json.UTF8StreamJsonParser
{"name":"foo"}

UTF-8 no BOM, with leading characters:
parser type: com.fasterxml.jackson.core.json.UTF8StreamJsonParser
{"name":"foo"}

UTF-8 BOM:
parser type: com.fasterxml.jackson.core.json.UTF8StreamJsonParser
d":{"name":"fo

You can see the result for the BOM payload gets shifted to the left by exactly 3 bytes while other payloads with or without padding characters are handled as expected.

@fabienrenaud fabienrenaud changed the title UTF-8 BOM not accounted for in JsonLocation.getByteOffset() UTF-8 BOM not accounted for in JsonLocation.getByteOffset() Apr 23, 2019
@cowtowncoder
Copy link
Member

Thank you for reporting this and doing troubleshooting. Sounds like a bug.

fabienrenaud pushed a commit to fabienrenaud/jackson-core that referenced this issue Apr 24, 2019
UTF8StreamJsonParser tracks read pointer (offset) and bytes processed
separately and uses those to generate JsonLocation. When the byte
payload starts with a UTF BOM, ByteSourceJsonBootstrapper processes a
few bytes ahead of the parser, moves/increases the offset and passes the
newly computed offset to the parser without telling it some bytes have
been pre-processed.
With this change, the number of bytes pre-processed for encoding
detection is passed to the parser. JsonLocation instances returned by
the parser now point to the correct byte offset when payload has a BOM.

Issue: FasterXML#533
fabienrenaud added a commit to fabienrenaud/jackson-core that referenced this issue Apr 24, 2019
UTF8StreamJsonParser tracks read pointer (offset) and bytes processed
separately and uses those to generate JsonLocation. When the byte
payload starts with a UTF BOM, ByteSourceJsonBootstrapper processes a
few bytes ahead of the parser, moves/increases the offset and passes the
newly computed offset to the parser without telling it some bytes have
been pre-processed.
With this change, the number of bytes pre-processed for encoding
detection is passed to the parser. JsonLocation instances returned by
the parser now point to the correct byte offset when payload has a BOM.

Issue: FasterXML#533
@fabienrenaud
Copy link
Contributor Author

I submitted a PR a couple of weeks ago. Please let me know of any feedback or necessary changes.

cowtowncoder added a commit that referenced this issue May 16, 2019
fabienrenaud added a commit to fabienrenaud/jackson-core that referenced this issue May 21, 2019
UTF8StreamJsonParser tracks read pointer (offset) and bytes processed
separately and uses those to generate JsonLocation. When the byte
payload starts with a UTF BOM, ByteSourceJsonBootstrapper processes a
few bytes ahead of the parser, moves/increases the offset and passes the
newly computed offset to the parser without telling it some bytes have
been pre-processed.
With this change, the number of bytes pre-processed for encoding
detection is passed to the parser. JsonLocation instances returned by
the parser now point to the correct byte offset when payload has a BOM.

Issue: FasterXML#533
fabienrenaud added a commit to fabienrenaud/jackson-core that referenced this issue May 21, 2019
UTF8StreamJsonParser tracks read pointer (offset) and bytes processed
separately and uses those to generate JsonLocation. When the byte
payload starts with a UTF BOM, ByteSourceJsonBootstrapper processes a
few bytes ahead of the parser, moves/increases the offset and passes the
newly computed offset to the parser without telling it some bytes have
been pre-processed.
With this change, the number of bytes pre-processed for encoding
detection is passed to the parser. JsonLocation instances returned by
the parser now point to the correct byte offset when payload has a BOM.

Issue: FasterXML#533
@cowtowncoder
Copy link
Member

Looks good. I can merge this in master, backport in 2.10.

The only thing we now need is the Contributor License Agreement, found from https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf
I think we don't yet have one for you? Only needs to be sent once for any and all contributions.
The simplest way is usually to print, fill & sign, scan, email to info at fasterxml dot com.
Once I receive it I will merge this patch.

Thank you in advance!

@fabienrenaud
Copy link
Contributor Author

Just submitted contributor agreement.
Will this make it to 2.9 as well?

@cowtowncoder cowtowncoder added 2.10 and removed 2.9 labels May 29, 2019
@cowtowncoder cowtowncoder added this to the 2.10.0 milestone May 29, 2019
cowtowncoder added a commit that referenced this issue May 29, 2019
@cowtowncoder
Copy link
Member

@fabienrenaud I think I'd rather backport it only in 2.10 (had to do that manually). But 2.10.0 should be out within month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants