Skip to content
This repository has been archived by the owner on Jan 22, 2019. It is now read-only.

Two doubles quotes in columns causes Unexpected character exception #151

Open
youribonnaffe opened this issue Aug 15, 2017 · 7 comments
Open

Comments

@youribonnaffe
Copy link

I have a CSV file with the following content (just a limited extract here):

route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color
OCE669711,OCESN,"",""Cars Réguliers ""L 11""  (Nantes - St Gilles Croix de Vie)"",,3,,,

Parsing this CSV content with CsvMapper causes the following error:

com.fasterxml.jackson.core.JsonParseException: Unexpected character ('C' (code 67)): Expected separator ('"' (code 34)) or end-of-line
 at [Source: java.io.StringReader@279ad2e3; line: 2, column: 23]

	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1702)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:558)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:456)
	at com.fasterxml.jackson.dataformat.csv.CsvParser._reportUnexpectedCsvChar(CsvParser.java:1089)
	at com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextQuotedString(CsvDecoder.java:838)
	at com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:601)
	at com.fasterxml.jackson.dataformat.csv.CsvParser._handleNextEntry(CsvParser.java:678)
	at com.fasterxml.jackson.dataformat.csv.CsvParser.nextFieldName(CsvParser.java:575)
	at com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringKeyMap(MapDeserializer.java:505)
	at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:362)
	at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:27)
	at com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:277)
	at com.fasterxml.jackson.databind.MappingIterator.readAll(MappingIterator.java:317)
	at com.fasterxml.jackson.databind.MappingIterator.readAll(MappingIterator.java:303)

Here is a unit test to reproduce the issue:

    @Test
    public void doubleQuotes() throws Exception {
        String content =
                "route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color\n" +
                        "OCE669711,OCESN,\"\",\"\"Cars Réguliers \"\"L 11\"\"  (Nantes - St Gilles Croix de Vie)\"\",,3,,,";

        CsvSchema schema = CsvSchema.emptySchema().withHeader();
        MappingIterator<Map<String, String>> it = new CsvMapper().readerFor(Map.class)
                .with(schema)
                .readValues(content);

        assertEquals(1, it.readAll().size());
    }

Is there a way to configure the parser to be more flexible about this usage of quotes?
Unfortunately the CSV file is not under my control and I won't be able to change it's format.

Parsing this file with OpenCSV was working but I was hoping to switch to Jackson for better performances.

@cowtowncoder
Copy link
Member

@youribonnaffe Thank you for reporting this problem. From code and example it seems to me this should just work as is.

Just one question: which version of Jackson are you using? Latest stable versions are 2.9.0 / 2.8.9.

@youribonnaffe
Copy link
Author

I'm using 2.8.9

@cowtowncoder
Copy link
Member

Thank you for confirming. That sounds odd as I am pretty sure this functionality has been around and tested for a long time.

@cowtowncoder
Copy link
Member

Hmmh. Actually, I am not sure this is a bug after all.

The problem is that the first double-quote is taken to mean that the column value is quoted.
This leaves the second quote, which is taken as the end quote because it is NOT doubled -- for proper behavior here, there should be 3 double-quotes, which would be interpreted as expected.
So it would seem like code that generated this CSV did not handle this aspect properly, based on my understanding of CSV.

Having said that, CSV "specification" is quite loose, as there isn't really an official specification.
So I would be interested in finding if something was said of this behavior. It is possible that I have not considered some corner case.

@cowtowncoder
Copy link
Member

Ok, reading RFC 4180, I see:

   5.  Each field may or may not be enclosed in double quotes (however
       some programs, such as Microsoft Excel, do not use double quotes
       at all).  If fields are not enclosed with double quotes, then
       double quotes may not appear inside the fields.  For example:

       "aaa","bbb","ccc" CRLF
       zzz,yyy,xxx

   6.  Fields containing line breaks (CRLF), double quotes, and commas
       should be enclosed in double-quotes.  For example:

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

which I think spells out why the test case is invalid -- field must be quoted (as it contains double-quotes itself) and each double-quote within must be doubled itself.

@youribonnaffe
Copy link
Author

I agree, the value is probably malformatted according to the RFC. Still do you think there is an interest to support such usage if that could be done without breaking the existing implementation?

@cowtowncoder
Copy link
Member

@youribonnaffe if that could be supported (perhaps via optional CsvParser.Feature), that could be useful. I have no objections to such support.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants