Mimetype for CSV Sparql Query Results should use correct encoding as defined in the Specification #4856

pajoma · 2023-12-13T12:37:58Z

Current Behavior

The query results are encoded in UTF-8:

public static final TupleQueryResultFormat CSV = new TupleQueryResultFormat("SPARQL/CSV", List.of("text/csv"),
     StandardCharsets.UTF_8, List.of("csv"), SPARQL_RESULTS_CSV_URI, NO_RDF_STAR);

The specification says:

Systems providing these formats should note that the content types for CSV is text/csv and for TSV text/tab-separated-values. Being text/*, the default character set is US-ASCII. The charset parameter should be used in conjunction with SPARQL Results; UTF-8 is recommended: text/csv; charset=utf-8 and text/tab-separated-values; charset=utf-8.

But the mimetype exposed by RDF4J is "text/csv" (in SparqlMimeTypes)

public static final String CSV_VALUE = "text/csv";

UTF-8 is obviously the correct choice, but standard clients like the python requests library are assuming "ISO-8859-1" for the Content Type "text/csv".

I can modify the rest controllers to not use the standard RDF4J mimetypes, eg.

    @PostMapping(value = "/query", consumes = {MediaType.TEXT_PLAIN_VALUE, SparqlMimeTypes.SPARQL_QUERY_VALUE},
            produces = { SparqlMimeTypes.JSON_VALUE, SparqlMimeTypes.CSV_VALUE+ ";charset=UTF-8"}
    )
    @ResponseStatus(HttpStatus.OK)
    Flux<BindingSet> queryBindingsPost(@RequestBody String query) {...}

but then I have to map from "text/csv;charset=UTF-8" to "text/csv" everywhere else, to get the correct ResultWriters.

Expected Behavior

public static final TupleQueryResultFormat CSV = new TupleQueryResultFormat("SPARQL/CSV", List.of("text/csv"), StandardCharsets.UTF_8, List.of("csv"), SPARQL_RESULTS_CSV_URI, NO_RDF_STAR);

should be text/csv;charset=utf-8

If "text/csv" remains included, the SPARQLResultsCSVWriter should use "ISO-8859-1" as encoding (with a warning maybe?))

Steps To Reproduce

Expose a sparql endpoint using the standard mimetypes defined in RDF4J
Call it with the python requests library and see, that is encodes the result in "ISO-8859-1"

            response = requests.post(
                url=f"...",
                data=query.encode("utf-8"),
                headers={
                    "X-API-KEY": api_key,
                    "Content-Type": "text/plain",
                    "Accept": "text/csv",
                    "X-Application": scope,
                },
            )
   
            enc = response.encoding  # is "ISO-8859-1", but in reality it is "UTF-8"

Version

4.3.8

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

No response

The text was updated successfully, but these errors were encountered:

pajoma added the 🐞 bug issue is a bug label Dec 13, 2023

hmottestad added the specification issues related to compliance to standards and external specs label Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mimetype for CSV Sparql Query Results should use correct encoding as defined in the Specification #4856

Mimetype for CSV Sparql Query Results should use correct encoding as defined in the Specification #4856

pajoma commented Dec 13, 2023

Mimetype for CSV Sparql Query Results should use correct encoding as defined in the Specification #4856

Mimetype for CSV Sparql Query Results should use correct encoding as defined in the Specification #4856

Comments

pajoma commented Dec 13, 2023

Current Behavior

Expected Behavior

Steps To Reproduce

Version

Are you interested in contributing a solution yourself?

Anything else?