Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically Bootstrap Named Analysed Fields for Searching and Boosting #69

Open
ahagenbruch opened this issue Apr 30, 2015 · 3 comments
Assignees

Comments

@ahagenbruch
Copy link

Hi @agazzarini,
the current schema in SolRDF is mostly focused on the use case as a SPARQL endpoint, i.e. its object literals are being indexed into unanalysed string fields. To accomodate a more common use case where we also want to be able to do analysed field searching and per field boosting we could write object literals into named fields derived from the QNames. As Solr provides the mechanism of dynamic fields we propose the following enhancement:

Transform the QName and optional datatype and language information into a field name of the following structure:

prefix_predicateName[_datatype][_lang]

Use abstract heuristics to provide a basic search schema. This can be adapted to the actual requirements of the dataset. We make the genral assumption that all fields can have multiple values:

Map untyped and language less literals to text_general:
<dynamicField name="*_xsd_string" type="text_general" indexed="true" stored="true" multiValued="true"/>

Map literals with language information to corresponding language text fields:
<dynamicField name="*_xsd_string_de" type="text_de" indexed="true" stored="true" multiValued="true"/>
...

Map typed literals with datatypes to corresponding fields:
xsd:integer => <dynamicField name="*_xsd_integer" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:nonPositiveInteger => <dynamicField name="*_xsd_nonPositiveInteger" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:NegativeInteger => <dynamicField name="*_xsd_negativeInteger" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:long => <dynamicField name="*_xsd_long" type="tlong" indexed="true" stored="true" multiValued="true"/>
xsd:unsignedLong => <dynamicField name="*_xsd_unsignedLong" type="tlong" indexed="true" stored="true" multiValued="true"/>
xsd:int => <dynamicField name="*_xsd_int" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:unsignedInt => <dynamicField name="*_xsd_unsignedInt" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:short => <dynamicField name="*_xsd_short" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:unsignedShort => <dynamicField name="*_xsd_unsignedShort" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:byte => <dynamicField name="*_xsd_byte" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:unsignedByte => <dynamicField name="*_xsd_unsignedByte" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:nonNegativeInteger => <dynamicField name="*_xsd_nonNegativeInteger" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:positiveInteger => <dynamicField name="*_xsd_positiveInteger" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:float => <dynamicField name="*_xsd_float" type="tfloat" indexed="true" stored="true" multiValued="true"/>
xsd:decimal => <dynamicField name="*_xsd_decimal" type="tfloat" indexed="true" stored="true" multiValued="true"/>
xsd:double => <dynamicField name="*_xsd_double" type="tdouble" indexed="true" stored="true" multiValued="true"/>
xsd:boolean => <dynamicField name="*_xsd_boolean" type="boolean" indexed="true" stored="true" multiValued="true"/>
xsd:string => <dynamicField name="*_xsd_string" type="text_general" indexed="true" stored="true" multiValued="true"/>
xsd:hexBinary => <dynamicField name="*_xsd_hexBinary" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:base64Binary => <dynamicField name="*_xsd_base64Binary" type="binary" indexed="true" stored="true" multiValued="true"/>
xsd:anyURI => <dynamicField name="*_xsd_anyURI" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:QName => <dynamicField name="*_xsd_QName" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:NOTATION => <dynamicField name="*_xsd_NOTATION" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:normalizedString => <dynamicField name="*_xsd_normalizedString" type="text_general" indexed="true" stored="true" multiValued="true"/>
xsd:token => <dynamicField name="*_xsd_token" type="text_general" indexed="true" stored="true" multiValued="true"/>
xsd:language => <dynamicField name="*_xsd_language" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:IDREFS => <dynamicField name="*_xsd_IDREFS" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:IDREF => <dynamicField name="*_xsd_IDREF" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:ENTITIES => <dynamicField name="*_xsd_ENTITIES" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:ENTITY => <dynamicField name="*_xsd_ENTITY" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:NMTOKENS => <dynamicField name="*_xsd_NMTOKENS" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:Name => <dynamicField name="*_xsd_Name" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:NCName => <dynamicField name="*_xsd_NCName" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:ID => <dynamicField name="*_xsd_ID" type="string" indexed="true" stored="true" multiValued="true"/>

Map date and dateTime types to a date field and supplement the missing values (e.g. "2015" => "2015-01-01T00:00:00Z"):
xsd:date => <dynamicField name="*_xsd_date" type="tdate" indexed="true" stored="true" multiValued="true"/>

Map duration to a string field:
xsd:duration => <dynamicField name="*_xsd_duration" type="string" indexed="true" stored="true" multiValued="true"/>

Map Gregorian date fields to a string field:
xsd:gYearMonth => <dynamicField name="*_xsd_gYearMonth" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:gYear => <dynamicField name="*_xsd_gYear" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:gMonthDay => <dynamicField name="*_xsd_gMonthDay" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:gDay => <dynamicField name="*_xsd_gDay" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:gMonth => <dynamicField name="*_xsd_gMonth" type="string" indexed="true" stored="true" multiValued="true"/>

@agazzarini agazzarini self-assigned this Apr 30, 2015
@agazzarini
Copy link
Member

Hi @ahagenbruch sounds really interesting. Many thanks for such detailed proposal.
I introduced the "Hybrid" mode for mixing Solr and plain RDF features so that could be something that goes under that direction. I strongly agree with you that StrFields have a limited power in terms of querying capabilities.

I have to read again your proposal and then investigate what kind of impacts it should have on the existing code. In the meantime a question: let's suppose we changed the schema in such way. What kind of queries are you issuing to SolRDF? I think, using plain SPARQL, you won't get any benefit from such schema. Do you want to use Solr built-in parsers and get results in SPARQL-results?

Thanks again


BTW: I created a user list on google. If you want feel free to join us. We could discuss about this thing also with other (few at the moment) users.

@agazzarini agazzarini added this to the Release 1.0 milestone May 9, 2015
@agazzarini
Copy link
Member

@ahagenbruch I'm moving the discussion back here as these are concrete implementation details. Two doubts:

Field name

You said, in your proposal:

prefix_predicateName[_datatype][_lang] 

What about the prefix? In your schema example we have a skos:notation and ok, skos is a widely used / standard namespace. But what about custom namespaces? It doesn't sound good to index something like:

pippo_mynote_xsd_string 

because "pippo" could be known only at index time; at query time you couldn't be aware about prefixes I previously used in indexing or, you could use the same namespace mapped with a different prefix (e.g. pluto:mynote at query time and pippo:mynote at index time, where pippo and pluto points to the same namespace URI)

Multivalued fields

You said

We make the general assumption that all fields can have multiple values

Why? Each triple (i.e. each document) will have exactly one value for the object field, regardless the schema we will use. Am I missing something about your proposal?

@ahagenbruch
Copy link
Author

Am 18.05.15 um 15:11 schrieb Andrea Gazzarini:

Hi Andrea,

You said, in your proposal:

|prefix_predicateName[_datatype][_lang] |

What about the prefix? In your schema example we have a skos:notation
and ok, skos is a widely used / standard namespace. But what about
custom namespaces? It doesn't sound good to index something like:

|pippo_mynote_xsd_string |

because "pippo" could be known only at index time; at query time you
couldn't be aware about prefixes I previously used in indexing or, you
could use the same namespace mapped with a different prefix (e.g.
pluto:mynote at query time and pippo:mynote at index time, where pippo
and pluto points to the same namespace URI)

I see your point, but I had these two use cases in mind when I wrote the
proposal:

  • Fielded search: The user wants to search on a specific field instead
    of on an aggregated field for 'simple search'. In most cases this would
    be done in an advanced search form in the front end where the user
    doesn't have to know about the actual field name in the index but sees a
    field name for general consumption (e.g. 'dcterms_title_xsd_string' vs.
    'Title'). The same would hold true if you exposed the document search
    via an API. It would be your responsibility to document the field names
    (possibly having a mapping in your API to more readable names to make
    them more developer friendly).
  • Weigthed fields in a request handler: If you expose your 'simple
    search' via a request handler that has for instance an eDismax query
    parser you put your field names and boost values into the qf parameter
    and are thus in control of what fields will be searched and how they
    contribute to the overall score. If you don't do this you will probably
    feed your fields to an overall search field via copy fields in your
    schema. In either case you know the field names...
Multivalued fields

You said

We make the general assumption that all fields can have multiple values

Why? Each triple (i.e. each document) will have exactly one value for
the object field, regardless the schema we will use. Am I missing
something about your proposal?

By document I mean the subject URI as the document ID, the predicates as
field names and the object literals as their values. As we can't know in
advance which of our predicates might hold a list of objects* the safe
way seems to make all fields multi valued in the most general schema I
proposed. If (as in my other two example schemas) you tailor the fields
more to your dataset's needs, you probably don't want to make fields for
which you know that they are single valued multi valued...

  • e.g.

<thsys/72180>
a skos:Concept, zbwext:Thsys ;
rdfs:label "Statistics"@en, "Statistik"@de ;
...

Cheers,

Andre

@agazzarini agazzarini removed this from the Release 1.0 milestone Jun 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants