Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WS285 data missing for Affymetrix chip GPL19230 #238

Open
MagdalenaZZ opened this issue Jun 20, 2022 · 8 comments
Open

WS285 data missing for Affymetrix chip GPL19230 #238

MagdalenaZZ opened this issue Jun 20, 2022 · 8 comments
Assignees

Comments

@MagdalenaZZ
Copy link

Among the 813514 Microarray_results objects in WS285, 178124 objects have no gene mapping, including all the probes from the Affymetrix chip GPL19230.

Would you take a look and find out what went wrong? Although these do not affect the website of WormBase, I rely on the probe-gene mapping to generate SPELL.

It probably has been going on for a long time. I only noticed it today because it only affects a small percentage of datasets in SPELL.

@MagdalenaZZ
Copy link
Author

Hi @sdiamantakis , fyi

@MagdalenaZZ
Copy link
Author

These are all the probes missing genes:
31255 Microarray "GPL21109"
30158 Microarray "Affymetrix_C.elegans_Genome_Array"
29547 Microarray "GPL14372"
26430 Microarray "GPL19516"
9238 Microarray "GPL14145"
8505 Microarray "GPL8303"
8053 Microarray "GPL11346"
3420 Microarray "GPL8304"
2840 Microarray "GPL8673"
2194 Microarray "WashU_GSC_C.elegans_Genome_Array"
1942 Microarray "GPL9450"
1828 Microarray "GPL14143"
1660 Microarray "GPL3518"
1492 Microarray "Agilent_C.elegans_Oligo_Microarray"
1015 Microarray "GPL14142"
814 Microarray "SMD2"
688 Microarray "Affymetrix1"
500 Microarray "UCSF_C.elegans_20K_PCR_array"
380 Microarray "GPL14146"
364 Microarray "GPL14144"
343 Microarray "SMD1"
232 Microarray "GPL13164"
187 Microarray "GPL13914"
187 Microarray "GPL13394"
114 Microarray "GPL8200"
108 Microarray "GPL9815"
8 Microarray "Buck_Institute_C.elegans_0.94k"
1 Microarray "GPL8209"

@MagdalenaZZ
Copy link
Author

We create these mappings during the build:
cat BUILD/elegans/acefiles/microarray_mappings.ace | more

Microarray_results : "SMD_0C24D10.5"
CDS C24D10.5
Gene WBGene00016056
Transcript C24D10.5.1

Microarray_results : "SMD_0C26F1.3"
CDS C26F1.3
Gene WBGene00016148
Transcript C26F1.3.1

@MagdalenaZZ
Copy link
Author

Using the script map_microarray.pl

@MagdalenaZZ
Copy link
Author

The probes belong to the genome array: Microarray_results : "GPL19230_18574193"
Species "Caenorhabditis elegans"
Oligo_set "GPL19230_18574193"
Microarray "Affymetrix_C.elegans_Genome_Array"

@markquintontulloch
Copy link
Contributor

My reply to Wen:
"I believe this is a problem with the data that we receive from Caltech in the citace dump. For the Microarray that you mentioned, GPL19230, there does not appear to be any Microarray_results linked to it. Possibly related to this, there are 15,472 Microarray_results objects that do not have the Microarray tag populated. As you suspected, this appears to be an old issue - I’ve looked as far back as WS275 and seem to be seeing the same thing.

During the build we first map CDSs, transcripts, and pseudogenes to the Oligo_set and PCR_product objects. We then map these results onto the Microarray_results objects using the xref from Oligo_set/PCR_product to Microarray_results and add the Gene annotation based on xrefs from the mapped entities.

Looking at the citace dump used for the latest build, I cannot see any link between a Microarray_results object and the Microarray GPL19230. As such, I believe that this is the root cause of the problem. Hopefully that all makes sense, please let me know if you have any more questions or if you still think that the problem lies in the build process."

@markquintontulloch
Copy link
Contributor

From Wen:
"Dear Mark,

I did not know that the names of the Microarray_results must match the platform form. All GPL19230 Microarray_results link to Microarray "Affymetrix_C.elegans_Genome_Array" which is the original/generic name for all platforms by Affymetrix. You cannot see Microarray_results under Microarray. There is no XREF in the schema.

20 years ago Affymetrix only had one major platform called GPL200 (in GEO), we named it as "Affymetrix_C.elegans_Genome_Array" Sometimes people custome made a new chip using the same probes, which gets a new platform name in GEO, but WormBase do not create a new Microarray object for that, instead we group all of them into the same generic platform in WormBase.

But GPL19230 is a platform with all new probes totally unrelated to GPL200. We created a new Microarray object for it, so it makes sense to point it to its own platform. I can update this in WS286 so that they point to the Microarray object "GPL19230."

Looking at the rest of Microarray_results without Gene link, I saw 841 GPL200 objects (named as *_at) have no Gene link, the rest of 21732 GPL200 Microarray_results have Gene links. All of them point to Microarray "Affymetrix_C.elegans_Genome_Array" So these 841 objects indeed map nowhere to the genome?

And the Pristionchus pacificus microarray platform GPL14372 has 29547 Microarray_results. All of them point to Microarray "GPL14372 '' but none got a Gene link.

This is really strange because I am pretty sure that they used to be fine. I will dig into the earlier archives of WormBase to see when we lost them.

Wen"

My reply:
"Hi Wen,

After further investigation, I've realised that the issue with the GPL19230 array is that we have not mapped the probesets against the genome. This is done outside of the normal build cycle and it looks like it hasn't been done since December 2018, so we will need to investigate if there are additional arrays that have not been mapped. We'll work on remedying this before the next build starts.

Regarding the Pristionchus microarray GPL14372 - although this has ~29,500 Microarray_results that don't have any mapped gene, it has ~62,000 that are mapped to genes. If this hasn't always been the case and some gene mappings have been lost then there must be a separate issue causing the discrepancy.

As for the PL200 (*_at) objects without a gene link, it looks to me like the vast majority of these are mapped to the genome. Is it possible that these simply do not overlap genes, or do those numbers not make sense to you?

Let me know if you have any further questions / thoughts.

Thanks,
Mark."

@markquintontulloch
Copy link
Contributor

Affymetrix emailed regarding issues with missing probe sequences and mismatching information. Kevin previously emailed them regarding the same issues several years ago, but 2nd time lucky perhaps!

"We, at WormBase, would like to align the probe sequences for the GPL19230 dataset against the latest version of the C. elegans assemby. However, we have run into a number of issues.

On the GEO page (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL19230) the pgf file, which contains probe sequences, refers to a different probeset to the mps file. It is the latter file, which does not contain sequence information, that contains the probset that we want. Confusingly, the probe set annotation file linked to from the Thermofisher page for the GeneChip C. elegans Gene 1.0 ST Array (https://www.thermofisher.com/order/catalog/product/902160) matches the probset in the pgf file while the annotation table on the GEO page matches the probe set in the mps file.

Furthermore, although we really need the probe sequences themselves, I tried extracting sequences from the specified assembly (UCSC.ce6) using the chromosomal coordinates given in the GEO page table but a number of the coordinates lie outside of the chromosome limits for that assembly.

Would it be possible for you to send us the probe sequences for this array?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants