You are not logged in.
Pages: 1
I have been doing some basic code optimization to desmume and some changes to have a few more frames per second.
All the changes here are done to 0.9.4 branch and tested with GTA Chinatown rom.
I really dont know if all the changes are really improving. Id like to hear your opinions.
Basic tips:
(*) Compile with opengl2d support.
(*) Use framedrop.
Im using 2.
Code:
(*) Automatic code optimization (gcc and friends)
Edit Makefiles (in src, cli, gtk, ..)
Change CFLAGS and CXXFLAGS
CFLAGS = -O3 -march=arch -minline-all-stringops -funroll-loops -msse -ffast-math
CXXFLAGS = -O3 -march=arch -minline-all-stringops -funroll-loops -msse -ffast-math
Replace arch with your architecture name: http://www.linuxjournal.com/article/7269
(*) Change INSTRUCTIONS_PER_BATCH and CYCLES_TO_WAIT_FOR_IRQ in NDS_exec (NDSSystem.cpp)
const int INSTRUCTIONS_PER_BATCH = 12;
const int CYCLES_TO_WAIT_FOR_IRQ = 1200;
(*) Decrease sound quality:
Change audiofmt.freq in SNDSDLInit (sndsdl.cpp)
audiofmt.freq = 22050;
chan->sampinc in adjust_channel_timer (SPU.cpp)
chan->sampinc = (((double)ARM7_CLOCK) / (22050 * 2)) / (double)(0x10000 - chan->timer);
samples_per_frame (SPU.cpp)
static const double samples_per_frame = time_per_frame * 22050;
fmt.bitspersample in WavWriter::open (SPU.cpp)
fmt.rate = 22050;
(*) Use fullscreen in desmume SDL cli:
Change this line:
sdl_videoFlags |= SDL_RESIZABLE
with this:
sdl_videoFlags |= SDL_FULLSCREEN;
Additionally, i have re-coded MatrixMultVec4x4, MatrixMultVec3x3, MatrixMultiply, MatrixTranslate and MatrixScale using SSE. (my CPU doesn't support SSE2), if someone need it. (matrix.cpp)
Thanks!
Offline
If it does still work the way I coded it, instructions per batch is something that SHOULDN'T be changed without a huge amount of testing, which you didn't seem to do. Probably the same applies about the IRQ stuff. In my own opinion, doing it configurable per game or other hacky stuff shouldn't be done, as it's opening a door to hackish world.... About the architecture optimizations, they're not useful across ports, so they're pretty useless. About the sound quality decrease, it's an option, a bad and stupid "optimization", but valid nonetheless.
Pretty much, the only that I would consider useful (keep in mind, that I'm not longer working actively on desmume), as an experienced coder, is the SSE matrix operations, which you just left out of the post.
Offline
My ears cant tell the difference between "good" and "bad" sound quality, but my eyes can see more frames per second, so decreasing sound quality its a optimization for me. :-D
Here are my SSE matrix functions. (they are a little naive)
void MATRIXFASTCALL MatrixMultVec4x4 (const float *matrix, float *vecPtr)
{
__m128 t,m;
__m128* p = (__m128*) matrix;
__m128* result = (__m128*) vecPtr;
float x = vecPtr[0];
float y = vecPtr[1];
float z = vecPtr[2];
float w = vecPtr[3];
m = _mm_set_ps1(x);
*result = _mm_mul_ps(*p,m);
p++;
m = _mm_set_ps1(y);
t = _mm_mul_ps(*p,m);
*result = _mm_add_ps(*result,t);
p++;
m = _mm_set_ps1(z);
t = _mm_mul_ps(*p,m);
*result = _mm_add_ps(*result,t);
p++;
m = _mm_set_ps1(w);
t = _mm_mul_ps(*p,m);
*result = _mm_add_ps(*result,t);
}
void MATRIXFASTCALL MatrixMultVec3x3 (const float *matrix, float *vecPtr)
{
__m128 t,m;
__m128* p = (__m128*) matrix;
__m128* result = (__m128*) vecPtr;
float x = vecPtr[0];
float y = vecPtr[1];
float z = vecPtr[2];
m = _mm_set_ps1(x);
*result = _mm_mul_ps(*p,m);
p++;
m = _mm_set_ps1(y);
t = _mm_mul_ps(*p,m);
*result = _mm_add_ps(*result,t);
p++;
m = _mm_set_ps1(z);
t = _mm_mul_ps(*p,m);
*result = _mm_add_ps(*result,t);
}
void MATRIXFASTCALL MatrixMultiply (float *matrix, const float *rightMatrix)
{
__attribute__((aligned (16))) float tmpMatrix[16];
float *tm = (float*) &tmpMatrix;
__m128 t,m;
__m128* p = (__m128*) matrix;
__m128* result = (__m128*) tm;
result--;
int i;
for (i=0 ; i<16; i++)
{
if((i % 4) == 0)
{
p=(__m128*) matrix;
result++;
*result = _mm_set_ps1(0.f);
}
m = _mm_set_ps1(rightMatrix[i]);
t = _mm_mul_ps(*p,m);
*result = _mm_add_ps(*result,t);
p++;
}
memcpy (matrix, tm, sizeof(float)*16);
}
void MATRIXFASTCALL MatrixTranslate (float *matrix, const float *ptr)
{
int i;
__m128 t,m;
__m128* p = (__m128*) matrix;
__m128* result = (__m128*) matrix;
result=result+3;
for (i=0; i<3;i++)
{
m = _mm_set_ps1(ptr[i]);
t = _mm_mul_ps(*p,m);
*result = _mm_add_ps(*result,t);
p++;
}
}
void MATRIXFASTCALL MatrixScale (float *matrix, const float *ptr)
{
int i;
__m128 m;
__m128* result = (__m128*) matrix;
for (i=0; i<3;i++)
{
m = _mm_set_ps1(ptr[i]);
*result = _mm_mul_ps(*result,m);
result++;
}
}
Offline
If it does still work the way I coded it, instructions per batch is something that SHOULDN'T be changed without a huge amount of testing, which you didn't seem to do. Probably the same applies about the IRQ stuff.
We've run into a lot of problems with this lately. It is a great speedup, but nowadays changing it to = 1 (effectively disabling it) is one of the best ways to fix crashing games. I used to have an explanation for why it causes problems but I've forgotten it. Something to do with things happening in the middle of batches. It is such a great speedup, I should add, that we have been leaving it on even though there are definitely games it crashes.
Soon I am going to rewrite the emulation loop to a schedule based system and it should be irrelevant, I hope, since the batch size will automatically be as large as possible, right up to the next event. We shall see how that goes.
Offline
audiofmt.freq = 22050;
Submit a patch with this parameterized by a #define and I think we'll use it.
The matrix stuff is an organizational headache as someone who cares a lot about the linux builds will have to figure out how to conditionally compile all that stuff....
Offline
I have a noob question and suggestion:
If we remove the -g complilation flag, I think that it can improve the speed of the execution becouse I think that it will not put the GDB code inside the executable.
Am I right with this?
Offline
Pages: 1